LinuxContainerExecutor mkdir failures causing NodeManagers to become unhealthy

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

LinuxContainerExecutor mkdir failures causing NodeManagers to become unhealthy

Jonathan Bender
Hello,

We started are using CGroups with LinuxContainerExecutor recently, running Apache Hadoop 3.0.0. Occasionally (once out of many millions of tasks) a yarn container will fail with a message like the following:
WARN privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 35. Privileged Execution Operation Stderr:
Could not create container dirsCould not create local files and directories



The root failure seems to be in the underlying mkdir call, but that exit code / errno is swallowed so we don't have more details. We tend to see this when many containers start at the same time for the same application on a host, and suspect it may be related to some race conditions around those shared directories between containers for the same application. 

Has anyone seen similar failures in using the LinuxContainerExecutor?


Under some circumstances this seems appropriate, but since this is a transient failure (none of these machines were at capacity for disks, inodes, etc) we shouldn't down the NodeManager. The behavior to add this blacklisting came as part of https://issues.apache.org/jira/browse/YARN-6302 which seems perfectly valid, but perhaps we should make this configurable so certain users can opt out?

Cheers,
Jon
Reply | Threaded
Open this post in threaded view
|

Re: LinuxContainerExecutor mkdir failures causing NodeManagers to become unhealthy

Eric Badger-2
Hi Jonathan,

Have you opened up a YARN JIRA with your findings? If not, that would be the next step in debugging the issue and coding up a fix. This certainly sounds like a bug and something that we should get to the bottom of.

As far as Nodemanagers becoming unhealthy, a config could be added to prevent this. But, if you're only seeing 1 failure out of millions of tasks, this seems like it would unmask more problems than it fixes. 1 container failing is bad, but a node going bad and failing every container that runs on it forever until it is shutdown is much, much worse. However, if you think that you have a use case that could benefit from the config being optional, that is something we could also look into. That would be a separate YARN JIRA as well.

Thanks,

Eric

On Mon, Sep 17, 2018 at 12:37 PM, Jonathan Bender <[hidden email]> wrote:
Hello,

We started are using CGroups with LinuxContainerExecutor recently, running Apache Hadoop 3.0.0. Occasionally (once out of many millions of tasks) a yarn container will fail with a message like the following:
WARN privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 35. Privileged Execution Operation Stderr:
Could not create container dirsCould not create local files and directories



The root failure seems to be in the underlying mkdir call, but that exit code / errno is swallowed so we don't have more details. We tend to see this when many containers start at the same time for the same application on a host, and suspect it may be related to some race conditions around those shared directories between containers for the same application. 

Has anyone seen similar failures in using the LinuxContainerExecutor?


Under some circumstances this seems appropriate, but since this is a transient failure (none of these machines were at capacity for disks, inodes, etc) we shouldn't down the NodeManager. The behavior to add this blacklisting came as part of https://issues.apache.org/jira/browse/YARN-6302 which seems perfectly valid, but perhaps we should make this configurable so certain users can opt out?

Cheers,
Jon

Reply | Threaded
Open this post in threaded view
|

Re: LinuxContainerExecutor mkdir failures causing NodeManagers to become unhealthy

Jeff Hubbs
I would also just suggest moving up to 3.1.1 and trying again. Barring that, maybe you can take the error message at its word. My experience with running Hadoop 3.x jobs is a little limited, but I know that jobs can paint a lot of data into /tmp/hadoop-yarn and if your nodes can't absorb a lot of expansion in that directory, things will error out albeit softly. Noting the way the terasort example behaves in that regard, I set up my worker nodes to make /tmp/hadoop-yarn a mount point for its own disk volume whose size I can preset and I can also optionally enable transparent compression via btrfs. A lot of times, I would expect I could give that volume some token small size but in trying to make a 1/5-scale (i.e., 200GB) terasort run, 128GiB with compression enabled across five workers wasn't enough. 1/10th-scale I could manage but at 1/5, it would fill up one node's /tmp/hadoop-yarn, then the next, then the next, etc. Makes me think that terasort tries to write the whole dang thing out to extra-HDFS file system before making an output file in HDFS.

On 9/17/18 1:55 PM, Eric Badger wrote:
Hi Jonathan,

Have you opened up a YARN JIRA with your findings? If not, that would be the next step in debugging the issue and coding up a fix. This certainly sounds like a bug and something that we should get to the bottom of.

As far as Nodemanagers becoming unhealthy, a config could be added to prevent this. But, if you're only seeing 1 failure out of millions of tasks, this seems like it would unmask more problems than it fixes. 1 container failing is bad, but a node going bad and failing every container that runs on it forever until it is shutdown is much, much worse. However, if you think that you have a use case that could benefit from the config being optional, that is something we could also look into. That would be a separate YARN JIRA as well.

Thanks,

Eric

On Mon, Sep 17, 2018 at 12:37 PM, Jonathan Bender <[hidden email]> wrote:
Hello,

We started are using CGroups with LinuxContainerExecutor recently, running Apache Hadoop 3.0.0. Occasionally (once out of many millions of tasks) a yarn container will fail with a message like the following:
WARN privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 35. Privileged Execution Operation Stderr:
Could not create container dirsCould not create local files and directories



The root failure seems to be in the underlying mkdir call, but that exit code / errno is swallowed so we don't have more details. We tend to see this when many containers start at the same time for the same application on a host, and suspect it may be related to some race conditions around those shared directories between containers for the same application. 

Has anyone seen similar failures in using the LinuxContainerExecutor?


Under some circumstances this seems appropriate, but since this is a transient failure (none of these machines were at capacity for disks, inodes, etc) we shouldn't down the NodeManager. The behavior to add this blacklisting came as part of https://issues.apache.org/jira/browse/YARN-6302 which seems perfectly valid, but perhaps we should make this configurable so certain users can opt out?

Cheers,
Jon


Reply | Threaded
Open this post in threaded view
|

Re: LinuxContainerExecutor mkdir failures causing NodeManagers to become unhealthy

Shane Kumpf
Hey Jon,

YARN-8751 takes care of the issue that marks the NM unhealthy under these conditions. If you can open a JIRA with details on the swallowed error, that would be appreciated. As noted, 3.1.1 has a number of fixes to the YARN containerization features, so it would be great if you can see if the issue still occurs with that release.

Thanks,
-Shane

On Mon, Sep 17, 2018 at 1:05 PM Jeff Hubbs <[hidden email]> wrote:
I would also just suggest moving up to 3.1.1 and trying again. Barring that, maybe you can take the error message at its word. My experience with running Hadoop 3.x jobs is a little limited, but I know that jobs can paint a lot of data into /tmp/hadoop-yarn and if your nodes can't absorb a lot of expansion in that directory, things will error out albeit softly. Noting the way the terasort example behaves in that regard, I set up my worker nodes to make /tmp/hadoop-yarn a mount point for its own disk volume whose size I can preset and I can also optionally enable transparent compression via btrfs. A lot of times, I would expect I could give that volume some token small size but in trying to make a 1/5-scale (i.e., 200GB) terasort run, 128GiB with compression enabled across five workers wasn't enough. 1/10th-scale I could manage but at 1/5, it would fill up one node's /tmp/hadoop-yarn, then the next, then the next, etc. Makes me think that terasort tries to write the whole dang thing out to extra-HDFS file system before making an output file in HDFS.

On 9/17/18 1:55 PM, Eric Badger wrote:
Hi Jonathan,

Have you opened up a YARN JIRA with your findings? If not, that would be the next step in debugging the issue and coding up a fix. This certainly sounds like a bug and something that we should get to the bottom of.

As far as Nodemanagers becoming unhealthy, a config could be added to prevent this. But, if you're only seeing 1 failure out of millions of tasks, this seems like it would unmask more problems than it fixes. 1 container failing is bad, but a node going bad and failing every container that runs on it forever until it is shutdown is much, much worse. However, if you think that you have a use case that could benefit from the config being optional, that is something we could also look into. That would be a separate YARN JIRA as well.

Thanks,

Eric

On Mon, Sep 17, 2018 at 12:37 PM, Jonathan Bender <[hidden email]> wrote:
Hello,

We started are using CGroups with LinuxContainerExecutor recently, running Apache Hadoop 3.0.0. Occasionally (once out of many millions of tasks) a yarn container will fail with a message like the following:
WARN privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 35. Privileged Execution Operation Stderr:
Could not create container dirsCould not create local files and directories



The root failure seems to be in the underlying mkdir call, but that exit code / errno is swallowed so we don't have more details. We tend to see this when many containers start at the same time for the same application on a host, and suspect it may be related to some race conditions around those shared directories between containers for the same application. 

Has anyone seen similar failures in using the LinuxContainerExecutor?


Under some circumstances this seems appropriate, but since this is a transient failure (none of these machines were at capacity for disks, inodes, etc) we shouldn't down the NodeManager. The behavior to add this blacklisting came as part of https://issues.apache.org/jira/browse/YARN-6302 which seems perfectly valid, but perhaps we should make this configurable so certain users can opt out?

Cheers,
Jon


Reply | Threaded
Open this post in threaded view
|

Re: LinuxContainerExecutor mkdir failures causing NodeManagers to become unhealthy

Jonathan Bender
Thanks for the responses all!

@Shane - that's great, we planned to move to 3.1.x soon anyway, all the more reason to do that.

@Eric - I opened a JIRA here with my findings: https://issues.apache.org/jira/browse/YARN-8786

On Mon, Sep 17, 2018 at 12:23 PM, Shane Kumpf <[hidden email]> wrote:
Hey Jon,

YARN-8751 takes care of the issue that marks the NM unhealthy under these conditions. If you can open a JIRA with details on the swallowed error, that would be appreciated. As noted, 3.1.1 has a number of fixes to the YARN containerization features, so it would be great if you can see if the issue still occurs with that release.

Thanks,
-Shane

On Mon, Sep 17, 2018 at 1:05 PM Jeff Hubbs <[hidden email]> wrote:
I would also just suggest moving up to 3.1.1 and trying again. Barring that, maybe you can take the error message at its word. My experience with running Hadoop 3.x jobs is a little limited, but I know that jobs can paint a lot of data into /tmp/hadoop-yarn and if your nodes can't absorb a lot of expansion in that directory, things will error out albeit softly. Noting the way the terasort example behaves in that regard, I set up my worker nodes to make /tmp/hadoop-yarn a mount point for its own disk volume whose size I can preset and I can also optionally enable transparent compression via btrfs. A lot of times, I would expect I could give that volume some token small size but in trying to make a 1/5-scale (i.e., 200GB) terasort run, 128GiB with compression enabled across five workers wasn't enough. 1/10th-scale I could manage but at 1/5, it would fill up one node's /tmp/hadoop-yarn, then the next, then the next, etc. Makes me think that terasort tries to write the whole dang thing out to extra-HDFS file system before making an output file in HDFS.

On 9/17/18 1:55 PM, Eric Badger wrote:
Hi Jonathan,

Have you opened up a YARN JIRA with your findings? If not, that would be the next step in debugging the issue and coding up a fix. This certainly sounds like a bug and something that we should get to the bottom of.

As far as Nodemanagers becoming unhealthy, a config could be added to prevent this. But, if you're only seeing 1 failure out of millions of tasks, this seems like it would unmask more problems than it fixes. 1 container failing is bad, but a node going bad and failing every container that runs on it forever until it is shutdown is much, much worse. However, if you think that you have a use case that could benefit from the config being optional, that is something we could also look into. That would be a separate YARN JIRA as well.

Thanks,

Eric

On Mon, Sep 17, 2018 at 12:37 PM, Jonathan Bender <[hidden email]> wrote:
Hello,

We started are using CGroups with LinuxContainerExecutor recently, running Apache Hadoop 3.0.0. Occasionally (once out of many millions of tasks) a yarn container will fail with a message like the following:
WARN privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 35. Privileged Execution Operation Stderr:
Could not create container dirsCould not create local files and directories



The root failure seems to be in the underlying mkdir call, but that exit code / errno is swallowed so we don't have more details. We tend to see this when many containers start at the same time for the same application on a host, and suspect it may be related to some race conditions around those shared directories between containers for the same application. 

Has anyone seen similar failures in using the LinuxContainerExecutor?


Under some circumstances this seems appropriate, but since this is a transient failure (none of these machines were at capacity for disks, inodes, etc) we shouldn't down the NodeManager. The behavior to add this blacklisting came as part of https://issues.apache.org/jira/browse/YARN-6302 which seems perfectly valid, but perhaps we should make this configurable so certain users can opt out?

Cheers,
Jon