yarn usercache dir not resolved properly when running an example application

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

yarn usercache dir not resolved properly when running an example application

Vinay Kashyap

I am using Hadoop 3.2.0 and trying to run a simple application in a docker container and I have made the required configuration changes both in yarn-site.xml and container-executor.cfg to choose LinuxContainerExecutor and docker runtime.

I use the example of distributed shell in one of the hortonworks blog. https://hortonworks.com/blog/trying-containerized-applications-apache-hadoop-yarn-3-1/

The problem I face here is when the application is submitted to YARN it fails with a reason related to directory creation issue with the below error

2019-02-14 20:51:16,450 INFO distributedshell.Client: Got application report from ASM for, appId=2, clientToAMToken=null, appDiagnostics=Application application_1550156488785_0002 failed 2 times due to AM Container for appattempt_1550156488785_0002_000002 exited with exitCode: -1000 Failing this attempt.Diagnostics: [2019-02-14 20:51:16.282]Application application_1550156488785_0002 initialization failed (exitCode=20) with output: main : command provided 0 main : user is myuser main : requested yarn user is myuser Failed to create directory /data/yarn/local/nmPrivate/container_1550156488785_0002_02_000001.tokens/usercache/myuser - Not a directory

I have configured yarn.nodemanager.local-dirs in yarn-site.xml and I can see the same reflected in YARN web ui localhost:8088/conf

<property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>/data/yarn/local</value>
    <final>false</final>
    <source>yarn-site.xml</source>
</property>

I do not understand why is it trying to create usercache dir inside the nmPrivate directory.

Note : I have verified the permissions for myuser to the directories and also have tried clearing the directories manually as suggested in a related post. But no fruit. I do not see any additional information about container launch failure in any other logs.

How do I debug why the usercache dir is not resolved properly??

Really appreciate any help on this.

Thanks

Vinay Kashyap

Reply | Threaded
Open this post in threaded view
|

Re: yarn usercache dir not resolved properly when running an example application

Prabhu Josephraj
Hi Vinay,

    Can you try specifying below configs under Docker section in container-executor.cfg which will allow Docker Containers to use the NM Local Dirs.

      docker.allowed.ro-mounts=/data/yarn/local,,/usr/jdk64/jdk1.8.0_112/bin
      docker.allowed.rw-mounts=/data/yarn/local,/data/yarn/log

Thanks,
Prabhu Joseph

On Thu, Feb 14, 2019 at 9:28 PM Vinay Kashyap <[hidden email]> wrote:

I am using Hadoop 3.2.0 and trying to run a simple application in a docker container and I have made the required configuration changes both in yarn-site.xml and container-executor.cfg to choose LinuxContainerExecutor and docker runtime.

I use the example of distributed shell in one of the hortonworks blog. https://hortonworks.com/blog/trying-containerized-applications-apache-hadoop-yarn-3-1/

The problem I face here is when the application is submitted to YARN it fails with a reason related to directory creation issue with the below error

2019-02-14 20:51:16,450 INFO distributedshell.Client: Got application report from ASM for, appId=2, clientToAMToken=null, appDiagnostics=Application application_1550156488785_0002 failed 2 times due to AM Container for appattempt_1550156488785_0002_000002 exited with exitCode: -1000 Failing this attempt.Diagnostics: [2019-02-14 20:51:16.282]Application application_1550156488785_0002 initialization failed (exitCode=20) with output: main : command provided 0 main : user is myuser main : requested yarn user is myuser Failed to create directory /data/yarn/local/nmPrivate/container_1550156488785_0002_02_000001.tokens/usercache/myuser - Not a directory

I have configured yarn.nodemanager.local-dirs in yarn-site.xml and I can see the same reflected in YARN web ui localhost:8088/conf

<property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>/data/yarn/local</value>
    <final>false</final>
    <source>yarn-site.xml</source>
</property>

I do not understand why is it trying to create usercache dir inside the nmPrivate directory.

Note : I have verified the permissions for myuser to the directories and also have tried clearing the directories manually as suggested in a related post. But no fruit. I do not see any additional information about container launch failure in any other logs.

How do I debug why the usercache dir is not resolved properly??

Really appreciate any help on this.

Thanks

Vinay Kashyap

Reply | Threaded
Open this post in threaded view
|

Re: yarn usercache dir not resolved properly when running an example application

Vinay Kashyap
Hi Prabhu,

Thanks for your reply. 
I tried the configurations as per your suggestion. But I get the same error.
Is this related to container localization by any chance?. 
Also, is there any log or out information which says that the docker container runtime has been picked up.?



On Thu, Feb 14, 2019 at 9:38 PM Prabhu Josephraj <[hidden email]> wrote:
Hi Vinay,

    Can you try specifying below configs under Docker section in container-executor.cfg which will allow Docker Containers to use the NM Local Dirs.

      docker.allowed.ro-mounts=/data/yarn/local,,/usr/jdk64/jdk1.8.0_112/bin
      docker.allowed.rw-mounts=/data/yarn/local,/data/yarn/log

Thanks,
Prabhu Joseph

On Thu, Feb 14, 2019 at 9:28 PM Vinay Kashyap <[hidden email]> wrote:

I am using Hadoop 3.2.0 and trying to run a simple application in a docker container and I have made the required configuration changes both in yarn-site.xml and container-executor.cfg to choose LinuxContainerExecutor and docker runtime.

I use the example of distributed shell in one of the hortonworks blog. https://hortonworks.com/blog/trying-containerized-applications-apache-hadoop-yarn-3-1/

The problem I face here is when the application is submitted to YARN it fails with a reason related to directory creation issue with the below error

2019-02-14 20:51:16,450 INFO distributedshell.Client: Got application report from ASM for, appId=2, clientToAMToken=null, appDiagnostics=Application application_1550156488785_0002 failed 2 times due to AM Container for appattempt_1550156488785_0002_000002 exited with exitCode: -1000 Failing this attempt.Diagnostics: [2019-02-14 20:51:16.282]Application application_1550156488785_0002 initialization failed (exitCode=20) with output: main : command provided 0 main : user is myuser main : requested yarn user is myuser Failed to create directory /data/yarn/local/nmPrivate/container_1550156488785_0002_02_000001.tokens/usercache/myuser - Not a directory

I have configured yarn.nodemanager.local-dirs in yarn-site.xml and I can see the same reflected in YARN web ui localhost:8088/conf

<property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>/data/yarn/local</value>
    <final>false</final>
    <source>yarn-site.xml</source>
</property>

I do not understand why is it trying to create usercache dir inside the nmPrivate directory.

Note : I have verified the permissions for myuser to the directories and also have tried clearing the directories manually as suggested in a related post. But no fruit. I do not see any additional information about container launch failure in any other logs.

How do I debug why the usercache dir is not resolved properly??

Really appreciate any help on this.

Thanks

Vinay Kashyap



--
Thanks and regards
Vinay Kashyap
Reply | Threaded
Open this post in threaded view
|

Re: yarn usercache dir not resolved properly when running an example application

Prabhu Josephraj
In case of Distributed Shell Job - ApplicationMaster runs in normal linux container and the subsequent shell command runs inside Docker 
container. The job fails even before launching AM, that is before starting Docker Container. I think the Distributed Shell job will fail even 
without Docker Settings.

As per the error code 20 , it is mostly related to accessing of NM local directory.  

20

INITIALIZE_USER_FAILED

Couldn't get, stat, or secure the per-user NodeManager directory.


Can we try below steps on (all) NodeManager machine.

Remove all contents under /data/yarn and make sure the /data and /data/yarn directory permission is 755 with owner root:root and local directory 
is owned by yarn:hadoop.

[root@tparimi-tarunhdp26-4 ~]# ls -lrt /
drwxr-xr-x.   5 root root    44 Oct 24 11:47 data

[root@tparimi-tarunhdp26-4 ~]# ls -lrt /data/
drwxr-xr-x. 4 root      root   28 Oct 24 14:30 yarn

[root@tparimi-tarunhdp26-4 ~]# ls -lrt /data/yarn/
total 4
drwxr-xr-x.  5 yarn hadoop   54 Feb 14 17:32 local
drwxrwxr-x. 10 yarn hadoop 4096 Feb 14 17:32 log

And also check if Distributed Shell jobs runs fine without Docker Settings.





On Thu, Feb 14, 2019 at 10:15 PM Vinay Kashyap <[hidden email]> wrote:
Hi Prabhu,

Thanks for your reply. 
I tried the configurations as per your suggestion. But I get the same error.
Is this related to container localization by any chance?. 
Also, is there any log or out information which says that the docker container runtime has been picked up.?



On Thu, Feb 14, 2019 at 9:38 PM Prabhu Josephraj <[hidden email]> wrote:
Hi Vinay,

    Can you try specifying below configs under Docker section in container-executor.cfg which will allow Docker Containers to use the NM Local Dirs.

      docker.allowed.ro-mounts=/data/yarn/local,,/usr/jdk64/jdk1.8.0_112/bin
      docker.allowed.rw-mounts=/data/yarn/local,/data/yarn/log

Thanks,
Prabhu Joseph

On Thu, Feb 14, 2019 at 9:28 PM Vinay Kashyap <[hidden email]> wrote:

I am using Hadoop 3.2.0 and trying to run a simple application in a docker container and I have made the required configuration changes both in yarn-site.xml and container-executor.cfg to choose LinuxContainerExecutor and docker runtime.

I use the example of distributed shell in one of the hortonworks blog. https://hortonworks.com/blog/trying-containerized-applications-apache-hadoop-yarn-3-1/

The problem I face here is when the application is submitted to YARN it fails with a reason related to directory creation issue with the below error

2019-02-14 20:51:16,450 INFO distributedshell.Client: Got application report from ASM for, appId=2, clientToAMToken=null, appDiagnostics=Application application_1550156488785_0002 failed 2 times due to AM Container for appattempt_1550156488785_0002_000002 exited with exitCode: -1000 Failing this attempt.Diagnostics: [2019-02-14 20:51:16.282]Application application_1550156488785_0002 initialization failed (exitCode=20) with output: main : command provided 0 main : user is myuser main : requested yarn user is myuser Failed to create directory /data/yarn/local/nmPrivate/container_1550156488785_0002_02_000001.tokens/usercache/myuser - Not a directory

I have configured yarn.nodemanager.local-dirs in yarn-site.xml and I can see the same reflected in YARN web ui localhost:8088/conf

<property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>/data/yarn/local</value>
    <final>false</final>
    <source>yarn-site.xml</source>
</property>

I do not understand why is it trying to create usercache dir inside the nmPrivate directory.

Note : I have verified the permissions for myuser to the directories and also have tried clearing the directories manually as suggested in a related post. But no fruit. I do not see any additional information about container launch failure in any other logs.

How do I debug why the usercache dir is not resolved properly??

Really appreciate any help on this.

Thanks

Vinay Kashyap



--
Thanks and regards
Vinay Kashyap
Reply | Threaded
Open this post in threaded view
|

Re: yarn usercache dir not resolved properly when running an example application

Vinay Kashyap
I am running hadoop on my mac and all the folders have myuser:staff as the owner. I have verified the permissions for the local dirs to be 755. 
I run all hadoop services with myuser and I have configured yarn.nodemanager.linux-container-executor.group=staff accordingly both in yarn-site.xml and container-executor.cfg

1. Is the container-executor binary certified to work as expected on OSX.? 
2. When linux container executor is configured, is there any hard expectation that users of the running hadoop services to be part of [root, hdfs, yarn...] and group to be hadoop.? So that the directory permissions fall in line accordingly?

Can you please help me understand this.? Could not find any write up on this.

On Thu, Feb 14, 2019 at 11:13 PM Prabhu Josephraj <[hidden email]> wrote:
In case of Distributed Shell Job - ApplicationMaster runs in normal linux container and the subsequent shell command runs inside Docker 
container. The job fails even before launching AM, that is before starting Docker Container. I think the Distributed Shell job will fail even 
without Docker Settings.

As per the error code 20 , it is mostly related to accessing of NM local directory.  

20

INITIALIZE_USER_FAILED

Couldn't get, stat, or secure the per-user NodeManager directory.


Can we try below steps on (all) NodeManager machine.

Remove all contents under /data/yarn and make sure the /data and /data/yarn directory permission is 755 with owner root:root and local directory 
is owned by yarn:hadoop.

[root@tparimi-tarunhdp26-4 ~]# ls -lrt /
drwxr-xr-x.   5 root root    44 Oct 24 11:47 data

[root@tparimi-tarunhdp26-4 ~]# ls -lrt /data/
drwxr-xr-x. 4 root      root   28 Oct 24 14:30 yarn

[root@tparimi-tarunhdp26-4 ~]# ls -lrt /data/yarn/
total 4
drwxr-xr-x.  5 yarn hadoop   54 Feb 14 17:32 local
drwxrwxr-x. 10 yarn hadoop 4096 Feb 14 17:32 log

And also check if Distributed Shell jobs runs fine without Docker Settings.





On Thu, Feb 14, 2019 at 10:15 PM Vinay Kashyap <[hidden email]> wrote:
Hi Prabhu,

Thanks for your reply. 
I tried the configurations as per your suggestion. But I get the same error.
Is this related to container localization by any chance?. 
Also, is there any log or out information which says that the docker container runtime has been picked up.?



On Thu, Feb 14, 2019 at 9:38 PM Prabhu Josephraj <[hidden email]> wrote:
Hi Vinay,

    Can you try specifying below configs under Docker section in container-executor.cfg which will allow Docker Containers to use the NM Local Dirs.

      docker.allowed.ro-mounts=/data/yarn/local,,/usr/jdk64/jdk1.8.0_112/bin
      docker.allowed.rw-mounts=/data/yarn/local,/data/yarn/log

Thanks,
Prabhu Joseph

On Thu, Feb 14, 2019 at 9:28 PM Vinay Kashyap <[hidden email]> wrote:

I am using Hadoop 3.2.0 and trying to run a simple application in a docker container and I have made the required configuration changes both in yarn-site.xml and container-executor.cfg to choose LinuxContainerExecutor and docker runtime.

I use the example of distributed shell in one of the hortonworks blog. https://hortonworks.com/blog/trying-containerized-applications-apache-hadoop-yarn-3-1/

The problem I face here is when the application is submitted to YARN it fails with a reason related to directory creation issue with the below error

2019-02-14 20:51:16,450 INFO distributedshell.Client: Got application report from ASM for, appId=2, clientToAMToken=null, appDiagnostics=Application application_1550156488785_0002 failed 2 times due to AM Container for appattempt_1550156488785_0002_000002 exited with exitCode: -1000 Failing this attempt.Diagnostics: [2019-02-14 20:51:16.282]Application application_1550156488785_0002 initialization failed (exitCode=20) with output: main : command provided 0 main : user is myuser main : requested yarn user is myuser Failed to create directory /data/yarn/local/nmPrivate/container_1550156488785_0002_02_000001.tokens/usercache/myuser - Not a directory

I have configured yarn.nodemanager.local-dirs in yarn-site.xml and I can see the same reflected in YARN web ui localhost:8088/conf

<property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>/data/yarn/local</value>
    <final>false</final>
    <source>yarn-site.xml</source>
</property>

I do not understand why is it trying to create usercache dir inside the nmPrivate directory.

Note : I have verified the permissions for myuser to the directories and also have tried clearing the directories manually as suggested in a related post. But no fruit. I do not see any additional information about container launch failure in any other logs.

How do I debug why the usercache dir is not resolved properly??

Really appreciate any help on this.

Thanks

Vinay Kashyap



--
Thanks and regards
Vinay Kashyap


--
Thanks and regards
Vinay Kashyap
Reply | Threaded
Open this post in threaded view
|

Re: yarn usercache dir not resolved properly when running an example application

Jeff Hubbs
On 2/14/19 11:09 PM, Vinay Kashyap wrote:
I am running hadoop on my mac and all the folders have myuser:staff as the owner. I have verified the permissions for the local dirs to be 755.

This doesn't sound right. By-the-book, there are supposed to be separate "users" for hdfs, yarn, and mapred to run their respective daemons. The directories they read/write in are supposed to be permed and owned to expect that. One possible approach for purposes of log-writing etc. is to put those user accounts in a group (perhaps named "hadoop") so that read/written areas in common are owned by that group and permed accordingly.

If you're going to ad-lib that arrangement then you'll have to ad-lib a lot of the rest of how worker nodes and edge nodes behave accordingly.

I run all hadoop services with myuser and I have configured yarn.nodemanager.linux-container-executor.group=staff accordingly both in yarn-site.xml and container-executor.cfg

1. Is the container-executor binary certified to work as expected on OSX.? 
2. When linux container executor is configured, is there any hard expectation that users of the running hadoop services to be part of [root, hdfs, yarn...] and group to be hadoop.? So that the directory permissions fall in line accordingly?

Can you please help me understand this.? Could not find any write up on this.

On Thu, Feb 14, 2019 at 11:13 PM Prabhu Josephraj <[hidden email]> wrote:
In case of Distributed Shell Job - ApplicationMaster runs in normal linux container and the subsequent shell command runs inside Docker 
container. The job fails even before launching AM, that is before starting Docker Container. I think the Distributed Shell job will fail even 
without Docker Settings.

As per the error code 20 , it is mostly related to accessing of NM local directory.  

20

INITIALIZE_USER_FAILED

Couldn't get, stat, or secure the per-user NodeManager directory.


Can we try below steps on (all) NodeManager machine.

Remove all contents under /data/yarn and make sure the /data and /data/yarn directory permission is 755 with owner root:root and local directory 
is owned by yarn:hadoop.

[root@tparimi-tarunhdp26-4 ~]# ls -lrt /
drwxr-xr-x.   5 root root    44 Oct 24 11:47 data

[root@tparimi-tarunhdp26-4 ~]# ls -lrt /data/
drwxr-xr-x. 4 root      root   28 Oct 24 14:30 yarn

[root@tparimi-tarunhdp26-4 ~]# ls -lrt /data/yarn/
total 4
drwxr-xr-x.  5 yarn hadoop   54 Feb 14 17:32 local
drwxrwxr-x. 10 yarn hadoop 4096 Feb 14 17:32 log

And also check if Distributed Shell jobs runs fine without Docker Settings.





On Thu, Feb 14, 2019 at 10:15 PM Vinay Kashyap <[hidden email]> wrote:
Hi Prabhu,

Thanks for your reply. 
I tried the configurations as per your suggestion. But I get the same error.
Is this related to container localization by any chance?. 
Also, is there any log or out information which says that the docker container runtime has been picked up.?



On Thu, Feb 14, 2019 at 9:38 PM Prabhu Josephraj <[hidden email]> wrote:
Hi Vinay,

    Can you try specifying below configs under Docker section in container-executor.cfg which will allow Docker Containers to use the NM Local Dirs.

      docker.allowed.ro-mounts=/data/yarn/local,,/usr/jdk64/jdk1.8.0_112/bin
      docker.allowed.rw-mounts=/data/yarn/local,/data/yarn/log

Thanks,
Prabhu Joseph

On Thu, Feb 14, 2019 at 9:28 PM Vinay Kashyap <[hidden email]> wrote:

I am using Hadoop 3.2.0 and trying to run a simple application in a docker container and I have made the required configuration changes both in yarn-site.xml and container-executor.cfg to choose LinuxContainerExecutor and docker runtime.

I use the example of distributed shell in one of the hortonworks blog. https://hortonworks.com/blog/trying-containerized-applications-apache-hadoop-yarn-3-1/

The problem I face here is when the application is submitted to YARN it fails with a reason related to directory creation issue with the below error

2019-02-14 20:51:16,450 INFO distributedshell.Client: Got application report from ASM for, appId=2, clientToAMToken=null, appDiagnostics=Application application_1550156488785_0002 failed 2 times due to AM Container for appattempt_1550156488785_0002_000002 exited with exitCode: -1000 Failing this attempt.Diagnostics: [2019-02-14 20:51:16.282]Application application_1550156488785_0002 initialization failed (exitCode=20) with output: main : command provided 0 main : user is myuser main : requested yarn user is myuser Failed to create directory /data/yarn/local/nmPrivate/container_1550156488785_0002_02_000001.tokens/usercache/myuser - Not a directory

I have configured yarn.nodemanager.local-dirs in yarn-site.xml and I can see the same reflected in YARN web ui localhost:8088/conf

<property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>/data/yarn/local</value>
    <final>false</final>
    <source>yarn-site.xml</source>
</property>

I do not understand why is it trying to create usercache dir inside the nmPrivate directory.

Note : I have verified the permissions for myuser to the directories and also have tried clearing the directories manually as suggested in a related post. But no fruit. I do not see any additional information about container launch failure in any other logs.

How do I debug why the usercache dir is not resolved properly??

Really appreciate any help on this.

Thanks

Vinay Kashyap



--
Thanks and regards
Vinay Kashyap


--
Thanks and regards
Vinay Kashyap


Reply | Threaded
Open this post in threaded view
|

Re: yarn usercache dir not resolved properly when running an example application

Vinay Kashyap
Perfect Jeff, I clearly understand. 
After changing the setup to the appropriate users and folder permissions, I can see some progress.. 

Cheers.. 

On Fri, Feb 15, 2019 at 10:05 AM Jeff Hubbs <[hidden email]> wrote:
On 2/14/19 11:09 PM, Vinay Kashyap wrote:
I am running hadoop on my mac and all the folders have myuser:staff as the owner. I have verified the permissions for the local dirs to be 755.

This doesn't sound right. By-the-book, there are supposed to be separate "users" for hdfs, yarn, and mapred to run their respective daemons. The directories they read/write in are supposed to be permed and owned to expect that. One possible approach for purposes of log-writing etc. is to put those user accounts in a group (perhaps named "hadoop") so that read/written areas in common are owned by that group and permed accordingly.

If you're going to ad-lib that arrangement then you'll have to ad-lib a lot of the rest of how worker nodes and edge nodes behave accordingly.

I run all hadoop services with myuser and I have configured yarn.nodemanager.linux-container-executor.group=staff accordingly both in yarn-site.xml and container-executor.cfg

1. Is the container-executor binary certified to work as expected on OSX.? 
2. When linux container executor is configured, is there any hard expectation that users of the running hadoop services to be part of [root, hdfs, yarn...] and group to be hadoop.? So that the directory permissions fall in line accordingly?

Can you please help me understand this.? Could not find any write up on this.

On Thu, Feb 14, 2019 at 11:13 PM Prabhu Josephraj <[hidden email]> wrote:
In case of Distributed Shell Job - ApplicationMaster runs in normal linux container and the subsequent shell command runs inside Docker 
container. The job fails even before launching AM, that is before starting Docker Container. I think the Distributed Shell job will fail even 
without Docker Settings.

As per the error code 20 , it is mostly related to accessing of NM local directory.  

20

INITIALIZE_USER_FAILED

Couldn't get, stat, or secure the per-user NodeManager directory.


Can we try below steps on (all) NodeManager machine.

Remove all contents under /data/yarn and make sure the /data and /data/yarn directory permission is 755 with owner root:root and local directory 
is owned by yarn:hadoop.

[root@tparimi-tarunhdp26-4 ~]# ls -lrt /
drwxr-xr-x.   5 root root    44 Oct 24 11:47 data

[root@tparimi-tarunhdp26-4 ~]# ls -lrt /data/
drwxr-xr-x. 4 root      root   28 Oct 24 14:30 yarn

[root@tparimi-tarunhdp26-4 ~]# ls -lrt /data/yarn/
total 4
drwxr-xr-x.  5 yarn hadoop   54 Feb 14 17:32 local
drwxrwxr-x. 10 yarn hadoop 4096 Feb 14 17:32 log

And also check if Distributed Shell jobs runs fine without Docker Settings.





On Thu, Feb 14, 2019 at 10:15 PM Vinay Kashyap <[hidden email]> wrote:
Hi Prabhu,

Thanks for your reply. 
I tried the configurations as per your suggestion. But I get the same error.
Is this related to container localization by any chance?. 
Also, is there any log or out information which says that the docker container runtime has been picked up.?



On Thu, Feb 14, 2019 at 9:38 PM Prabhu Josephraj <[hidden email]> wrote:
Hi Vinay,

    Can you try specifying below configs under Docker section in container-executor.cfg which will allow Docker Containers to use the NM Local Dirs.

      docker.allowed.ro-mounts=/data/yarn/local,,/usr/jdk64/jdk1.8.0_112/bin
      docker.allowed.rw-mounts=/data/yarn/local,/data/yarn/log

Thanks,
Prabhu Joseph

On Thu, Feb 14, 2019 at 9:28 PM Vinay Kashyap <[hidden email]> wrote:

I am using Hadoop 3.2.0 and trying to run a simple application in a docker container and I have made the required configuration changes both in yarn-site.xml and container-executor.cfg to choose LinuxContainerExecutor and docker runtime.

I use the example of distributed shell in one of the hortonworks blog. https://hortonworks.com/blog/trying-containerized-applications-apache-hadoop-yarn-3-1/

The problem I face here is when the application is submitted to YARN it fails with a reason related to directory creation issue with the below error

2019-02-14 20:51:16,450 INFO distributedshell.Client: Got application report from ASM for, appId=2, clientToAMToken=null, appDiagnostics=Application application_1550156488785_0002 failed 2 times due to AM Container for appattempt_1550156488785_0002_000002 exited with exitCode: -1000 Failing this attempt.Diagnostics: [2019-02-14 20:51:16.282]Application application_1550156488785_0002 initialization failed (exitCode=20) with output: main : command provided 0 main : user is myuser main : requested yarn user is myuser Failed to create directory /data/yarn/local/nmPrivate/container_1550156488785_0002_02_000001.tokens/usercache/myuser - Not a directory

I have configured yarn.nodemanager.local-dirs in yarn-site.xml and I can see the same reflected in YARN web ui localhost:8088/conf

<property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>/data/yarn/local</value>
    <final>false</final>
    <source>yarn-site.xml</source>
</property>

I do not understand why is it trying to create usercache dir inside the nmPrivate directory.

Note : I have verified the permissions for myuser to the directories and also have tried clearing the directories manually as suggested in a related post. But no fruit. I do not see any additional information about container launch failure in any other logs.

How do I debug why the usercache dir is not resolved properly??

Really appreciate any help on this.

Thanks

Vinay Kashyap



--
Thanks and regards
Vinay Kashyap


--
Thanks and regards
Vinay Kashyap




--
Thanks and regards
Vinay Kashyap
Reply | Threaded
Open this post in threaded view
|

Re: yarn usercache dir not resolved properly when running an example application

Jeff Hubbs
Great, Vinay - I'm glad that made a difference. When you get to the point where you are running a cluster, the same sort of thing will have to carry over to all nodes, with the added issue that ssh and keys must be configured such that each of those users can shell to other nodes without supplying a password.

On 2/18/19 11:41 PM, Vinay Kashyap wrote:
Perfect Jeff, I clearly understand. 
After changing the setup to the appropriate users and folder permissions, I can see some progress.. 

Cheers.. 

On Fri, Feb 15, 2019 at 10:05 AM Jeff Hubbs <[hidden email]> wrote:
On 2/14/19 11:09 PM, Vinay Kashyap wrote:
I am running hadoop on my mac and all the folders have myuser:staff as the owner. I have verified the permissions for the local dirs to be 755.

This doesn't sound right. By-the-book, there are supposed to be separate "users" for hdfs, yarn, and mapred to run their respective daemons. The directories they read/write in are supposed to be permed and owned to expect that. One possible approach for purposes of log-writing etc. is to put those user accounts in a group (perhaps named "hadoop") so that read/written areas in common are owned by that group and permed accordingly.

If you're going to ad-lib that arrangement then you'll have to ad-lib a lot of the rest of how worker nodes and edge nodes behave accordingly.

I run all hadoop services with myuser and I have configured yarn.nodemanager.linux-container-executor.group=staff accordingly both in yarn-site.xml and container-executor.cfg

1. Is the container-executor binary certified to work as expected on OSX.? 
2. When linux container executor is configured, is there any hard expectation that users of the running hadoop services to be part of [root, hdfs, yarn...] and group to be hadoop.? So that the directory permissions fall in line accordingly?

Can you please help me understand this.? Could not find any write up on this.

On Thu, Feb 14, 2019 at 11:13 PM Prabhu Josephraj <[hidden email]> wrote:
In case of Distributed Shell Job - ApplicationMaster runs in normal linux container and the subsequent shell command runs inside Docker 
container. The job fails even before launching AM, that is before starting Docker Container. I think the Distributed Shell job will fail even 
without Docker Settings.

As per the error code 20 , it is mostly related to accessing of NM local directory.  

20

INITIALIZE_USER_FAILED

Couldn't get, stat, or secure the per-user NodeManager directory.


Can we try below steps on (all) NodeManager machine.

Remove all contents under /data/yarn and make sure the /data and /data/yarn directory permission is 755 with owner root:root and local directory 
is owned by yarn:hadoop.

[root@tparimi-tarunhdp26-4 ~]# ls -lrt /
drwxr-xr-x.   5 root root    44 Oct 24 11:47 data

[root@tparimi-tarunhdp26-4 ~]# ls -lrt /data/
drwxr-xr-x. 4 root      root   28 Oct 24 14:30 yarn

[root@tparimi-tarunhdp26-4 ~]# ls -lrt /data/yarn/
total 4
drwxr-xr-x.  5 yarn hadoop   54 Feb 14 17:32 local
drwxrwxr-x. 10 yarn hadoop 4096 Feb 14 17:32 log

And also check if Distributed Shell jobs runs fine without Docker Settings.





On Thu, Feb 14, 2019 at 10:15 PM Vinay Kashyap <[hidden email]> wrote:
Hi Prabhu,

Thanks for your reply. 
I tried the configurations as per your suggestion. But I get the same error.
Is this related to container localization by any chance?. 
Also, is there any log or out information which says that the docker container runtime has been picked up.?



On Thu, Feb 14, 2019 at 9:38 PM Prabhu Josephraj <[hidden email]> wrote:
Hi Vinay,

    Can you try specifying below configs under Docker section in container-executor.cfg which will allow Docker Containers to use the NM Local Dirs.

      docker.allowed.ro-mounts=/data/yarn/local,,/usr/jdk64/jdk1.8.0_112/bin
      docker.allowed.rw-mounts=/data/yarn/local,/data/yarn/log

Thanks,
Prabhu Joseph

On Thu, Feb 14, 2019 at 9:28 PM Vinay Kashyap <[hidden email]> wrote:

I am using Hadoop 3.2.0 and trying to run a simple application in a docker container and I have made the required configuration changes both in yarn-site.xml and container-executor.cfg to choose LinuxContainerExecutor and docker runtime.

I use the example of distributed shell in one of the hortonworks blog. https://hortonworks.com/blog/trying-containerized-applications-apache-hadoop-yarn-3-1/

The problem I face here is when the application is submitted to YARN it fails with a reason related to directory creation issue with the below error

2019-02-14 20:51:16,450 INFO distributedshell.Client: Got application report from ASM for, appId=2, clientToAMToken=null, appDiagnostics=Application application_1550156488785_0002 failed 2 times due to AM Container for appattempt_1550156488785_0002_000002 exited with exitCode: -1000 Failing this attempt.Diagnostics: [2019-02-14 20:51:16.282]Application application_1550156488785_0002 initialization failed (exitCode=20) with output: main : command provided 0 main : user is myuser main : requested yarn user is myuser Failed to create directory /data/yarn/local/nmPrivate/container_1550156488785_0002_02_000001.tokens/usercache/myuser - Not a directory

I have configured yarn.nodemanager.local-dirs in yarn-site.xml and I can see the same reflected in YARN web ui localhost:8088/conf

<property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>/data/yarn/local</value>
    <final>false</final>
    <source>yarn-site.xml</source>
</property>

I do not understand why is it trying to create usercache dir inside the nmPrivate directory.

Note : I have verified the permissions for myuser to the directories and also have tried clearing the directories manually as suggested in a related post. But no fruit. I do not see any additional information about container launch failure in any other logs.

How do I debug why the usercache dir is not resolved properly??

Really appreciate any help on this.

Thanks

Vinay Kashyap



--
Thanks and regards
Vinay Kashyap


--
Thanks and regards
Vinay Kashyap




--
Thanks and regards
Vinay Kashyap


Reply | Threaded
Open this post in threaded view
|

Re: yarn usercache dir not resolved properly when running an example application

Vinay Kashyap
Yes Jeff Thanks again.
I could successfully run standalone TF training application with Tensorboard on docker container. Will definitely take care of silent ssh once I start with Distributed TF.. 



On Tue, Feb 19, 2019 at 9:44 PM Jeff Hubbs <[hidden email]> wrote:
Great, Vinay - I'm glad that made a difference. When you get to the point where you are running a cluster, the same sort of thing will have to carry over to all nodes, with the added issue that ssh and keys must be configured such that each of those users can shell to other nodes without supplying a password.

On 2/18/19 11:41 PM, Vinay Kashyap wrote:
Perfect Jeff, I clearly understand. 
After changing the setup to the appropriate users and folder permissions, I can see some progress.. 

Cheers.. 

On Fri, Feb 15, 2019 at 10:05 AM Jeff Hubbs <[hidden email]> wrote:
On 2/14/19 11:09 PM, Vinay Kashyap wrote:
I am running hadoop on my mac and all the folders have myuser:staff as the owner. I have verified the permissions for the local dirs to be 755.

This doesn't sound right. By-the-book, there are supposed to be separate "users" for hdfs, yarn, and mapred to run their respective daemons. The directories they read/write in are supposed to be permed and owned to expect that. One possible approach for purposes of log-writing etc. is to put those user accounts in a group (perhaps named "hadoop") so that read/written areas in common are owned by that group and permed accordingly.

If you're going to ad-lib that arrangement then you'll have to ad-lib a lot of the rest of how worker nodes and edge nodes behave accordingly.

I run all hadoop services with myuser and I have configured yarn.nodemanager.linux-container-executor.group=staff accordingly both in yarn-site.xml and container-executor.cfg

1. Is the container-executor binary certified to work as expected on OSX.? 
2. When linux container executor is configured, is there any hard expectation that users of the running hadoop services to be part of [root, hdfs, yarn...] and group to be hadoop.? So that the directory permissions fall in line accordingly?

Can you please help me understand this.? Could not find any write up on this.

On Thu, Feb 14, 2019 at 11:13 PM Prabhu Josephraj <[hidden email]> wrote:
In case of Distributed Shell Job - ApplicationMaster runs in normal linux container and the subsequent shell command runs inside Docker 
container. The job fails even before launching AM, that is before starting Docker Container. I think the Distributed Shell job will fail even 
without Docker Settings.

As per the error code 20 , it is mostly related to accessing of NM local directory.  

20

INITIALIZE_USER_FAILED

Couldn't get, stat, or secure the per-user NodeManager directory.


Can we try below steps on (all) NodeManager machine.

Remove all contents under /data/yarn and make sure the /data and /data/yarn directory permission is 755 with owner root:root and local directory 
is owned by yarn:hadoop.

[root@tparimi-tarunhdp26-4 ~]# ls -lrt /
drwxr-xr-x.   5 root root    44 Oct 24 11:47 data

[root@tparimi-tarunhdp26-4 ~]# ls -lrt /data/
drwxr-xr-x. 4 root      root   28 Oct 24 14:30 yarn

[root@tparimi-tarunhdp26-4 ~]# ls -lrt /data/yarn/
total 4
drwxr-xr-x.  5 yarn hadoop   54 Feb 14 17:32 local
drwxrwxr-x. 10 yarn hadoop 4096 Feb 14 17:32 log

And also check if Distributed Shell jobs runs fine without Docker Settings.





On Thu, Feb 14, 2019 at 10:15 PM Vinay Kashyap <[hidden email]> wrote:
Hi Prabhu,

Thanks for your reply. 
I tried the configurations as per your suggestion. But I get the same error.
Is this related to container localization by any chance?. 
Also, is there any log or out information which says that the docker container runtime has been picked up.?



On Thu, Feb 14, 2019 at 9:38 PM Prabhu Josephraj <[hidden email]> wrote:
Hi Vinay,

    Can you try specifying below configs under Docker section in container-executor.cfg which will allow Docker Containers to use the NM Local Dirs.

      docker.allowed.ro-mounts=/data/yarn/local,,/usr/jdk64/jdk1.8.0_112/bin
      docker.allowed.rw-mounts=/data/yarn/local,/data/yarn/log

Thanks,
Prabhu Joseph

On Thu, Feb 14, 2019 at 9:28 PM Vinay Kashyap <[hidden email]> wrote:

I am using Hadoop 3.2.0 and trying to run a simple application in a docker container and I have made the required configuration changes both in yarn-site.xml and container-executor.cfg to choose LinuxContainerExecutor and docker runtime.

I use the example of distributed shell in one of the hortonworks blog. https://hortonworks.com/blog/trying-containerized-applications-apache-hadoop-yarn-3-1/

The problem I face here is when the application is submitted to YARN it fails with a reason related to directory creation issue with the below error

2019-02-14 20:51:16,450 INFO distributedshell.Client: Got application report from ASM for, appId=2, clientToAMToken=null, appDiagnostics=Application application_1550156488785_0002 failed 2 times due to AM Container for appattempt_1550156488785_0002_000002 exited with exitCode: -1000 Failing this attempt.Diagnostics: [2019-02-14 20:51:16.282]Application application_1550156488785_0002 initialization failed (exitCode=20) with output: main : command provided 0 main : user is myuser main : requested yarn user is myuser Failed to create directory /data/yarn/local/nmPrivate/container_1550156488785_0002_02_000001.tokens/usercache/myuser - Not a directory

I have configured yarn.nodemanager.local-dirs in yarn-site.xml and I can see the same reflected in YARN web ui localhost:8088/conf

<property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>/data/yarn/local</value>
    <final>false</final>
    <source>yarn-site.xml</source>
</property>

I do not understand why is it trying to create usercache dir inside the nmPrivate directory.

Note : I have verified the permissions for myuser to the directories and also have tried clearing the directories manually as suggested in a related post. But no fruit. I do not see any additional information about container launch failure in any other logs.

How do I debug why the usercache dir is not resolved properly??

Really appreciate any help on this.

Thanks

Vinay Kashyap



--
Thanks and regards
Vinay Kashyap


--
Thanks and regards
Vinay Kashyap




--
Thanks and regards
Vinay Kashyap




--
Thanks and regards
Vinay Kashyap