Hadoop 3.2.0 {Submarine} : Understanding HDFS data Read/Write during/after application launch/execution

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Hadoop 3.2.0 {Submarine} : Understanding HDFS data Read/Write during/after application launch/execution

Vinay Kashyap
Hi all,

I am using Hadoop 3.2.0. I am trying few examples using Submarine to run TensorFlow jobs in a docker container.
I would like to understand few details regarding Read/Write HDFS data during/after application launch/execution. Have highlighted the questions line.

When launching the application which reads input from HDFS, we configure --input_path to a hdfs path, as mentioned in the standard example.

yarn jar hadoop-yarn-applications-submarine-<version>.jar job run \
 --name tf-job-001 --docker_image <your docker image> \
 --input_path hdfs://default/dataset/cifar-10-data \
 --checkpoint_path hdfs://default/tmp/cifar-10-jobdir \
 --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
 --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \
 --num_workers 2 \
 --worker_resources memory=8G,vcores=2,gpu=1 --worker_launch_cmd "cmd for worker ..." \
 --num_ps 2 \
 --ps_resources memory=4G,vcores=2,gpu=0 --ps_launch_cmd "cmd for ps" \

Question 1 : What if I have more than 1 dataset in a separate HDFS paths? Can --input_path take multiple paths in any fashion or is it expected to maintain all the datasets under one path.?

"DOCKER_JAVA_HOME points to JAVA_HOME inside Docker image" and "DOCKER_HADOOP_HDFS_HOME points to HADOOP_HDFS_HOME inside Docker image".

Question 2 : What is the exact expectation here.? In the sense, is there any relation/connection with the Hadoop running outside the docker container.? I guess read HDFS data into the docker container happens during Container localization, but how does output data write back happens to HDFS running outside the docker container.?

Assuming a scenario where Application 1 creates a model and Application 2 performs scoring. Both the applications run in a separate docker containers. I would like the understand how does the data read and write across applications happen in this case.
Would be of great help if anyone can be guide me understanding this or direct me to a blog or write up which explains the above.
 
Thanks and regards
Vinay Kashyap
Reply | Threaded
Open this post in threaded view
|

Re: Hadoop 3.2.0 {Submarine} : Understanding HDFS data Read/Write during/after application launch/execution

zhankun tang
Hi Vinay,

For question one, IIRC, we cannot set multiple "--input_path" flag at present. The "--input_path" is designed originally as a placeholder to store a path and then the path is used to replace "%input_path%" in worker command like "python worker.sh -input %input_path% ..".
So from this perspective, you can directly append the other input paths to your worker command in your own way.

For question two, because YARN might set a wrong HADOOP_COMMON_HOME by default. So submarine provides the environment variable to be set in the worker's launch script if the worker wants to access HDFS.
And there's no data plane relation between outside Hadoop and the container except YARN will localize resources for the container.

Hope this can answer your questions.

Best Regards,
Zhankun

On Fri, 22 Feb 2019 at 15:35, Vinay Kashyap <[hidden email]> wrote:
Hi all,

I am using Hadoop 3.2.0. I am trying few examples using Submarine to run TensorFlow jobs in a docker container.
I would like to understand few details regarding Read/Write HDFS data during/after application launch/execution. Have highlighted the questions line.

When launching the application which reads input from HDFS, we configure --input_path to a hdfs path, as mentioned in the standard example.

yarn jar hadoop-yarn-applications-submarine-<version>.jar job run \
 --name tf-job-001 --docker_image <your docker image> \
 --input_path hdfs://default/dataset/cifar-10-data \
 --checkpoint_path hdfs://default/tmp/cifar-10-jobdir \
 --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
 --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \
 --num_workers 2 \
 --worker_resources memory=8G,vcores=2,gpu=1 --worker_launch_cmd "cmd for worker ..." \
 --num_ps 2 \
 --ps_resources memory=4G,vcores=2,gpu=0 --ps_launch_cmd "cmd for ps" \

Question 1 : What if I have more than 1 dataset in a separate HDFS paths? Can --input_path take multiple paths in any fashion or is it expected to maintain all the datasets under one path.?

"DOCKER_JAVA_HOME points to JAVA_HOME inside Docker image" and "DOCKER_HADOOP_HDFS_HOME points to HADOOP_HDFS_HOME inside Docker image".

Question 2 : What is the exact expectation here.? In the sense, is there any relation/connection with the Hadoop running outside the docker container.? I guess read HDFS data into the docker container happens during Container localization, but how does output data write back happens to HDFS running outside the docker container.?

Assuming a scenario where Application 1 creates a model and Application 2 performs scoring. Both the applications run in a separate docker containers. I would like the understand how does the data read and write across applications happen in this case.
Would be of great help if anyone can be guide me understanding this or direct me to a blog or write up which explains the above.
 
Thanks and regards
Vinay Kashyap
Reply | Threaded
Open this post in threaded view
|

Re: Hadoop 3.2.0 {Submarine} : Understanding HDFS data Read/Write during/after application launch/execution

Vinay Kashyap
Hi Zhankun,
Thanks for the reply.

Regarding Question 1 : Okay.. I understand, Let me try configuring multiple input path place holders and refer the same in the worker launch command.

Regarding Question 2 : 
What I did not understand is why YARN has to set anything related to Hadoop which runs inside the container. The Hadoop environment and the worker code to read the same is completely isolated to the docker container. In that case, the worker scripts should know where the HADOOP_HOME is inside the container right.? There is another argument called --checkpoint_path which acts as a path where all the outputs (models or datasets) which are resulted as part of the execution of the worker code inside the docker container. Hence, --input_path acts as entry point which will be localized and --checkpoint_path acts as exit point, where both these paths are hdfs paths which runs outside the docker container. So why YARN should know the hadoop configuration which is inside the container.?

Thanks and regards
Vinay Kashyap

On Fri, Feb 22, 2019 at 7:39 PM zhankun tang <[hidden email]> wrote:
Hi Vinay,

For question one, IIRC, we cannot set multiple "--input_path" flag at present. The "--input_path" is designed originally as a placeholder to store a path and then the path is used to replace "%input_path%" in worker command like "python worker.sh -input %input_path% ..".
So from this perspective, you can directly append the other input paths to your worker command in your own way.

For question two, because YARN might set a wrong HADOOP_COMMON_HOME by default. So submarine provides the environment variable to be set in the worker's launch script if the worker wants to access HDFS.
And there's no data plane relation between outside Hadoop and the container except YARN will localize resources for the container.

Hope this can answer your questions.

Best Regards,
Zhankun

On Fri, 22 Feb 2019 at 15:35, Vinay Kashyap <[hidden email]> wrote:
Hi all,

I am using Hadoop 3.2.0. I am trying few examples using Submarine to run TensorFlow jobs in a docker container.
I would like to understand few details regarding Read/Write HDFS data during/after application launch/execution. Have highlighted the questions line.

When launching the application which reads input from HDFS, we configure --input_path to a hdfs path, as mentioned in the standard example.

yarn jar hadoop-yarn-applications-submarine-<version>.jar job run \
 --name tf-job-001 --docker_image <your docker image> \
 --input_path hdfs://default/dataset/cifar-10-data \
 --checkpoint_path hdfs://default/tmp/cifar-10-jobdir \
 --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
 --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \
 --num_workers 2 \
 --worker_resources memory=8G,vcores=2,gpu=1 --worker_launch_cmd "cmd for worker ..." \
 --num_ps 2 \
 --ps_resources memory=4G,vcores=2,gpu=0 --ps_launch_cmd "cmd for ps" \

Question 1 : What if I have more than 1 dataset in a separate HDFS paths? Can --input_path take multiple paths in any fashion or is it expected to maintain all the datasets under one path.?

"DOCKER_JAVA_HOME points to JAVA_HOME inside Docker image" and "DOCKER_HADOOP_HDFS_HOME points to HADOOP_HDFS_HOME inside Docker image".

Question 2 : What is the exact expectation here.? In the sense, is there any relation/connection with the Hadoop running outside the docker container.? I guess read HDFS data into the docker container happens during Container localization, but how does output data write back happens to HDFS running outside the docker container.?

Assuming a scenario where Application 1 creates a model and Application 2 performs scoring. Both the applications run in a separate docker containers. I would like the understand how does the data read and write across applications happen in this case.
Would be of great help if anyone can be guide me understanding this or direct me to a blog or write up which explains the above.
 
Thanks and regards
Vinay Kashyap


--
Thanks and regards
Vinay Kashyap
Reply | Threaded
Open this post in threaded view
|

Re: Hadoop 3.2.0 {Submarine} : Understanding HDFS data Read/Write during/after application launch/execution

zhankun tang
Hi Vinay,

IIRC, YARN will have the host's Hadoop environments set in container launch script by default. And in the submarine case, the user's worker command is used to generate a worker script which is invoked in the container launch script. If submarine doesn't override the default Hadoop environment variable, the HDFS read/write in the container might fail due to not found or incorrect Hadoop location.
So even a Docker image is built with correct Hadoop environment set, it seems also needs this override to use HDFS library in a container. This seems caused by YARN's Docker support and the submarine is doing a workaround here.

The submarine is evolving rapidly, please share your thoughts if it's uncomfortable for you.

Thanks,
Zhankun

On Mon, 25 Feb 2019 at 12:22, Vinay Kashyap <[hidden email]> wrote:
Hi Zhankun,
Thanks for the reply.

Regarding Question 1 : Okay.. I understand, Let me try configuring multiple input path place holders and refer the same in the worker launch command.

Regarding Question 2 : 
What I did not understand is why YARN has to set anything related to Hadoop which runs inside the container. The Hadoop environment and the worker code to read the same is completely isolated to the docker container. In that case, the worker scripts should know where the HADOOP_HOME is inside the container right.? There is another argument called --checkpoint_path which acts as a path where all the outputs (models or datasets) which are resulted as part of the execution of the worker code inside the docker container. Hence, --input_path acts as entry point which will be localized and --checkpoint_path acts as exit point, where both these paths are hdfs paths which runs outside the docker container. So why YARN should know the hadoop configuration which is inside the container.?

Thanks and regards
Vinay Kashyap

On Fri, Feb 22, 2019 at 7:39 PM zhankun tang <[hidden email]> wrote:
Hi Vinay,

For question one, IIRC, we cannot set multiple "--input_path" flag at present. The "--input_path" is designed originally as a placeholder to store a path and then the path is used to replace "%input_path%" in worker command like "python worker.sh -input %input_path% ..".
So from this perspective, you can directly append the other input paths to your worker command in your own way.

For question two, because YARN might set a wrong HADOOP_COMMON_HOME by default. So submarine provides the environment variable to be set in the worker's launch script if the worker wants to access HDFS.
And there's no data plane relation between outside Hadoop and the container except YARN will localize resources for the container.

Hope this can answer your questions.

Best Regards,
Zhankun

On Fri, 22 Feb 2019 at 15:35, Vinay Kashyap <[hidden email]> wrote:
Hi all,

I am using Hadoop 3.2.0. I am trying few examples using Submarine to run TensorFlow jobs in a docker container.
I would like to understand few details regarding Read/Write HDFS data during/after application launch/execution. Have highlighted the questions line.

When launching the application which reads input from HDFS, we configure --input_path to a hdfs path, as mentioned in the standard example.

yarn jar hadoop-yarn-applications-submarine-<version>.jar job run \
 --name tf-job-001 --docker_image <your docker image> \
 --input_path hdfs://default/dataset/cifar-10-data \
 --checkpoint_path hdfs://default/tmp/cifar-10-jobdir \
 --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
 --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \
 --num_workers 2 \
 --worker_resources memory=8G,vcores=2,gpu=1 --worker_launch_cmd "cmd for worker ..." \
 --num_ps 2 \
 --ps_resources memory=4G,vcores=2,gpu=0 --ps_launch_cmd "cmd for ps" \

Question 1 : What if I have more than 1 dataset in a separate HDFS paths? Can --input_path take multiple paths in any fashion or is it expected to maintain all the datasets under one path.?

"DOCKER_JAVA_HOME points to JAVA_HOME inside Docker image" and "DOCKER_HADOOP_HDFS_HOME points to HADOOP_HDFS_HOME inside Docker image".

Question 2 : What is the exact expectation here.? In the sense, is there any relation/connection with the Hadoop running outside the docker container.? I guess read HDFS data into the docker container happens during Container localization, but how does output data write back happens to HDFS running outside the docker container.?

Assuming a scenario where Application 1 creates a model and Application 2 performs scoring. Both the applications run in a separate docker containers. I would like the understand how does the data read and write across applications happen in this case.
Would be of great help if anyone can be guide me understanding this or direct me to a blog or write up which explains the above.
 
Thanks and regards
Vinay Kashyap


--
Thanks and regards
Vinay Kashyap
Reply | Threaded
Open this post in threaded view
|

Re: Hadoop 3.2.0 {Submarine} : Understanding HDFS data Read/Write during/after application launch/execution

Vinay Kashyap
Thanks Zhankun for the clarification. 
Also, is my understanding correct on --checkpoint_path as I mentioned earlier in the thread..?? Quoting the comment again in this thread.

[There is another argument called --checkpoint_path which acts as a path where all the outputs (models or datasets) which are resulted as part of the execution of the worker code inside the docker container. Hence, --input_path acts as entry point which will be localized and --checkpoint_path acts as exit point, where both these paths are hdfs paths which runs outside the docker container.]

Will continue my exercise with Submarine and would love to discuss more.



On Mon, Feb 25, 2019 at 4:21 PM zhankun tang <[hidden email]> wrote:
Hi Vinay,

IIRC, YARN will have the host's Hadoop environments set in container launch script by default. And in the submarine case, the user's worker command is used to generate a worker script which is invoked in the container launch script. If submarine doesn't override the default Hadoop environment variable, the HDFS read/write in the container might fail due to not found or incorrect Hadoop location.
So even a Docker image is built with correct Hadoop environment set, it seems also needs this override to use HDFS library in a container. This seems caused by YARN's Docker support and the submarine is doing a workaround here.

The submarine is evolving rapidly, please share your thoughts if it's uncomfortable for you.

Thanks,
Zhankun

On Mon, 25 Feb 2019 at 12:22, Vinay Kashyap <[hidden email]> wrote:
Hi Zhankun,
Thanks for the reply.

Regarding Question 1 : Okay.. I understand, Let me try configuring multiple input path place holders and refer the same in the worker launch command.

Regarding Question 2 : 
What I did not understand is why YARN has to set anything related to Hadoop which runs inside the container. The Hadoop environment and the worker code to read the same is completely isolated to the docker container. In that case, the worker scripts should know where the HADOOP_HOME is inside the container right.? There is another argument called --checkpoint_path which acts as a path where all the outputs (models or datasets) which are resulted as part of the execution of the worker code inside the docker container. Hence, --input_path acts as entry point which will be localized and --checkpoint_path acts as exit point, where both these paths are hdfs paths which runs outside the docker container. So why YARN should know the hadoop configuration which is inside the container.?

Thanks and regards
Vinay Kashyap

On Fri, Feb 22, 2019 at 7:39 PM zhankun tang <[hidden email]> wrote:
Hi Vinay,

For question one, IIRC, we cannot set multiple "--input_path" flag at present. The "--input_path" is designed originally as a placeholder to store a path and then the path is used to replace "%input_path%" in worker command like "python worker.sh -input %input_path% ..".
So from this perspective, you can directly append the other input paths to your worker command in your own way.

For question two, because YARN might set a wrong HADOOP_COMMON_HOME by default. So submarine provides the environment variable to be set in the worker's launch script if the worker wants to access HDFS.
And there's no data plane relation between outside Hadoop and the container except YARN will localize resources for the container.

Hope this can answer your questions.

Best Regards,
Zhankun

On Fri, 22 Feb 2019 at 15:35, Vinay Kashyap <[hidden email]> wrote:
Hi all,

I am using Hadoop 3.2.0. I am trying few examples using Submarine to run TensorFlow jobs in a docker container.
I would like to understand few details regarding Read/Write HDFS data during/after application launch/execution. Have highlighted the questions line.

When launching the application which reads input from HDFS, we configure --input_path to a hdfs path, as mentioned in the standard example.

yarn jar hadoop-yarn-applications-submarine-<version>.jar job run \
 --name tf-job-001 --docker_image <your docker image> \
 --input_path hdfs://default/dataset/cifar-10-data \
 --checkpoint_path hdfs://default/tmp/cifar-10-jobdir \
 --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
 --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \
 --num_workers 2 \
 --worker_resources memory=8G,vcores=2,gpu=1 --worker_launch_cmd "cmd for worker ..." \
 --num_ps 2 \
 --ps_resources memory=4G,vcores=2,gpu=0 --ps_launch_cmd "cmd for ps" \

Question 1 : What if I have more than 1 dataset in a separate HDFS paths? Can --input_path take multiple paths in any fashion or is it expected to maintain all the datasets under one path.?

"DOCKER_JAVA_HOME points to JAVA_HOME inside Docker image" and "DOCKER_HADOOP_HDFS_HOME points to HADOOP_HDFS_HOME inside Docker image".

Question 2 : What is the exact expectation here.? In the sense, is there any relation/connection with the Hadoop running outside the docker container.? I guess read HDFS data into the docker container happens during Container localization, but how does output data write back happens to HDFS running outside the docker container.?

Assuming a scenario where Application 1 creates a model and Application 2 performs scoring. Both the applications run in a separate docker containers. I would like the understand how does the data read and write across applications happen in this case.
Would be of great help if anyone can be guide me understanding this or direct me to a blog or write up which explains the above.
 
Thanks and regards
Vinay Kashyap


--
Thanks and regards
Vinay Kashyap


--
Thanks and regards
Vinay Kashyap