Application Master machine affinity/preference settings?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Application Master machine affinity/preference settings?

Everett Anderson
Hi!

We've been using Hadoop MapReduce and Spark on YARN on AWS Elastic MapReduce (EMR). EMR has a concept of Core versus Task nodes, where Core nodes participate in HDFS but Task nodes don't and their number can be more easily scaled up or down based on load.

Most applications we run are batch and can tolerate machines going away well, but some of them are ad hoc interactive Spark sessions. Spark seems to handle executors (workers) going away okay, but if the main Application Master for that user's session goes away, they lose state.

Is there a mechanism in YARN such that we could prioritize launching Application Masters on the Core machine pool in a cluster when resources are available?

I know there are scheduling queues that we could use to segregate isolate entire applications -- such as batch versus interactive ones -- but I'm not sure if there's a way to ensure just the AM of a given application is prioritized to be on a specific set of machines.

Thanks!

- Everett

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Application Master machine affinity/preference settings?

Naganarasimha Garla-2
Hi Everett Anderson,
     I can think of doing it in 2 ways,
1. Create a labels for CoreMachine pool (as Exclusive or non exclusive partition) and submit the AM request with CoreMachine label expression. In this way AM's are submitted in the Coremachine pool itself. refer http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeLabel.html
2. After YARN-6050, if you are aware of the nodes which are CoreMachine, then you can submit AM with multiple ResourceRequest with each having ResourceName pointing to different nodes.

Regards,
+ Naga


On Thu, Jun 15, 2017 at 9:17 PM, Everett Anderson <[hidden email]> wrote:
Hi!

We've been using Hadoop MapReduce and Spark on YARN on AWS Elastic MapReduce (EMR). EMR has a concept of Core versus Task nodes, where Core nodes participate in HDFS but Task nodes don't and their number can be more easily scaled up or down based on load.

Most applications we run are batch and can tolerate machines going away well, but some of them are ad hoc interactive Spark sessions. Spark seems to handle executors (workers) going away okay, but if the main Application Master for that user's session goes away, they lose state.

Is there a mechanism in YARN such that we could prioritize launching Application Masters on the Core machine pool in a cluster when resources are available?

I know there are scheduling queues that we could use to segregate isolate entire applications -- such as batch versus interactive ones -- but I'm not sure if there's a way to ensure just the AM of a given application is prioritized to be on a specific set of machines.

Thanks!

- Everett


Loading...