new to hadoop, jobs never leaving accepted

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

new to hadoop, jobs never leaving accepted

Jason Laughman
I’ve been setting up a Hadoop 2.9.1 cluster and have data replicating through HDFS, but when I try to run a job via Hive (I see that it’s deprecated, but it’s what I’m working with for now) it never gets out of accepted state in the web tool.  I’ve done some Googling and the general consensus is that it’s resource constraints, so can someone tell me if I’ve got enough horsepower here?

I’ve got one small name server, three small data servers, and two larger data servers.  I figured out the the small data servers were too small because even if I tried to tweak YARN parameters for RAM and CPU the resource managers would immediately shutdown.  I added the two larger data servers, and now I see two active nodes but only with a total of one container:

$ yarn node -list
19/07/09 23:54:11 INFO client.RMProxy: Connecting to ResourceManager at <resource_manager>:8032
Total Nodes:2
         Node-Id     Node-State Node-Http-Address Number-of-Running-Containers
node1:40079        RUNNING node1:8042                           1
node2:36311        RUNNING node2:8042                           0

There are a ton of some sort of automated jobs backed up on there, and when I try to run anything through Hive it just sits there and eventually times out (I do see it get accepted).  My larger nodes are 4 GB RAM and 2 vcores and I set YARN to do automated resource allocation with yarn.nodemanager.resource.detect-hardware-capabilities.  Is that enough to even get a POC lab working?  I don’t care about having the three smaller servers running as resource nodes, but I’d like to have a better understanding of what’s going on with the larger servers, because it seems like they’re close to working.

Here’s the metrics data from the website, hopefully somebody can parse it.
Cluster Metrics
Apps Submitted Apps Pending Apps Running Apps Completed Containers Running Memory Used Memory Total Memory Reserved VCores Used VCores Total VCores Reserved
292 284 1 7 1 1 GB 3.38 GB 0 B 1 4 0
Cluster Nodes Metrics
Active Nodes Decommissioning Nodes Decommissioned Nodes Lost Nodes Unhealthy Nodes Rebooted Nodes Shutdown Nodes
2 0 0 0 0 0 4
Scheduler Metrics
Scheduler Type Scheduling Resource Type Minimum Allocation Maximum Allocation Maximum Cluster Application Priority
Capacity Scheduler [MEMORY] <memory:1024, vCores:1> <memory:1732, vCores:2> 0
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: new to hadoop, jobs never leaving accepted

yangtao.yt
Hi, Jason.

According to the information you provided, your cluster has two nodes with the same resource <memory:1732, vCores:2>, the single running container is AM container which already takes over <memory:1024, vCores: 1>.
I think a possible cause is that available resource of your cluster was insufficient for requesting new containers, please refer to the application attempt UI (http://<RM-HOST>:<RM-HTTP-PORT>/cluster/appattempt/<APP-ATTEMPT-ID>), you can find outstanding requests with required resources over there. Another possible cause is the queue/user limit, you can refer to scheduler UI (http://<RM-HOST>:<RM-HTTP-PORT>/cluster/scheduler) to check resource quotas and usage of the queue.
Hope it helps.

Best,
Tao Yang

> 在 2019年7月10日,上午8:23,Jason Laughman <[hidden email]> 写道:
>
> I’ve been setting up a Hadoop 2.9.1 cluster and have data replicating through HDFS, but when I try to run a job via Hive (I see that it’s deprecated, but it’s what I’m working with for now) it never gets out of accepted state in the web tool.  I’ve done some Googling and the general consensus is that it’s resource constraints, so can someone tell me if I’ve got enough horsepower here?
>
> I’ve got one small name server, three small data servers, and two larger data servers.  I figured out the the small data servers were too small because even if I tried to tweak YARN parameters for RAM and CPU the resource managers would immediately shutdown.  I added the two larger data servers, and now I see two active nodes but only with a total of one container:
>
> $ yarn node -list
> 19/07/09 23:54:11 INFO client.RMProxy: Connecting to ResourceManager at <resource_manager>:8032
> Total Nodes:2
>         Node-Id     Node-State Node-Http-Address Number-of-Running-Containers
> node1:40079        RUNNING node1:8042                           1
> node2:36311        RUNNING node2:8042                           0
>
> There are a ton of some sort of automated jobs backed up on there, and when I try to run anything through Hive it just sits there and eventually times out (I do see it get accepted).  My larger nodes are 4 GB RAM and 2 vcores and I set YARN to do automated resource allocation with yarn.nodemanager.resource.detect-hardware-capabilities.  Is that enough to even get a POC lab working?  I don’t care about having the three smaller servers running as resource nodes, but I’d like to have a better understanding of what’s going on with the larger servers, because it seems like they’re close to working.
>
> Here’s the metrics data from the website, hopefully somebody can parse it.
> Cluster Metrics
> Apps Submitted Apps Pending Apps Running Apps Completed Containers Running Memory Used Memory Total Memory Reserved VCores Used VCores Total VCores Reserved
> 292 284 1 7 1 1 GB 3.38 GB 0 B 1 4 0
> Cluster Nodes Metrics
> Active Nodes Decommissioning Nodes Decommissioned Nodes Lost Nodes Unhealthy Nodes Rebooted Nodes Shutdown Nodes
> 2 0 0 0 0 0 4
> Scheduler Metrics
> Scheduler Type Scheduling Resource Type Minimum Allocation Maximum Allocation Maximum Cluster Application Priority
> Capacity Scheduler [MEMORY] <memory:1024, vCores:1> <memory:1732, vCores:2> 0
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


smime.p7s (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: new to hadoop, jobs never leaving accepted

Jason Laughman
I added a couple of bigger servers and now I see multiple containers running, but I still can’t get a job to run.  The job details now say:

Diagnostics: [Wed Jul 10 17:27:59 +0000 2019] Application is added to the scheduler and is not yet activated. Queue's AM resource limit exceeded. Details : AM Partition = <DEFAULT_PARTITION>; AM Resource Request = <memory:2048, vCores:1>; Queue Resource Limit for AM = <memory:3072, vCores:1>; User AM Resource Limit of the queue = <memory:3072, vCores:1>; Queue AM Resource Usage = <memory:2048, vCores:2>;

I understand WHAT that’s saying, but I don’t understand WHY.  Here’s what my scheduler details look like, I don’t see why it’s complaining about the AM, unless something’s not talking to something else right:

Queue State: RUNNING
Used Capacity: 12.5%
Configured Capacity: 100.0%
Configured Max Capacity: 100.0%
Absolute Used Capacity: 12.5%
Absolute Configured Capacity: 100.0%
Absolute Configured Max Capacity: 100.0%
Used Resources: <memory:3072, vCores:3>
Configured Max Application Master Limit: 10.0
Max Application Master Resources: <memory:3072, vCores:1>
Used Application Master Resources: <memory:3072, vCores:3>
Max Application Master Resources Per User: <memory:3072, vCores:1>
Num Schedulable Applications: 3
Num Non-Schedulable Applications: 38
Num Containers: 3
Max Applications: 10000
Max Applications Per User: 10000
Configured Minimum User Limit Percent: 100%
Configured User Limit Factor: 1.0
Accessible Node Labels: *
Ordering Policy: FifoOrderingPolicy
Preemption: disabled
Intra-queue Preemption: disabled
Default Node Label Expression: <DEFAULT_PARTITION>
Default Application Priority: 0

User Name Max Resource Weight Used Resource Max AM Resource Used AM Resource Schedulable Apps Non-Schedulable Apps
hdfs <memory:0, vCores:0> 1.0 <memory:0, vCores:0> <memory:3072, vCores:1> <memory:0, vCores:0> 0 1
dr.who <memory:24576, vCores:1> 1.0 <memory:3072, vCores:3> <memory:3072, vCores:1> <memory:3072, vCores:3> 3 37

> On Jul 10, 2019, at 3:37 AM, yangtao.yt <[hidden email]> wrote:
>
> Hi, Jason.
>
> According to the information you provided, your cluster has two nodes with the same resource <memory:1732, vCores:2>, the single running container is AM container which already takes over <memory:1024, vCores: 1>.
> I think a possible cause is that available resource of your cluster was insufficient for requesting new containers, please refer to the application attempt UI (http://<RM-HOST>:<RM-HTTP-PORT>/cluster/appattempt/<APP-ATTEMPT-ID>), you can find outstanding requests with required resources over there. Another possible cause is the queue/user limit, you can refer to scheduler UI (http://<RM-HOST>:<RM-HTTP-PORT>/cluster/scheduler) to check resource quotas and usage of the queue.
> Hope it helps.
>
> Best,
> Tao Yang
>
>> 在 2019年7月10日,上午8:23,Jason Laughman <[hidden email]> 写道:
>>
>> I’ve been setting up a Hadoop 2.9.1 cluster and have data replicating through HDFS, but when I try to run a job via Hive (I see that it’s deprecated, but it’s what I’m working with for now) it never gets out of accepted state in the web tool.  I’ve done some Googling and the general consensus is that it’s resource constraints, so can someone tell me if I’ve got enough horsepower here?
>>
>> I’ve got one small name server, three small data servers, and two larger data servers.  I figured out the the small data servers were too small because even if I tried to tweak YARN parameters for RAM and CPU the resource managers would immediately shutdown.  I added the two larger data servers, and now I see two active nodes but only with a total of one container:
>>
>> $ yarn node -list
>> 19/07/09 23:54:11 INFO client.RMProxy: Connecting to ResourceManager at <resource_manager>:8032
>> Total Nodes:2
>>        Node-Id     Node-State Node-Http-Address Number-of-Running-Containers
>> node1:40079        RUNNING node1:8042                           1
>> node2:36311        RUNNING node2:8042                           0
>>
>> There are a ton of some sort of automated jobs backed up on there, and when I try to run anything through Hive it just sits there and eventually times out (I do see it get accepted).  My larger nodes are 4 GB RAM and 2 vcores and I set YARN to do automated resource allocation with yarn.nodemanager.resource.detect-hardware-capabilities.  Is that enough to even get a POC lab working?  I don’t care about having the three smaller servers running as resource nodes, but I’d like to have a better understanding of what’s going on with the larger servers, because it seems like they’re close to working.
>>
>> Here’s the metrics data from the website, hopefully somebody can parse it.
>> Cluster Metrics
>> Apps Submitted Apps Pending Apps Running Apps Completed Containers Running Memory Used Memory Total Memory Reserved VCores Used VCores Total VCores Reserved
>> 292 284 1 7 1 1 GB 3.38 GB 0 B 1 4 0
>> Cluster Nodes Metrics
>> Active Nodes Decommissioning Nodes Decommissioned Nodes Lost Nodes Unhealthy Nodes Rebooted Nodes Shutdown Nodes
>> 2 0 0 0 0 0 4
>> Scheduler Metrics
>> Scheduler Type Scheduling Resource Type Minimum Allocation Maximum Allocation Maximum Cluster Application Priority
>> Capacity Scheduler [MEMORY] <memory:1024, vCores:1> <memory:1732, vCores:2> 0
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: new to hadoop, jobs never leaving accepted

Sunil Govindan
Hi Jason
Default value of yarn.scheduler.capacity.<queue-path>.maximum-am-resource-percent is 0.1 (which is 10% of resource of the queue).
If you want to run more apps in that queue, you need to increase this limit to run Application Master's for the apps.
Please make sure this is not configured to a higher value as you may end up in having too many master's and less workers.

Thanks
Sunil
  




On Wed, Jul 10, 2019 at 11:11 PM Jason Laughman <[hidden email]> wrote:
I added a couple of bigger servers and now I see multiple containers running, but I still can’t get a job to run.  The job details now say:

Diagnostics:    [Wed Jul 10 17:27:59 +0000 2019] Application is added to the scheduler and is not yet activated. Queue's AM resource limit exceeded. Details : AM Partition = <DEFAULT_PARTITION>; AM Resource Request = <memory:2048, vCores:1>; Queue Resource Limit for AM = <memory:3072, vCores:1>; User AM Resource Limit of the queue = <memory:3072, vCores:1>; Queue AM Resource Usage = <memory:2048, vCores:2>;

I understand WHAT that’s saying, but I don’t understand WHY.  Here’s what my scheduler details look like, I don’t see why it’s complaining about the AM, unless something’s not talking to something else right:

Queue State:    RUNNING
Used Capacity:  12.5%
Configured Capacity:    100.0%
Configured Max Capacity:        100.0%
Absolute Used Capacity: 12.5%
Absolute Configured Capacity:   100.0%
Absolute Configured Max Capacity:       100.0%
Used Resources: <memory:3072, vCores:3>
Configured Max Application Master Limit:        10.0
Max Application Master Resources:       <memory:3072, vCores:1>
Used Application Master Resources:      <memory:3072, vCores:3>
Max Application Master Resources Per User:      <memory:3072, vCores:1>
Num Schedulable Applications:   3
Num Non-Schedulable Applications:       38
Num Containers: 3
Max Applications:       10000
Max Applications Per User:      10000
Configured Minimum User Limit Percent:  100%
Configured User Limit Factor:   1.0
Accessible Node Labels: *
Ordering Policy:        FifoOrderingPolicy
Preemption:     disabled
Intra-queue Preemption: disabled
Default Node Label Expression:  <DEFAULT_PARTITION>
Default Application Priority:   0

User Name       Max Resource    Weight  Used Resource   Max AM Resource Used AM Resource        Schedulable Apps        Non-Schedulable Apps
hdfs    <memory:0, vCores:0>    1.0     <memory:0, vCores:0>    <memory:3072, vCores:1> <memory:0, vCores:0>    0       1
dr.who  <memory:24576, vCores:1>        1.0     <memory:3072, vCores:3> <memory:3072, vCores:1> <memory:3072, vCores:3> 3       37

> On Jul 10, 2019, at 3:37 AM, yangtao.yt <[hidden email]> wrote:
>
> Hi, Jason.
>
> According to the information you provided, your cluster has two nodes with the same resource <memory:1732, vCores:2>, the single running container is AM container which already takes over <memory:1024, vCores: 1>.
> I think a possible cause is that available resource of your cluster was insufficient for requesting new containers, please refer to the application attempt UI (http://<RM-HOST>:<RM-HTTP-PORT>/cluster/appattempt/<APP-ATTEMPT-ID>), you can find outstanding requests with required resources over there. Another possible cause is the queue/user limit, you can refer to scheduler UI (http://<RM-HOST>:<RM-HTTP-PORT>/cluster/scheduler) to check resource quotas and usage of the queue.
> Hope it helps.
>
> Best,
> Tao Yang
>
>> 在 2019年7月10日,上午8:23,Jason Laughman <[hidden email]> 写道:
>>
>> I’ve been setting up a Hadoop 2.9.1 cluster and have data replicating through HDFS, but when I try to run a job via Hive (I see that it’s deprecated, but it’s what I’m working with for now) it never gets out of accepted state in the web tool.  I’ve done some Googling and the general consensus is that it’s resource constraints, so can someone tell me if I’ve got enough horsepower here?
>>
>> I’ve got one small name server, three small data servers, and two larger data servers.  I figured out the the small data servers were too small because even if I tried to tweak YARN parameters for RAM and CPU the resource managers would immediately shutdown.  I added the two larger data servers, and now I see two active nodes but only with a total of one container:
>>
>> $ yarn node -list
>> 19/07/09 23:54:11 INFO client.RMProxy: Connecting to ResourceManager at <resource_manager>:8032
>> Total Nodes:2
>>        Node-Id            Node-State Node-Http-Address       Number-of-Running-Containers
>> node1:40079          RUNNING node1:8042                                 1
>> node2:36311          RUNNING node2:8042                                 0
>>
>> There are a ton of some sort of automated jobs backed up on there, and when I try to run anything through Hive it just sits there and eventually times out (I do see it get accepted).  My larger nodes are 4 GB RAM and 2 vcores and I set YARN to do automated resource allocation with yarn.nodemanager.resource.detect-hardware-capabilities.  Is that enough to even get a POC lab working?  I don’t care about having the three smaller servers running as resource nodes, but I’d like to have a better understanding of what’s going on with the larger servers, because it seems like they’re close to working.
>>
>> Here’s the metrics data from the website, hopefully somebody can parse it.
>> Cluster Metrics
>> Apps Submitted       Apps Pending    Apps Running    Apps Completed  Containers Running      Memory Used     Memory Total    Memory Reserved VCores Used     VCores Total    VCores Reserved
>> 292  284     1       7       1       1 GB    3.38 GB 0 B     1       4       0
>> Cluster Nodes Metrics
>> Active Nodes Decommissioning Nodes   Decommissioned Nodes    Lost Nodes      Unhealthy Nodes Rebooted Nodes  Shutdown Nodes
>> 2    0       0       0       0       0       4
>> Scheduler Metrics
>> Scheduler Type       Scheduling Resource Type        Minimum Allocation      Maximum Allocation      Maximum Cluster Application Priority
>> Capacity Scheduler   [MEMORY]        <memory:1024, vCores:1> <memory:1732, vCores:2> 0
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]