Data node not able to contact the resource manager

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Data node not able to contact the resource manager

Daniel Santos
Hello,

I have a cluster with one machine holding the name nodes (primary and secondary) a yarn node (resource manager) and four data nodes.
I am running hadoop 2.7.0.

When I submit a job to the cluster I can see it in the scheduler webpage. If I go to the container page and check the logs, in the syslog file i have in the end the following :

2019-08-05 14:58:05,962 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:06,962 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:07,963 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:08,965 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:09,966 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:10,967 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:11,968 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:12,969 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

I have checked the configuration of the resource manager and the data node where the application is running on and the property :  yarn.resourcemanager.hostname that I have set in yarn-site.xml is shown.
I have disabled ipv6 on the yarn machine, as some posts on the internet suggested. All the configuration files are the same in every node of the cluster.

still I am getting these errors, and the application ends with a timeout.

What am I doing wrong ?

Thanks
Regards
Reply | Threaded
Open this post in threaded view
|

Re: Data node not able to contact the resource manager

Jon Mack
Looks to me it's missing the resource manager configuration based on the port it's trying to connect to..

On Mon, Aug 5, 2019 at 9:15 AM Daniel Santos <[hidden email]> wrote:
Hello,

I have a cluster with one machine holding the name nodes (primary and secondary) a yarn node (resource manager) and four data nodes.
I am running hadoop 2.7.0.

When I submit a job to the cluster I can see it in the scheduler webpage. If I go to the container page and check the logs, in the syslog file i have in the end the following :

2019-08-05 14:58:05,962 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:06,962 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:07,963 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:08,965 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:09,966 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:10,967 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:11,968 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:12,969 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

I have checked the configuration of the resource manager and the data node where the application is running on and the property :  yarn.resourcemanager.hostname that I have set in yarn-site.xml is shown.
I have disabled ipv6 on the yarn machine, as some posts on the internet suggested. All the configuration files are the same in every node of the cluster.

still I am getting these errors, and the application ends with a timeout.

What am I doing wrong ?

Thanks
Regards
Reply | Threaded
Open this post in threaded view
|

Re: Data node not able to contact the resource manager

Daniel Santos
Hello Jon,

I have the following yarn-site.xml :

<configuration>
        <!-- Site specific YARN configuration properties -->
        <property>
                <name>yarn.acl.enable</name>
                <value>0</value>
        </property>
        <property>
                <name>yarn.resourcemanager.hostname</name>
                <value>hadoopresourcemanager</value>
        </property>
        <property>
                <name>yarn.nodemanager,aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
        <property>
                <name>yarn.nodemanager.resource.memory-mb</name>
                <value>1536</value>
        </property>
        <property>
                <name>yarn.scheduler.maximum-allocation-mb</name>
                <value>1536</value>
        </property>
        <property>
                <name>yarn.scheduler.minimum-allocation-mb</name>
                <value>128</value>
        </property>
        <property>
                <name>yarn.nodemanager.vmem-check-enabled</name>
                <value>false</value>
        </property>
        <property>
                <name>yarn.resourcemanager.address</name>
                <value>hadoopresourcemanager:8032</value>
        </property>
        <property>
                <name>yarn.resourcemanager.scheduler.address</name>
                <value>hadoopresourcemanager:8030</value>
        </property>
        <property>
                <name>yarn.resourcemanager.resource-tracker.address</name>
                <value>hadoopresourcemanager:8031</value>
        </property>
</configuration>

So I can say, I already tried your suggestion

Cheers

On 5 Aug 2019, at 15:22, Jon Mack <[hidden email]> wrote:

Looks to me it's missing the resource manager configuration based on the port it's trying to connect to..

On Mon, Aug 5, 2019 at 9:15 AM Daniel Santos <[hidden email]> wrote:
Hello,

I have a cluster with one machine holding the name nodes (primary and secondary) a yarn node (resource manager) and four data nodes.
I am running hadoop 2.7.0.

When I submit a job to the cluster I can see it in the scheduler webpage. If I go to the container page and check the logs, in the syslog file i have in the end the following :

2019-08-05 14:58:05,962 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:06,962 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:07,963 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:08,965 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:09,966 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:10,967 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:11,968 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:12,969 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

I have checked the configuration of the resource manager and the data node where the application is running on and the property :  yarn.resourcemanager.hostname that I have set in yarn-site.xml is shown.
I have disabled ipv6 on the yarn machine, as some posts on the internet suggested. All the configuration files are the same in every node of the cluster.

still I am getting these errors, and the application ends with a timeout.

What am I doing wrong ?

Thanks
Regards

Reply | Threaded
Open this post in threaded view
|

Re: Data node not able to contact the resource manager

Jeff Hubbs
Does "hadoopresourcemanager" resolve to a machine that's a Hadoop resource manager? In Hadoop, it's absolutely vital that all names resolve correctly in both directions.

On 8/5/19 10:55 AM, Daniel Santos wrote:
Hello Jon,

I have the following yarn-site.xml :

<configuration>
? ? ? ? <!-- Site specific YARN configuration properties -->
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.acl.enable</name>
? ? ? ? ? ? ? ? <value>0</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.resourcemanager.hostname</name>
? ? ? ? ? ? ? ? <value>hadoopresourcemanager</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.nodemanager,aux-services</name>
? ? ? ? ? ? ? ? <value>mapreduce_shuffle</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.nodemanager.resource.memory-mb</name>
? ? ? ? ? ? ? ? <value>1536</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.scheduler.maximum-allocation-mb</name>
? ? ? ? ? ? ? ? <value>1536</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.scheduler.minimum-allocation-mb</name>
? ? ? ? ? ? ? ? <value>128</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.nodemanager.vmem-check-enabled</name>
? ? ? ? ? ? ? ? <value>false</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.resourcemanager.address</name>
? ? ? ? ? ? ? ? <value>hadoopresourcemanager:8032</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.resourcemanager.scheduler.address</name>
? ? ? ? ? ? ? ? <value>hadoopresourcemanager:8030</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.resourcemanager.resource-tracker.address</name>
? ? ? ? ? ? ? ? <value>hadoopresourcemanager:8031</value>
? ? ? ? </property>
</configuration>

So I can say, I already tried your suggestion

Cheers

On 5 Aug 2019, at 15:22, Jon Mack <[hidden email]> wrote:

Looks to me it's missing the resource manager configuration based on the port it's trying to connect to..

On Mon, Aug 5, 2019 at 9:15 AM Daniel Santos <[hidden email]> wrote:
Hello,

I have a cluster with one machine holding the name nodes (primary and secondary) a yarn node (resource manager) and four data nodes.
I am running hadoop 2.7.0.

When I submit a job to the cluster I can see it in the scheduler webpage. If I go to the container page and check the logs, in the syslog file i have in the end the following :

2019-08-05 14:58:05,962 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:06,962 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:07,963 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:08,965 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:09,966 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:10,967 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:11,968 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:12,969 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

I have checked the configuration of the resource manager and the data node where the application is running on and the property : ?yarn.resourcemanager.hostname that I have set in yarn-site.xml is shown.
I have disabled ipv6 on the yarn machine, as some posts on the internet suggested. All the configuration files are the same in every node of the cluster.

still I am getting these errors, and the application ends with a timeout.

What am I doing wrong ?

Thanks
Regards


Reply | Threaded
Open this post in threaded view
|

Re: Data node not able to contact the resource manager

Daniel Santos
Hello,
I am using hosts files on all machines that are centrally managed through puppet. When I run the yarn startup script on the hadoopresourcemanager machine it creates the node managers one each slave. 

Regards

Sent from my iPhone

On 5 Aug 2019, at 16:01, Jeff Hubbs <[hidden email]> wrote:

Does "hadoopresourcemanager" resolve to a machine that's a Hadoop resource manager? In Hadoop, it's absolutely vital that all names resolve correctly in both directions.

On 8/5/19 10:55 AM, Daniel Santos wrote:
Hello Jon,

I have the following yarn-site.xml :

<configuration>
? ? ? ? <!-- Site specific YARN configuration properties -->
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.acl.enable</name>
? ? ? ? ? ? ? ? <value>0</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.resourcemanager.hostname</name>
? ? ? ? ? ? ? ? <value>hadoopresourcemanager</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.nodemanager,aux-services</name>
? ? ? ? ? ? ? ? <value>mapreduce_shuffle</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.nodemanager.resource.memory-mb</name>
? ? ? ? ? ? ? ? <value>1536</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.scheduler.maximum-allocation-mb</name>
? ? ? ? ? ? ? ? <value>1536</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.scheduler.minimum-allocation-mb</name>
? ? ? ? ? ? ? ? <value>128</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.nodemanager.vmem-check-enabled</name>
? ? ? ? ? ? ? ? <value>false</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.resourcemanager.address</name>
? ? ? ? ? ? ? ? <value>hadoopresourcemanager:8032</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.resourcemanager.scheduler.address</name>
? ? ? ? ? ? ? ? <value>hadoopresourcemanager:8030</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.resourcemanager.resource-tracker.address</name>
? ? ? ? ? ? ? ? <value>hadoopresourcemanager:8031</value>
? ? ? ? </property>
</configuration>

So I can say, I already tried your suggestion

Cheers

On 5 Aug 2019, at 15:22, Jon Mack <[hidden email]> wrote:

Looks to me it's missing the resource manager configuration based on the port it's trying to connect to..

On Mon, Aug 5, 2019 at 9:15 AM Daniel Santos <[hidden email]> wrote:
Hello,

I have a cluster with one machine holding the name nodes (primary and secondary) a yarn node (resource manager) and four data nodes.
I am running hadoop 2.7.0.

When I submit a job to the cluster I can see it in the scheduler webpage. If I go to the container page and check the logs, in the syslog file i have in the end the following :

2019-08-05 14:58:05,962 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:06,962 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:07,963 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:08,965 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:09,966 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:10,967 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:11,968 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:12,969 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

I have checked the configuration of the resource manager and the data node where the application is running on and the property : ?yarn.resourcemanager.hostname that I have set in yarn-site.xml is shown.
I have disabled ipv6 on the yarn machine, as some posts on the internet suggested. All the configuration files are the same in every node of the cluster.

still I am getting these errors, and the application ends with a timeout.

What am I doing wrong ?

Thanks
Regards


Reply | Threaded
Open this post in threaded view
|

Re: Data node not able to contact the resource manager

Daniel Santos
Hello

I found out the cause of the error. When I submit a job to the cluster, I supply a xml configuration file with properties of the cluster I am connecting to.
I had to replicate some properties related to addresses of yarn on that configuration file.

I though that the cluster configuration would be sufficient, but no.

Thanks for your interest
Regards


On 5 Aug 2019, at 19:21, Jon Mack <[hidden email]> wrote:

Doesn't look the client is resolving the IP Address correctly (IE 0.0.0.0/0.0.0.0:8030 ), try a nslookup on one of the clients (IE nslookup  hadoopresourcemanager ) to see what the client is resolving it to. Change the configuration to use the IP Address instead of the hostname if possible.

Also do a netstat -an | grep 8030 on hadoopresourcemanager to verify the resource manager service is running.


On Mon, Aug 5, 2019 at 12:38 PM Daniel Santos <[hidden email]> wrote:
Hello,
I am using hosts files on all machines that are centrally managed through puppet. When I run the yarn startup script on the hadoopresourcemanager machine it creates the node managers one each slave. 

Regards

Sent from my iPhone

On 5 Aug 2019, at 16:01, Jeff Hubbs <[hidden email]> wrote:

Does "hadoopresourcemanager" resolve to a machine that's a Hadoop resource manager? In Hadoop, it's absolutely vital that all names resolve correctly in both directions.

On 8/5/19 10:55 AM, Daniel Santos wrote:
Hello Jon,

I have the following yarn-site.xml :

<configuration>
? ? ? ? <!-- Site specific YARN configuration properties -->
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.acl.enable</name>
? ? ? ? ? ? ? ? <value>0</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.resourcemanager.hostname</name>
? ? ? ? ? ? ? ? <value>hadoopresourcemanager</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.nodemanager,aux-services</name>
? ? ? ? ? ? ? ? <value>mapreduce_shuffle</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.nodemanager.resource.memory-mb</name>
? ? ? ? ? ? ? ? <value>1536</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.scheduler.maximum-allocation-mb</name>
? ? ? ? ? ? ? ? <value>1536</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.scheduler.minimum-allocation-mb</name>
? ? ? ? ? ? ? ? <value>128</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.nodemanager.vmem-check-enabled</name>
? ? ? ? ? ? ? ? <value>false</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.resourcemanager.address</name>
? ? ? ? ? ? ? ? <value>hadoopresourcemanager:8032</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.resourcemanager.scheduler.address</name>
? ? ? ? ? ? ? ? <value>hadoopresourcemanager:8030</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.resourcemanager.resource-tracker.address</name>
? ? ? ? ? ? ? ? <value>hadoopresourcemanager:8031</value>
? ? ? ? </property>
</configuration>

So I can say, I already tried your suggestion

Cheers

On 5 Aug 2019, at 15:22, Jon Mack <[hidden email]> wrote:

Looks to me it's missing the resource manager configuration based on the port it's trying to connect to..

On Mon, Aug 5, 2019 at 9:15 AM Daniel Santos <[hidden email]> wrote:
Hello,

I have a cluster with one machine holding the name nodes (primary and secondary) a yarn node (resource manager) and four data nodes.
I am running hadoop 2.7.0.

When I submit a job to the cluster I can see it in the scheduler webpage. If I go to the container page and check the logs, in the syslog file i have in the end the following :

2019-08-05 14:58:05,962 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:06,962 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:07,963 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:08,965 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:09,966 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:10,967 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:11,968 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:12,969 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

I have checked the configuration of the resource manager and the data node where the application is running on and the property : ?yarn.resourcemanager.hostname that I have set in yarn-site.xml is shown.
I have disabled ipv6 on the yarn machine, as some posts on the internet suggested. All the configuration files are the same in every node of the cluster.

still I am getting these errors, and the application ends with a timeout.

What am I doing wrong ?

Thanks
Regards



Reply | Threaded
Open this post in threaded view
|

Re: Data node not able to contact the resource manager

Jon Mack
Can your share to the group what the xlm configuration file it was. Maybe it could help someone in the future.

Thanks for letting us know the outcome.

On Mon, Aug 5, 2019 at 6:00 PM Daniel Santos <[hidden email]> wrote:
Hello

I found out the cause of the error. When I submit a job to the cluster, I supply a xml configuration file with properties of the cluster I am connecting to.
I had to replicate some properties related to addresses of yarn on that configuration file.

I though that the cluster configuration would be sufficient, but no.

Thanks for your interest
Regards


On 5 Aug 2019, at 19:21, Jon Mack <[hidden email]> wrote:

Doesn't look the client is resolving the IP Address correctly (IE 0.0.0.0/0.0.0.0:8030 ), try a nslookup on one of the clients (IE nslookup  hadoopresourcemanager ) to see what the client is resolving it to. Change the configuration to use the IP Address instead of the hostname if possible.

Also do a netstat -an | grep 8030 on hadoopresourcemanager to verify the resource manager service is running.


On Mon, Aug 5, 2019 at 12:38 PM Daniel Santos <[hidden email]> wrote:
Hello,
I am using hosts files on all machines that are centrally managed through puppet. When I run the yarn startup script on the hadoopresourcemanager machine it creates the node managers one each slave. 

Regards

Sent from my iPhone

On 5 Aug 2019, at 16:01, Jeff Hubbs <[hidden email]> wrote:

Does "hadoopresourcemanager" resolve to a machine that's a Hadoop resource manager? In Hadoop, it's absolutely vital that all names resolve correctly in both directions.

On 8/5/19 10:55 AM, Daniel Santos wrote:
Hello Jon,

I have the following yarn-site.xml :

<configuration>
? ? ? ? <!-- Site specific YARN configuration properties -->
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.acl.enable</name>
? ? ? ? ? ? ? ? <value>0</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.resourcemanager.hostname</name>
? ? ? ? ? ? ? ? <value>hadoopresourcemanager</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.nodemanager,aux-services</name>
? ? ? ? ? ? ? ? <value>mapreduce_shuffle</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.nodemanager.resource.memory-mb</name>
? ? ? ? ? ? ? ? <value>1536</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.scheduler.maximum-allocation-mb</name>
? ? ? ? ? ? ? ? <value>1536</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.scheduler.minimum-allocation-mb</name>
? ? ? ? ? ? ? ? <value>128</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.nodemanager.vmem-check-enabled</name>
? ? ? ? ? ? ? ? <value>false</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.resourcemanager.address</name>
? ? ? ? ? ? ? ? <value>hadoopresourcemanager:8032</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.resourcemanager.scheduler.address</name>
? ? ? ? ? ? ? ? <value>hadoopresourcemanager:8030</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.resourcemanager.resource-tracker.address</name>
? ? ? ? ? ? ? ? <value>hadoopresourcemanager:8031</value>
? ? ? ? </property>
</configuration>

So I can say, I already tried your suggestion

Cheers

On 5 Aug 2019, at 15:22, Jon Mack <[hidden email]> wrote:

Looks to me it's missing the resource manager configuration based on the port it's trying to connect to..

On Mon, Aug 5, 2019 at 9:15 AM Daniel Santos <[hidden email]> wrote:
Hello,

I have a cluster with one machine holding the name nodes (primary and secondary) a yarn node (resource manager) and four data nodes.
I am running hadoop 2.7.0.

When I submit a job to the cluster I can see it in the scheduler webpage. If I go to the container page and check the logs, in the syslog file i have in the end the following :

2019-08-05 14:58:05,962 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:06,962 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:07,963 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:08,965 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:09,966 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:10,967 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:11,968 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:12,969 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

I have checked the configuration of the resource manager and the data node where the application is running on and the property : ?yarn.resourcemanager.hostname that I have set in yarn-site.xml is shown.
I have disabled ipv6 on the yarn machine, as some posts on the internet suggested. All the configuration files are the same in every node of the cluster.

still I am getting these errors, and the application ends with a timeout.

What am I doing wrong ?

Thanks
Regards



Reply | Threaded
Open this post in threaded view
|

Re: Data node not able to contact the resource manager

Daniel Santos
Hello,

Of course I will be most happy to share it here. Here goes the configuration file I am using on the client :

<?xml version="1.0"?>
<configuration>
        <property>
                <name>fs.defaultFS</name>
                <value><a href="hdfs://hadoopnamenode:9000/&lt;/value&gt;" class="">hdfs://hadoopnamenode:9000/</value>
        </property>
        <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
        </property>
        <property>
                <name>yarn.resourcemanager.address</name>
                <value>hadoopresourcemanager:8032</value>
        </property>
        <property>
                <name>yarn.resourcemanager.scheduler.address</name>
                <value>hadoopresourcemanager:8030</value>
        </property>
        <property>
                <name>yarn.resourcemanager.resource-tracker.address</name>
                <value>hadoopresourcemanager:8031</value>
        </property>
</configuration>

I then supply the file in the command used to run the job on the cluster :

~/devtools/hadoop-2.7.0/bin/yarn \
jar avg_imgsize.jar net.xekmypic.hadoop.avgfilesize.JobDriver \
-conf ../clusterconfig/hadoop-cluster.xml \
<a href="hdfs://hadoopnamenode:9000/input/avgfilesize" class="">hdfs://hadoopnamenode:9000/input/avgfilesize <a href="hdfs://hadoopnamenode:9000/output_avgfilesize" class="">hdfs://hadoopnamenode:9000/output_avgfilesize

In the above command the file supplied next to the -conf parameter is the one containing the xml.

Cheers

On 6 Aug 2019, at 20:14, Jon Mack <[hidden email]> wrote:

Can your share to the group what the xlm configuration file it was. Maybe it could help someone in the future.

Thanks for letting us know the outcome.

On Mon, Aug 5, 2019 at 6:00 PM Daniel Santos <[hidden email]> wrote:
Hello

I found out the cause of the error. When I submit a job to the cluster, I supply a xml configuration file with properties of the cluster I am connecting to.
I had to replicate some properties related to addresses of yarn on that configuration file.

I though that the cluster configuration would be sufficient, but no.

Thanks for your interest
Regards


On 5 Aug 2019, at 19:21, Jon Mack <[hidden email]> wrote:

Doesn't look the client is resolving the IP Address correctly (IE 0.0.0.0/0.0.0.0:8030 ), try a nslookup on one of the clients (IE nslookup  hadoopresourcemanager ) to see what the client is resolving it to. Change the configuration to use the IP Address instead of the hostname if possible.

Also do a netstat -an | grep 8030 on hadoopresourcemanager to verify the resource manager service is running.


On Mon, Aug 5, 2019 at 12:38 PM Daniel Santos <[hidden email]> wrote:
Hello,
I am using hosts files on all machines that are centrally managed through puppet. When I run the yarn startup script on the hadoopresourcemanager machine it creates the node managers one each slave. 

Regards

Sent from my iPhone

On 5 Aug 2019, at 16:01, Jeff Hubbs <[hidden email]> wrote:

Does "hadoopresourcemanager" resolve to a machine that's a Hadoop resource manager? In Hadoop, it's absolutely vital that all names resolve correctly in both directions.

On 8/5/19 10:55 AM, Daniel Santos wrote:
Hello Jon,

I have the following yarn-site.xml :

<configuration>
? ? ? ? <!-- Site specific YARN configuration properties -->
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.acl.enable</name>
? ? ? ? ? ? ? ? <value>0</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.resourcemanager.hostname</name>
? ? ? ? ? ? ? ? <value>hadoopresourcemanager</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.nodemanager,aux-services</name>
? ? ? ? ? ? ? ? <value>mapreduce_shuffle</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.nodemanager.resource.memory-mb</name>
? ? ? ? ? ? ? ? <value>1536</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.scheduler.maximum-allocation-mb</name>
? ? ? ? ? ? ? ? <value>1536</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.scheduler.minimum-allocation-mb</name>
? ? ? ? ? ? ? ? <value>128</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.nodemanager.vmem-check-enabled</name>
? ? ? ? ? ? ? ? <value>false</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.resourcemanager.address</name>
? ? ? ? ? ? ? ? <value>hadoopresourcemanager:8032</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.resourcemanager.scheduler.address</name>
? ? ? ? ? ? ? ? <value>hadoopresourcemanager:8030</value>
? ? ? ? </property>
? ? ? ? <property>
? ? ? ? ? ? ? ? <name>yarn.resourcemanager.resource-tracker.address</name>
? ? ? ? ? ? ? ? <value>hadoopresourcemanager:8031</value>
? ? ? ? </property>
</configuration>

So I can say, I already tried your suggestion

Cheers

On 5 Aug 2019, at 15:22, Jon Mack <[hidden email]> wrote:

Looks to me it's missing the resource manager configuration based on the port it's trying to connect to..

On Mon, Aug 5, 2019 at 9:15 AM Daniel Santos <[hidden email]> wrote:
Hello,

I have a cluster with one machine holding the name nodes (primary and secondary) a yarn node (resource manager) and four data nodes.
I am running hadoop 2.7.0.

When I submit a job to the cluster I can see it in the scheduler webpage. If I go to the container page and check the logs, in the syslog file i have in the end the following :

2019-08-05 14:58:05,962 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:06,962 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:07,963 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:08,965 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:09,966 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:10,967 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:11,968 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2019-08-05 14:58:12,969 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

I have checked the configuration of the resource manager and the data node where the application is running on and the property : ?yarn.resourcemanager.hostname that I have set in yarn-site.xml is shown.
I have disabled ipv6 on the yarn machine, as some posts on the internet suggested. All the configuration files are the same in every node of the cluster.

still I am getting these errors, and the application ends with a timeout.

What am I doing wrong ?

Thanks
Regards