3.1.0 MR work won't distribute after dual-homing NameNode

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

3.1.0 MR work won't distribute after dual-homing NameNode

Jeff Hubbs

Hi -

I have a three node Hadoop 3.1.0 cluster on which the daemons are distributed like so:

Daemons on msba02a...
20112 NameNode
20240 DataNode
24101 JobHistoryServer
20918 WebAppProxyServer
20743 NodeManager
20476 SecondaryNameNode

Daemons on msba02b...
22547 DataNode
22734 ResourceManager
23007 NodeManager

Daemons on msba02c...
10005 NodeManager
9818 DataNode

All three nodes run Gentoo Linux and have either one or two volumes devoted to HDFS; HDFS reports a size of 5.7TiB.

Previously, HDFS and MapReduce (testing with the archetypical "wordcount" job on a 5.8GiB XML file) worked fine in an environment where all three machines are on the same office LAN and get their IP addresses from DHCP; dynamic DNS creates network host names based on the machines' host names as reported by the machines' DHCP clients. FQDNs were used for all intra- and inter-machine references in the Hadoop configuration files.

Since then, I've changed things so that msba02a now has a second NIC that connects to an independent LAN along with the other two machines using their built-in NICs like before; msba02b and msba02c reach the Internet by going through NAT on msba02a. /etc/hosts on all three machines has been populated with the static IPs I gave them like so:

127.0.0.1  localhost
1.0.0.1 msba02a
1.0.0.10 msba02b
1.0.0.20 msba02c

So now if I shell into msba02a and run the wordcount job with the test XML file sitting in HDFS with replication set to 3, the job *does* run and gives me the expected output file...but the workload doesn't distribute to all cores on all nodes like before; it all executes on msba02a. In fact, it doesn't even run on all cores on msba02a; it seems to light up just one core at any given moment. The job used to run on the cluster in 1m48s; now it takes 5m56 (a ratio I can't understand; these are all four-core, eight-thread machines so I'd expect a ratio of close to 24:1, not 3:1). The only time the other two nodes light up at all is near the end of the job when the output file (770MiB) is written out to HDFS.

I've gone through https://hadoop.apache.org/docs/current3/hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html and set the values shown there to 1.0.0.1 in hdfs-site.xml on msba02a in hopes of getting the daemons to bind to the cluster-facing NIC instead of the outward-facing NIC, but it seems to me like HDFS is working exactly like it's supposed to. Note that the ResourceManager daemon runs on msba02b and therefore doesn't need to be bound to a particular NIC; it still uses that machine's only NIC like before except now its IP address is static and is resolved via its local /etc/hosts.

The only errors showing up in the daemon logs of any nodes seem to be e.g. "org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted" in hadoop-yarn-resourcemanager-msba02b.log and hadoop-mapred-historyserver-msba02a.log.

As for the hadoop run output, previously when everything was working things would get to point where it would print out a series of lines like

map 0% reduce 0%

and that line would repeat with "map" percentage climbing first and then the "reduce" percentage would climb until both numbers reached 100% and the job would wrap up soon afterward. Now, it intersperses those lines with other output and it skips around, like this:

2018-06-05 17:45:34,338 INFO mapreduce.Job:  map 100% reduce 0%
2018-06-05 17:45:36,295 INFO mapred.MapTask: Finished spill 0
2018-06-05 17:45:36,295 INFO mapred.MapTask: (RESET) equator 61480136 kv 15370028(61480112) kvi 13480948(53923792)
2018-06-05 17:45:36,882 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:36,882 INFO mapred.MapTask: bufstart = 61480136; bufend = 10372007; bufvoid = 104857566
2018-06-05 17:45:36,882 INFO mapred.MapTask: kvstart = 15370028(61480112); kvend = 7835876(31343504); length = 7534153/6553600
2018-06-05 17:45:36,882 INFO mapred.MapTask: (EQUATOR) 17997991 kvi 4499492(17997968)
2018-06-05 17:45:38,774 INFO mapred.MapTask: Finished spill 1
2018-06-05 17:45:38,774 INFO mapred.MapTask: (RESET) equator 17997991 kv 4499492(17997968) kvi 2642780(10571120)
2018-06-05 17:45:38,910 INFO mapred.LocalJobRunner:
2018-06-05 17:45:38,910 INFO mapred.MapTask: Starting flush of map output
2018-06-05 17:45:38,910 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:38,911 INFO mapred.MapTask: bufstart = 17997991; bufend = 40956853; bufvoid = 104857600
2018-06-05 17:45:38,911 INFO mapred.MapTask: kvstart = 4499492(17997968); kvend = 1327036(5308144); length = 3172457/6553600
2018-06-05 17:45:39,340 INFO mapreduce.Job:  map 4% reduce 0%
2018-06-05 17:45:39,684 INFO mapred.MapTask: Finished spill 2
2018-06-05 17:45:39,788 INFO mapred.Merger: Merging 3 sorted segments
2018-06-05 17:45:39,788 INFO mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 34645401 bytes
2018-06-05 17:45:40,251 INFO mapred.Task: Task:attempt_local1155504279_0001_m_000002_0 is done. And is in the process of committing
2018-06-05 17:45:40,253 INFO mapred.LocalJobRunner: map > sort
2018-06-05 17:45:40,253 INFO mapred.Task: Task 'attempt_local1155504279_0001_m_000002_0' done.
2018-06-05 17:45:40,253 INFO mapred.Task: Final Counters for attempt_local1155504279_0001_m_000002_0: Counters: 23
    File System Counters
        FILE: Number of bytes read=106419805
        FILE: Number of bytes written=202253153
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=410006948
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=9
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=1
    Map-Reduce Framework
        Map input records=2653033
        Map output records=4553651
        Map output bytes=130562451
        Map output materialized bytes=31060160
        Input split bytes=95
        Combine input records=5425504
        Combine output records=1618222
        Spilled Records=1618222
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=114
        Total committed heap usage (bytes)=1301807104
    File Input Format Counters
        Bytes Read=134348800
2018-06-05 17:45:40,253 INFO mapred.LocalJobRunner: Finishing task: attempt_local1155504279_0001_m_000002_0
2018-06-05 17:45:40,253 INFO mapred.LocalJobRunner: Starting task: attempt_local1155504279_0001_m_000003_0
2018-06-05 17:45:40,254 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2018-06-05 17:45:40,254 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2018-06-05 17:45:40,254 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2018-06-05 17:45:40,255 INFO mapred.MapTask: Processing split: hdfs://msba02a:9000/allcat.xml:268435456+134217728
2018-06-05 17:45:40,265 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
2018-06-05 17:45:40,266 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
2018-06-05 17:45:40,266 INFO mapred.MapTask: soft limit at 83886080
2018-06-05 17:45:40,266 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
2018-06-05 17:45:40,266 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
2018-06-05 17:45:40,266 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2018-06-05 17:45:40,341 INFO mapreduce.Job:  map 100% reduce 0%
2018-06-05 17:45:41,079 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:41,079 INFO mapred.MapTask: bufstart = 0; bufend = 53799451; bufvoid = 104857600
2018-06-05 17:45:41,079 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 18692744(74770976); length = 7521653/6553600
2018-06-05 17:45:41,079 INFO mapred.MapTask: (EQUATOR) 61425451 kvi 15356356(61425424)
2018-06-05 17:45:43,110 INFO mapred.MapTask: Finished spill 0
2018-06-05 17:45:43,110 INFO mapred.MapTask: (RESET) equator 61425451 kv 15356356(61425424) kvi 13514352(54057408)
2018-06-05 17:45:43,687 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:43,687 INFO mapred.MapTask: bufstart = 61425451; bufend = 10294846; bufvoid = 104857586
2018-06-05 17:45:43,687 INFO mapred.MapTask: kvstart = 15356356(61425424); kvend = 7816592(31266368); length = 7539765/6553600
2018-06-05 17:45:43,687 INFO mapred.MapTask: (EQUATOR) 17920846 kvi 4480204(17920816)
2018-06-05 17:45:46,275 INFO mapred.MapTask: Finished spill 1
2018-06-05 17:45:46,275 INFO mapred.MapTask: (RESET) equator 17920846 kv 4480204(17920816) kvi 2573716(10294864)
2018-06-05 17:45:46,423 INFO mapred.LocalJobRunner:
2018-06-05 17:45:46,423 INFO mapred.MapTask: Starting flush of map output
2018-06-05 17:45:46,423 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:46,423 INFO mapred.MapTask: bufstart = 17920846; bufend = 41420321; bufvoid = 104857600
2018-06-05 17:45:46,423 INFO mapred.MapTask: kvstart = 4480204(17920816); kvend = 1126824(4507296); length = 3353381/6553600


Any hints as to why work isn't distributing? It seems to me like this kind of network configuration for Hadoop clusters would be more the norm than one where all nodes are on a network with everything else in an environment (in our situation one driver for having cluster traffic isolated is because the data files used may contain NDA-bound data that shouldn't travel the office LAN unencrypted).

Thanks!




Reply | Threaded
Open this post in threaded view
|

Re: 3.1.0 MR work won't distribute after dual-homing NameNode

Gour Saha

Looks like the YARN/MR multihoming doc patch never got committed and hence not available in the site documentation. You can look into the doc patch in https://issues.apache.org/jira/browse/YARN-2384 (may be use an online markdown tool to view it better) and see if you followed the configuration mentioned there. Another comprehensive multihoming document which might help you is here.

 

-Gour

 

From: Jeff Hubbs <[hidden email]>
Date: Tuesday, June 5, 2018 at 2:57 PM
To: "[hidden email]" <[hidden email]>
Subject: 3.1.0 MR work won't distribute after dual-homing NameNode

 

Hi -

I have a three node Hadoop 3.1.0 cluster on which the daemons are distributed like so:

Daemons on msba02a...
20112 NameNode
20240 DataNode
24101 JobHistoryServer
20918 WebAppProxyServer
20743 NodeManager
20476 SecondaryNameNode

Daemons on msba02b...
22547 DataNode
22734 ResourceManager
23007 NodeManager

Daemons on msba02c...
10005 NodeManager
9818 DataNode

All three nodes run Gentoo Linux and have either one or two volumes devoted to HDFS; HDFS reports a size of 5.7TiB.

Previously, HDFS and MapReduce (testing with the archetypical "wordcount" job on a 5.8GiB XML file) worked fine in an environment where all three machines are on the same office LAN and get their IP addresses from DHCP; dynamic DNS creates network host names based on the machines' host names as reported by the machines' DHCP clients. FQDNs were used for all intra- and inter-machine references in the Hadoop configuration files.

Since then, I've changed things so that msba02a now has a second NIC that connects to an independent LAN along with the other two machines using their built-in NICs like before; msba02b and msba02c reach the Internet by going through NAT on msba02a. /etc/hosts on all three machines has been populated with the static IPs I gave them like so:

127.0.0.1  localhost
1.0.0.1 msba02a
1.0.0.10 msba02b
1.0.0.20 msba02c

So now if I shell into msba02a and run the wordcount job with the test XML file sitting in HDFS with replication set to 3, the job *does* run and gives me the expected output file...but the workload doesn't distribute to all cores on all nodes like before; it all executes on msba02a. In fact, it doesn't even run on all cores on msba02a; it seems to light up just one core at any given moment. The job used to run on the cluster in 1m48s; now it takes 5m56 (a ratio I can't understand; these are all four-core, eight-thread machines so I'd expect a ratio of close to 24:1, not 3:1). The only time the other two nodes light up at all is near the end of the job when the output file (770MiB) is written out to HDFS.

I've gone through https://hadoop.apache.org/docs/current3/hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html and set the values shown there to 1.0.0.1 in hdfs-site.xml on msba02a in hopes of getting the daemons to bind to the cluster-facing NIC instead of the outward-facing NIC, but it seems to me like HDFS is working exactly like it's supposed to. Note that the ResourceManager daemon runs on msba02b and therefore doesn't need to be bound to a particular NIC; it still uses that machine's only NIC like before except now its IP address is static and is resolved via its local /etc/hosts.

The only errors showing up in the daemon logs of any nodes seem to be e.g. "org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted" in hadoop-yarn-resourcemanager-msba02b.log and hadoop-mapred-historyserver-msba02a.log.

As for the hadoop run output, previously when everything was working things would get to point where it would print out a series of lines like

map 0% reduce 0%

and that line would repeat with "map" percentage climbing first and then the "reduce" percentage would climb until both numbers reached 100% and the job would wrap up soon afterward. Now, it intersperses those lines with other output and it skips around, like this:

2018-06-05 17:45:34,338 INFO mapreduce.Job:  map 100% reduce 0%
2018-06-05 17:45:36,295 INFO mapred.MapTask: Finished spill 0
2018-06-05 17:45:36,295 INFO mapred.MapTask: (RESET) equator 61480136 kv 15370028(61480112) kvi 13480948(53923792)
2018-06-05 17:45:36,882 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:36,882 INFO mapred.MapTask: bufstart = 61480136; bufend = 10372007; bufvoid = 104857566
2018-06-05 17:45:36,882 INFO mapred.MapTask: kvstart = 15370028(61480112); kvend = 7835876(31343504); length = 7534153/6553600
2018-06-05 17:45:36,882 INFO mapred.MapTask: (EQUATOR) 17997991 kvi 4499492(17997968)
2018-06-05 17:45:38,774 INFO mapred.MapTask: Finished spill 1
2018-06-05 17:45:38,774 INFO mapred.MapTask: (RESET) equator 17997991 kv 4499492(17997968) kvi 2642780(10571120)
2018-06-05 17:45:38,910 INFO mapred.LocalJobRunner:
2018-06-05 17:45:38,910 INFO mapred.MapTask: Starting flush of map output
2018-06-05 17:45:38,910 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:38,911 INFO mapred.MapTask: bufstart = 17997991; bufend = 40956853; bufvoid = 104857600
2018-06-05 17:45:38,911 INFO mapred.MapTask: kvstart = 4499492(17997968); kvend = 1327036(5308144); length = 3172457/6553600
2018-06-05 17:45:39,340 INFO mapreduce.Job:  map 4% reduce 0%
2018-06-05 17:45:39,684 INFO mapred.MapTask: Finished spill 2
2018-06-05 17:45:39,788 INFO mapred.Merger: Merging 3 sorted segments
2018-06-05 17:45:39,788 INFO mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 34645401 bytes
2018-06-05 17:45:40,251 INFO mapred.Task: Task:attempt_local1155504279_0001_m_000002_0 is done. And is in the process of committing
2018-06-05 17:45:40,253 INFO mapred.LocalJobRunner: map > sort
2018-06-05 17:45:40,253 INFO mapred.Task: Task 'attempt_local1155504279_0001_m_000002_0' done.
2018-06-05 17:45:40,253 INFO mapred.Task: Final Counters for attempt_local1155504279_0001_m_000002_0: Counters: 23
    File System Counters
        FILE: Number of bytes read=106419805
        FILE: Number of bytes written=202253153
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=410006948
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=9
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=1
    Map-Reduce Framework
        Map input records=2653033
        Map output records=4553651
        Map output bytes=130562451
        Map output materialized bytes=31060160
        Input split bytes=95
        Combine input records=5425504
        Combine output records=1618222
        Spilled Records=1618222
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=114
        Total committed heap usage (bytes)=1301807104
    File Input Format Counters
        Bytes Read=134348800
2018-06-05 17:45:40,253 INFO mapred.LocalJobRunner: Finishing task: attempt_local1155504279_0001_m_000002_0
2018-06-05 17:45:40,253 INFO mapred.LocalJobRunner: Starting task: attempt_local1155504279_0001_m_000003_0
2018-06-05 17:45:40,254 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2018-06-05 17:45:40,254 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2018-06-05 17:45:40,254 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2018-06-05 17:45:40,255 INFO mapred.MapTask: Processing split: hdfs://msba02a:9000/allcat.xml:268435456+134217728
2018-06-05 17:45:40,265 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
2018-06-05 17:45:40,266 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
2018-06-05 17:45:40,266 INFO mapred.MapTask: soft limit at 83886080
2018-06-05 17:45:40,266 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
2018-06-05 17:45:40,266 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
2018-06-05 17:45:40,266 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2018-06-05 17:45:40,341 INFO mapreduce.Job:  map 100% reduce 0%
2018-06-05 17:45:41,079 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:41,079 INFO mapred.MapTask: bufstart = 0; bufend = 53799451; bufvoid = 104857600
2018-06-05 17:45:41,079 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 18692744(74770976); length = 7521653/6553600
2018-06-05 17:45:41,079 INFO mapred.MapTask: (EQUATOR) 61425451 kvi 15356356(61425424)
2018-06-05 17:45:43,110 INFO mapred.MapTask: Finished spill 0
2018-06-05 17:45:43,110 INFO mapred.MapTask: (RESET) equator 61425451 kv 15356356(61425424) kvi 13514352(54057408)
2018-06-05 17:45:43,687 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:43,687 INFO mapred.MapTask: bufstart = 61425451; bufend = 10294846; bufvoid = 104857586
2018-06-05 17:45:43,687 INFO mapred.MapTask: kvstart = 15356356(61425424); kvend = 7816592(31266368); length = 7539765/6553600
2018-06-05 17:45:43,687 INFO mapred.MapTask: (EQUATOR) 17920846 kvi 4480204(17920816)
2018-06-05 17:45:46,275 INFO mapred.MapTask: Finished spill 1
2018-06-05 17:45:46,275 INFO mapred.MapTask: (RESET) equator 17920846 kv 4480204(17920816) kvi 2573716(10294864)
2018-06-05 17:45:46,423 INFO mapred.LocalJobRunner:
2018-06-05 17:45:46,423 INFO mapred.MapTask: Starting flush of map output
2018-06-05 17:45:46,423 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:46,423 INFO mapred.MapTask: bufstart = 17920846; bufend = 41420321; bufvoid = 104857600
2018-06-05 17:45:46,423 INFO mapred.MapTask: kvstart = 4480204(17920816); kvend = 1126824(4507296); length = 3353381/6553600

 

Any hints as to why work isn't distributing? It seems to me like this kind of network configuration for Hadoop clusters would be more the norm than one where all nodes are on a network with everything else in an environment (in our situation one driver for having cluster traffic isolated is because the data files used may contain NDA-bound data that shouldn't travel the office LAN unencrypted).

Thanks!

 

 

 

Reply | Threaded
Open this post in threaded view
|

Re: 3.1.0 MR work won't distribute after dual-homing NameNode

Jeff Hubbs
Gour -

Thank you; I'll certainly look into that. On Monday I performed an experiment where I reduced the cluster down to a two-node, putting all the daemons that were unique to msba02 onto msba02b and reconstructing HDFS as appropriate. This way, no active machine was dual-homed; they ran as they had before I changed the network except for having static IPs and name resolution via host table. When I did this and ran the wordcount mapreduce job, I observed the same behavior: everything ran on just one core of msba02b until the output file (with all the found words and their number of instances - it's a 770-MiB file) back out to HDFS.

I'm about to start part of the way over with a fresh binary distribution of 3.1.0 and see what happens. I thought I would also look into the systems' name resolution priority and make /etc/hosts come first.


On 6/13/18 11:02 AM, Gour Saha wrote:

Looks like the YARN/MR multihoming doc patch never got committed and hence not available in the site documentation. You can look into the doc patch in https://issues.apache.org/jira/browse/YARN-2384 (may be use an online markdown tool to view it better) and see if you followed the configuration mentioned there. Another comprehensive multihoming document which might help you is here.

 

-Gour

 

From: Jeff Hubbs [hidden email]
Date: Tuesday, June 5, 2018 at 2:57 PM
To: [hidden email] [hidden email]
Subject: 3.1.0 MR work won't distribute after dual-homing NameNode

 

Hi -

I have a three node Hadoop 3.1.0 cluster on which the daemons are distributed like so:

Daemons on msba02a...
20112 NameNode
20240 DataNode
24101 JobHistoryServer
20918 WebAppProxyServer
20743 NodeManager
20476 SecondaryNameNode

Daemons on msba02b...
22547 DataNode
22734 ResourceManager
23007 NodeManager

Daemons on msba02c...
10005 NodeManager
9818 DataNode

All three nodes run Gentoo Linux and have either one or two volumes devoted to HDFS; HDFS reports a size of 5.7TiB.

Previously, HDFS and MapReduce (testing with the archetypical "wordcount" job on a 5.8GiB XML file) worked fine in an environment where all three machines are on the same office LAN and get their IP addresses from DHCP; dynamic DNS creates network host names based on the machines' host names as reported by the machines' DHCP clients. FQDNs were used for all intra- and inter-machine references in the Hadoop configuration files.

Since then, I've changed things so that msba02a now has a second NIC that connects to an independent LAN along with the other two machines using their built-in NICs like before; msba02b and msba02c reach the Internet by going through NAT on msba02a. /etc/hosts on all three machines has been populated with the static IPs I gave them like so:

127.0.0.1  localhost
1.0.0.1 msba02a
1.0.0.10 msba02b
1.0.0.20 msba02c

So now if I shell into msba02a and run the wordcount job with the test XML file sitting in HDFS with replication set to 3, the job *does* run and gives me the expected output file...but the workload doesn't distribute to all cores on all nodes like before; it all executes on msba02a. In fact, it doesn't even run on all cores on msba02a; it seems to light up just one core at any given moment. The job used to run on the cluster in 1m48s; now it takes 5m56 (a ratio I can't understand; these are all four-core, eight-thread machines so I'd expect a ratio of close to 24:1, not 3:1). The only time the other two nodes light up at all is near the end of the job when the output file (770MiB) is written out to HDFS.

I've gone through https://hadoop.apache.org/docs/current3/hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html and set the values shown there to 1.0.0.1 in hdfs-site.xml on msba02a in hopes of getting the daemons to bind to the cluster-facing NIC instead of the outward-facing NIC, but it seems to me like HDFS is working exactly like it's supposed to. Note that the ResourceManager daemon runs on msba02b and therefore doesn't need to be bound to a particular NIC; it still uses that machine's only NIC like before except now its IP address is static and is resolved via its local /etc/hosts.

The only errors showing up in the daemon logs of any nodes seem to be e.g. "org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted" in hadoop-yarn-resourcemanager-msba02b.log and hadoop-mapred-historyserver-msba02a.log.

As for the hadoop run output, previously when everything was working things would get to point where it would print out a series of lines like

map 0% reduce 0%

and that line would repeat with "map" percentage climbing first and then the "reduce" percentage would climb until both numbers reached 100% and the job would wrap up soon afterward. Now, it intersperses those lines with other output and it skips around, like this:

2018-06-05 17:45:34,338 INFO mapreduce.Job:  map 100% reduce 0%
2018-06-05 17:45:36,295 INFO mapred.MapTask: Finished spill 0
2018-06-05 17:45:36,295 INFO mapred.MapTask: (RESET) equator 61480136 kv 15370028(61480112) kvi 13480948(53923792)
2018-06-05 17:45:36,882 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:36,882 INFO mapred.MapTask: bufstart = 61480136; bufend = 10372007; bufvoid = 104857566
2018-06-05 17:45:36,882 INFO mapred.MapTask: kvstart = 15370028(61480112); kvend = 7835876(31343504); length = 7534153/6553600
2018-06-05 17:45:36,882 INFO mapred.MapTask: (EQUATOR) 17997991 kvi 4499492(17997968)
2018-06-05 17:45:38,774 INFO mapred.MapTask: Finished spill 1
2018-06-05 17:45:38,774 INFO mapred.MapTask: (RESET) equator 17997991 kv 4499492(17997968) kvi 2642780(10571120)
2018-06-05 17:45:38,910 INFO mapred.LocalJobRunner:
2018-06-05 17:45:38,910 INFO mapred.MapTask: Starting flush of map output
2018-06-05 17:45:38,910 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:38,911 INFO mapred.MapTask: bufstart = 17997991; bufend = 40956853; bufvoid = 104857600
2018-06-05 17:45:38,911 INFO mapred.MapTask: kvstart = 4499492(17997968); kvend = 1327036(5308144); length = 3172457/6553600
2018-06-05 17:45:39,340 INFO mapreduce.Job:  map 4% reduce 0%
2018-06-05 17:45:39,684 INFO mapred.MapTask: Finished spill 2
2018-06-05 17:45:39,788 INFO mapred.Merger: Merging 3 sorted segments
2018-06-05 17:45:39,788 INFO mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 34645401 bytes
2018-06-05 17:45:40,251 INFO mapred.Task: Task:attempt_local1155504279_0001_m_000002_0 is done. And is in the process of committing
2018-06-05 17:45:40,253 INFO mapred.LocalJobRunner: map > sort
2018-06-05 17:45:40,253 INFO mapred.Task: Task 'attempt_local1155504279_0001_m_000002_0' done.
2018-06-05 17:45:40,253 INFO mapred.Task: Final Counters for attempt_local1155504279_0001_m_000002_0: Counters: 23
    File System Counters
        FILE: Number of bytes read=106419805
        FILE: Number of bytes written=202253153
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=410006948
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=9
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=1
    Map-Reduce Framework
        Map input records=2653033
        Map output records=4553651
        Map output bytes=130562451
        Map output materialized bytes=31060160
        Input split bytes=95
        Combine input records=5425504
        Combine output records=1618222
        Spilled Records=1618222
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=114
        Total committed heap usage (bytes)=1301807104
    File Input Format Counters
        Bytes Read=134348800
2018-06-05 17:45:40,253 INFO mapred.LocalJobRunner: Finishing task: attempt_local1155504279_0001_m_000002_0
2018-06-05 17:45:40,253 INFO mapred.LocalJobRunner: Starting task: attempt_local1155504279_0001_m_000003_0
2018-06-05 17:45:40,254 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2018-06-05 17:45:40,254 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2018-06-05 17:45:40,254 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2018-06-05 17:45:40,255 INFO mapred.MapTask: Processing split: hdfs://msba02a:9000/allcat.xml:268435456+134217728
2018-06-05 17:45:40,265 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
2018-06-05 17:45:40,266 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
2018-06-05 17:45:40,266 INFO mapred.MapTask: soft limit at 83886080
2018-06-05 17:45:40,266 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
2018-06-05 17:45:40,266 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
2018-06-05 17:45:40,266 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2018-06-05 17:45:40,341 INFO mapreduce.Job:  map 100% reduce 0%
2018-06-05 17:45:41,079 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:41,079 INFO mapred.MapTask: bufstart = 0; bufend = 53799451; bufvoid = 104857600
2018-06-05 17:45:41,079 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 18692744(74770976); length = 7521653/6553600
2018-06-05 17:45:41,079 INFO mapred.MapTask: (EQUATOR) 61425451 kvi 15356356(61425424)
2018-06-05 17:45:43,110 INFO mapred.MapTask: Finished spill 0
2018-06-05 17:45:43,110 INFO mapred.MapTask: (RESET) equator 61425451 kv 15356356(61425424) kvi 13514352(54057408)
2018-06-05 17:45:43,687 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:43,687 INFO mapred.MapTask: bufstart = 61425451; bufend = 10294846; bufvoid = 104857586
2018-06-05 17:45:43,687 INFO mapred.MapTask: kvstart = 15356356(61425424); kvend = 7816592(31266368); length = 7539765/6553600
2018-06-05 17:45:43,687 INFO mapred.MapTask: (EQUATOR) 17920846 kvi 4480204(17920816)
2018-06-05 17:45:46,275 INFO mapred.MapTask: Finished spill 1
2018-06-05 17:45:46,275 INFO mapred.MapTask: (RESET) equator 17920846 kv 4480204(17920816) kvi 2573716(10294864)
2018-06-05 17:45:46,423 INFO mapred.LocalJobRunner:
2018-06-05 17:45:46,423 INFO mapred.MapTask: Starting flush of map output
2018-06-05 17:45:46,423 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:46,423 INFO mapred.MapTask: bufstart = 17920846; bufend = 41420321; bufvoid = 104857600
2018-06-05 17:45:46,423 INFO mapred.MapTask: kvstart = 4480204(17920816); kvend = 1126824(4507296); length = 3353381/6553600

 

Any hints as to why work isn't distributing? It seems to me like this kind of network configuration for Hadoop clusters would be more the norm than one where all nodes are on a network with everything else in an environment (in our situation one driver for having cluster traffic isolated is because the data files used may contain NDA-bound data that shouldn't travel the office LAN unencrypted).

Thanks!

 

 

 


Reply | Threaded
Open this post in threaded view
|

Re: 3.1.0 MR work won't distribute after dual-homing NameNode

Jeff Hubbs
I looked through that material and used it to start over with a new binary distribution of 3.1.0 on all three machines and the situation is unchanged. Again, HDFS works perfectly in write and read, but the wordcount mapreduce job will still only run on one thread at a time on the machine on which I execute the job (doing so doesn't exercise the other two machines' DataNode daemons because all three machines run the DataNode daemon and replication is set to three, so the work doesn't need to read from the other datanodes.

I tried using the start-yarn.sh script instead of starting the ResourceManager daemon and the NodeManager daemons separately; it didn't make any difference but the script ran ResourceManager on msba02a instead of msba02b as intended, which eventually made the NodeManagers throw warnings and shut down because whereas start-yarn.sh apparently didn't pay any attention to the value yarn.resourcemanager.hostname (msba02b), the NodeManagers did.

Since I have about six minutes while the wordcount job finishes, I have plenty of time to knock around the web interfaces and while the ResourceManager app reports three healthy nodes with 48GiB total RAM and 24 VCores like I expect, everything else is zeroed out while the job is running; Apps Running, Containers Running, Apps Submitted, Memory Used, Memory Reserved, VCores Reserved are all balls and the table underneath Scheduler Metrics just says "No data available in table."

I think it's noteworthy that not only does work not distribute across machines, it doesn't even distribute across threads and cores in the one machine the job runs on. As I said, before I went static IPs and captive-LAN for the three nodes, all 24 threads would light up running this job.

On 6/13/18 1:59 PM, Jeff Hubbs wrote:
Gour -

Thank you; I'll certainly look into that. On Monday I performed an experiment where I reduced the cluster down to a two-node, putting all the daemons that were unique to msba02 onto msba02b and reconstructing HDFS as appropriate. This way, no active machine was dual-homed; they ran as they had before I changed the network except for having static IPs and name resolution via host table. When I did this and ran the wordcount mapreduce job, I observed the same behavior: everything ran on just one core of msba02b until the output file (with all the found words and their number of instances - it's a 770-MiB file) back out to HDFS.

I'm about to start part of the way over with a fresh binary distribution of 3.1.0 and see what happens. I thought I would also look into the systems' name resolution priority and make /etc/hosts come first.


On 6/13/18 11:02 AM, Gour Saha wrote:

Looks like the YARN/MR multihoming doc patch never got committed and hence not available in the site documentation. You can look into the doc patch in https://issues.apache.org/jira/browse/YARN-2384 (may be use an online markdown tool to view it better) and see if you followed the configuration mentioned there. Another comprehensive multihoming document which might help you is here.

 

-Gour

 

From: Jeff Hubbs [hidden email]
Date: Tuesday, June 5, 2018 at 2:57 PM
To: [hidden email] [hidden email]
Subject: 3.1.0 MR work won't distribute after dual-homing NameNode

 

Hi -

I have a three node Hadoop 3.1.0 cluster on which the daemons are distributed like so:

Daemons on msba02a...
20112 NameNode
20240 DataNode
24101 JobHistoryServer
20918 WebAppProxyServer
20743 NodeManager
20476 SecondaryNameNode

Daemons on msba02b...
22547 DataNode
22734 ResourceManager
23007 NodeManager

Daemons on msba02c...
10005 NodeManager
9818 DataNode

All three nodes run Gentoo Linux and have either one or two volumes devoted to HDFS; HDFS reports a size of 5.7TiB.

Previously, HDFS and MapReduce (testing with the archetypical "wordcount" job on a 5.8GiB XML file) worked fine in an environment where all three machines are on the same office LAN and get their IP addresses from DHCP; dynamic DNS creates network host names based on the machines' host names as reported by the machines' DHCP clients. FQDNs were used for all intra- and inter-machine references in the Hadoop configuration files.

Since then, I've changed things so that msba02a now has a second NIC that connects to an independent LAN along with the other two machines using their built-in NICs like before; msba02b and msba02c reach the Internet by going through NAT on msba02a. /etc/hosts on all three machines has been populated with the static IPs I gave them like so:

127.0.0.1  localhost
1.0.0.1 msba02a
1.0.0.10 msba02b
1.0.0.20 msba02c

So now if I shell into msba02a and run the wordcount job with the test XML file sitting in HDFS with replication set to 3, the job *does* run and gives me the expected output file...but the workload doesn't distribute to all cores on all nodes like before; it all executes on msba02a. In fact, it doesn't even run on all cores on msba02a; it seems to light up just one core at any given moment. The job used to run on the cluster in 1m48s; now it takes 5m56 (a ratio I can't understand; these are all four-core, eight-thread machines so I'd expect a ratio of close to 24:1, not 3:1). The only time the other two nodes light up at all is near the end of the job when the output file (770MiB) is written out to HDFS.

I've gone through https://hadoop.apache.org/docs/current3/hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html and set the values shown there to 1.0.0.1 in hdfs-site.xml on msba02a in hopes of getting the daemons to bind to the cluster-facing NIC instead of the outward-facing NIC, but it seems to me like HDFS is working exactly like it's supposed to. Note that the ResourceManager daemon runs on msba02b and therefore doesn't need to be bound to a particular NIC; it still uses that machine's only NIC like before except now its IP address is static and is resolved via its local /etc/hosts.

The only errors showing up in the daemon logs of any nodes seem to be e.g. "org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted" in hadoop-yarn-resourcemanager-msba02b.log and hadoop-mapred-historyserver-msba02a.log.

As for the hadoop run output, previously when everything was working things would get to point where it would print out a series of lines like

map 0% reduce 0%

and that line would repeat with "map" percentage climbing first and then the "reduce" percentage would climb until both numbers reached 100% and the job would wrap up soon afterward. Now, it intersperses those lines with other output and it skips around, like this:

2018-06-05 17:45:34,338 INFO mapreduce.Job:  map 100% reduce 0%
2018-06-05 17:45:36,295 INFO mapred.MapTask: Finished spill 0
2018-06-05 17:45:36,295 INFO mapred.MapTask: (RESET) equator 61480136 kv 15370028(61480112) kvi 13480948(53923792)
2018-06-05 17:45:36,882 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:36,882 INFO mapred.MapTask: bufstart = 61480136; bufend = 10372007; bufvoid = 104857566
2018-06-05 17:45:36,882 INFO mapred.MapTask: kvstart = 15370028(61480112); kvend = 7835876(31343504); length = 7534153/6553600
2018-06-05 17:45:36,882 INFO mapred.MapTask: (EQUATOR) 17997991 kvi 4499492(17997968)
2018-06-05 17:45:38,774 INFO mapred.MapTask: Finished spill 1
2018-06-05 17:45:38,774 INFO mapred.MapTask: (RESET) equator 17997991 kv 4499492(17997968) kvi 2642780(10571120)
2018-06-05 17:45:38,910 INFO mapred.LocalJobRunner:
2018-06-05 17:45:38,910 INFO mapred.MapTask: Starting flush of map output
2018-06-05 17:45:38,910 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:38,911 INFO mapred.MapTask: bufstart = 17997991; bufend = 40956853; bufvoid = 104857600
2018-06-05 17:45:38,911 INFO mapred.MapTask: kvstart = 4499492(17997968); kvend = 1327036(5308144); length = 3172457/6553600
2018-06-05 17:45:39,340 INFO mapreduce.Job:  map 4% reduce 0%
2018-06-05 17:45:39,684 INFO mapred.MapTask: Finished spill 2
2018-06-05 17:45:39,788 INFO mapred.Merger: Merging 3 sorted segments
2018-06-05 17:45:39,788 INFO mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 34645401 bytes
2018-06-05 17:45:40,251 INFO mapred.Task: Task:attempt_local1155504279_0001_m_000002_0 is done. And is in the process of committing
2018-06-05 17:45:40,253 INFO mapred.LocalJobRunner: map > sort
2018-06-05 17:45:40,253 INFO mapred.Task: Task 'attempt_local1155504279_0001_m_000002_0' done.
2018-06-05 17:45:40,253 INFO mapred.Task: Final Counters for attempt_local1155504279_0001_m_000002_0: Counters: 23
    File System Counters
        FILE: Number of bytes read=106419805
        FILE: Number of bytes written=202253153
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=410006948
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=9
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=1
    Map-Reduce Framework
        Map input records=2653033
        Map output records=4553651
        Map output bytes=130562451
        Map output materialized bytes=31060160
        Input split bytes=95
        Combine input records=5425504
        Combine output records=1618222
        Spilled Records=1618222
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=114
        Total committed heap usage (bytes)=1301807104
    File Input Format Counters
        Bytes Read=134348800
2018-06-05 17:45:40,253 INFO mapred.LocalJobRunner: Finishing task: attempt_local1155504279_0001_m_000002_0
2018-06-05 17:45:40,253 INFO mapred.LocalJobRunner: Starting task: attempt_local1155504279_0001_m_000003_0
2018-06-05 17:45:40,254 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2018-06-05 17:45:40,254 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2018-06-05 17:45:40,254 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2018-06-05 17:45:40,255 INFO mapred.MapTask: Processing split: hdfs://msba02a:9000/allcat.xml:268435456+134217728
2018-06-05 17:45:40,265 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
2018-06-05 17:45:40,266 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
2018-06-05 17:45:40,266 INFO mapred.MapTask: soft limit at 83886080
2018-06-05 17:45:40,266 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
2018-06-05 17:45:40,266 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
2018-06-05 17:45:40,266 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2018-06-05 17:45:40,341 INFO mapreduce.Job:  map 100% reduce 0%
2018-06-05 17:45:41,079 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:41,079 INFO mapred.MapTask: bufstart = 0; bufend = 53799451; bufvoid = 104857600
2018-06-05 17:45:41,079 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 18692744(74770976); length = 7521653/6553600
2018-06-05 17:45:41,079 INFO mapred.MapTask: (EQUATOR) 61425451 kvi 15356356(61425424)
2018-06-05 17:45:43,110 INFO mapred.MapTask: Finished spill 0
2018-06-05 17:45:43,110 INFO mapred.MapTask: (RESET) equator 61425451 kv 15356356(61425424) kvi 13514352(54057408)
2018-06-05 17:45:43,687 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:43,687 INFO mapred.MapTask: bufstart = 61425451; bufend = 10294846; bufvoid = 104857586
2018-06-05 17:45:43,687 INFO mapred.MapTask: kvstart = 15356356(61425424); kvend = 7816592(31266368); length = 7539765/6553600
2018-06-05 17:45:43,687 INFO mapred.MapTask: (EQUATOR) 17920846 kvi 4480204(17920816)
2018-06-05 17:45:46,275 INFO mapred.MapTask: Finished spill 1
2018-06-05 17:45:46,275 INFO mapred.MapTask: (RESET) equator 17920846 kv 4480204(17920816) kvi 2573716(10294864)
2018-06-05 17:45:46,423 INFO mapred.LocalJobRunner:
2018-06-05 17:45:46,423 INFO mapred.MapTask: Starting flush of map output
2018-06-05 17:45:46,423 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:46,423 INFO mapred.MapTask: bufstart = 17920846; bufend = 41420321; bufvoid = 104857600
2018-06-05 17:45:46,423 INFO mapred.MapTask: kvstart = 4480204(17920816); kvend = 1126824(4507296); length = 3353381/6553600

 

Any hints as to why work isn't distributing? It seems to me like this kind of network configuration for Hadoop clusters would be more the norm than one where all nodes are on a network with everything else in an environment (in our situation one driver for having cluster traffic isolated is because the data files used may contain NDA-bound data that shouldn't travel the office LAN unencrypted).

Thanks!

 

 

 



Reply | Threaded
Open this post in threaded view
|

Re: 3.1.0 MR work won't distribute after dual-homing NameNode

Jeff Hubbs
Yet at the same time, I'm looking at the log for the ResourceManager daemon (which runs on msba02b in addition with the NodeManager and DataNode daemons just to take some load off of msba02a) and it knows all three nodes are there:

INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added node msba02a:38409 cluster capacity: <memory:16384, vCores:8>
INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added node msba02b:38809 cluster capacity: <memory:32768, vCores:16>
INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added node msba02c:38025 cluster capacity: <memory:49152, vCores:24>
And the NodeManager daemon on that same machine (which won't do any mapreduce work) is indeed connecting to the ResourceManager:

INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as msba02b:38809 with total resource of <memory:16384, vCores:8>

In fact it looks as though the whole cluster is just getting along with itself swimmingly:

$ for d in msba02a msba02b msba02c; do echo $d; ssh root@"$d" "grep FATAL /var/log/hadoop/*.log" ; done
msba02a
msba02b
msba02c

Do I need to try different example code? Any other ideas for me to look into?










On 6/14/18 6:26 PM, Jeff Hubbs wrote:
I looked through that material and used it to start over with a new binary distribution of 3.1.0 on all three machines and the situation is unchanged. Again, HDFS works perfectly in write and read, but the wordcount mapreduce job will still only run on one thread at a time on the machine on which I execute the job (doing so doesn't exercise the other two machines' DataNode daemons because all three machines run the DataNode daemon and replication is set to three, so the work doesn't need to read from the other datanodes.

I tried using the start-yarn.sh script instead of starting the ResourceManager daemon and the NodeManager daemons separately; it didn't make any difference but the script ran ResourceManager on msba02a instead of msba02b as intended, which eventually made the NodeManagers throw warnings and shut down because whereas start-yarn.sh apparently didn't pay any attention to the value yarn.resourcemanager.hostname (msba02b), the NodeManagers did.

Since I have about six minutes while the wordcount job finishes, I have plenty of time to knock around the web interfaces and while the ResourceManager app reports three healthy nodes with 48GiB total RAM and 24 VCores like I expect, everything else is zeroed out while the job is running; Apps Running, Containers Running, Apps Submitted, Memory Used, Memory Reserved, VCores Reserved are all balls and the table underneath Scheduler Metrics just says "No data available in table."

I think it's noteworthy that not only does work not distribute across machines, it doesn't even distribute across threads and cores in the one machine the job runs on. As I said, before I went static IPs and captive-LAN for the three nodes, all 24 threads would light up running this job.

On 6/13/18 1:59 PM, Jeff Hubbs wrote:
Gour -

Thank you; I'll certainly look into that. On Monday I performed an experiment where I reduced the cluster down to a two-node, putting all the daemons that were unique to msba02 onto msba02b and reconstructing HDFS as appropriate. This way, no active machine was dual-homed; they ran as they had before I changed the network except for having static IPs and name resolution via host table. When I did this and ran the wordcount mapreduce job, I observed the same behavior: everything ran on just one core of msba02b until the output file (with all the found words and their number of instances - it's a 770-MiB file) back out to HDFS.

I'm about to start part of the way over with a fresh binary distribution of 3.1.0 and see what happens. I thought I would also look into the systems' name resolution priority and make /etc/hosts come first.


On 6/13/18 11:02 AM, Gour Saha wrote:

Looks like the YARN/MR multihoming doc patch never got committed and hence not available in the site documentation. You can look into the doc patch in https://issues.apache.org/jira/browse/YARN-2384 (may be use an online markdown tool to view it better) and see if you followed the configuration mentioned there. Another comprehensive multihoming document which might help you is here.

 

-Gour

 

From: Jeff Hubbs [hidden email]
Date: Tuesday, June 5, 2018 at 2:57 PM
To: [hidden email] [hidden email]
Subject: 3.1.0 MR work won't distribute after dual-homing NameNode

 

Hi -

I have a three node Hadoop 3.1.0 cluster on which the daemons are distributed like so:

Daemons on msba02a...
20112 NameNode
20240 DataNode
24101 JobHistoryServer
20918 WebAppProxyServer
20743 NodeManager
20476 SecondaryNameNode

Daemons on msba02b...
22547 DataNode
22734 ResourceManager
23007 NodeManager

Daemons on msba02c...
10005 NodeManager
9818 DataNode

All three nodes run Gentoo Linux and have either one or two volumes devoted to HDFS; HDFS reports a size of 5.7TiB.

Previously, HDFS and MapReduce (testing with the archetypical "wordcount" job on a 5.8GiB XML file) worked fine in an environment where all three machines are on the same office LAN and get their IP addresses from DHCP; dynamic DNS creates network host names based on the machines' host names as reported by the machines' DHCP clients. FQDNs were used for all intra- and inter-machine references in the Hadoop configuration files.

Since then, I've changed things so that msba02a now has a second NIC that connects to an independent LAN along with the other two machines using their built-in NICs like before; msba02b and msba02c reach the Internet by going through NAT on msba02a. /etc/hosts on all three machines has been populated with the static IPs I gave them like so:

127.0.0.1  localhost
1.0.0.1 msba02a
1.0.0.10 msba02b
1.0.0.20 msba02c

So now if I shell into msba02a and run the wordcount job with the test XML file sitting in HDFS with replication set to 3, the job *does* run and gives me the expected output file...but the workload doesn't distribute to all cores on all nodes like before; it all executes on msba02a. In fact, it doesn't even run on all cores on msba02a; it seems to light up just one core at any given moment. The job used to run on the cluster in 1m48s; now it takes 5m56 (a ratio I can't understand; these are all four-core, eight-thread machines so I'd expect a ratio of close to 24:1, not 3:1). The only time the other two nodes light up at all is near the end of the job when the output file (770MiB) is written out to HDFS.

I've gone through https://hadoop.apache.org/docs/current3/hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html and set the values shown there to 1.0.0.1 in hdfs-site.xml on msba02a in hopes of getting the daemons to bind to the cluster-facing NIC instead of the outward-facing NIC, but it seems to me like HDFS is working exactly like it's supposed to. Note that the ResourceManager daemon runs on msba02b and therefore doesn't need to be bound to a particular NIC; it still uses that machine's only NIC like before except now its IP address is static and is resolved via its local /etc/hosts.

The only errors showing up in the daemon logs of any nodes seem to be e.g. "org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted" in hadoop-yarn-resourcemanager-msba02b.log and hadoop-mapred-historyserver-msba02a.log.

As for the hadoop run output, previously when everything was working things would get to point where it would print out a series of lines like

map 0% reduce 0%

and that line would repeat with "map" percentage climbing first and then the "reduce" percentage would climb until both numbers reached 100% and the job would wrap up soon afterward. Now, it intersperses those lines with other output and it skips around, like this:

2018-06-05 17:45:34,338 INFO mapreduce.Job:  map 100% reduce 0%
2018-06-05 17:45:36,295 INFO mapred.MapTask: Finished spill 0
2018-06-05 17:45:36,295 INFO mapred.MapTask: (RESET) equator 61480136 kv 15370028(61480112) kvi 13480948(53923792)
2018-06-05 17:45:36,882 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:36,882 INFO mapred.MapTask: bufstart = 61480136; bufend = 10372007; bufvoid = 104857566
2018-06-05 17:45:36,882 INFO mapred.MapTask: kvstart = 15370028(61480112); kvend = 7835876(31343504); length = 7534153/6553600
2018-06-05 17:45:36,882 INFO mapred.MapTask: (EQUATOR) 17997991 kvi 4499492(17997968)
2018-06-05 17:45:38,774 INFO mapred.MapTask: Finished spill 1
2018-06-05 17:45:38,774 INFO mapred.MapTask: (RESET) equator 17997991 kv 4499492(17997968) kvi 2642780(10571120)
2018-06-05 17:45:38,910 INFO mapred.LocalJobRunner:
2018-06-05 17:45:38,910 INFO mapred.MapTask: Starting flush of map output
2018-06-05 17:45:38,910 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:38,911 INFO mapred.MapTask: bufstart = 17997991; bufend = 40956853; bufvoid = 104857600
2018-06-05 17:45:38,911 INFO mapred.MapTask: kvstart = 4499492(17997968); kvend = 1327036(5308144); length = 3172457/6553600
2018-06-05 17:45:39,340 INFO mapreduce.Job:  map 4% reduce 0%
2018-06-05 17:45:39,684 INFO mapred.MapTask: Finished spill 2
2018-06-05 17:45:39,788 INFO mapred.Merger: Merging 3 sorted segments
2018-06-05 17:45:39,788 INFO mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 34645401 bytes
2018-06-05 17:45:40,251 INFO mapred.Task: Task:attempt_local1155504279_0001_m_000002_0 is done. And is in the process of committing
2018-06-05 17:45:40,253 INFO mapred.LocalJobRunner: map > sort
2018-06-05 17:45:40,253 INFO mapred.Task: Task 'attempt_local1155504279_0001_m_000002_0' done.
2018-06-05 17:45:40,253 INFO mapred.Task: Final Counters for attempt_local1155504279_0001_m_000002_0: Counters: 23
    File System Counters
        FILE: Number of bytes read=106419805
        FILE: Number of bytes written=202253153
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=410006948
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=9
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=1
    Map-Reduce Framework
        Map input records=2653033
        Map output records=4553651
        Map output bytes=130562451
        Map output materialized bytes=31060160
        Input split bytes=95
        Combine input records=5425504
        Combine output records=1618222
        Spilled Records=1618222
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=114
        Total committed heap usage (bytes)=1301807104
    File Input Format Counters
        Bytes Read=134348800
2018-06-05 17:45:40,253 INFO mapred.LocalJobRunner: Finishing task: attempt_local1155504279_0001_m_000002_0
2018-06-05 17:45:40,253 INFO mapred.LocalJobRunner: Starting task: attempt_local1155504279_0001_m_000003_0
2018-06-05 17:45:40,254 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2018-06-05 17:45:40,254 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2018-06-05 17:45:40,254 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2018-06-05 17:45:40,255 INFO mapred.MapTask: Processing split: hdfs://msba02a:9000/allcat.xml:268435456+134217728
2018-06-05 17:45:40,265 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
2018-06-05 17:45:40,266 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
2018-06-05 17:45:40,266 INFO mapred.MapTask: soft limit at 83886080
2018-06-05 17:45:40,266 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
2018-06-05 17:45:40,266 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
2018-06-05 17:45:40,266 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2018-06-05 17:45:40,341 INFO mapreduce.Job:  map 100% reduce 0%
2018-06-05 17:45:41,079 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:41,079 INFO mapred.MapTask: bufstart = 0; bufend = 53799451; bufvoid = 104857600
2018-06-05 17:45:41,079 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 18692744(74770976); length = 7521653/6553600
2018-06-05 17:45:41,079 INFO mapred.MapTask: (EQUATOR) 61425451 kvi 15356356(61425424)
2018-06-05 17:45:43,110 INFO mapred.MapTask: Finished spill 0
2018-06-05 17:45:43,110 INFO mapred.MapTask: (RESET) equator 61425451 kv 15356356(61425424) kvi 13514352(54057408)
2018-06-05 17:45:43,687 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:43,687 INFO mapred.MapTask: bufstart = 61425451; bufend = 10294846; bufvoid = 104857586
2018-06-05 17:45:43,687 INFO mapred.MapTask: kvstart = 15356356(61425424); kvend = 7816592(31266368); length = 7539765/6553600
2018-06-05 17:45:43,687 INFO mapred.MapTask: (EQUATOR) 17920846 kvi 4480204(17920816)
2018-06-05 17:45:46,275 INFO mapred.MapTask: Finished spill 1
2018-06-05 17:45:46,275 INFO mapred.MapTask: (RESET) equator 17920846 kv 4480204(17920816) kvi 2573716(10294864)
2018-06-05 17:45:46,423 INFO mapred.LocalJobRunner:
2018-06-05 17:45:46,423 INFO mapred.MapTask: Starting flush of map output
2018-06-05 17:45:46,423 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:46,423 INFO mapred.MapTask: bufstart = 17920846; bufend = 41420321; bufvoid = 104857600
2018-06-05 17:45:46,423 INFO mapred.MapTask: kvstart = 4480204(17920816); kvend = 1126824(4507296); length = 3353381/6553600

 

Any hints as to why work isn't distributing? It seems to me like this kind of network configuration for Hadoop clusters would be more the norm than one where all nodes are on a network with everything else in an environment (in our situation one driver for having cluster traffic isolated is because the data files used may contain NDA-bound data that shouldn't travel the office LAN unencrypted).

Thanks!

 

 

 




Reply | Threaded
Open this post in threaded view
|

Re: 3.1.0 MR work won't distribute after dual-homing NameNode

Jeff Hubbs
Just tried the teragen example (worked fine) and terasort is running now. Work is still executing on one core of msba02a.

There is one thing I see in the ResourceManager log on msba02b that is suspect: "WARN org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: fair-scheduler.xml not found on the classpath." Fair-scheduler is what's indicated as the scheduler type in the ResourceManager web app, which still shows zero activity even while terasort is running. NodeHealthyStatus indicates "true" for all nodes.

This is what is showing up on the console in more or less of a loop while terasort executes (note that "map 100% reduce 0%" alternates with "map 33% reduce 0%"; the 33% number ticks up gradually as the job runs.

2018-06-15 16:02:25,382 INFO mapred.LocalJobRunner: Finishing task: attempt_local1529961718_0001_m_000242_0
2018-06-15 16:02:25,382 INFO mapred.LocalJobRunner: Starting task: attempt_local1529961718_0001_m_000243_0
2018-06-15 16:02:25,382 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2018-06-15 16:02:25,382 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2018-06-15 16:02:25,382 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2018-06-15 16:02:25,383 INFO mapred.MapTask: Processing split: hdfs://msba02a:9000/terasort/input/part-m-00000:68853694464+134217728
2018-06-15 16:02:25,397 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
2018-06-15 16:02:25,397 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
2018-06-15 16:02:25,397 INFO mapred.MapTask: soft limit at 83886080
2018-06-15 16:02:25,397 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
2018-06-15 16:02:25,397 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
2018-06-15 16:02:25,398 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2018-06-15 16:02:25,686 INFO mapreduce.Job:  map 100% reduce 0%
2018-06-15 16:02:26,234 INFO mapred.MapTask: Spilling map output
2018-06-15 16:02:26,234 INFO mapred.MapTask: bufstart = 0; bufend = 72511698; bufvoid = 104857600
2018-06-15 16:02:26,234 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 23370804(93483216); length = 2843593/6553600
2018-06-15 16:02:26,234 INFO mapred.MapTask: (EQUATOR) 75355282 kvi 18838816(75355264)
2018-06-15 16:02:27,758 INFO mapred.MapTask: Finished spill 0
2018-06-15 16:02:27,758 INFO mapred.MapTask: (RESET) equator 75355282 kv 18838816(75355264) kvi 18127932(72511728)
2018-06-15 16:02:28,130 INFO mapred.LocalJobRunner:
2018-06-15 16:02:28,130 INFO mapred.MapTask: Starting flush of map output
2018-06-15 16:02:28,130 INFO mapred.MapTask: Spilling map output
2018-06-15 16:02:28,130 INFO mapred.MapTask: bufstart = 75355282; bufend = 34888038; bufvoid = 104857600
2018-06-15 16:02:28,130 INFO mapred.MapTask: kvstart = 18838816(75355264); kvend = 16313708(65254832); length = 2525109/6553600
2018-06-15 16:02:28,687 INFO mapreduce.Job:  map 33% reduce 0%
2018-06-15 16:02:29,179 INFO mapred.MapTask: Finished spill 1
2018-06-15 16:02:30,222 INFO mapred.Merger: Merging 2 sorted segments
2018-06-15 16:02:30,223 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 139586394 bytes
2018-06-15 16:02:30,971 INFO mapred.Task: Task:attempt_local1529961718_0001_m_000243_0 is done. And is in the process of committing
2018-06-15 16:02:30,972 INFO mapred.LocalJobRunner: map > sort
2018-06-15 16:02:30,972 INFO mapred.Task: Task 'attempt_local1529961718_0001_m_000243_0' done.
2018-06-15 16:02:30,972 INFO mapred.Task: Final Counters for attempt_local1529961718_0001_m_000243_0: Counters: 22
        File System Counters
                FILE: Number of bytes read=34084801145
                FILE: Number of bytes written=68135324939
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=32766918300
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=513
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Map-Reduce Framework
                Map input records=1342177
                Map output records=1342177
                Map output bytes=136902054
                Map output materialized bytes=139586414
                Input split bytes=112
                Combine input records=0
                Spilled Records=2684354
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=16
                Total committed heap usage (bytes)=1830289408
        File Input Format Counters
                Bytes Read=134217700
2018-06-15 16:02:30,972 INFO mapred.LocalJobRunner: Finishing task: attempt_local1529961718_0001_m_000243_0
2018-06-15 16:02:30,972 INFO mapred.LocalJobRunner: Starting task: attempt_local1529961718_0001_m_000244_0
2018-06-15 16:02:30,972 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2018-06-15 16:02:30,972 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2018-06-15 16:02:30,972 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2018-06-15 16:02:30,973 INFO mapred.MapTask: Processing split: hdfs://msba02a:9000/terasort/input/part-m-00000:68719476736+134217728
2018-06-15 16:02:30,983 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
2018-06-15 16:02:30,983 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
2018-06-15 16:02:30,983 INFO mapred.MapTask: soft limit at 83886080
2018-06-15 16:02:30,983 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
2018-06-15 16:02:30,983 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
2018-06-15 16:02:30,984 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2018-06-15 16:02:31,687 INFO mapreduce.Job:  map 100% reduce 0%
2018-06-15 16:02:31,885 INFO mapred.MapTask: Spilling map output
2018-06-15 16:02:31,885 INFO mapred.MapTask: bufstart = 0; bufend = 72511698; bufvoid = 104857600
2018-06-15 16:02:31,885 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 23370804(93483216); length = 2843593/6553600
2018-06-15 16:02:31,885 INFO mapred.MapTask: (EQUATOR) 75355282 kvi 18838816(75355264)
2018-06-15 16:02:33,418 INFO mapred.MapTask: Finished spill 0
2018-06-15 16:02:33,418 INFO mapred.MapTask: (RESET) equator 75355282 kv 18838816(75355264) kvi 18127932(72511728)
2018-06-15 16:02:33,774 INFO mapred.LocalJobRunner:
2018-06-15 16:02:33,774 INFO mapred.MapTask: Starting flush of map output
2018-06-15 16:02:33,774 INFO mapred.MapTask: Spilling map output
2018-06-15 16:02:33,774 INFO mapred.MapTask: bufstart = 75355282; bufend = 34888038; bufvoid = 104857600
2018-06-15 16:02:33,774 INFO mapred.MapTask: kvstart = 18838816(75355264); kvend = 16313708(65254832); length = 2525109/6553600
2018-06-15 16:02:34,688 INFO mapreduce.Job:  map 33% reduce 0%
2018-06-15 16:02:34,962 INFO mapred.MapTask: Finished spill 1
2018-06-15 16:02:35,047 INFO mapred.Merger: Merging 2 sorted segments
2018-06-15 16:02:35,047 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 139586394 bytes
2018-06-15 16:02:35,748 INFO mapred.Task: Task:attempt_local1529961718_0001_m_000244_0 is done. And is in the process of committing
2018-06-15 16:02:35,750 INFO mapred.LocalJobRunner: map > sort
2018-06-15 16:02:35,750 INFO mapred.Task: Task 'attempt_local1529961718_0001_m_000244_0' done.
2018-06-15 16:02:35,750 INFO mapred.Task: Final Counters for attempt_local1529961718_0001_m_000244_0: Counters: 22
        File System Counters
                FILE: Number of bytes read=34224444536
                FILE: Number of bytes written=68414497805
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=32901136000
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=515
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Map-Reduce Framework
                Map input records=1342177
                Map output records=1342177
                Map output bytes=136902054
                Map output materialized bytes=139586414
                Input split bytes=112
                Combine input records=0
                Spilled Records=2684354
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=18
                Total committed heap usage (bytes)=1847066624
        File Input Format Counters
                Bytes Read=134217700
2018-06-15 16:02:35,750 INFO mapred.LocalJobRunner: Finishing task: attempt_local1529961718_0001_m_000244_0
2018-06-15 16:02:35,750 INFO mapred.LocalJobRunner: Starting task: attempt_local1529961718_0001_m_000245_0
2018-06-15 16:02:35,750 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2018-06-15 16:02:35,750 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2018-06-15 16:02:35,750 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2018-06-15 16:02:35,751 INFO mapred.MapTask: Processing split: hdfs://msba02a:9000/terasort/input/part-m-00000:68585259008+134217728
2018-06-15 16:02:35,762 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
2018-06-15 16:02:35,762 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
2018-06-15 16:02:35,762 INFO mapred.MapTask: soft limit at 83886080
2018-06-15 16:02:35,762 INFO mapred.


On 6/15/18 11:44 AM, Jeff Hubbs wrote:
Yet at the same time, I'm looking at the log for the ResourceManager daemon (which runs on msba02b in addition with the NodeManager and DataNode daemons just to take some load off of msba02a) and it knows all three nodes are there:

INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added node msba02a:38409 cluster capacity: <memory:16384, vCores:8>
INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added node msba02b:38809 cluster capacity: <memory:32768, vCores:16>
INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added node msba02c:38025 cluster capacity: <memory:49152, vCores:24>
And the NodeManager daemon on that same machine (which won't do any mapreduce work) is indeed connecting to the ResourceManager:

INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as msba02b:38809 with total resource of <memory:16384, vCores:8>

In fact it looks as though the whole cluster is just getting along with itself swimmingly:

$ for d in msba02a msba02b msba02c; do echo $d; ssh root@"$d" "grep FATAL /var/log/hadoop/*.log" ; done
msba02a
msba02b
msba02c

Do I need to try different example code? Any other ideas for me to look into?










On 6/14/18 6:26 PM, Jeff Hubbs wrote:
I looked through that material and used it to start over with a new binary distribution of 3.1.0 on all three machines and the situation is unchanged. Again, HDFS works perfectly in write and read, but the wordcount mapreduce job will still only run on one thread at a time on the machine on which I execute the job (doing so doesn't exercise the other two machines' DataNode daemons because all three machines run the DataNode daemon and replication is set to three, so the work doesn't need to read from the other datanodes.

I tried using the start-yarn.sh script instead of starting the ResourceManager daemon and the NodeManager daemons separately; it didn't make any difference but the script ran ResourceManager on msba02a instead of msba02b as intended, which eventually made the NodeManagers throw warnings and shut down because whereas start-yarn.sh apparently didn't pay any attention to the value yarn.resourcemanager.hostname (msba02b), the NodeManagers did.

Since I have about six minutes while the wordcount job finishes, I have plenty of time to knock around the web interfaces and while the ResourceManager app reports three healthy nodes with 48GiB total RAM and 24 VCores like I expect, everything else is zeroed out while the job is running; Apps Running, Containers Running, Apps Submitted, Memory Used, Memory Reserved, VCores Reserved are all balls and the table underneath Scheduler Metrics just says "No data available in table."

I think it's noteworthy that not only does work not distribute across machines, it doesn't even distribute across threads and cores in the one machine the job runs on. As I said, before I went static IPs and captive-LAN for the three nodes, all 24 threads would light up running this job.

On 6/13/18 1:59 PM, Jeff Hubbs wrote:
Gour -

Thank you; I'll certainly look into that. On Monday I performed an experiment where I reduced the cluster down to a two-node, putting all the daemons that were unique to msba02 onto msba02b and reconstructing HDFS as appropriate. This way, no active machine was dual-homed; they ran as they had before I changed the network except for having static IPs and name resolution via host table. When I did this and ran the wordcount mapreduce job, I observed the same behavior: everything ran on just one core of msba02b until the output file (with all the found words and their number of instances - it's a 770-MiB file) back out to HDFS.

I'm about to start part of the way over with a fresh binary distribution of 3.1.0 and see what happens. I thought I would also look into the systems' name resolution priority and make /etc/hosts come first.


On 6/13/18 11:02 AM, Gour Saha wrote:

Looks like the YARN/MR multihoming doc patch never got committed and hence not available in the site documentation. You can look into the doc patch in https://issues.apache.org/jira/browse/YARN-2384 (may be use an online markdown tool to view it better) and see if you followed the configuration mentioned there. Another comprehensive multihoming document which might help you is here.

 

-Gour

 

From: Jeff Hubbs [hidden email]
Date: Tuesday, June 5, 2018 at 2:57 PM
To: [hidden email] [hidden email]
Subject: 3.1.0 MR work won't distribute after dual-homing NameNode

 

Hi -

I have a three node Hadoop 3.1.0 cluster on which the daemons are distributed like so:

Daemons on msba02a...
20112 NameNode
20240 DataNode
24101 JobHistoryServer
20918 WebAppProxyServer
20743 NodeManager
20476 SecondaryNameNode

Daemons on msba02b...
22547 DataNode
22734 ResourceManager
23007 NodeManager

Daemons on msba02c...
10005 NodeManager
9818 DataNode

All three nodes run Gentoo Linux and have either one or two volumes devoted to HDFS; HDFS reports a size of 5.7TiB.

Previously, HDFS and MapReduce (testing with the archetypical "wordcount" job on a 5.8GiB XML file) worked fine in an environment where all three machines are on the same office LAN and get their IP addresses from DHCP; dynamic DNS creates network host names based on the machines' host names as reported by the machines' DHCP clients. FQDNs were used for all intra- and inter-machine references in the Hadoop configuration files.

Since then, I've changed things so that msba02a now has a second NIC that connects to an independent LAN along with the other two machines using their built-in NICs like before; msba02b and msba02c reach the Internet by going through NAT on msba02a. /etc/hosts on all three machines has been populated with the static IPs I gave them like so:

127.0.0.1  localhost
1.0.0.1 msba02a
1.0.0.10 msba02b
1.0.0.20 msba02c

So now if I shell into msba02a and run the wordcount job with the test XML file sitting in HDFS with replication set to 3, the job *does* run and gives me the expected output file...but the workload doesn't distribute to all cores on all nodes like before; it all executes on msba02a. In fact, it doesn't even run on all cores on msba02a; it seems to light up just one core at any given moment. The job used to run on the cluster in 1m48s; now it takes 5m56 (a ratio I can't understand; these are all four-core, eight-thread machines so I'd expect a ratio of close to 24:1, not 3:1). The only time the other two nodes light up at all is near the end of the job when the output file (770MiB) is written out to HDFS.

I've gone through https://hadoop.apache.org/docs/current3/hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html and set the values shown there to 1.0.0.1 in hdfs-site.xml on msba02a in hopes of getting the daemons to bind to the cluster-facing NIC instead of the outward-facing NIC, but it seems to me like HDFS is working exactly like it's supposed to. Note that the ResourceManager daemon runs on msba02b and therefore doesn't need to be bound to a particular NIC; it still uses that machine's only NIC like before except now its IP address is static and is resolved via its local /etc/hosts.

The only errors showing up in the daemon logs of any nodes seem to be e.g. "org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted" in hadoop-yarn-resourcemanager-msba02b.log and hadoop-mapred-historyserver-msba02a.log.

As for the hadoop run output, previously when everything was working things would get to point where it would print out a series of lines like

map 0% reduce 0%

and that line would repeat with "map" percentage climbing first and then the "reduce" percentage would climb until both numbers reached 100% and the job would wrap up soon afterward. Now, it intersperses those lines with other output and it skips around, like this:

2018-06-05 17:45:34,338 INFO mapreduce.Job:  map 100% reduce 0%
2018-06-05 17:45:36,295 INFO mapred.MapTask: Finished spill 0
2018-06-05 17:45:36,295 INFO mapred.MapTask: (RESET) equator 61480136 kv 15370028(61480112) kvi 13480948(53923792)
2018-06-05 17:45:36,882 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:36,882 INFO mapred.MapTask: bufstart = 61480136; bufend = 10372007; bufvoid = 104857566
2018-06-05 17:45:36,882 INFO mapred.MapTask: kvstart = 15370028(61480112); kvend = 7835876(31343504); length = 7534153/6553600
2018-06-05 17:45:36,882 INFO mapred.MapTask: (EQUATOR) 17997991 kvi 4499492(17997968)
2018-06-05 17:45:38,774 INFO mapred.MapTask: Finished spill 1
2018-06-05 17:45:38,774 INFO mapred.MapTask: (RESET) equator 17997991 kv 4499492(17997968) kvi 2642780(10571120)
2018-06-05 17:45:38,910 INFO mapred.LocalJobRunner:
2018-06-05 17:45:38,910 INFO mapred.MapTask: Starting flush of map output
2018-06-05 17:45:38,910 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:38,911 INFO mapred.MapTask: bufstart = 17997991; bufend = 40956853; bufvoid = 104857600
2018-06-05 17:45:38,911 INFO mapred.MapTask: kvstart = 4499492(17997968); kvend = 1327036(5308144); length = 3172457/6553600
2018-06-05 17:45:39,340 INFO mapreduce.Job:  map 4% reduce 0%
2018-06-05 17:45:39,684 INFO mapred.MapTask: Finished spill 2
2018-06-05 17:45:39,788 INFO mapred.Merger: Merging 3 sorted segments
2018-06-05 17:45:39,788 INFO mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 34645401 bytes
2018-06-05 17:45:40,251 INFO mapred.Task: Task:attempt_local1155504279_0001_m_000002_0 is done. And is in the process of committing
2018-06-05 17:45:40,253 INFO mapred.LocalJobRunner: map > sort
2018-06-05 17:45:40,253 INFO mapred.Task: Task 'attempt_local1155504279_0001_m_000002_0' done.
2018-06-05 17:45:40,253 INFO mapred.Task: Final Counters for attempt_local1155504279_0001_m_000002_0: Counters: 23
    File System Counters
        FILE: Number of bytes read=106419805
        FILE: Number of bytes written=202253153
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=410006948
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=9
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=1
    Map-Reduce Framework
        Map input records=2653033
        Map output records=4553651
        Map output bytes=130562451
        Map output materialized bytes=31060160
        Input split bytes=95
        Combine input records=5425504
        Combine output records=1618222
        Spilled Records=1618222
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=114
        Total committed heap usage (bytes)=1301807104
    File Input Format Counters
        Bytes Read=134348800
2018-06-05 17:45:40,253 INFO mapred.LocalJobRunner: Finishing task: attempt_local1155504279_0001_m_000002_0
2018-06-05 17:45:40,253 INFO mapred.LocalJobRunner: Starting task: attempt_local1155504279_0001_m_000003_0
2018-06-05 17:45:40,254 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2018-06-05 17:45:40,254 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2018-06-05 17:45:40,254 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2018-06-05 17:45:40,255 INFO mapred.MapTask: Processing split: hdfs://msba02a:9000/allcat.xml:268435456+134217728
2018-06-05 17:45:40,265 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
2018-06-05 17:45:40,266 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
2018-06-05 17:45:40,266 INFO mapred.MapTask: soft limit at 83886080
2018-06-05 17:45:40,266 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
2018-06-05 17:45:40,266 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
2018-06-05 17:45:40,266 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2018-06-05 17:45:40,341 INFO mapreduce.Job:  map 100% reduce 0%
2018-06-05 17:45:41,079 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:41,079 INFO mapred.MapTask: bufstart = 0; bufend = 53799451; bufvoid = 104857600
2018-06-05 17:45:41,079 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 18692744(74770976); length = 7521653/6553600
2018-06-05 17:45:41,079 INFO mapred.MapTask: (EQUATOR) 61425451 kvi 15356356(61425424)
2018-06-05 17:45:43,110 INFO mapred.MapTask: Finished spill 0
2018-06-05 17:45:43,110 INFO mapred.MapTask: (RESET) equator 61425451 kv 15356356(61425424) kvi 13514352(54057408)
2018-06-05 17:45:43,687 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:43,687 INFO mapred.MapTask: bufstart = 61425451; bufend = 10294846; bufvoid = 104857586
2018-06-05 17:45:43,687 INFO mapred.MapTask: kvstart = 15356356(61425424); kvend = 7816592(31266368); length = 7539765/6553600
2018-06-05 17:45:43,687 INFO mapred.MapTask: (EQUATOR) 17920846 kvi 4480204(17920816)
2018-06-05 17:45:46,275 INFO mapred.MapTask: Finished spill 1
2018-06-05 17:45:46,275 INFO mapred.MapTask: (RESET) equator 17920846 kv 4480204(17920816) kvi 2573716(10294864)
2018-06-05 17:45:46,423 INFO mapred.LocalJobRunner:
2018-06-05 17:45:46,423 INFO mapred.MapTask: Starting flush of map output
2018-06-05 17:45:46,423 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:46,423 INFO mapred.MapTask: bufstart = 17920846; bufend = 41420321; bufvoid = 104857600
2018-06-05 17:45:46,423 INFO mapred.MapTask: kvstart = 4480204(17920816); kvend = 1126824(4507296); length = 3353381/6553600

 

Any hints as to why work isn't distributing? It seems to me like this kind of network configuration for Hadoop clusters would be more the norm than one where all nodes are on a network with everything else in an environment (in our situation one driver for having cluster traffic isolated is because the data files used may contain NDA-bound data that shouldn't travel the office LAN unencrypted).

Thanks!

 

 

 





Reply | Threaded
Open this post in threaded view
|

Re: 3.1.0 MR work won't distribute after dual-homing NameNode

Jeff Hubbs
It seems to be that the reason I was having a problem is that mapreduce.framework.name wasn't explicitly set to "yarn" in mapred-site.xml. I added that and a few other settings to that file and now I've got distributing behavior again.

An incidental observation: the terasort example tries to eat up a lot of space in each worker node's /tmp/hadoop-yarn when specifying a 1TB test file. I was only getting about 7% of map completed before everything ground to a halt without erroring and it seems that was why, as / on my workers is a 32-GiB partition formatted with btrfs and is nominally only about one-quarter full. For the time being all I've done in the interest of getting terasort to complete is to reboot the workers with the filesystem mounted with -o compress=lzo; I've looked at the files that are buried deep under /tmp/hadoop-yarn; they're segments of the teragen-generated file that look like they'll compress pretty well but I don't know if it'll be enough.

On 6/15/18 4:05 PM, Jeff Hubbs wrote:
Just tried the teragen example (worked fine) and terasort is running now. Work is still executing on one core of msba02a.

There is one thing I see in the ResourceManager log on msba02b that is suspect: "WARN org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: fair-scheduler.xml not found on the classpath." Fair-scheduler is what's indicated as the scheduler type in the ResourceManager web app, which still shows zero activity even while terasort is running. NodeHealthyStatus indicates "true" for all nodes.

This is what is showing up on the console in more or less of a loop while terasort executes (note that "map 100% reduce 0%" alternates with "map 33% reduce 0%"; the 33% number ticks up gradually as the job runs.

2018-06-15 16:02:25,382 INFO mapred.LocalJobRunner: Finishing task: attempt_local1529961718_0001_m_000242_0
2018-06-15 16:02:25,382 INFO mapred.LocalJobRunner: Starting task: attempt_local1529961718_0001_m_000243_0
2018-06-15 16:02:25,382 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2018-06-15 16:02:25,382 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2018-06-15 16:02:25,382 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2018-06-15 16:02:25,383 INFO mapred.MapTask: Processing split: hdfs://msba02a:9000/terasort/input/part-m-00000:68853694464+134217728
2018-06-15 16:02:25,397 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
2018-06-15 16:02:25,397 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
2018-06-15 16:02:25,397 INFO mapred.MapTask: soft limit at 83886080
2018-06-15 16:02:25,397 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
2018-06-15 16:02:25,397 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
2018-06-15 16:02:25,398 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2018-06-15 16:02:25,686 INFO mapreduce.Job:  map 100% reduce 0%
2018-06-15 16:02:26,234 INFO mapred.MapTask: Spilling map output
2018-06-15 16:02:26,234 INFO mapred.MapTask: bufstart = 0; bufend = 72511698; bufvoid = 104857600
2018-06-15 16:02:26,234 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 23370804(93483216); length = 2843593/6553600
2018-06-15 16:02:26,234 INFO mapred.MapTask: (EQUATOR) 75355282 kvi 18838816(75355264)
2018-06-15 16:02:27,758 INFO mapred.MapTask: Finished spill 0
2018-06-15 16:02:27,758 INFO mapred.MapTask: (RESET) equator 75355282 kv 18838816(75355264) kvi 18127932(72511728)
2018-06-15 16:02:28,130 INFO mapred.LocalJobRunner:
2018-06-15 16:02:28,130 INFO mapred.MapTask: Starting flush of map output
2018-06-15 16:02:28,130 INFO mapred.MapTask: Spilling map output
2018-06-15 16:02:28,130 INFO mapred.MapTask: bufstart = 75355282; bufend = 34888038; bufvoid = 104857600
2018-06-15 16:02:28,130 INFO mapred.MapTask: kvstart = 18838816(75355264); kvend = 16313708(65254832); length = 2525109/6553600
2018-06-15 16:02:28,687 INFO mapreduce.Job:  map 33% reduce 0%
2018-06-15 16:02:29,179 INFO mapred.MapTask: Finished spill 1
2018-06-15 16:02:30,222 INFO mapred.Merger: Merging 2 sorted segments
2018-06-15 16:02:30,223 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 139586394 bytes
2018-06-15 16:02:30,971 INFO mapred.Task: Task:attempt_local1529961718_0001_m_000243_0 is done. And is in the process of committing
2018-06-15 16:02:30,972 INFO mapred.LocalJobRunner: map > sort
2018-06-15 16:02:30,972 INFO mapred.Task: Task 'attempt_local1529961718_0001_m_000243_0' done.
2018-06-15 16:02:30,972 INFO mapred.Task: Final Counters for attempt_local1529961718_0001_m_000243_0: Counters: 22
        File System Counters
                FILE: Number of bytes read=34084801145
                FILE: Number of bytes written=68135324939
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=32766918300
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=513
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Map-Reduce Framework
                Map input records=1342177
                Map output records=1342177
                Map output bytes=136902054
                Map output materialized bytes=139586414
                Input split bytes=112
                Combine input records=0
                Spilled Records=2684354
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=16
                Total committed heap usage (bytes)=1830289408
        File Input Format Counters
                Bytes Read=134217700
2018-06-15 16:02:30,972 INFO mapred.LocalJobRunner: Finishing task: attempt_local1529961718_0001_m_000243_0
2018-06-15 16:02:30,972 INFO mapred.LocalJobRunner: Starting task: attempt_local1529961718_0001_m_000244_0
2018-06-15 16:02:30,972 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2018-06-15 16:02:30,972 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2018-06-15 16:02:30,972 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2018-06-15 16:02:30,973 INFO mapred.MapTask: Processing split: hdfs://msba02a:9000/terasort/input/part-m-00000:68719476736+134217728
2018-06-15 16:02:30,983 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
2018-06-15 16:02:30,983 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
2018-06-15 16:02:30,983 INFO mapred.MapTask: soft limit at 83886080
2018-06-15 16:02:30,983 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
2018-06-15 16:02:30,983 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
2018-06-15 16:02:30,984 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2018-06-15 16:02:31,687 INFO mapreduce.Job:  map 100% reduce 0%
2018-06-15 16:02:31,885 INFO mapred.MapTask: Spilling map output
2018-06-15 16:02:31,885 INFO mapred.MapTask: bufstart = 0; bufend = 72511698; bufvoid = 104857600
2018-06-15 16:02:31,885 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 23370804(93483216); length = 2843593/6553600
2018-06-15 16:02:31,885 INFO mapred.MapTask: (EQUATOR) 75355282 kvi 18838816(75355264)
2018-06-15 16:02:33,418 INFO mapred.MapTask: Finished spill 0
2018-06-15 16:02:33,418 INFO mapred.MapTask: (RESET) equator 75355282 kv 18838816(75355264) kvi 18127932(72511728)
2018-06-15 16:02:33,774 INFO mapred.LocalJobRunner:
2018-06-15 16:02:33,774 INFO mapred.MapTask: Starting flush of map output
2018-06-15 16:02:33,774 INFO mapred.MapTask: Spilling map output
2018-06-15 16:02:33,774 INFO mapred.MapTask: bufstart = 75355282; bufend = 34888038; bufvoid = 104857600
2018-06-15 16:02:33,774 INFO mapred.MapTask: kvstart = 18838816(75355264); kvend = 16313708(65254832); length = 2525109/6553600
2018-06-15 16:02:34,688 INFO mapreduce.Job:  map 33% reduce 0%
2018-06-15 16:02:34,962 INFO mapred.MapTask: Finished spill 1
2018-06-15 16:02:35,047 INFO mapred.Merger: Merging 2 sorted segments
2018-06-15 16:02:35,047 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 139586394 bytes
2018-06-15 16:02:35,748 INFO mapred.Task: Task:attempt_local1529961718_0001_m_000244_0 is done. And is in the process of committing
2018-06-15 16:02:35,750 INFO mapred.LocalJobRunner: map > sort
2018-06-15 16:02:35,750 INFO mapred.Task: Task 'attempt_local1529961718_0001_m_000244_0' done.
2018-06-15 16:02:35,750 INFO mapred.Task: Final Counters for attempt_local1529961718_0001_m_000244_0: Counters: 22
        File System Counters
                FILE: Number of bytes read=34224444536
                FILE: Number of bytes written=68414497805
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=32901136000
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=515
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Map-Reduce Framework
                Map input records=1342177
                Map output records=1342177
                Map output bytes=136902054
                Map output materialized bytes=139586414
                Input split bytes=112
                Combine input records=0
                Spilled Records=2684354
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=18
                Total committed heap usage (bytes)=1847066624
        File Input Format Counters
                Bytes Read=134217700
2018-06-15 16:02:35,750 INFO mapred.LocalJobRunner: Finishing task: attempt_local1529961718_0001_m_000244_0
2018-06-15 16:02:35,750 INFO mapred.LocalJobRunner: Starting task: attempt_local1529961718_0001_m_000245_0
2018-06-15 16:02:35,750 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2018-06-15 16:02:35,750 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2018-06-15 16:02:35,750 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2018-06-15 16:02:35,751 INFO mapred.MapTask: Processing split: hdfs://msba02a:9000/terasort/input/part-m-00000:68585259008+134217728
2018-06-15 16:02:35,762 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
2018-06-15 16:02:35,762 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
2018-06-15 16:02:35,762 INFO mapred.MapTask: soft limit at 83886080
2018-06-15 16:02:35,762 INFO mapred.


On 6/15/18 11:44 AM, Jeff Hubbs wrote:
Yet at the same time, I'm looking at the log for the ResourceManager daemon (which runs on msba02b in addition with the NodeManager and DataNode daemons just to take some load off of msba02a) and it knows all three nodes are there:

INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added node msba02a:38409 cluster capacity: <memory:16384, vCores:8>
INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added node msba02b:38809 cluster capacity: <memory:32768, vCores:16>
INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added node msba02c:38025 cluster capacity: <memory:49152, vCores:24>
And the NodeManager daemon on that same machine (which won't do any mapreduce work) is indeed connecting to the ResourceManager:

INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as msba02b:38809 with total resource of <memory:16384, vCores:8>

In fact it looks as though the whole cluster is just getting along with itself swimmingly:

$ for d in msba02a msba02b msba02c; do echo $d; ssh root@"$d" "grep FATAL /var/log/hadoop/*.log" ; done
msba02a
msba02b
msba02c

Do I need to try different example code? Any other ideas for me to look into?










On 6/14/18 6:26 PM, Jeff Hubbs wrote:
I looked through that material and used it to start over with a new binary distribution of 3.1.0 on all three machines and the situation is unchanged. Again, HDFS works perfectly in write and read, but the wordcount mapreduce job will still only run on one thread at a time on the machine on which I execute the job (doing so doesn't exercise the other two machines' DataNode daemons because all three machines run the DataNode daemon and replication is set to three, so the work doesn't need to read from the other datanodes.

I tried using the start-yarn.sh script instead of starting the ResourceManager daemon and the NodeManager daemons separately; it didn't make any difference but the script ran ResourceManager on msba02a instead of msba02b as intended, which eventually made the NodeManagers throw warnings and shut down because whereas start-yarn.sh apparently didn't pay any attention to the value yarn.resourcemanager.hostname (msba02b), the NodeManagers did.

Since I have about six minutes while the wordcount job finishes, I have plenty of time to knock around the web interfaces and while the ResourceManager app reports three healthy nodes with 48GiB total RAM and 24 VCores like I expect, everything else is zeroed out while the job is running; Apps Running, Containers Running, Apps Submitted, Memory Used, Memory Reserved, VCores Reserved are all balls and the table underneath Scheduler Metrics just says "No data available in table."

I think it's noteworthy that not only does work not distribute across machines, it doesn't even distribute across threads and cores in the one machine the job runs on. As I said, before I went static IPs and captive-LAN for the three nodes, all 24 threads would light up running this job.

On 6/13/18 1:59 PM, Jeff Hubbs wrote:
Gour -

Thank you; I'll certainly look into that. On Monday I performed an experiment where I reduced the cluster down to a two-node, putting all the daemons that were unique to msba02 onto msba02b and reconstructing HDFS as appropriate. This way, no active machine was dual-homed; they ran as they had before I changed the network except for having static IPs and name resolution via host table. When I did this and ran the wordcount mapreduce job, I observed the same behavior: everything ran on just one core of msba02b until the output file (with all the found words and their number of instances - it's a 770-MiB file) back out to HDFS.

I'm about to start part of the way over with a fresh binary distribution of 3.1.0 and see what happens. I thought I would also look into the systems' name resolution priority and make /etc/hosts come first.


On 6/13/18 11:02 AM, Gour Saha wrote:

Looks like the YARN/MR multihoming doc patch never got committed and hence not available in the site documentation. You can look into the doc patch in https://issues.apache.org/jira/browse/YARN-2384 (may be use an online markdown tool to view it better) and see if you followed the configuration mentioned there. Another comprehensive multihoming document which might help you is here.

 

-Gour

 

From: Jeff Hubbs [hidden email]
Date: Tuesday, June 5, 2018 at 2:57 PM
To: [hidden email] [hidden email]
Subject: 3.1.0 MR work won't distribute after dual-homing NameNode

 

Hi -

I have a three node Hadoop 3.1.0 cluster on which the daemons are distributed like so:

Daemons on msba02a...
20112 NameNode
20240 DataNode
24101 JobHistoryServer
20918 WebAppProxyServer
20743 NodeManager
20476 SecondaryNameNode

Daemons on msba02b...
22547 DataNode
22734 ResourceManager
23007 NodeManager

Daemons on msba02c...
10005 NodeManager
9818 DataNode

All three nodes run Gentoo Linux and have either one or two volumes devoted to HDFS; HDFS reports a size of 5.7TiB.

Previously, HDFS and MapReduce (testing with the archetypical "wordcount" job on a 5.8GiB XML file) worked fine in an environment where all three machines are on the same office LAN and get their IP addresses from DHCP; dynamic DNS creates network host names based on the machines' host names as reported by the machines' DHCP clients. FQDNs were used for all intra- and inter-machine references in the Hadoop configuration files.

Since then, I've changed things so that msba02a now has a second NIC that connects to an independent LAN along with the other two machines using their built-in NICs like before; msba02b and msba02c reach the Internet by going through NAT on msba02a. /etc/hosts on all three machines has been populated with the static IPs I gave them like so:

127.0.0.1  localhost
1.0.0.1 msba02a
1.0.0.10 msba02b
1.0.0.20 msba02c

So now if I shell into msba02a and run the wordcount job with the test XML file sitting in HDFS with replication set to 3, the job *does* run and gives me the expected output file...but the workload doesn't distribute to all cores on all nodes like before; it all executes on msba02a. In fact, it doesn't even run on all cores on msba02a; it seems to light up just one core at any given moment. The job used to run on the cluster in 1m48s; now it takes 5m56 (a ratio I can't understand; these are all four-core, eight-thread machines so I'd expect a ratio of close to 24:1, not 3:1). The only time the other two nodes light up at all is near the end of the job when the output file (770MiB) is written out to HDFS.

I've gone through https://hadoop.apache.org/docs/current3/hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html and set the values shown there to 1.0.0.1 in hdfs-site.xml on msba02a in hopes of getting the daemons to bind to the cluster-facing NIC instead of the outward-facing NIC, but it seems to me like HDFS is working exactly like it's supposed to. Note that the ResourceManager daemon runs on msba02b and therefore doesn't need to be bound to a particular NIC; it still uses that machine's only NIC like before except now its IP address is static and is resolved via its local /etc/hosts.

The only errors showing up in the daemon logs of any nodes seem to be e.g. "org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted" in hadoop-yarn-resourcemanager-msba02b.log and hadoop-mapred-historyserver-msba02a.log.

As for the hadoop run output, previously when everything was working things would get to point where it would print out a series of lines like

map 0% reduce 0%

and that line would repeat with "map" percentage climbing first and then the "reduce" percentage would climb until both numbers reached 100% and the job would wrap up soon afterward. Now, it intersperses those lines with other output and it skips around, like this:

2018-06-05 17:45:34,338 INFO mapreduce.Job:  map 100% reduce 0%
2018-06-05 17:45:36,295 INFO mapred.MapTask: Finished spill 0
2018-06-05 17:45:36,295 INFO mapred.MapTask: (RESET) equator 61480136 kv 15370028(61480112) kvi 13480948(53923792)
2018-06-05 17:45:36,882 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:36,882 INFO mapred.MapTask: bufstart = 61480136; bufend = 10372007; bufvoid = 104857566
2018-06-05 17:45:36,882 INFO mapred.MapTask: kvstart = 15370028(61480112); kvend = 7835876(31343504); length = 7534153/6553600
2018-06-05 17:45:36,882 INFO mapred.MapTask: (EQUATOR) 17997991 kvi 4499492(17997968)
2018-06-05 17:45:38,774 INFO mapred.MapTask: Finished spill 1
2018-06-05 17:45:38,774 INFO mapred.MapTask: (RESET) equator 17997991 kv 4499492(17997968) kvi 2642780(10571120)
2018-06-05 17:45:38,910 INFO mapred.LocalJobRunner:
2018-06-05 17:45:38,910 INFO mapred.MapTask: Starting flush of map output
2018-06-05 17:45:38,910 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:38,911 INFO mapred.MapTask: bufstart = 17997991; bufend = 40956853; bufvoid = 104857600
2018-06-05 17:45:38,911 INFO mapred.MapTask: kvstart = 4499492(17997968); kvend = 1327036(5308144); length = 3172457/6553600
2018-06-05 17:45:39,340 INFO mapreduce.Job:  map 4% reduce 0%
2018-06-05 17:45:39,684 INFO mapred.MapTask: Finished spill 2
2018-06-05 17:45:39,788 INFO mapred.Merger: Merging 3 sorted segments
2018-06-05 17:45:39,788 INFO mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 34645401 bytes
2018-06-05 17:45:40,251 INFO mapred.Task: Task:attempt_local1155504279_0001_m_000002_0 is done. And is in the process of committing
2018-06-05 17:45:40,253 INFO mapred.LocalJobRunner: map > sort
2018-06-05 17:45:40,253 INFO mapred.Task: Task 'attempt_local1155504279_0001_m_000002_0' done.
2018-06-05 17:45:40,253 INFO mapred.Task: Final Counters for attempt_local1155504279_0001_m_000002_0: Counters: 23
    File System Counters
        FILE: Number of bytes read=106419805
        FILE: Number of bytes written=202253153
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=410006948
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=9
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=1
    Map-Reduce Framework
        Map input records=2653033
        Map output records=4553651
        Map output bytes=130562451
        Map output materialized bytes=31060160
        Input split bytes=95
        Combine input records=5425504
        Combine output records=1618222
        Spilled Records=1618222
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=114
        Total committed heap usage (bytes)=1301807104
    File Input Format Counters
        Bytes Read=134348800
2018-06-05 17:45:40,253 INFO mapred.LocalJobRunner: Finishing task: attempt_local1155504279_0001_m_000002_0
2018-06-05 17:45:40,253 INFO mapred.LocalJobRunner: Starting task: attempt_local1155504279_0001_m_000003_0
2018-06-05 17:45:40,254 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2018-06-05 17:45:40,254 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2018-06-05 17:45:40,254 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2018-06-05 17:45:40,255 INFO mapred.MapTask: Processing split: hdfs://msba02a:9000/allcat.xml:268435456+134217728
2018-06-05 17:45:40,265 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
2018-06-05 17:45:40,266 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
2018-06-05 17:45:40,266 INFO mapred.MapTask: soft limit at 83886080
2018-06-05 17:45:40,266 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
2018-06-05 17:45:40,266 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
2018-06-05 17:45:40,266 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2018-06-05 17:45:40,341 INFO mapreduce.Job:  map 100% reduce 0%
2018-06-05 17:45:41,079 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:41,079 INFO mapred.MapTask: bufstart = 0; bufend = 53799451; bufvoid = 104857600
2018-06-05 17:45:41,079 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 18692744(74770976); length = 7521653/6553600
2018-06-05 17:45:41,079 INFO mapred.MapTask: (EQUATOR) 61425451 kvi 15356356(61425424)
2018-06-05 17:45:43,110 INFO mapred.MapTask: Finished spill 0
2018-06-05 17:45:43,110 INFO mapred.MapTask: (RESET) equator 61425451 kv 15356356(61425424) kvi 13514352(54057408)
2018-06-05 17:45:43,687 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:43,687 INFO mapred.MapTask: bufstart = 61425451; bufend = 10294846; bufvoid = 104857586
2018-06-05 17:45:43,687 INFO mapred.MapTask: kvstart = 15356356(61425424); kvend = 7816592(31266368); length = 7539765/6553600
2018-06-05 17:45:43,687 INFO mapred.MapTask: (EQUATOR) 17920846 kvi 4480204(17920816)
2018-06-05 17:45:46,275 INFO mapred.MapTask: Finished spill 1
2018-06-05 17:45:46,275 INFO mapred.MapTask: (RESET) equator 17920846 kv 4480204(17920816) kvi 2573716(10294864)
2018-06-05 17:45:46,423 INFO mapred.LocalJobRunner:
2018-06-05 17:45:46,423 INFO mapred.MapTask: Starting flush of map output
2018-06-05 17:45:46,423 INFO mapred.MapTask: Spilling map output
2018-06-05 17:45:46,423 INFO mapred.MapTask: bufstart = 17920846; bufend = 41420321; bufvoid = 104857600
2018-06-05 17:45:46,423 INFO mapred.MapTask: kvstart = 4480204(17920816); kvend = 1126824(4507296); length = 3353381/6553600

 

Any hints as to why work isn't distributing? It seems to me like this kind of network configuration for Hadoop clusters would be more the norm than one where all nodes are on a network with everything else in an environment (in our situation one driver for having cluster traffic isolated is because the data files used may contain NDA-bound data that shouldn't travel the office LAN unencrypted).

Thanks!