Hadoop error in shuffle in fetcher: Exceeded MAX_FAILED_UNIQUE_FETCHES

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Hadoop error in shuffle in fetcher: Exceeded MAX_FAILED_UNIQUE_FETCHES

Seonyeong Bak
Hi all,

We've run a hadoop cluster (Apache Hadoop 2.7.1) with 40 datanodes.
Currently, we're using Fair Scheduler in our cluster.
And there are no limits on the number of concurrent running jobs.
30 ~ 50 I/O heavy jobs has been running concurrently at dawn.

Recently we got shuffle errors as follows when we had run HDFS Balancer or spark streaming jobs..

Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#2 
    at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134) 
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:366)
    at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:288)
    at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:354) 
    at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)



I also noticed that SocketTimeoutException had occurred in some tasks in the same job.
But there is no network problem..


Someone said that we need to increase the value of "mapreduce.tasktracker.http.threads" property.
However, no codes use that property after the commit starting with hash value 80a05764be5c4f517.


Here are my questions:

1. Is that property currently being used?
2. If so, Is it really helpful to solve our problem?
3. Do we need to fine tune the settings of NodeManagers and DataNodes?
4. Is there any better solution?


Thanks,
Pak
Reply | Threaded
Open this post in threaded view
|

Re: Hadoop error in shuffle in fetcher: Exceeded MAX_FAILED_UNIQUE_FETCHES

Ravi Prakash-2
This is an auxiliary service that runs inside the NodeManager which provides the intermediate data.

Cheers
Ravi

On Tue, Jun 6, 2017 at 8:06 PM, Seonyoung Park <[hidden email]> wrote:
Hi all,

We've run a hadoop cluster (Apache Hadoop 2.7.1) with 40 datanodes.
Currently, we're using Fair Scheduler in our cluster.
And there are no limits on the number of concurrent running jobs.
30 ~ 50 I/O heavy jobs has been running concurrently at dawn.

Recently we got shuffle errors as follows when we had run HDFS Balancer or spark streaming jobs..

Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#2 
    at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134) 
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:366)
    at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:288)
    at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:354) 
    at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)



I also noticed that SocketTimeoutException had occurred in some tasks in the same job.
But there is no network problem..


Someone said that we need to increase the value of "mapreduce.tasktracker.http.threads" property.
However, no codes use that property after the commit starting with hash value 80a05764be5c4f517.


Here are my questions:

1. Is that property currently being used?
2. If so, Is it really helpful to solve our problem?
3. Do we need to fine tune the settings of NodeManagers and DataNodes?
4. Is there any better solution?


Thanks,
Pak