Re: Does the map task push map output to reduce task or reduce task pull it from map task
Well ,I m not sure But I think it might be the pull.. because physically
the mappers and the reducers are the same nodes ,So if the Mappers had to
push , it might be the case that all nodes are mapping and there are no
reducers to accept it. May be for this reason ,unless all of the Mapper
tasks are finished, the reducers might not want to start reducing anything
There is also this sort shuffle layer between maping and reducing , it
clearly demarcates the phases.. whihc seem to suggest that its the pull
rather than the push ..
You might think of this as a performance bottle neck, but in reality it
seems it isnt .
btw, Wait for some expert to answer, I m a beginner too !
Reduce task looks at map tasks for the partition it requires, and pulls it ( the number of parallel copies is controlled by reduce.parallel.copies ). As partitions are taken in by reduce task, it performs a merge sort, this forms your S&S phase. Typically your mappers / reducers are O(n) , S&S is O(nlogn), so if the amount of intermediate data is huge you will see a relative drop in performance.
> know what the equivalent would be in the mapreduce package
> in 0.20.x.
> dave bayer
The framework code to do with fetching of map outputs is the same for
both the mapred and mapreduce based reducers.