There's a known problem when mapper #setup phase is heavy (e.g. loading files from hdfs) and #map operations are fast, so it spends like 5 minutes in #setup and 30 sec in #map. I have hbase MR job with 1000 regions => 1000 mappers and I see that it spends most of the time in setup phase.
The solution obviously is to init data once and reuse it till the end of the job, but I'm not sure how to implement it with current framework restrictions.
Is it theoretically possible to assign several local input splits to the same mapper (e.g. return a bigger multi-region split from custom TableInputFormat)? Or maybe there are other best practices for this problem? I'm asking here, because I feel that there could be hidden problems I'm not aware of, and it would be better locate or avoid it in the beginning.