How much overhead a cachedDistributed file has in a mapreduce program?

I think you may try to use archive files (files are unarchived on the task node automitically).
You can add archive files to the DistributedCache by to mean :




  • With tool that use GenericOptionsParser. Then, you can specify the files to be distributed as a comma-separated list of URIs as the argument to -archives option. If you don't specify the scheme , the files are assumed to be local. So, when you launch the job, the local file is copied to the distributed filesystem (often HDFS)



    $> hadoop jar foo.jar ClassUsingDistributedCacheFile -archives archive.jar input output


  • With the distributed cache API (see the javaDoc). With the API, the files specified by the URI must be in a shared filesystem (so the java API does not copy the file.




Before a task is run, the tasktracker copies the files from the distributed filesystem to a local disk, as you say. I think the overhead come from retrieving all your little files in the HDFS


More: