






Lets get into basics of ext FileSystem(forget HDFS for timebeing). 
 1. In your Hardisk data is stored in form of track and sectors.  Now when a file is stored its not necessary the complete record will be saved in the same block(4kb) and it can span across blocks . 
 2. The Process which is reading the files, reads the block and find the record boundary (Record is a logical entity). A record is a logical entity
 3. The file saved into Hardisk as bytes has no understanding of record or file format. File format and records are logical entities. 

Apply the same logic on HDFS. 
 1. The block size is 128MB. 
 2. Just like ext filesystem HDFS has no clue of the record boundaries. 
 3. What Mappers do is logically find the record boundaries by 
    a. The mapper which reads fileOffset 0 starts reading from start of file, till it finds n.
    b. All mapper which don't read a file from offset 0 will skip the bytes till they reach n and then continue reading. The sequences of bytes till newline is ommited. Now this byte sequence can be a complete record or partial record and is consumed by other mapper.  
    c. Mappers will read the block they are supposed to and continue reading till they find n which is present in other block and not on the block which is local to them.
    d. Except first mapper all other mapper read the block local to them and byte sequence from other block till they find n. 


数据块在存储过程中由Hadoop HDFS网关决定,这将基于Hadoop版本1.x或2.x,并且也取决于文件的大小,您将其从本地到Hadoop Gateway和在-put命令之后的稍后,Hadoop网关将文件块和存储库分解为您的数据节点目录/data/dfs/data/current/(如果您在单个节点上运行,则它在Hadoop目录内),以BLK_ <job_process_id>的形式,并与Blk_的元数据一起使用。job_id名称和.meta扩展名。

Hadoop 1中数据块的大小为64 MB,在Hadoop 2中,它增加到128 MB块大小,此后根据文件大小,它按照我前面说的,因此没有工具可以在Hadoop HDF,如果有的话,请让我知道!

在hadoop 1中,我们简单地将一个文件放入群集中,如下所述,如果文件大小为100 mb,那么什么 -

bin/hadoop fs -put <full-path of the input file till ext> </user/datanode/(target-dir)>

hadoop 1网关将将文件分为两个(64 Mb&amp; 36 Mb)块,而Hadoop 2则简单地使其简单地将一个块划分为一个块,然后根据您的配置顺序复制这些块。

如果您使用hadoop放置一个jar用于映射减少作业,则可以在其中设置 org.apache.hadoop.mapreduce.Job方法在您的java mapper-reducer类中为1,然后在测试以下MR工作的jar for jar for以下。

//Setting the Results to Single Target File in Java File inside main method

然后运行hadoop fs脚本,例如:

bin/hadoop jar <full class path of your jar file> <full class path of Main class inside jar> <input directory or file path> <give the output target directory>

如果您使用SQoop从RDBMS引擎导入数据,则可以使用" -m 1"。设置单个文件结果,但与您的问题不同。

