Hadoop数据块和数据内容



hadoop将输入数据的内容分解为块,而无需考虑内容。
如描述的帖子:

hdfs不知道(也不关心)文件中存储的内容,因此未根据人类会理解的规则来分配原始文件。例如,人类将需要记录界限(显示记录开始和结束的位置)。

我尚不清楚的一部分是,如果数据仅基于数据大小而不对内容进行分配,那么稍后是否对查询的准确性无关?例如,一个经常被认为是城市清单和每日温度的例子。一个城市可以在一个街区及其在其他地方的温度,然后地图操作如何正确查询信息。我缺少的块和疑问似乎有一些基本的东西。
任何帮助将不胜感激。

一个城市可以处于一个街区,其温度可以在其他地方

是的,是的。在这种情况下,记录边界越过两个块,并且都聚集了。

准确性不会丢失,但是在磁盘和网络IO方面,性能肯定是。当检测到一个块的末端而没有达到输入点,则读取下一个块。即使此拆分位于以下块的前几个字节内,它仍然是处理字节流。

Lets get into basics of ext FileSystem(forget HDFS for timebeing). 
 1. In your Hardisk data is stored in form of track and sectors.  Now when a file is stored its not necessary the complete record will be saved in the same block(4kb) and it can span across blocks . 
 2. The Process which is reading the files, reads the block and find the record boundary (Record is a logical entity). A record is a logical entity
 3. The file saved into Hardisk as bytes has no understanding of record or file format. File format and records are logical entities. 

Apply the same logic on HDFS. 
 1. The block size is 128MB. 
 2. Just like ext filesystem HDFS has no clue of the record boundaries. 
 3. What Mappers do is logically find the record boundaries by 
    a. The mapper which reads fileOffset 0 starts reading from start of file, till it finds n.
    b. All mapper which don't read a file from offset 0 will skip the bytes till they reach n and then continue reading. The sequences of bytes till newline is ommited. Now this byte sequence can be a complete record or partial record and is consumed by other mapper.  
    c. Mappers will read the block they are supposed to and continue reading till they find n which is present in other block and not on the block which is local to them.
    d. Except first mapper all other mapper read the block local to them and byte sequence from other block till they find n. 

请参阅ali,

数据块在存储过程中由Hadoop HDFS网关决定,这将基于Hadoop版本1.x或2.x,并且也取决于文件的大小,您将其从本地到Hadoop Gateway和在-put命令之后的稍后,Hadoop网关将文件块和存储库分解为您的数据节点目录/data/dfs/data/current/(如果您在单个节点上运行,则它在Hadoop目录内),以BLK_ <job_process_id>的形式,并与Blk_的元数据一起使用。job_id名称和.meta扩展名。

Hadoop 1中数据块的大小为64 MB,在Hadoop 2中,它增加到128 MB块大小,此后根据文件大小,它按照我前面说的,因此没有工具可以在Hadoop HDF,如果有的话,请让我知道!

在hadoop 1中,我们简单地将一个文件放入群集中,如下所述,如果文件大小为100 mb,那么什么 -

bin/hadoop fs -put <full-path of the input file till ext> </user/datanode/(target-dir)>

hadoop 1网关将将文件分为两个(64 Mb&amp; 36 Mb)块,而Hadoop 2则简单地使其简单地将一个块划分为一个块,然后根据您的配置顺序复制这些块。

如果您使用hadoop放置一个jar用于映射减少作业,则可以在其中设置 org.apache.hadoop.mapreduce.Job方法在您的java mapper-reducer类中为1,然后在测试以下MR工作的jar for jar for以下。

//Setting the Results to Single Target File in Java File inside main method
job.setNumReduceTasks(1);

然后运行hadoop fs脚本,例如:

bin/hadoop jar <full class path of your jar file> <full class path of Main class inside jar> <input directory or file path> <give the output target directory>

如果您使用SQoop从RDBMS引擎导入数据,则可以使用" -m 1"。设置单个文件结果,但与您的问题不同。

希望,我的答案会在这个问题上瞥见您,谢谢。

最新更新