mapreduce任务:
Key1 in file_one = a1,a2,a3,a10,a11,a12;file_two中的Key2为persona1、persona1、persona2、persona3、persona12、persona12、persona3、persona11、persona10。
Merge_file=JOIN file_one BY Key1, file_two BY Key2?(如何写这个…)
既然第二个键有重复,这有关系吗?
谢谢
我的建议是为每个数据集创建一个新列,并在该列上进行连接,例如:
A = foreach file_one generate *, join_key1 as SUBSTRING(key1, 1, 100);
B = foreach file_two generate *, join_key2 as SUBSTRING(key2, 7, 100);
C = join A by join_key1, B by join_key2;