我正在尝试用Erlang在Riak上做mapreduce。我有如下数据:
Bucket = "Numbers"
{Keys,values} = {Random key,1},{Random key,2}........{Random key,1000}.
现在,我存储了1000个从1到1000的值,其中所有键都是由术语undefined作为参数自动生成的,所以所有键的值都是从1到1000。
所以我只想要偶数的数据。使用mapreduce,我如何实现这一点?
您将按照http://docs.basho.com/riak/latest/dev/advanced/mapreduce/
中的描述构造相函数。一个可能的映射函数:
Mapfun = fun(Object, _KeyData, _Arg) ->
%% get the object value, convert to integer and check if even
Value = list_to_integer(binary_to_term(riak_object:get_value(Object))),
case Value rem 2 of
0 -> [Value];
1 -> []
end
end.
尽管你可能不想在遇到兄弟姐妹时完全失败:
Mapfun = fun(Object, _KeyData, _Arg) ->
Values = riak_object:get_values(Object),
case length(Values) of %% checking for siblings
1 -> %% only 1 value == no siblings
I = list_to_integer(binary_to_term(hd(Values))),
case I rem 2 of
0 -> [I]; %% value is even
1 -> [] %% value is odd
end;
_ -> [] %% What should happen with siblings?
end
end.
还可能需要防止或检查其他情况:包含非数字字符的值,空值,删除值(tombsones),仅举几例。
编辑:
注意:做一个全桶的MapReduce作业将需要Riak从磁盘读取每个值,这可能会在一个相当大的数据集上导致极端的延迟和超时。你可能不希望在生产环境中这样做。
执行MapReduce的完整示例(出于空间考虑,限制为数字1到200):
假设您已经克隆并构建了riak-erlang-client
使用上面的第二个Mapfun
erl -pa {path-to-riak-erlang-client}/ebin
定义reduce函数对列表进行排序
Reducefun = fun(List,_) ->
lists:sort(List)
end.
附加到本地Riak服务器
{ok, Pid} = riakc_pb_socket:start_link("127.0.0.1", 8087).
生成一些测试数据
[ riakc_pb_socket:put(
Pid,
riakc_obj:new(
<<"numbers">>,
list_to_binary("Key" ++ V),V
)
) || V <- [ integer_to_list(Itr) || Itr <- lists:seq(1,200)]],
这个客户端执行MapReduce的函数是mapred(pid(), mapred_inputs(), [mapred_queryterm()])
mapred_queryterm
是自述文件中定义的{Type, FunTerm, Arg, Keep}
形式的相规范列表。对于这个例子,有两个阶段:
- 只选择偶数的映射阶段
{map, Mapfun, none, true}
- 对结果进行排序的reduce阶段
{reduce, Reducefun, none, true}
执行MapReduce查询
{ok,Results} = riakc_pb_socket:mapred(
Pid, %% The socket pid from above
<<"numbers">>, %% Input is the bucket
[{map,{qfun,Mapfun},none,true},
{reduce,{qfun,Reducefun},none,true}]
),
Results
将是[{_Phase Index_, _Phase Output_}]
的一个列表,Keep
为真时,每个阶段都有一个单独的条目,在这个例子中,两个阶段都标记为keep,所以在这个例子中,Results
将是[{0,[_map phase result_]},{1,[_reduce phase result_]}]
打印出每个阶段的结果:
[ io:format("MapReduce Result of phase ~p:~n~P~n",[P,Result,500])
|| {P,Result} <- Results ].
当我运行这个时,我的输出是:
MapReduce Result of phase 0:
[182,132,174,128,8,146,18,168,70,98,186,118,50,28,22,112,82,160,114,106,12,26,
124,14,194,64,122,144,172,96,126,162,58,170,108,44,90,104,6,196,40,154,94,
120,76,48,150,52,4,62,140,178,2,142,100,166,192,66,16,36,38,88,102,68,34,32,
30,164,110,42,92,138,86,54,152,116,156,72,134,200,148,46,10,176,198,84,56,78,
130,136,74,190,158,24,184,180,80,60,20,188]
MapReduce Result of phase 1:
[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,
56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,
104,106,108,110,112,114,116,118,120,122,124,126,128,130,132,134,136,138,140,
142,144,146,148,150,152,154,156,158,160,162,164,166,168,170,172,174,176,178,
180,182,184,186,188,190,192,194,196,198,200]
[ok,ok]