Mapreduce的简单函数-使用python



我试图更好地理解大数据编程,但我对python几乎一无所知。所以我使用mapreduce范例在python中,我处理一些存储在目录下的文本文件比如mydir,所以我的数据源是:

global_file = glob.glob("mydir/*")
def file_contents(file_name):
     f = open(file_name)
     try:
         return f.read()
     finally:
         f.close()
datasource = dict((file_name, file_contents(file_name)) for file_name in global_file)

则mapreduce函数为

#each line in each text file is structured as follow : paper-id:::author1::author2::…. ::authorN:::title
def mapfn(k, v):
    for w in v.splitlines():
        separator = w.split('::|:::')
        for x in separator[1:len(separator)-1]:
            for y in separator[-1].split():
                yield x + y, 1

首先,kv将代表一个键值对,其中k是文件的id, v是后一个文件的内容。(最后我想获得每个单词按作者分组的出现次数)

现在的问题是,当我运行算法时,我得到一个空白数组结果。我的python语法正确吗?

我用更好的命名和正确的拆分正则表达式部分重写了mapfn函数,并添加了一个简单的测试:

import re
datasource = {
    "foo":(
        "paper-1:::author1::author2::authorN:::title1n" 
        "paper-2:::author21::author22::author23::author2N:::title2n"
        "paper-3:::author31::author32:::title3"
        )
    }
def mapfn(k, v):
    for line in v.splitlines():
        data = re.split(r":{2,3}", line)
        words = data[-1].split()
        for author in data[1:-1]:
            for word in words:
                yield author + word, 1

def main():
    for k, v in datasource.items():
        for result in mapfn(k, v):
            print result
if __name__ == "__main__":
    main()

这会产生以下结果:

bruno@betty ~/Work/playground $ python mapf.py 
('author1title1', 1)
('author2title1', 1)
('authorNtitle1', 1)
('author21title2', 1)
('author22title2', 1)
('author23title2', 1)
('author2Ntitle2', 1)
('author31title3', 1)
('author32title3', 1)

不确定这是您所期望的,但至少它产生了一些输出。到目前为止,我还没有使用mapReduce的实际经验,所以你必须告诉我更多关于上下文和如何运行代码的信息,或者等待本地mapReduce专家的介入。

相关内容

  • 没有找到相关文章