Python mapreduce将文本转换为数组

我有一个这样的文件:

0, 1, 1, 1, 1, - 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 01, 1, 1, - 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 02 1 1 1 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 03, 1, 1, - 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 04, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

我想让第一个项目，键，和其他项目的值，它们的一个数组。我的代码不工作:

mRDD = rRDD.map(lambda line: (line[0], (np.array(int(line))))).collect()

期望输出:

(3, (1,1,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0))
(4, (1,1,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0))

我的最后一个方法:

import os.path
import numpy as np
baseDir = os.path.join('data')
inputPath = os.path.join('mydata', 'matriz_reglas_test.csv')
fileName = os.path.join(baseDir, inputPath)
reglasRDD = (sc.textFile(fileName, 8)
               .cache()
            )
regRDD = reglasRDD.map(lambda line: line.split('n'))
print regRDD.take(5)
movRDD = regRDD.map(lambda line: (line[0], (int(x) for x in line[1:] if x))).collect()
print movRDD.take(5)

和错误:

PicklingError: Can't pickle <type 'generator'>: attribute lookup __builtin__.generator failed

最后我有了解决方案:

    import os.path
    import numpy as np
    baseDir = os.path.join('data')
    inputPath = os.path.join('mydata', 'matriz_reglas_test.csv')
    fileName = os.path.join(baseDir, inputPath)    
    split_regex = r'W+'
    def tokenize(string):
        """ An implementation of input string tokenization
        Args:
            string (str): input string
        Returns:
            list: a list of tokens
        """
        s = re.split(split_regex, string)
        return [int(word) for word in s if word]

    reglasRDD = (sc.textFile(fileName, 8)
                   .map(tokenize)
                   .cache()
                )
    movRDD = reglasRDD.map(lambda line: (line[0], (line[1:])))
    print movRDD.take(5)

输出:

[(0, 1, 1, 1, 1, - 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), (1, (1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), (2, (1, - 1, 0, 0, 0, 0, 0, 0, 0)), (3) (1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), (4), (1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)))

谢谢! !

我不确定rRDD.map().collect()部分，但是您可以轻松地使用np.genfromtxt()读取文件并使用字典理解来进行映射。

data_array = np.genfromtxt('data.csv', delimiter=',')
data_dict = {first:rest for first, *rest in data_array}

for循环将遍历数组的行(文件的每一行)。拆包用于将第一个元素分配给first，将行其余部分分配给rest。注意，这是Python 3中的新特性!如果使用Python 2，可以稍微改变一下字典理解:

data_dict = {row[0]:row[1:] for row in data_array}

下面的(未优化的)代码可能会让您找到正确的路径:

with open("tmp.txt", "r") as f:
    for line in f:
        line = line.strip()
        first = int(line[0])
        rest = line[1:].split(",")
        rest = tuple([int(x) for x in rest if x])
        tup = (first,(rest))
        print tup

相关内容

最新更新

热门标签：