我有一个这样的文件:
0, 1, 1, 1, 1, - 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 01, 1, 1, - 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 02 1 1 1 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 03, 1, 1, - 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 04, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
我想让第一个项目,键,和其他项目的值,它们的一个数组。我的代码不工作:
mRDD = rRDD.map(lambda line: (line[0], (np.array(int(line))))).collect()
期望输出:
(3, (1,1,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0))
(4, (1,1,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0))
我的最后一个方法:
import os.path
import numpy as np
baseDir = os.path.join('data')
inputPath = os.path.join('mydata', 'matriz_reglas_test.csv')
fileName = os.path.join(baseDir, inputPath)
reglasRDD = (sc.textFile(fileName, 8)
.cache()
)
regRDD = reglasRDD.map(lambda line: line.split('n'))
print regRDD.take(5)
movRDD = regRDD.map(lambda line: (line[0], (int(x) for x in line[1:] if x))).collect()
print movRDD.take(5)
和错误:
PicklingError: Can't pickle <type 'generator'>: attribute lookup __builtin__.generator failed
最后我有了解决方案:
import os.path
import numpy as np
baseDir = os.path.join('data')
inputPath = os.path.join('mydata', 'matriz_reglas_test.csv')
fileName = os.path.join(baseDir, inputPath)
split_regex = r'W+'
def tokenize(string):
""" An implementation of input string tokenization
Args:
string (str): input string
Returns:
list: a list of tokens
"""
s = re.split(split_regex, string)
return [int(word) for word in s if word]
reglasRDD = (sc.textFile(fileName, 8)
.map(tokenize)
.cache()
)
movRDD = reglasRDD.map(lambda line: (line[0], (line[1:])))
print movRDD.take(5)
输出:[(0, 1, 1, 1, 1, - 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), (1, (1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), (2, (1, - 1, 0, 0, 0, 0, 0, 0, 0)), (3) (1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), (4), (1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)))
谢谢! !
我不确定rRDD.map().collect()
部分,但是您可以轻松地使用np.genfromtxt()
读取文件并使用字典理解来进行映射。
data_array = np.genfromtxt('data.csv', delimiter=',')
data_dict = {first:rest for first, *rest in data_array}
for
循环将遍历数组的行(文件的每一行)。拆包用于将第一个元素分配给first
,将行其余部分分配给rest
。注意,这是Python 3中的新特性!如果使用Python 2,可以稍微改变一下字典理解:
data_dict = {row[0]:row[1:] for row in data_array}
下面的(未优化的)代码可能会让您找到正确的路径:
with open("tmp.txt", "r") as f:
for line in f:
line = line.strip()
first = int(line[0])
rest = line[1:].split(",")
rest = tuple([int(x) for x in rest if x])
tup = (first,(rest))
print tup