根据某些条件过滤 RDD



我有一个RDD,如下所示-

[[u'100=NO', u'101=OR', u'102=-0.00955461556684', u'103=0.799738137456', u'104=-0.619426440691', u'105=-0.505799761741', u'106=1.06018348173', u'107=-0.203731351216', u'108=0.242253668965', u'109=20170411', u'110=14:47:54'], [u'100=NO', u'101=OR', u'102=1.09790894815', u'103=-0.591742622246', u'104=0.60404467739', u'105=-0.729487378829', u'106=-0.41507842821', u'107=-1.01921955205', u'108=-0.153191948561', u'109=20170411', u'110=14:47:56'], [u'100=NO', u'101=OR', u'102=-0.0845031955962', u'103=0.428040384808', u'104=0.0579505934162', u'105=0.893705789837', u'106=-0.544258436965', u'107=1.10990090862', u'108=0.740638990995', u'109=20170411', u'110=14:47:58'], [u'100=NO', u'101=ORCL', u'102=1.20406493416', u'103=-0.275962563807', u'104=-0.728142212616', u'105=2.04751448847', u'106=2.10361125056', u'107=0.588650303087', u'108=-0.693327897822', u'109=20170411', u'110=14:48:00']]

我想从RDD的所有索引中删除"="符号之前的所有字符。

我尝试了以下示例 -

rdd.filter(lambda x : str(x[6]).split("=",1)[-1])

但是我想从rdd的所有索引中删除这些字符。

预期的 rdd 集 -

[[u'NO', u'OR', u'-0.00955461556684', u'0.799738137456', u'-0.619426440691', u'-0.505799761741', u'1.06018348173', u'-0.203731351216', u'0.242253668965', u'20170411', u'14:47:54'], [u'NO', u'OR', u'1.09790894815', u'-0.591742622246', u'0.60404467739', u'-0.729487378829', u'-0.41507842821', u'-1.01921955205', u'-0.153191948561', u'20170411', u'14:47:56'], [u'NO', u'OR', u'-0.0845031955962', u'0.428040384808', u'0.0579505934162', u'0.893705789837', u'-0.544258436965', u'1.10990090862', u'0.740638990995', u'20170411', u'14:47:58'], [u'100=NO', u'101=ORCL', u'102=1.20406493416', u'-0.275962563807', u'-0.728142212616', u'2.04751448847', u'2.10361125056', u'0.588650303087', u'-0.693327897822', u'20170411', u'14:48:00']]

不只是过滤,因为必须修改数据,因此filter似乎不是合适的工具。

尝试嵌套列表理解sc.parallelize

 RDD = sc.parallelize([[i.split('=')[1] for i in j] for j in RDD.toLocalIterator()])

你好,我是编程新手,但我认为他也可以用正则表达式解决这个问题。 我尝试类似的东西:

import re
test=[[u'100=NO', u'101=OR', u'102=-0.00955461556684', u'103=0.799738137456', u'104=-0.619426440691', u'105=-0.505799761741', u'106=1.06018348173', u'107=-0.203731351216', u'108=0.242253668965', u'109=20170411', u'110=14:47:54'], [u'100=NO', u'101=OR', u'102=1.09790894815', u'103=-0.591742622246', u'104=0.60404467739', u'105=-0.729487378829', u'106=-0.41507842821', u'107=-1.01921955205', u'108=-0.153191948561', u'109=20170411', u'110=14:47:56'], [u'100=NO', u'101=OR', u'102=-0.0845031955962', u'103=0.428040384808', u'104=0.0579505934162', u'105=0.893705789837', u'106=-0.544258436965', u'107=1.10990090862', u'108=0.740638990995', u'109=20170411', u'110=14:47:58'], [u'100=NO', u'101=ORCL', u'102=1.20406493416', u'103=-0.275962563807', u'104=-0.728142212616', u'105=2.04751448847', u'106=2.10361125056', u'107=0.588650303087', u'108=-0.693327897822', u'109=20170411', u'110=14:48:00']]
result = re.sub(r"[u]'d+", r"", test)
print(result)

但它给出了一个错误,例如:预期的字符串或类似字节的对象。如果有人能解释我会很高兴.

最新更新