我有一个带有特定列的数据帧,如下所示:
colA
['work', 'time', 'money', 'home', 'good', 'financial']
['school', 'lazy', 'good', 'math', 'sad', 'important', 'dizzy', 'go']
['frame', 'happy', 'feel', 'youth', 'change', 'home', 'past']
['first', 'eat', 'good', 'hungry', 'empty', 'fool']
['meet', 'risk', 'fire', 'angry', 'go']
ColA是字符串NOT列表。我有这样的清单:
word = ['good', 'sad', 'angry', 'feel', 'empty', 'dizzy', 'go', 'happy', 'fool', 'eat', 'past', 'lazy', 'youth', 'old', 'enjoy', 'free', 'time', 'hungry']
我想把单词记在单子里。所以它应该是这样的:
colA
['time', 'good']
['lazy', 'good', 'sad', 'dizzy', 'go']
['happy', 'feel', 'youth', 'past']
['eat', 'good', 'hungry', 'empty', 'fool']
['angry, 'go']
我尝试过使用str.contains,但出现了一个错误:
contains() takes from 2 to 6 positional arguments but 18 were given
我只是个乞丐,很抱歉。
使用具有列表理解的ast.literal_eval
筛选匹配值:
import ast
s = set(word)
df['new'] = df['colA'].apply(lambda x: [y for y in ast.literal_eval(x) if y in s])
print (df)
colA
0 ['work', 'time', 'money', 'home', 'good', 'fin...
1 ['school', 'lazy', 'good', 'math', 'sad', 'imp...
2 ['frame', 'happy', 'feel', 'youth', 'change', ...
3 ['first', 'eat', 'good', 'hungry', 'empty', 'f...
4 ['meet', 'risk', 'fire', 'angry', 'go']
new
0 [time, good]
1 [lazy, good, sad, dizzy, go]
2 [happy, feel, youth, past]
3 [eat, good, hungry, empty, fool]
4 [angry, go]
性能比较:有了这些数据,apply
比纯列表理解更快:
df = pd.concat([df] * 10000, ignore_index=True)
In [26]: %timeit df['colB'] = [[w for w in literal_eval(l) if w in S] for l in df['colA']]
845 ms ± 32.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [27]: %timeit df['new'] = df['colA'].apply(lambda x: [y for y in ast.literal_eval(x) if y in s])
826 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
您可以在列表理解中使用ast.literal_eval
(比apply
快(:
from ast import literal_eval
# using a set for efficiency (for x in LIST is slow)
S = set(word)
df['colA'] = [str([w for w in literal_eval(l) if w in S]) for l in df['colA']]
注意。这里的输出是一个字符串,如果您想要一个列表使用:df['colA'] = [[w for w in literal_eval(l) if w in S] for l in df['colA']]
输出:
colA
0 ['time', 'good']
1 ['lazy', 'good', 'sad', 'dizzy', 'go']
2 ['happy', 'feel', 'youth', 'past']
3 ['eat', 'good', 'empty', 'fool']
4 ['angry', 'go']
计时
列表理解明显快于apply
(在熊猫1.5上测试(
df = pd.concat([df]*10000, ignore_index=True)
%%timeit
df['new'] = [[w for w in literal_eval(l) if w in S] for l in df['colA']]
674 ms ± 69.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['new'] = df['colA'].apply(lambda x: [y for y in literal_eval(x) if y in s])
1.04 s ± 67.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)