如何在pandas数据帧中保留许多特定的字符串



我有一个带有特定列的数据帧,如下所示:

colA    
['work', 'time', 'money', 'home', 'good', 'financial']    
['school', 'lazy', 'good', 'math', 'sad', 'important', 'dizzy', 'go']    
['frame', 'happy', 'feel', 'youth', 'change', 'home', 'past']    
['first', 'eat', 'good', 'hungry', 'empty', 'fool']    
['meet', 'risk', 'fire', 'angry', 'go']    

ColA是字符串NOT列表。我有这样的清单:

word = ['good', 'sad', 'angry', 'feel', 'empty', 'dizzy', 'go', 'happy', 'fool', 'eat', 'past', 'lazy', 'youth', 'old', 'enjoy', 'free', 'time', 'hungry']   

我想把单词记在单子里。所以它应该是这样的:

colA    
['time', 'good']    
['lazy', 'good', 'sad', 'dizzy', 'go']    
['happy', 'feel', 'youth', 'past']     
['eat', 'good', 'hungry', 'empty', 'fool']    
['angry, 'go']    

我尝试过使用str.contains,但出现了一个错误:

contains() takes from 2 to 6 positional arguments but 18 were given    

我只是个乞丐,很抱歉。

使用具有列表理解的ast.literal_eval筛选匹配值:

import ast
s = set(word)
df['new'] = df['colA'].apply(lambda x: [y for y in ast.literal_eval(x) if y in s])
print (df)
colA  
0  ['work', 'time', 'money', 'home', 'good', 'fin...   
1  ['school', 'lazy', 'good', 'math', 'sad', 'imp...   
2  ['frame', 'happy', 'feel', 'youth', 'change', ...   
3  ['first', 'eat', 'good', 'hungry', 'empty', 'f...   
4        ['meet', 'risk', 'fire', 'angry', 'go']       
new  
0                      [time, good]  
1      [lazy, good, sad, dizzy, go]  
2        [happy, feel, youth, past]  
3  [eat, good, hungry, empty, fool]  
4                       [angry, go] 

性能比较:有了这些数据,apply比纯列表理解更快:

df = pd.concat([df] * 10000, ignore_index=True)

In [26]: %timeit df['colB'] = [[w for w in literal_eval(l) if w in S] for l in df['colA']]
845 ms ± 32.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [27]: %timeit df['new'] = df['colA'].apply(lambda x: [y for y in ast.literal_eval(x) if y in s])
826 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

您可以在列表理解中使用ast.literal_eval(比apply快(:

from ast import literal_eval
# using a set for efficiency (for x in LIST is slow)
S = set(word)
df['colA'] = [str([w for w in literal_eval(l) if w in S]) for l in df['colA']]

注意。这里的输出是一个字符串,如果您想要一个列表使用:df['colA'] = [[w for w in literal_eval(l) if w in S] for l in df['colA']]

输出:

colA
0                        ['time', 'good']
1  ['lazy', 'good', 'sad', 'dizzy', 'go']
2      ['happy', 'feel', 'youth', 'past']
3        ['eat', 'good', 'empty', 'fool']
4                         ['angry', 'go']

计时

列表理解明显快于apply(在熊猫1.5上测试(

df = pd.concat([df]*10000, ignore_index=True)
%%timeit
df['new'] = [[w for w in literal_eval(l) if w in S] for l in df['colA']]
674 ms ± 69.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['new'] = df['colA'].apply(lambda x: [y for y in literal_eval(x) if y in s])
1.04 s ± 67.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

最新更新