我有两个数据框架,如:df1:
Category Keywords
0 Fruit ['apple', 'pear', 'plum', 'grape']
1 Color ['red', 'purple', 'green']
df2:
Items
0 plum
1 purple
2 pear
3 orange
4 apple
5 rainbow
每当我从df1的关键字列表中找到df2中的任何值时,我想将找到的值移动到新的列表或数据框中;这意味着从df2取值并移动到df3。结果如下:
df2:
Items
0 orange
1 rainbow
df3:
Items
0 plum
1 purple
2 pear
3 apple
或项目列表,如[李子,紫色,梨,苹果]
一个类似但不准确的问题是:使用来自数据帧的关键字来检测是否存在于另一个数据帧或字符串
中编辑:诸如"梨"或";pearl"仍然应该为关键字"pear">
进行标识items_list = df1['Keywords'].tolist()
items_list = [item for sub_list in items_list for item in sub_list]
df3 = df2.loc[~df2['Items'].isin(items_list)]
df2 = df2.loc[df2['Items'].isin(items_list)]
您可以使用str.contains()并检查|
的正则表达式。此外,我正在使用explosion()将关键字转换为列表。
import pandas as pd
c = ['Category','Keywords']
d = [['Fruit',['apple', 'pear', 'plum', 'grape']],
['Color',['red', 'purple', 'green']]]
df1 = pd.DataFrame(d,columns=c)
df2 = pd.DataFrame({'Items':['plum','purple','pear','orange',
'apple','rainbow','pearl','pears',
'peary','pineapple','plumber']})
print (df1)
print (df2)
keywords = df1.Keywords.explode().explode().to_list()
key_dict = r'({})'.format('|'.join(keywords))
mask = df2.Items.str.contains(key_dict)
df3 = df2[mask]
df2 = df2[~mask]
print (df2)
print (df3)
这将给你:
原始df1:
Category Keywords
0 Fruit [apple, pear, plum, grape]
1 Color [red, purple, green]
原始df2:
Items
0 plum
1 purple
2 pear
3 orange
4 apple
5 rainbow
6 pearl
7 pears
8 peary
9 pineapple
10 plumber
New df3:包含关键字
的所有项Items
0 plum
1 purple
2 pear
4 apple
6 pearl
7 pears
8 peary
9 pineapple
10 plumber
更新df2:
Items
3 orange
5 rainbow