我有一个名为text
的列的数据帧,如果第一列中的文本包含字典中的一个或多个子字符串,则希望在新列中赋值。如果text
列包含子字符串,我希望将字典的键分配给新列category
。
这就是我的代码:
import pandas as pd
some_strings = ['Apples and pears and cherries and bananas',
'VW and Ford and Lamborghini and Chrysler and Hyundai',
'Berlin and Paris and Athens and London']
categories = ['fruits', 'cars', 'capitals']
test_df = pd.DataFrame(some_strings, columns = ['text'])
cat_map = {'fruits': {'apples', 'pears', 'cherries', 'bananas'},
'cars': {'VW', 'Ford', 'Lamborghini', 'Chrysler', 'Hyundai'},
'capitals': {'Berlin', 'Paris', 'Athens', 'London'}}
字典cat_map
包含作为值的字符串集合。如果test_df
中的text
列包含这些单词中的任何一个,那么我希望将字典的键作为值分配给新的category
列。输出数据帧应该如下所示:
output_frame = pd.DataFrame({'text': some_strings,
'category': categories})
如有任何帮助,我们将不胜感激。
您可以尝试
d = {v:k for k, s in cat_map.items() for v in s}
test_df['category'] = (test_df['text'].str.extractall('('+'|'.join(d)+')')
[0].map(d)
.groupby(level=0).agg(set))
print(d)
{'cherries': 'fruits', 'pears': 'fruits', 'bananas': 'fruits', 'apples': 'fruits', 'Chrysler': 'cars', 'Hyundai': 'cars', 'Lamborghini': 'cars', 'Ford': 'cars', 'VW': 'cars', 'Berlin': 'capitals', 'Athens': 'capitals', 'London': 'capitals', 'Paris': 'capitals'}
print(test_df)
text category
0 Apples and pears and cherries and bananas {fruits}
1 VW and Ford and Lamborghini and Chrysler and Hyundai {cars}
2 Berlin and Paris and Athens and London {capitals}
不确定你想要实现什么,但如果我理解得当您可以检查字符串中的任何单词是否存在于您的cat_map 中
import pandas as pd
results = {"text": [], "category": []}
for element in some_strings:
for key, value in cat_map:
# Check if any of the word of the current string is in current category
if set(element.split(' ')).intersection(value):
results["text"].append(element)
results["category"].append(key)
df = pd.DataFrame.from_dict(results)
一种方法:
lookup = { word : label for label, words in cat_map.items() for word in words }
pattern = fr"b({'|'.join(lookup)})b"
test_df["category"] = test_df["text"].str.extract(pattern, expand=False).map(lookup)
print(test_df)
输出
text category
0 Apples and pears and cherries and bananas fruits
1 VW and Ford and Lamborghini and Chrysler and H... cars
2 Berlin and Paris and Athens and London capitals
你可以试试这个
results = {"text": [], "category": []}
for text in some_strings:
for key in cat_map.keys():
for word in set(text.split(" ")):
if word in cat_map[key]:
results["text"].append(text)
results["category"].append(key)
df = pd.DataFrame.from_dict(results)
df.drop_duplicates()