如何将具有一组字符串的字典映射到数据帧的列



我有一个名为text的列的数据帧,如果第一列中的文本包含字典中的一个或多个子字符串,则希望在新列中赋值。如果text列包含子字符串,我希望将字典的键分配给新列category

这就是我的代码:

import pandas as pd
some_strings = ['Apples and pears and cherries and bananas', 
'VW and Ford and Lamborghini and Chrysler and Hyundai', 
'Berlin and Paris and Athens and London']
categories = ['fruits', 'cars', 'capitals']
test_df = pd.DataFrame(some_strings, columns = ['text'])
cat_map = {'fruits': {'apples', 'pears', 'cherries', 'bananas'}, 
'cars': {'VW', 'Ford', 'Lamborghini', 'Chrysler', 'Hyundai'}, 
'capitals': {'Berlin', 'Paris', 'Athens', 'London'}}

字典cat_map包含作为值的字符串集合。如果test_df中的text列包含这些单词中的任何一个,那么我希望将字典的键作为值分配给新的category列。输出数据帧应该如下所示:

output_frame = pd.DataFrame({'text': some_strings, 
'category': categories})

如有任何帮助,我们将不胜感激。

您可以尝试

d = {v:k for k, s in cat_map.items() for v in s}
test_df['category'] = (test_df['text'].str.extractall('('+'|'.join(d)+')')
[0].map(d)
.groupby(level=0).agg(set))
print(d)
{'cherries': 'fruits', 'pears': 'fruits', 'bananas': 'fruits', 'apples': 'fruits', 'Chrysler': 'cars', 'Hyundai': 'cars', 'Lamborghini': 'cars', 'Ford': 'cars', 'VW': 'cars', 'Berlin': 'capitals', 'Athens': 'capitals', 'London': 'capitals', 'Paris': 'capitals'}

print(test_df)
text    category
0             Apples and pears and cherries and bananas    {fruits}
1  VW and Ford and Lamborghini and Chrysler and Hyundai      {cars}
2                Berlin and Paris and Athens and London  {capitals}

不确定你想要实现什么,但如果我理解得当您可以检查字符串中的任何单词是否存在于您的cat_map 中

import pandas as pd
results = {"text": [], "category": []}
for element in some_strings:
for key, value in cat_map:
# Check if any of the word of the current string is in current category
if set(element.split(' ')).intersection(value):
results["text"].append(element)
results["category"].append(key)
df = pd.DataFrame.from_dict(results)

一种方法:

lookup = { word : label for label, words in cat_map.items() for word in words }
pattern = fr"b({'|'.join(lookup)})b"
test_df["category"] = test_df["text"].str.extract(pattern, expand=False).map(lookup)
print(test_df)

输出

text  category
0          Apples and pears and cherries and bananas    fruits
1  VW and Ford and Lamborghini and Chrysler and H...      cars
2             Berlin and Paris and Athens and London  capitals

你可以试试这个

results = {"text": [], "category": []}
for text in some_strings:
for key in cat_map.keys():
for word in set(text.split(" ")):
if word in cat_map[key]:
results["text"].append(text)
results["category"].append(key)
df = pd.DataFrame.from_dict(results)
df.drop_duplicates()

最新更新