根据列值创建列表,并使用该列表从df中的字符串列中提取单词,而不使用for循环覆盖行值



好吧,我承认我被困住了。希望有人能帮我解决这个问题!我将尽我所能解释清楚。我有两个df。其中一个有字符串列和市政当局,另一个df有市政当局和街道。我想为每行创建一个街道列表(针对特定的城市),因此它只提取该特定城市的字符串列中的街道。我现在的代码有点工作,但它不断迭代所有的市政当局,因此提取其他市政当局的街道,并将街道添加到错误的行。我希望下面的代码示例能让我的问题更清楚一些。

创建dataframes:

import pandas as pd
import re
# Sample dataframe with the municipality and string column
data1 = {'municipality': ['Urk','Utrecht','Almere','Utrecht','Huizen'],
'text': ["I'm going to Plantage, Pollux and Oostvaardersdiep","Tomorrow I'm going to Hoog Catharijne", 
"I'm not going to the Balijelaan","I'm not going to Socrateshof today",
"Next week I'll be going to Socrateshof"]}
df = pd.DataFrame(data1, columns = ['municipality','text'])
print(df)

输出:

municipality                                               text
0          Urk  I'm going to Plantage, Pollux and Oostvaarders...
1      Utrecht              Tomorrow I'm going to Hoog Catharijne
2       Almere                    I'm not going to the Balijelaan
3      Utrecht                 I'm not going to Socrateshof today
4       Huizen             Next week I'll be going to Socrateshof
# Sample dataframe with the municipality and street 
data2 = {'municipality': ['Urk','Urk','Utrecht','Almere','Almere','Huizen'],
'street_name': ['Plantage','Pollux','Balijelaan','Oostvaardersdiep','Catharijne','Socrateshof']}
df2 = pd.DataFrame(data2, columns = ['municipality','street_name'])
print(df2)

输出:

municipality       street_name
0          Urk          Plantage
1          Urk            Pollux
2      Utrecht        Balijelaan
3       Almere  Oostvaardersdiep
4       Almere        Catharijne
5       Huizen       Socrateshof

运行下面的函数:

# Function
street = []
def extract_street(txt):
mun_list_filter = df['municipality'] # I want the streets for this municipality
df_bag_filter_mun = df2[df2['municipality'].isin(mun_list_filter)] # Filter second df on the wanted municipality
street_list_mun = list(df_bag_filter_mun['street_name'].unique()) # Select all unique streets for the specific municipality
st = re.findall(r"b|".join(street_list_mun), txt) # Find all the streets in the string column 'tekst'
street.append(st) # Append to empty street list
return street # As you can see it keeps iterating over all municipalities 
# Call function by iterating over rows in string column
for txt in df['text']:
extract_street(txt)
# Add street list to df
df = df.assign(**{'street_match': street})
df['street_match'] = [', '.join(map(str, l)) for l in df['street_match']]
df

输出:

municipality text                                                street_match
0   Urk          I'm going to Plantage, Pollux and Oostvaardersdiep  Plantage, Pollux, Oostvaardersdiep
1   Utrecht      Tomorrow I'm going to Hoog Catharijne               Catharijne
2   Almere       I'm not going to the Balijelaan                     Balijelaan
3   Utrecht      I'm not going to Socrateshof today                  Socrateshof
4   Huizen       Next week I'll be going to Socrateshof              Socrateshof

如您所见,在第一行的城市"Urk"中,函数添加了街道"Oostvaardersdiep",尽管只有在第一行的城市是"Almere"时才应该匹配这条街道。只有最后一行是正确的,因为' socrates '实际上在'Huizen'市。

预期的结果:

municipality text                                                street_match
0   Urk          I'm going to Plantage, Pollux and Oostvaardersdiep  Plantage, Pollux
1   Utrecht      Tomorrow I'm going to Hoog Catharijne              
2   Almere       I'm not going to the Balijelaan                    
3   Utrecht      I'm not going to Socrateshof today                 
4   Huizen       Next week I'll be going to Socrateshof              Socrateshof

我知道问题是什么,只是不知道如何解决。我尝试过apply/lambda,但也没有运气。谢谢!

添加另一个答案,以显示更短/更简单的方法来完成您想要的操作。(第一个只是修复代码中不工作的地方。)

使用.apply(),您可以调用df的每行的函数的修改版本,然后使用df2中的街道名称进行检查。

def extract_street(row):
street_list_mun = df2.loc[df2['municipality'] == row['municipality'], 'street_name'].unique()
streets_regex = r'b(' + '|'.join(street_list_mun) + r')b'
streets_found = set(re.findall(streets_regex, row['text']))
return ', '.join(streets_found)
## or if you want this to return a list of streets
# return list(streets_found)
df['street_match'] = df.apply(extract_street, axis=1)
df

输出:

municipality                                                text      street_match
0          Urk  I'm going to Plantage, Pollux and Oostvaardersdiep  Plantage, Pollux
1      Utrecht               Tomorrow I'm going to Hoog Catharijne                  
2       Almere                     I'm not going to the Balijelaan                  
3      Utrecht                  I'm not going to Socrateshof today                  
4       Huizen              Next week I'll be going to Socrateshof       Socrateshof

注意:

  1. 您的正则表达式有一个问题-表达式的join部分生成字符串,如Plantageb|Pollux。如果(a)街道的最后一个名字在另一个单词的开头,或者(b)街道的最后一个名字在另一个单词的末尾:"我要去NotPlantage, Polluxsss和Oostvaardersdiep"两条街都能匹配,但不应该匹配。相反,单词边界b应该位于选项列表的末尾,并用括号分隔它们。它应该生成如下字符串:b(Plantage|Pollux)b。这与"polluxss"不匹配。或";NotPlantage"。我已经在上面的代码中做了修改。

  2. 我使用set来获得街道匹配的唯一列表。如果台词是"我要去波鲁克斯,波鲁克斯,波鲁克斯"它会给出3次结果,而不是一次。

@aneroid我现在想从类似的文本列中提取多个精确匹配(在列表中)。下面的代码(基于你的正则表达式)适用于这个简单的例子,但在我更大更复杂的数据集上,我得到一堆元组和空字符串。你知道我该如何改进这段代码吗?

# String column
data1 = {'text': ["Today I'm going to Utrecht","Tomorrow I'm going to Utrecht and Urk", 
"Next week I'll be going to the Amsterdamsestraatweg"]}
df = pd.DataFrame(data1, columns = ['text'])
print(df)
# City column in other df
data2 = {'city': ['Urk','Utrecht','Almere','Huizen','Amsterdam','Urk']}
df2 = pd.DataFrame(data2, columns = ['city'])
print(df2)
# I create a list of all the unique cities in df2
city_list = list(df2['city'].unique())
len(city_list)
len(set(city_list))
# Extract the words if there is an exact match 
df['city_match'] = df['text'].str.findall(r'b(' + '|'.join(city_list) + r')b')
df['city_match'] = [', '.join(map(str, l)) for l in df['city_match']]
print(df)
# Output
text    city_match
0                         Today I'm going to Utrecht       Utrecht
1              Tomorrow I'm going to Utrecht and Urk  Utrecht, Urk
2  Next week I'll be going to the Amsterdamsestra...      

正如你所看到的,它是有效的。"amsterdam straatweg"不是完全匹配的,所以不匹配。奇怪的是,在我的大df中,我得到一堆元组和空字符串作为输出,如下所示:

0                        ('Wijk bij Duurstede', '', '')
6                                   ('Utrecht', '', '')
7     ('Huizen', '', ''), ('Huizen', '', ''), ('Huiz...
9     ('Utrecht', '', ''), ('Utrecht', '', ''), ('Ut...
10                     ('Urk', '', ''), ('Urk', '', '')
11    ('Amersfoort', '', ''), ('Amersfoort', '', '')...
12                                 ('Lelystad', '', '')
13             ('Utrecht', '', ''), ('Utrecht', '', '')
16    ('Hilversum', '', ''), ('Hilversum', '', ''), ...
18             ('De Bilt', '', ''), ('De Bilt', '', '')
19                                      ('Urk', '', '')

再次表示感谢

只传入text的一个问题是,您无法进行市政过滤。这就是为什么你会把"Urk"写成"Oostvaardersdiep",尽管它在"Almere"。你得到它是因为"Oostvaardersdiep"这个名字出现在"Urk"条目的文本中。您的extract_streets()函数不知道要与哪个城市匹配。

让你的代码工作的最小的改变是:

  1. muntxt一起传递到extract_street()
  2. mun_list_filter应该使用mun而不是所有的直辖市
street = []
def extract_street(txt, mun):  # Pass in municipality
df_bag_filter_mun = df2[df2['municipality'] == mun]
### everything below is COPY-PASTED from your question
street_list_mun = list(df_bag_filter_mun['street_name'].unique()) # Select all unique streets for the specific municipality
st = re.findall(r"b|".join(street_list_mun), txt) # Find all the streets in the string column 'tekst'
street.append(st) # Append to empty street list
return street # As you can see it keeps iterating over all municipalities 
# add the 'municipality' for the extract loop
for txt, mun in zip(df['text'], df['municipality']):  
extract_street(txt, mun)
# Add street list to df
df = df.assign(**{'street_match': street})

输出:

municipality                                                text        street_match
0          Urk  I'm going to Plantage, Pollux and Oostvaardersdiep  [Plantage, Pollux]
1      Utrecht               Tomorrow I'm going to Hoog Catharijne                  []
2       Almere                     I'm not going to the Balijelaan                  []
3      Utrecht                  I'm not going to Socrateshof today                  []
4       Huizen              Next week I'll be going to Socrateshof       [Socrateshof]

然后连接列表使其成为字符串:

df['street_match'] = df['street_match'].str.join(', ')

输出:

municipality                                                text      street_match
0          Urk  I'm going to Plantage, Pollux and Oostvaardersdiep  Plantage, Pollux
1      Utrecht               Tomorrow I'm going to Hoog Catharijne                  
2       Almere                     I'm not going to the Balijelaan                  
3      Utrecht                  I'm not going to Socrateshof today                  
4       Huizen              Next week I'll be going to Socrateshof       Socrateshof

最新更新