好吧,我承认我被困住了。希望有人能帮我解决这个问题!我将尽我所能解释清楚。我有两个df。其中一个有字符串列和市政当局,另一个df有市政当局和街道。我想为每行创建一个街道列表(针对特定的城市),因此它只提取该特定城市的字符串列中的街道。我现在的代码有点工作,但它不断迭代所有的市政当局,因此提取其他市政当局的街道,并将街道添加到错误的行。我希望下面的代码示例能让我的问题更清楚一些。
创建dataframes:
import pandas as pd
import re
# Sample dataframe with the municipality and string column
data1 = {'municipality': ['Urk','Utrecht','Almere','Utrecht','Huizen'],
'text': ["I'm going to Plantage, Pollux and Oostvaardersdiep","Tomorrow I'm going to Hoog Catharijne",
"I'm not going to the Balijelaan","I'm not going to Socrateshof today",
"Next week I'll be going to Socrateshof"]}
df = pd.DataFrame(data1, columns = ['municipality','text'])
print(df)
输出:
municipality text
0 Urk I'm going to Plantage, Pollux and Oostvaarders...
1 Utrecht Tomorrow I'm going to Hoog Catharijne
2 Almere I'm not going to the Balijelaan
3 Utrecht I'm not going to Socrateshof today
4 Huizen Next week I'll be going to Socrateshof
# Sample dataframe with the municipality and street
data2 = {'municipality': ['Urk','Urk','Utrecht','Almere','Almere','Huizen'],
'street_name': ['Plantage','Pollux','Balijelaan','Oostvaardersdiep','Catharijne','Socrateshof']}
df2 = pd.DataFrame(data2, columns = ['municipality','street_name'])
print(df2)
输出:
municipality street_name
0 Urk Plantage
1 Urk Pollux
2 Utrecht Balijelaan
3 Almere Oostvaardersdiep
4 Almere Catharijne
5 Huizen Socrateshof
运行下面的函数:
# Function
street = []
def extract_street(txt):
mun_list_filter = df['municipality'] # I want the streets for this municipality
df_bag_filter_mun = df2[df2['municipality'].isin(mun_list_filter)] # Filter second df on the wanted municipality
street_list_mun = list(df_bag_filter_mun['street_name'].unique()) # Select all unique streets for the specific municipality
st = re.findall(r"b|".join(street_list_mun), txt) # Find all the streets in the string column 'tekst'
street.append(st) # Append to empty street list
return street # As you can see it keeps iterating over all municipalities
# Call function by iterating over rows in string column
for txt in df['text']:
extract_street(txt)
# Add street list to df
df = df.assign(**{'street_match': street})
df['street_match'] = [', '.join(map(str, l)) for l in df['street_match']]
df
输出:
municipality text street_match
0 Urk I'm going to Plantage, Pollux and Oostvaardersdiep Plantage, Pollux, Oostvaardersdiep
1 Utrecht Tomorrow I'm going to Hoog Catharijne Catharijne
2 Almere I'm not going to the Balijelaan Balijelaan
3 Utrecht I'm not going to Socrateshof today Socrateshof
4 Huizen Next week I'll be going to Socrateshof Socrateshof
如您所见,在第一行的城市"Urk"中,函数添加了街道"Oostvaardersdiep",尽管只有在第一行的城市是"Almere"时才应该匹配这条街道。只有最后一行是正确的,因为' socrates '实际上在'Huizen'市。
预期的结果:
municipality text street_match
0 Urk I'm going to Plantage, Pollux and Oostvaardersdiep Plantage, Pollux
1 Utrecht Tomorrow I'm going to Hoog Catharijne
2 Almere I'm not going to the Balijelaan
3 Utrecht I'm not going to Socrateshof today
4 Huizen Next week I'll be going to Socrateshof Socrateshof
我知道问题是什么,只是不知道如何解决。我尝试过apply/lambda,但也没有运气。谢谢!
添加另一个答案,以显示更短/更简单的方法来完成您想要的操作。(第一个只是修复代码中不工作的地方。)
使用.apply()
,您可以调用df
的每行的函数的修改版本,然后使用df2
中的街道名称进行检查。
def extract_street(row):
street_list_mun = df2.loc[df2['municipality'] == row['municipality'], 'street_name'].unique()
streets_regex = r'b(' + '|'.join(street_list_mun) + r')b'
streets_found = set(re.findall(streets_regex, row['text']))
return ', '.join(streets_found)
## or if you want this to return a list of streets
# return list(streets_found)
df['street_match'] = df.apply(extract_street, axis=1)
df
输出:
municipality text street_match
0 Urk I'm going to Plantage, Pollux and Oostvaardersdiep Plantage, Pollux
1 Utrecht Tomorrow I'm going to Hoog Catharijne
2 Almere I'm not going to the Balijelaan
3 Utrecht I'm not going to Socrateshof today
4 Huizen Next week I'll be going to Socrateshof Socrateshof
注意:
您的正则表达式有一个问题-表达式的
join
部分生成字符串,如Plantageb|Pollux
。如果(a)街道的最后一个名字在另一个单词的开头,或者(b)街道的最后一个名字在另一个单词的末尾:"我要去NotPlantage, Polluxsss和Oostvaardersdiep"两条街都能匹配,但不应该匹配。相反,单词边界b
应该位于选项列表的末尾,并用括号分隔它们。它应该生成如下字符串:b(Plantage|Pollux)b
。这与"polluxss"不匹配。或";NotPlantage"。我已经在上面的代码中做了修改。我使用
set
来获得街道匹配的唯一列表。如果台词是"我要去波鲁克斯,波鲁克斯,波鲁克斯"它会给出3次结果,而不是一次。
@aneroid我现在想从类似的文本列中提取多个精确匹配(在列表中)。下面的代码(基于你的正则表达式)适用于这个简单的例子,但在我更大更复杂的数据集上,我得到一堆元组和空字符串。你知道我该如何改进这段代码吗?
# String column
data1 = {'text': ["Today I'm going to Utrecht","Tomorrow I'm going to Utrecht and Urk",
"Next week I'll be going to the Amsterdamsestraatweg"]}
df = pd.DataFrame(data1, columns = ['text'])
print(df)
# City column in other df
data2 = {'city': ['Urk','Utrecht','Almere','Huizen','Amsterdam','Urk']}
df2 = pd.DataFrame(data2, columns = ['city'])
print(df2)
# I create a list of all the unique cities in df2
city_list = list(df2['city'].unique())
len(city_list)
len(set(city_list))
# Extract the words if there is an exact match
df['city_match'] = df['text'].str.findall(r'b(' + '|'.join(city_list) + r')b')
df['city_match'] = [', '.join(map(str, l)) for l in df['city_match']]
print(df)
# Output
text city_match
0 Today I'm going to Utrecht Utrecht
1 Tomorrow I'm going to Utrecht and Urk Utrecht, Urk
2 Next week I'll be going to the Amsterdamsestra...
正如你所看到的,它是有效的。"amsterdam straatweg"不是完全匹配的,所以不匹配。奇怪的是,在我的大df中,我得到一堆元组和空字符串作为输出,如下所示:
0 ('Wijk bij Duurstede', '', '')
6 ('Utrecht', '', '')
7 ('Huizen', '', ''), ('Huizen', '', ''), ('Huiz...
9 ('Utrecht', '', ''), ('Utrecht', '', ''), ('Ut...
10 ('Urk', '', ''), ('Urk', '', '')
11 ('Amersfoort', '', ''), ('Amersfoort', '', '')...
12 ('Lelystad', '', '')
13 ('Utrecht', '', ''), ('Utrecht', '', '')
16 ('Hilversum', '', ''), ('Hilversum', '', ''), ...
18 ('De Bilt', '', ''), ('De Bilt', '', '')
19 ('Urk', '', '')
再次表示感谢只传入text
的一个问题是,您无法进行市政过滤。这就是为什么你会把"Urk"写成"Oostvaardersdiep",尽管它在"Almere"。你得到它是因为"Oostvaardersdiep"这个名字出现在"Urk"条目的文本中。您的extract_streets()
函数不知道要与哪个城市匹配。
让你的代码工作的最小的改变是:
- 将
mun
与txt
一起传递到extract_street()
mun_list_filter
应该使用mun
而不是所有的直辖市
street = []
def extract_street(txt, mun): # Pass in municipality
df_bag_filter_mun = df2[df2['municipality'] == mun]
### everything below is COPY-PASTED from your question
street_list_mun = list(df_bag_filter_mun['street_name'].unique()) # Select all unique streets for the specific municipality
st = re.findall(r"b|".join(street_list_mun), txt) # Find all the streets in the string column 'tekst'
street.append(st) # Append to empty street list
return street # As you can see it keeps iterating over all municipalities
# add the 'municipality' for the extract loop
for txt, mun in zip(df['text'], df['municipality']):
extract_street(txt, mun)
# Add street list to df
df = df.assign(**{'street_match': street})
输出:
municipality text street_match
0 Urk I'm going to Plantage, Pollux and Oostvaardersdiep [Plantage, Pollux]
1 Utrecht Tomorrow I'm going to Hoog Catharijne []
2 Almere I'm not going to the Balijelaan []
3 Utrecht I'm not going to Socrateshof today []
4 Huizen Next week I'll be going to Socrateshof [Socrateshof]
然后连接列表使其成为字符串:
df['street_match'] = df['street_match'].str.join(', ')
输出:
municipality text street_match
0 Urk I'm going to Plantage, Pollux and Oostvaardersdiep Plantage, Pollux
1 Utrecht Tomorrow I'm going to Hoog Catharijne
2 Almere I'm not going to the Balijelaan
3 Utrecht I'm not going to Socrateshof today
4 Huizen Next week I'll be going to Socrateshof Socrateshof