str.extract 从熊猫数据帧的背面开始

我有一个包含数千行和两列的数据帧，如下所示：

string       state
0      the best new york cheesecake rochester ny          ny
1      the best dallas bbq houston tx random str          tx
2   la jolla fish shop of san diego san diego ca          ca
3                                   nothing here          dc

对于每个州，我有一个所有城市名称(小写)的正则表达式，结构类似于(city1|city2|city3|...)，其中城市的顺序是任意的(但如果需要可以更改)。例如，纽约州的正则表达式包含'new york'和'rochester'(同样，德克萨斯州的'dallas'和'houston'，加利福尼亚州的'san diego'和'la jolla')。

我想找出字符串中最后一个出现的城市是什么(对于观察 1、2、3、4，我分别想要'rochester'、'houston'、'san diego'和NaN(或其他什么)。

我从str.extract开始，试图想一些事情，比如反转弦，但陷入了僵局。

非常感谢您的任何帮助！

您可以使用str.findall，但是如果没有匹配项，则list为空，因此需要应用。最后按[-1]选择字符串的最后一项：

cities = r"new york|dallas|rochester|houston|san diego"
print (df['string'].str.findall(cities)
.apply(lambda x: x if len(x) >= 1 else ['no match val'])
.str[-1])
0       rochester
1         houston
2       san diego
3    no match val
Name: string, dtype: object

(已更正>= 1 至> 1。

另一个解决方案是位黑客 - 在每个字符串的开头添加不匹配字符串radd并将此字符串也添加到城市：

a = 'no match val'
cities = r"new york|dallas|rochester|houston|san diego" + '|' + a
print (df['string'].radd(a).str.findall(cities).str[-1])
0       rochester
1         houston
2       san diego
3    no match val
Name: string, dtype: object

cities = r"new york|dallas|..."
def last_match(s):
found = re.findall(cities, s)
return found[-1] if found else ""
df['string'].apply(last_match)
#0    rochester
#1      houston
#2    san diego
#3

相关内容

最新更新

热门标签：