获取位置字符串的最后一个单词，"纽约"、"北达科他州"、"南卡罗来纳州"等特殊情况除外

我正在尝试从pandas数据帧创建一个新字段。字段是"；位置"；它包含城市和州信息。我使用了一个str.split().str[-1]函数来获取位置的最后一个单词，它通常是完整的状态名称。

问题是像"；北卡罗来纳州；变得公正；carolina"；。我想考虑特殊情况，例如当.str[-2]=；"北方"；或"；新的"；或"；"南"；或"；"西部"；。

下面是我的代码示例：

df["state"] = df.location.str.split().str[-1]
print(df.state.value_counts().reset_index())

这是输出：

index  state  
0      california  59855  
1            york     17  
2        illinois      8  
3   massachusetts      5

你可以看到"；约克；应该是"；纽约"；。

我想我应该为位置字段写一个函数，如下所示：

def get_location(x):  
if x.str.split().str[-2] in ["new", "north", "south", "west"]:  
return x.str.split().str[-2:]  
else:  
return x.str.split().str[-1]

这里的问题是，我在调用get_location(df.location):时收到以下错误消息

"级数的真值是模糊的。使用a.empty、a.bool((、a.item((、.any((或.all((；

我在这里走对了吗？我该怎么做才能让我的新df.state字段返回这样的输出：

index   state  
0       california   59855  
1         new york      17  
2         illinois       8  
3    massachusetts       5  
4   north corolina       3

谢谢！

您可以使用split方法计算字符串的长度，类似于以下方法：

# Dataframe dummy from your Data:
your_df = pd.DataFrame({'location': ['New York', 'North Carolina', 'South Illinois', 'Texas', 'Florida'], 'another_field': [1000, 2000, 3000, 4000, 5000]})
# You verify the count of strings, if there are two or more, then you return full string.
your_df['state'] = your_df['location'].apply(lambda your_location: your_location if len(your_location.split(" ")) > 1 else your_location.split(" ")[-1])
your_df

输出：

location       another_field    state
0   New York                1000    New York
1   North Carolina          2000    North Carolina
2   South Illinois          3000    South Illinois
3   Texas                   4000    Texas
4   Florida                 5000    Florida

相关内容

最新更新

热门标签：