在panda中拆分地址列



我有一个熊猫数据集,如下所示:

import pandas as pd
data = {'id':  ['001', '002', '003','004'],
'address': ["William J. Clare\n290 Valley Dr.\nCasper, WY 82604\nUSA",
"1180 Shelard Tower\nMinneapolis, MN 55426\nUSA",
"William N. Barnard\n145 S. Durbin\nCasper, WY 82601\nUSA",
"215 S 11th ST"],
'locality': [None, None, None,'Laramie'],
'region': [None, None, None, 'WY'],
'Zipcode': [None, None, None, '87656'],
'Country': [None, None, None, 'US']
}
df = pd.DataFrame(data)

我试着用新行分隔地址列,但由于它有两个\后面跟着n,所以我做不到。请帮我把地址和出口分成地区、邮编和国家。

样本输出:

id  address locality    region  Zipcode Country
1   290 Valley Dr.  Casper  WY  82604   USA
2   1180 Shelard Tower  Minneapolis MN  55426   USA
3   145 S. Durbin   Casper  WY  82601   USA
4   215 S 11th ST   Laramie WY  87656   US

我使用split命令尝试了不同的拆分方法,但它给了我额外的\。我正试图将它保存在熊猫数据帧中,以便进行进一步的分析。

这里有一种使用extract而不是拆分和update的方法:

df.update(df['address'].str.extract(r'([^,]+)(?:,s(w+)s*(d+)\n(w+))?$')
.set_axis(["address", "region", "Zipcode", "Country"], axis=1) 
)
df['name'] = pd.NA
df.update(df['address'].str.extract(r'(?:(.*?)\n)?(.*)\n(.+)')
.set_axis(['name', 'address', 'locality'], axis=1)
)

输出:

id             address     locality region Zipcode Country                name
0  001      290 Valley Dr.       Casper     WY   82604     USA    William J. Clare
1  002  1180 Shelard Tower  Minneapolis     MN   55426     USA                <NA>
2  003       145 S. Durbin       Casper     WY   82601     USA  William N. Barnard
3  004       215 S 11th ST      Laramie     WY   87656      US                <NA>

regex演示

第二行没有名称,因此可能会引起一些麻烦。请检查你的例子是否正确。

假设你的df是这样的:

data = {'id':  ['001', '002', '003','004'],
'address': ["William J. Clare\n290 Valley Dr.\nCasper, WY 82604\nUSA",
"name\n1180 Shelard Tower\nMinneapolis, MN 55426\nUSA",
"William N. Barnard\n145 S. Durbin\nCasper, WY 82601\nUSA",
"215 S 11th ST"],
'locality': [None, None, None,'Laramie'],
'region': [None, None, None, 'WY'],
'Zipcode': [None, None, None, '87656'],
'Country': [None, None, None, 'US']
}
df = pd.DataFrame(data)

您可以创建一个过滤条件:

cond = df[["locality", "region", "Zipcode", "Country"]].isna()

然后使用splitexpand=True

df.loc[cond.all(axis=1)].address.str.split(r"\n", expand=True)
0    William J. Clare      290 Valley Dr.       Casper, WY 82604  USA
1                name  1180 Shelard Tower  Minneapolis, MN 55426  USA
2  William N. Barnard       145 S. Durbin       Casper, WY 82601  USA

最新更新