为什么pandas.series.extract(regex)能够打印正确的值,但不能使用索引或np.where将值赋给现有的变量
import pandas as pd
import numpy as np
df = pd.DataFrame(
[
['1', np.nan, np.nan, '1 Banana St, 69126 Heidelberg'],
['2', "Doloros St", 67898, '2 Choco Rd, 69412 Eberbach']],
columns=['id', "Street", 'Postcode', 'FullAddress']
)
m = df['Street'].isna()
print(df["FullAddress"].str.extract(r'(.+?),')) # prints street
print(df["FullAddress"].str.extract(r'b(d{5})b')) # prints postcode
df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(.+?),') # outputs NaN
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'b(d{5})b')
# trying where method throws error - NotImplementedError: cannot align with a higher dimensional NDFrame
df["Street"] = df["Street"].where(~(df["Street"].isna()), df["FullAddress"].str.extract(r'(.+?),'))
我要做的是用FullAddress的值填充空的Street和Postcode -而不干扰现有的Street和Postcode值。
没有问题的索引,正则表达式,甚至提取…我读了文档,搜索了类似的东西…每个人都得到了什么,但我不明白!?
缺少expand=False
作为str.extract
的参数:
>>> df.loc[m, 'FullAddress'].str.extract(r'(.+?),')
0 # <- it's not a Series but a DataFrame with one column
0 1 Banana St
>>> df.loc[m, 'FullAddress'].str.extract(r'(.+?),', expand=False)
0 1 Banana St
Name: FullAddress, dtype: object # <- now it's a Series
在第一个版本中,Pandas不能对齐列标签Street
和0
。在第二个版本中,该系列适合Street
系列,因此:
df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(.+?),', expand=False)
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'b(d{5})b', expand=False)
print(df)
# Output
id Street Postcode FullAddress
0 1 1 Banana St 69126 1 Banana St, 69126 Heidelberg
1 2 Doloros St 67898.0 2 Choco Rd, 69412 Eberbach
*:有可能使用extract
没有expand=False
使用命名组(?P<xxx>...)
对齐列标签:
df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(?P<Street>.+?),')
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'b(?P<Postcode>d{5})b')
# OR
pattern = r'(?P<Street>.+?),s*b(?P<Postcode>d{5})b'
df.loc[m, ['Street', 'Postcode']] = df.loc[m, 'FullAddress'].str.extract(pattern)
您可以使用.fillna
来填充数据框中的NaN值:
df["Street"] = df["Street"].fillna(df["FullAddress"].str.extract(r'(.+?),')[0])
df["Postcode"] = df["Postcode"].fillna(df["FullAddress"].str.extract(r'b(d{5})b')[0])
这将用extract
的结果填充所有空值,同时保留所有现有值:
id Street Postcode FullAddress
0 1 1 Banana St 69126 1 Banana St, 69126 Heidelberg
1 2 Doloros St 67898 2 Choco Rd, 69412 Eberbach