为什么pandas.series.str.extract不在这里工作,而在其他地方工作



为什么pandas.series.extract(regex)能够打印正确的值,但不能使用索引或np.where将值赋给现有的变量

import pandas as pd
import numpy as np
df = pd.DataFrame(
[
['1', np.nan, np.nan, '1 Banana St, 69126 Heidelberg'],
['2', "Doloros St", 67898, '2 Choco Rd, 69412 Eberbach']], 
columns=['id', "Street", 'Postcode', 'FullAddress']
)
m = df['Street'].isna()
print(df["FullAddress"].str.extract(r'(.+?),'))                        # prints street
print(df["FullAddress"].str.extract(r'b(d{5})b'))                   # prints postcode
df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(.+?),')  # outputs NaN
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'b(d{5})b')
# trying where method throws error - NotImplementedError: cannot align with a higher dimensional NDFrame
df["Street"] = df["Street"].where(~(df["Street"].isna()), df["FullAddress"].str.extract(r'(.+?),'))

我要做的是用FullAddress的值填充空的Street和Postcode -而不干扰现有的Street和Postcode值。

没有问题的索引,正则表达式,甚至提取…我读了文档,搜索了类似的东西…每个人都得到了什么,但我不明白!?

缺少expand=False作为str.extract的参数:

>>> df.loc[m, 'FullAddress'].str.extract(r'(.+?),')
0  # <- it's not a Series but a DataFrame with one column
0  1 Banana St
>>> df.loc[m, 'FullAddress'].str.extract(r'(.+?),', expand=False)
0    1 Banana St
Name: FullAddress, dtype: object  # <- now it's a Series

在第一个版本中,Pandas不能对齐列标签Street0。在第二个版本中,该系列适合Street系列,因此:

df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(.+?),', expand=False)
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'b(d{5})b', expand=False)
print(df)
# Output
id       Street Postcode                    FullAddress
0  1  1 Banana St    69126  1 Banana St, 69126 Heidelberg
1  2   Doloros St  67898.0     2 Choco Rd, 69412 Eberbach

*:有可能使用extract没有expand=False使用命名组(?P<xxx>...)对齐列标签:

df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(?P<Street>.+?),')
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'b(?P<Postcode>d{5})b')
# OR
pattern = r'(?P<Street>.+?),s*b(?P<Postcode>d{5})b'
df.loc[m, ['Street', 'Postcode']] = df.loc[m, 'FullAddress'].str.extract(pattern)

您可以使用.fillna来填充数据框中的NaN值:

df["Street"] = df["Street"].fillna(df["FullAddress"].str.extract(r'(.+?),')[0])
df["Postcode"] = df["Postcode"].fillna(df["FullAddress"].str.extract(r'b(d{5})b')[0])

这将用extract的结果填充所有空值,同时保留所有现有值:

id       Street Postcode                    FullAddress
0  1  1 Banana St    69126  1 Banana St, 69126 Heidelberg
1  2   Doloros St    67898     2 Choco Rd, 69412 Eberbach

相关内容

  • 没有找到相关文章