如何分离组合的列,但数据不一致



我正在为一份新工作做准备,在这份工作中,我将收到不同质量的数据提交,通常时间、日期/字符等被毫无意义地组合在一起,在分析之前必须分开。提前思考如何解决这个问题。

使用下面一个虚构的例子,我将region、rep和product组合在一起。

file['combine'] = file['Region'] + file['Sales Rep'] + file['Product']

Shift Region Sales Rep  Product  Cost per  Units Sold              combine
0       3   East  Shirlene   Pencil         5          71   EastShirlenePencil
1       3  South  Anderson   Folder        17          69  SouthAndersonFolder
2       3   West    Shelli   Folder        17         185     WestShelliFolder
3       3  South    Damion   Binder        30         159    SouthDamionBinder
4       3   West  Shirlene  Stapler        25          41  WestShirleneStapler

假设没有其他数据,问题是,如何拆分"组合"列?

非常感谢!

如果您想要字符串之间的空间,您可以执行:

df["combine"] = df[["Region", "Sales Rep", "Product"]].apply(" ".join, axis=1)
print(df)

打印:

Shift Region Sales Rep  Product  Cost per  Units Sold                combine
0      3   East  Shirlene   Pencil         5          71   East Shirlene Pencil
1      3  South  Anderson   Folder        17          69  South Anderson Folder
2      3   West    Shelli   Folder        17         185     West Shelli Folder
3      3  South    Damion   Binder        30         159    South Damion Binder
4      3   West  Shirlene  Stapler        25          41  West Shirlene Stapler

或者:如果您想拆分已组合的字符串:

import re
df["separated"] = df["combine"].apply(lambda x: re.findall(r"[A-Z][^A-Z]*", x))
print(df)

打印:

Shift Region Sales Rep  Product  Cost per  Units Sold              combine                  separated
0      3   East  Shirlene   Pencil         5          71   EastShirlenePencil   [East, Shirlene, Pencil]
1      3  South  Anderson   Folder        17          69  SouthAndersonFolder  [South, Anderson, Folder]
2      3   West    Shelli   Folder        17         185     WestShelliFolder     [West, Shelli, Folder]
3      3  South    Damion   Binder        30         159    SouthDamionBinder    [South, Damion, Binder]
4      3   West  Shirlene  Stapler        25          41  WestShirleneStapler  [West, Shirlene, Stapler]

最新更新