根据列表中的值拆分字符串列,并将其添加到Pandas DataFrame中的另一列中



我有熊猫dataframe

import pandas as pd
data = {"Column1": ["258 E SONORA ST SAN",
"57474 SAXONY WAY APT 223 WESLEY",  
"62748 CALIFORNIA ST APT 2 SAN",    
"3211 LONGLAKE DR FERN",    
"420 S PLYMOUTH CT APT 265",    
"AHLONA L LABARRE POA -L 274 NESTLINGWOOD DR LONG", 
"224-22 141 STREET RICHMOND",   
"15624 274TH ST CAMBRIA",   
"778 SANTO DOMINGO AVE SW PALM",    
"261 BROADMOOR DR SOUTH SIOUX"],    
"Colum2" : ["BERNARDINO", "CHAPEL", "FRANCISCO", "CREEK", "CHICAGO", "VALLEY", "HILL", "HEIGHTS", "BAY", "CITY"]}
df = pd.DataFrame(data)
df

输出

Column1                                 Colum2
0   258 E SONORA ST SAN                                 BERNARDINO
1   57474 SAXONY WAY APT 223 WESLEY                     CHAPEL
2   62748 CALIFORNIA ST APT 2 SAN                       FRANCISCO
3   3211 LONGLAKE DR FERN                               CREEK
4   420 S PLYMOUTH CT APT 265                           CHICAGO
5   AHLONA L LABARRE POA -L 274 NESTLINGWOOD DR LONG    VALLEY
6   224-22 141 STREET RICHMOND                          HILL
7   15624 274TH ST CAMBRIA                              HEIGHTS
8   778 SANTO DOMINGO AVE SW PALM                       BAY
9   261 BROADMOOR DR SOUTH SIOUX                        CITY

我有一个list的值在column1

中我需要拆分字符串
split_city = ["ST","DR", "STREET", "AVE SW"]

我还想在APT和数字字符之后包括一个分割。

如何根据列表中的值分割一串列,并将它们添加到Pandas DataFrame中的另一列?

要求输出


Column1                                 Colum2
0   258 E SONORA ST                                     SAN BERNARDINO
1   57474 SAXONY WAY APT 223                            WESLEY CHAPEL
2   62748 CALIFORNIA ST APT 2                           SAN FRANCISCO
3   3211 LONGLAKE DR                                    FERN CREEK
4   420 S PLYMOUTH CT APT 265                           CHICAGO
5   AHLONA L LABARRE POA -L 274 NESTLINGWOOD DR         LONG VALLEY
6   224-22 141 STREET                                   RICHMOND HILL
7   15624 274TH ST                                      CAMBRIA HEIGHTS
8   778 SANTO DOMINGO AVE SW                            PALM BAY
9   261 BROADMOOR DR                                    SOUTH SIOUX CITY

我不知道在Pandas中是否有一个很好的方法来做到这一点,但是因为这里的边缘情况如此之多,所以将地址组合起来然后使用regex而不是尝试在分割选择上进行分割(同时也考虑到公寓):

解决方案
import re

pattern = re.compile(r"([-dw ]*)s(ST|WAY|DR|STREET|AVE|N|S|E|W|SW|SE|NW|NE|APT d*)s([w ]*)")

column1 = ["258 E SONORA ST SAN",
"57474 SAXONY WAY APT 223 WESLEY",
"62748 CALIFORNIA ST APT 2 SAN",
"3211 LONGLAKE DR FERN",
"420 S PLYMOUTH CT APT 265",
"AHLONA L LABARRE POA -L 274 NESTLINGWOOD DR LONG",
"224-22 141 STREET RICHMOND",
"15624 274TH ST CAMBRIA",
"778 SANTO DOMINGO AVE SW PALM",
"261 BROADMOOR DR SOUTH SIOUX"]

column2 = ["BERNARDINO", "CHAPEL", "FRANCISCO", "CREEK", "CHICAGO", "VALLEY", "HILL", "HEIGHTS", "BAY", "CITY"]

combined = [" ".join(t) for t in zip(column1, column2)]
streets = []
cities = []
for t in (pattern.findall(s) for s in combined):
*street, city = t[0]
streets.append(" ".join(street))
cities.append(city)

df = pd.DataFrame({"street": streets, "city": cities})

输出:

In [10]: pd.DataFrame({"street": streets, "city": cities})
Out[10]:
street              city
0                              258 E SONORA ST    SAN BERNARDINO
1                     57474 SAXONY WAY APT 223     WESLEY CHAPEL
2                    62748 CALIFORNIA ST APT 2     SAN FRANCISCO
3                             3211 LONGLAKE DR        FERN CREEK
4                    420 S PLYMOUTH CT APT 265           CHICAGO
5  AHLONA L LABARRE POA -L 274 NESTLINGWOOD DR       LONG VALLEY
6                            224-22 141 STREET     RICHMOND HILL
7                               15624 274TH ST   CAMBRIA HEIGHTS
8                     778 SANTO DOMINGO AVE SW          PALM BAY
9                             261 BROADMOOR DR  SOUTH SIOUX CITY

最新更新