我有熊猫dataframe
import pandas as pd
data = {"Column1": ["258 E SONORA ST SAN",
"57474 SAXONY WAY APT 223 WESLEY",
"62748 CALIFORNIA ST APT 2 SAN",
"3211 LONGLAKE DR FERN",
"420 S PLYMOUTH CT APT 265",
"AHLONA L LABARRE POA -L 274 NESTLINGWOOD DR LONG",
"224-22 141 STREET RICHMOND",
"15624 274TH ST CAMBRIA",
"778 SANTO DOMINGO AVE SW PALM",
"261 BROADMOOR DR SOUTH SIOUX"],
"Colum2" : ["BERNARDINO", "CHAPEL", "FRANCISCO", "CREEK", "CHICAGO", "VALLEY", "HILL", "HEIGHTS", "BAY", "CITY"]}
df = pd.DataFrame(data)
df
输出
Column1 Colum2
0 258 E SONORA ST SAN BERNARDINO
1 57474 SAXONY WAY APT 223 WESLEY CHAPEL
2 62748 CALIFORNIA ST APT 2 SAN FRANCISCO
3 3211 LONGLAKE DR FERN CREEK
4 420 S PLYMOUTH CT APT 265 CHICAGO
5 AHLONA L LABARRE POA -L 274 NESTLINGWOOD DR LONG VALLEY
6 224-22 141 STREET RICHMOND HILL
7 15624 274TH ST CAMBRIA HEIGHTS
8 778 SANTO DOMINGO AVE SW PALM BAY
9 261 BROADMOOR DR SOUTH SIOUX CITY
我有一个list
的值在column1
split_city = ["ST","DR", "STREET", "AVE SW"]
我还想在APT
和数字字符之后包括一个分割。
如何根据列表中的值分割一串列,并将它们添加到Pandas DataFrame中的另一列?
要求输出
Column1 Colum2
0 258 E SONORA ST SAN BERNARDINO
1 57474 SAXONY WAY APT 223 WESLEY CHAPEL
2 62748 CALIFORNIA ST APT 2 SAN FRANCISCO
3 3211 LONGLAKE DR FERN CREEK
4 420 S PLYMOUTH CT APT 265 CHICAGO
5 AHLONA L LABARRE POA -L 274 NESTLINGWOOD DR LONG VALLEY
6 224-22 141 STREET RICHMOND HILL
7 15624 274TH ST CAMBRIA HEIGHTS
8 778 SANTO DOMINGO AVE SW PALM BAY
9 261 BROADMOOR DR SOUTH SIOUX CITY
我不知道在Pandas中是否有一个很好的方法来做到这一点,但是因为这里的边缘情况如此之多,所以将地址组合起来然后使用regex而不是尝试在分割选择上进行分割(同时也考虑到公寓):
解决方案import re
pattern = re.compile(r"([-dw ]*)s(ST|WAY|DR|STREET|AVE|N|S|E|W|SW|SE|NW|NE|APT d*)s([w ]*)")
column1 = ["258 E SONORA ST SAN",
"57474 SAXONY WAY APT 223 WESLEY",
"62748 CALIFORNIA ST APT 2 SAN",
"3211 LONGLAKE DR FERN",
"420 S PLYMOUTH CT APT 265",
"AHLONA L LABARRE POA -L 274 NESTLINGWOOD DR LONG",
"224-22 141 STREET RICHMOND",
"15624 274TH ST CAMBRIA",
"778 SANTO DOMINGO AVE SW PALM",
"261 BROADMOOR DR SOUTH SIOUX"]
column2 = ["BERNARDINO", "CHAPEL", "FRANCISCO", "CREEK", "CHICAGO", "VALLEY", "HILL", "HEIGHTS", "BAY", "CITY"]
combined = [" ".join(t) for t in zip(column1, column2)]
streets = []
cities = []
for t in (pattern.findall(s) for s in combined):
*street, city = t[0]
streets.append(" ".join(street))
cities.append(city)
df = pd.DataFrame({"street": streets, "city": cities})
输出:
In [10]: pd.DataFrame({"street": streets, "city": cities})
Out[10]:
street city
0 258 E SONORA ST SAN BERNARDINO
1 57474 SAXONY WAY APT 223 WESLEY CHAPEL
2 62748 CALIFORNIA ST APT 2 SAN FRANCISCO
3 3211 LONGLAKE DR FERN CREEK
4 420 S PLYMOUTH CT APT 265 CHICAGO
5 AHLONA L LABARRE POA -L 274 NESTLINGWOOD DR LONG VALLEY
6 224-22 141 STREET RICHMOND HILL
7 15624 274TH ST CAMBRIA HEIGHTS
8 778 SANTO DOMINGO AVE SW PALM BAY
9 261 BROADMOOR DR SOUTH SIOUX CITY