数据框中有一个字段,名称如下
Joseph Sam Smith
Angela Savage
James Taylor
William Smith Jr
我想把它分成四列,first_name, middle_name, last_name, suffix。对于这个数据集,假设唯一可能的后缀是Jr是可以的(尽管不是理想的)。
我只假设了第一个和最后一个,但后来我意识到我需要更多。
df[['first_name','last_name']] = df['name'].str.split(" ", 1, expand=True)
提前感谢!
不是矢量化的方法,但它完成了工作。假设每个人至少有一个姓和名,即没有"姓";或"王子"
设置data = pd.Series([
"Joseph Sam Smith",
"Angela Savage",
"James Taylor",
"William Smith Jr",
])
suffixes = ["Jr", "III"]
解决方案
def decipher(name):
l = [None]*4 # placeholder list
tokens = name.split()
l[0] = tokens.pop(0) # first name
if tokens[-1] in suffixes:
l[-1] = tokens.pop() # add suffix to end of list
l[2] = tokens.pop() # last element of tokens must be last name
if len(tokens) > 0: # if there any elements left they are a middle name
l[1] = tokens.pop()
return pd.Series(l)
result = data.apply(decipher)
result
是
0 1 2 3
0 Joseph Sam Smith None
1 Angela None Savage None
2 James None Taylor None
3 William None Smith Jr
name = "Joseph Sam Smith"
df =[["first_name","middle_name","last_name","suffix"]]
nameLis = name.split(" ")
if(len(nameLis)==3):
nameLis.append("")
elif(len(nameLis)==2):
nameLis.insert(1,"")
nameLis.insert(3,"")
df.append(nameLis)
>>> import pandas as pd
>>> x = pd.Series(["Joseph Sam Smith","Angela Savage", "James Taylor", "William Smith Jr"])
>>> x
0 Joseph Sam Smith
1 Angela Savage
2 James Taylor
3 William Smith Jr
dtype: object
>>> d = x.str.split(expand=True)
>>> d['suffix'] = None
>>> d.columns = ['FirstName', 'MiddleName', 'LastName', 'suffix']
>>> matched = d.loc[d.LastName.eq("Jr")]
>>> d.iloc[matched.index, 3] = d.iloc[matched.index, 2].to_list()
>>> d.iloc[matched.index, 2] = None
>>> d
FirstName MiddleName LastName suffix
0 Joseph Sam Smith None
1 Angela Savage None None
2 James Taylor None None
3 William Smith None Jr