如何根据空格数拆分数据框中的字符串



数据框中有一个字段,名称如下

Joseph Sam Smith
Angela Savage
James Taylor
William Smith Jr

我想把它分成四列,first_name, middle_name, last_name, suffix。对于这个数据集,假设唯一可能的后缀是Jr是可以的(尽管不是理想的)。

我只假设了第一个和最后一个,但后来我意识到我需要更多。

df[['first_name','last_name']] = df['name'].str.split(" ", 1, expand=True)

提前感谢!

不是矢量化的方法,但它完成了工作。假设每个人至少有一个姓和名,即没有"姓";或"王子"

设置
data = pd.Series([
"Joseph Sam Smith",
"Angela Savage",
"James Taylor",
"William Smith Jr",
])
suffixes = ["Jr", "III"]

解决方案

def decipher(name):
l = [None]*4  # placeholder list
tokens = name.split()
l[0] = tokens.pop(0)  # first name
if tokens[-1] in suffixes:  
l[-1] = tokens.pop()  # add suffix to end of list
l[2] = tokens.pop()  # last element of tokens must be last name
if len(tokens) > 0:  # if there any elements left they are a middle name
l[1] = tokens.pop()
return pd.Series(l)
result = data.apply(decipher)

result

0     1       2     3
0   Joseph   Sam   Smith  None
1   Angela  None  Savage  None
2    James  None  Taylor  None
3  William  None   Smith    Jr
name = "Joseph Sam Smith"
df =[["first_name","middle_name","last_name","suffix"]]
nameLis = name.split(" ")
if(len(nameLis)==3):
nameLis.append("")
elif(len(nameLis)==2):
nameLis.insert(1,"")
nameLis.insert(3,"")
df.append(nameLis)
>>> import pandas as pd
>>> x = pd.Series(["Joseph Sam Smith","Angela Savage", "James Taylor", "William Smith Jr"])
>>> x
0    Joseph Sam Smith
1       Angela Savage
2        James Taylor
3    William Smith Jr
dtype: object
>>> d = x.str.split(expand=True)
>>> d['suffix'] = None
>>> d.columns = ['FirstName', 'MiddleName', 'LastName', 'suffix']
>>> matched = d.loc[d.LastName.eq("Jr")]
>>> d.iloc[matched.index, 3] = d.iloc[matched.index, 2].to_list()
>>> d.iloc[matched.index, 2] = None
>>> d

FirstName   MiddleName  LastName    suffix
0   Joseph      Sam         Smith       None
1   Angela      Savage      None        None
2   James       Taylor      None        None
3   William     Smith       None        Jr

相关内容

  • 没有找到相关文章

最新更新