我有一个数据帧,其中包含以下格式的列
df=
ID Folder Name Country
300 ABC 12345 CANADA
1000 NaN USA
450 AML 2233 USA
111 ABC 2234 USA
550 AML 3312 AFRICA
输出需要采用以下格式
ID Folder Name Country Folder Name - ABC Folder Name - AML
300 ABC 12345 CANADA ABC 12345 NaN
1000 NaN USA NaN NaN
450 AML 2233 USA NaN AML 2233
111 ABC 2234 USA ABC 2234 NaN
550 AML 3312 AFRICA NaN AML 3312
我尝试使用以下python代码:-
df_['Folder Name - ABC'] = df['Folder Name'].apply(lambda x: x.str.startswith('ABC',na = False))
你能帮我哪里出了问题吗?
您不应该使用apply
,而应该使用布尔索引:
df.loc[df['Folder Name'].str.startswith('ABC', na=False),
'Folder Name - ABC'] = df['Folder Name']
然而,一种不需要在所有可能的代码上循环的更好方法是提取代码pivot_table
和merge
:
out = df.merge(
df.assign(col=df['Folder Name'].str.extract('(w+)'))
.pivot_table(index='ID', columns='col',
values='Folder Name', aggfunc='first')
.add_prefix('Folder Name - '),
on='ID', how='left'
)
输出:
ID Folder Name Country Folder Name - ABC Folder Name - AML
0 300 ABC 12345 CANADA ABC 12345 NaN
1 1000 NaN USA NaN NaN
2 450 AML 2233 USA NaN AML 2233
3 111 ABC 2234 USA ABC 2234 NaN
4 550 AML 3312 AFRICA NaN AML 3312
如果您有一个list
,在df['Folder Name']
中每个字符串的开头都有要匹配的子字符串,那么您也可以获得如下结果:
lst = ['ABC','AML']
pat = f'^({".*)|(".join(lst)}.*)'
# '^(ABC.*)|(AML.*)'
df[[f'Folder Name - {x}' for x in lst]] =
df['Folder Name'].str.extract(pat, expand=True)
print(df)
ID Folder Name Country Folder Name - ABC Folder Name - AML
0 300 ABC 12345 CANADA ABC 12345 NaN
1 1000 NaN USA NaN NaN
2 450 AML 2233 USA NaN AML 2233
3 111 ABC 2234 USA ABC 2234 NaN
4 550 AML 3312 AFRICA NaN AML 3312
如果你还没有这个列表,你可以简单地首先创建它:
lst = df['Folder Name'].dropna().str.extract('^([A-Z]{3})')[0].unique()
# this will be an array, not a list,
# but that doesn't affect the functionality here
注意:如果您的list
包含不匹配的项,那么您最终会得到完全由NaN
值填充的额外列。你可以在最后把这些扔掉。例如:
lst = ['ABC','AML','NON']
# 'NON' won't match
pat = f'^({".*)|(".join(lst)}.*)'
df[[f'Folder Name - {x}' for x in lst]] =
df['Folder Name'].str.extract(pat, expand=True)
df = df.dropna(axis=1, how='all')
# dropping column `Folder Name - NON` with only `NaN` values
startswith
方法返回True
或False
,因此您的列将只包含布尔值。您可以尝试以下操作:
df_['Folder Name - ABC'] = df['Folder Name'].apply(lambda x: x if x.str.startswith('ABC',na = False))
这段代码能起作用吗?
df['Folder Name - ABC'] = df['Folder Name'].where(df['Folder Name'].str.startswith('ABC'))