使用特定字符筛选包含值和NaN的列,并创建单独的列



我有一个数据帧,其中包含以下格式的列

df=

ID     Folder Name    Country
300    ABC 12345      CANADA
1000   NaN            USA
450    AML 2233       USA
111    ABC 2234       USA
550    AML 3312       AFRICA

输出需要采用以下格式

ID     Folder Name    Country    Folder Name - ABC   Folder Name - AML
300    ABC 12345      CANADA      ABC 12345             NaN
1000     NaN          USA         NaN                   NaN
450    AML 2233       USA         NaN                   AML 2233
111    ABC 2234       USA         ABC 2234              NaN
550    AML 3312       AFRICA      NaN                   AML 3312

我尝试使用以下python代码:-

df_['Folder Name - ABC'] = df['Folder Name'].apply(lambda x: x.str.startswith('ABC',na = False))

你能帮我哪里出了问题吗?

您不应该使用apply,而应该使用布尔索引:

df.loc[df['Folder Name'].str.startswith('ABC', na=False),
'Folder Name - ABC'] = df['Folder Name']

然而,一种不需要在所有可能的代码上循环的更好方法是提取代码pivot_tablemerge:

out = df.merge(
df.assign(col=df['Folder Name'].str.extract('(w+)'))
.pivot_table(index='ID', columns='col',
values='Folder Name', aggfunc='first')
.add_prefix('Folder Name - '),
on='ID', how='left'
)

输出:

ID Folder Name Country Folder Name - ABC Folder Name - AML
0   300   ABC 12345  CANADA         ABC 12345               NaN
1  1000         NaN     USA               NaN               NaN
2   450    AML 2233     USA               NaN          AML 2233
3   111    ABC 2234     USA          ABC 2234               NaN
4   550    AML 3312  AFRICA               NaN          AML 3312

如果您有一个list,在df['Folder Name']中每个字符串的开头都有要匹配的子字符串,那么您也可以获得如下结果:

lst = ['ABC','AML']
pat = f'^({".*)|(".join(lst)}.*)'
# '^(ABC.*)|(AML.*)'
df[[f'Folder Name - {x}' for x in lst]] = 
df['Folder Name'].str.extract(pat, expand=True)
print(df)
ID Folder Name Country Folder Name - ABC Folder Name - AML
0   300   ABC 12345  CANADA         ABC 12345               NaN
1  1000         NaN     USA               NaN               NaN
2   450    AML 2233     USA               NaN          AML 2233
3   111    ABC 2234     USA          ABC 2234               NaN
4   550    AML 3312  AFRICA               NaN          AML 3312

如果你还没有这个列表,你可以简单地首先创建它:

lst = df['Folder Name'].dropna().str.extract('^([A-Z]{3})')[0].unique()
# this will be an array, not a list, 
# but that doesn't affect the functionality here

注意:如果您的list包含不匹配的项,那么您最终会得到完全由NaN值填充的额外列。你可以在最后把这些扔掉。例如:

lst = ['ABC','AML','NON']
# 'NON' won't match
pat = f'^({".*)|(".join(lst)}.*)'
df[[f'Folder Name - {x}' for x in lst]] = 
df['Folder Name'].str.extract(pat, expand=True)
df = df.dropna(axis=1, how='all')
# dropping column `Folder Name - NON` with only `NaN` values

startswith方法返回TrueFalse,因此您的列将只包含布尔值。您可以尝试以下操作:

df_['Folder Name - ABC'] = df['Folder Name'].apply(lambda x: x if x.str.startswith('ABC',na = False))

这段代码能起作用吗?

df['Folder Name - ABC'] = df['Folder Name'].where(df['Folder Name'].str.startswith('ABC'))

相关内容

  • 没有找到相关文章

最新更新