Pandas:将某个str列转换为其他值



我有pandas数据帧列,它同时包含liststr值。我只是试图将str值转换为正确的list格式,以便它与其他list形式的值相匹配。我找到了解决办法,但我想看看是否有更好的方法?以下是我的问题:

  1. 如果有Panda的构建功能/能力可以使用,而不是编写长regexreplace。。等等
  2. 如何在没有正则表达式的情况下将[nan]转换为[]

这是我的尝试:

数据文件:

StudentName,CourseID
Alan,"['abc-12-0878', 'abc-12-45', 'abc-12-232342']"
Tim,"['abc-12-0878', 'abc-12-45']"
David,abc-12-1147
Martha,
Matt,"['abc-12-0878', 'abc-12-45']"
Abby,abc-12-1148

我的代码尝试:

import pandas as pd
df = pd.read_csv('sample_students.csv')
df

df['result'] = df['CourseID'].astype(str).apply(lambda x: x.strip('[]').replace("'","").split(',')) 
# Regex route.
# Pandas`s build in function available?
# gives `[nan]` instead of `[]`
# `to_list` and `tolist` didn't work.

我正在寻找的结果:

print(df[['CourseID','result']]) 
CourseID                                        result
['abc-12-0878', 'abc-12-45', 'abc-12-232342']   ['abc-12-0878', 'abc-12-45', 'abc-12-232342']
['abc-12-0878', 'abc-12-45']                    ['abc-12-0878', 'abc-12-45']
abc-12-1147                                     ['abc-12-1147']
NaN                                             []
['abc-12-0878', 'abc-12-45']                    ['abc-12-0878', 'abc-12-45']
abc-12-1148                                     [abc-12-1148]

您可以应用ast.literal_eval((来解析列表的文本表示。

import ast
def f(s):
if pd.isna(s):     # case 1: nan
return []
elif s[0] == "[":  # case 2: string of list
return ast.literal_eval(s)
else:              # case 3: string
return [s]
df["result"] = df["CourseID"].apply(f)

或者在一行中:

df["result"] = df["CourseID"].apply(lambda s: [] if pd.isna(s) else ast.literal_eval(s) if s[0] == "[" else [s])

结果

print(df[["CourseID","result"]])
CourseID                                   result
0  ['abc-12-0878', 'abc-12-45', 'abc-12-232342']  [abc-12-0878, abc-12-45, abc-12-232342]
1                   ['abc-12-0878', 'abc-12-45']                 [abc-12-0878, abc-12-45]
2                                    abc-12-1147                            [abc-12-1147]
3                                            NaN                                       []
4                   ['abc-12-0878', 'abc-12-45']                 [abc-12-0878, abc-12-45]
5                                    abc-12-1148                            [abc-12-1148]

如果你不想导入另一个库,你可以这样做:

def update_data(val):
if pd.isna(val):
return []
if val[0] == '[':
return val    
return [val]
df['Result'] = df.apply(lambda row: update_data(row['CourseID']), axis= 1)

您可以只检查值的类型

df['result'] = df['CourseID'].apply(lambda x: x.strip('[]').replace("'","").split(',') if type(x) == str else [])
>>> print(df[['CourseID','result']])
CourseID                                     result
0  ['abc-12-0878', 'abc-12-45', 'abc-12-232342']  [abc-12-0878,  abc-12-45,  abc-12-232342]
1                   ['abc-12-0878', 'abc-12-45']                  [abc-12-0878,  abc-12-45]
2                                    abc-12-1147                              [abc-12-1147]
3                                            NaN                                         []
4                   ['abc-12-0878', 'abc-12-45']                  [abc-12-0878,  abc-12-45]
5                                    abc-12-1148                              [abc-12-1148]

相关内容

  • 没有找到相关文章

最新更新