重新编码数据帧列中的列表值



我正在尝试重新编码以列表格式组织的数据帧列中的值。我知道如何替换数据帧列中的字符串值,但正在努力如何在列表中执行此操作。

以下是我的数据片段:

{0: '[Crime, Drama]',
 1: '[Crime, Drama]',
 2: '[Crime, Drama]',
 3: '[Action, Crime, Drama, Thriller]',
 4: '[Crime, Drama]',
 5: '[Biography, Drama, History]',
 6: '[Crime, Drama]',
 7: '[Adventure, Drama, Fantasy]',
 8: '[Western]',
 9: '[Drama]'}

例如,我想将所有犯罪重新编码为惊悚片,将传记重新编码为历史。

我知道以下内容适用于替换字符串值

df.loc[df['genre']=='Crime']='Thriller'

但是如何为列表修改它呢?

谢谢!

编辑:用于创建此数据帧(使用从 IMDB 数据库中提取的数据(的代码为:

# these are the variables we want to (ie are able to) extract from the movie object
metadata = ('title', 'rating', 'genre', "plot", "language", "runtime", "year", "color", "country" , "votes")
#creates dataframe with variable name headers
df = pd.DataFrame(np.random.randn(250, len(metadata)), columns=metadata) 
#these are all different data types, including lists, this makes it compile
df = df.astype('object')
#populate df with movie objects
for i in range(250):
    for j in metadata:
        df.loc[i, j] = movies_list[i].get(j)
# convert to the right data types:
metadata_dict_dtypes = {"title": unicode,
                        "rating": float,
                        "genre":list,
                        "plot": str,
                        "language":list,
                        "runtime":list,
                        "year":int,
                        "color":list,
                        "country":list ,
                        "votes":int}
for colname, my_dtype in metadata_dict_dtypes.iteritems():
    df[colname] = df[colname].astype(my_dtype)

假设这在数据帧中正确格式化为列表。您可以编写一个函数,该函数将行和流派名称更改映射作为参数,并将其应用于数据帧。例如

name_map = {'Crime': 'Thriller', 'Biography': 'History'}
def change_names(row, name_map):
    for name in name_map:
        if name in row.genre:
            row.genre[row.genre.index(name)] = name_map[name]
    return row
df = df.apply(lambda row: change_name(row, name_map), axis=1)

它没有矢量化,但它会完成工作。

考虑使用列表理解进行更新。下面使用流派列表的单列数据框。

df = pd.DataFrame({'Genre': [['Crime', 'Drama'],
                             ['Crime', 'Drama'],
                             ['Crime', 'Drama'],
                             ['Action', 'Crime', 'Drama', 'Thriller'],
                             ['Crime', 'Drama'],
                             ['Biography', 'Drama', 'History'],
                             ['Crime', 'Drama'],
                             ['Adventure', 'Drama', 'Fantasy'],
                             ['Western'],
                             ['Drama']]})    
print(df)
#                               Genre
# 0                    [Crime, Drama]
# 1                    [Crime, Drama]
# 2                    [Crime, Drama]
# 3  [Action, Crime, Drama, Thriller]
# 4                    [Crime, Drama]
# 5       [Biography, Drama, History]
# 6                    [Crime, Drama]
# 7       [Adventure, Drama, Fantasy]
# 8                         [Western]
# 9                           [Drama]
df['Genre'] = [['Thriller' if i=='Crime' else i for i in m] for m in df['Genre']]
print(df)
#                                  Genre
# 0                    [Thriller, Drama]
# 1                    [Thriller, Drama]
# 2                    [Thriller, Drama]
# 3  [Action, Thriller, Drama, Thriller]
# 4                    [Thriller, Drama]
# 5          [Biography, Drama, History]
# 6                    [Thriller, Drama]
# 7          [Adventure, Drama, Fantasy]
# 8                            [Western]
# 9                              [Drama]

最新更新