如何在不知道同名的情况下从熊猫 dfs 的特定列中删除前 8 个字符?



我有一个由以下内容创建的熊猫DataFrame

df = pandas.DataFrame({"imdbPage": emptyWebPageSet,
"title": emptySetTitle,
"genre1": lst1,
"genre2": lst2,
"genre3": lst3,
"genre4": lst4,
"info":infoSet,
"Runtime(mins)":movieTime,
"releaseData":releaseDateSet,
"imdbRating":ratingSet,
"numberOfVotes":votesList,
"numberOfEpisodes":noOfEpisodesSet,
"TotalRunTime(mins)":totalRunTimeSet
})
df = pandas.get_dummies(data=df, columns=['genre1', 'genre2', 'genre3', 'genre4'])

输出中的列标题如下所示:

output = ["imdbPage", "title", "info", "Runtime(mins)", "releaseData", "imdbRating", "numberOfVotes",
"numberOfEpisodes", """genre1_Action", "genre1_Adventure", "genre1_Animation",
"genre1_Biography", "genre1_Comedy".... etc]

我想做的是从输出中删除所有"genre1_""genre2_"部分,但我显然不知道列的确切名称或有多少 - 只知道它们以"genre1_""genre2_""genre3_""genre4_"开头。

使用 str.replace:

import pandas as pd
output = ["imdbPage", "title", "info", "Runtime(mins)", "releaseData", "imdbRating", "numberOfVotes",
"numberOfEpisodes", "genre1_Action", "genre1_Adventure", "genre1_Animation", "genre1_Biography",
"genre1_Comedy"]
print(pd.Series(data=output).str.replace('^genred+_', ''))

输出

0             imdbPage
1                title
2                 info
3        Runtime(mins)
4          releaseData
5           imdbRating
6        numberOfVotes
7     numberOfEpisodes
8               Action
9            Adventure
10           Animation
11           Biography
12              Comedy
dtype: object

您可以尝试以下方法(参考此处(:

newcols = {}
for col in df.columns:
newcol = re.match("(^genred{1,}_)(.*$)", col).group(2)
newcols[col] = newcol
df.rename(columns=newcols, inplace=True)
print(df)

或者,更简洁地说:

df.rename(columns=lambda x: re.match("(^genred{1,}-)(.*$)", x).group(2), inplace=True)

最新更新