我有一个由以下内容创建的熊猫DataFrame
:
df = pandas.DataFrame({"imdbPage": emptyWebPageSet,
"title": emptySetTitle,
"genre1": lst1,
"genre2": lst2,
"genre3": lst3,
"genre4": lst4,
"info":infoSet,
"Runtime(mins)":movieTime,
"releaseData":releaseDateSet,
"imdbRating":ratingSet,
"numberOfVotes":votesList,
"numberOfEpisodes":noOfEpisodesSet,
"TotalRunTime(mins)":totalRunTimeSet
})
df = pandas.get_dummies(data=df, columns=['genre1', 'genre2', 'genre3', 'genre4'])
输出中的列标题如下所示:
output = ["imdbPage", "title", "info", "Runtime(mins)", "releaseData", "imdbRating", "numberOfVotes",
"numberOfEpisodes", """genre1_Action", "genre1_Adventure", "genre1_Animation",
"genre1_Biography", "genre1_Comedy".... etc]
我想做的是从输出中删除所有"genre1_"
、"genre2_"
部分,但我显然不知道列的确切名称或有多少 - 只知道它们以"genre1_"
、"genre2_"
、"genre3_"
或"genre4_"
开头。
使用 str.replace:
import pandas as pd
output = ["imdbPage", "title", "info", "Runtime(mins)", "releaseData", "imdbRating", "numberOfVotes",
"numberOfEpisodes", "genre1_Action", "genre1_Adventure", "genre1_Animation", "genre1_Biography",
"genre1_Comedy"]
print(pd.Series(data=output).str.replace('^genred+_', ''))
输出
0 imdbPage
1 title
2 info
3 Runtime(mins)
4 releaseData
5 imdbRating
6 numberOfVotes
7 numberOfEpisodes
8 Action
9 Adventure
10 Animation
11 Biography
12 Comedy
dtype: object
您可以尝试以下方法(参考此处(:
newcols = {}
for col in df.columns:
newcol = re.match("(^genred{1,}_)(.*$)", col).group(2)
newcols[col] = newcol
df.rename(columns=newcols, inplace=True)
print(df)
或者,更简洁地说:
df.rename(columns=lambda x: re.match("(^genred{1,}-)(.*$)", x).group(2), inplace=True)