从pandas数据帧的标题中删除常用词



假设我有以下数据帧

import pandas as pd
data = [['Mallika', 23, 'Student'], ['Yash', 25, 'Tutor'], ['Abc', 14, 'Clerk']]
data_frame = pd.DataFrame(data, columns=['Student.first.name.word', 'Student.Current.Age.word', 'Student.Current.Profession.word'])
Student.first.name.word  Student.Current.Age.word Student.Current.Profession.word
0           Mallika                23                 Student
1              Yash                25                   Tutor
2               Abc                14                   Clerk

我将如何将常见的列标题词";学生;以及";单词";

这样您就可以得到以下数据帧:

first.name  Current.Age Current.Profession
0  Mallika   23    Student
1     Yash   25      Tutor
2      Abc   14      Clerk

您可以使用正则表达式从列中删除这些单词和.s并将其赋值回:

data_frame.columns = data_frame.columns.str.replace(r"(Student|word|.)", "")

获取

>>> data_frame
name  Age Profession
0  Mallika   23    Student
1     Yash   25      Tutor
2      Abc   14      Clerk

更新后

您可以split - slice - join:

data_frame.columns = data_frame.columns.str.split(r".").str[1:-1].str.join(".")

即在文字点上拆分,首先取出&最后一个元素,最后用点将它们连接起来

获取

first.name  Current.Age Current.Profession
0    Mallika           23            Student
1       Yash           25              Tutor
2        Abc           14              Clerk

以下是我的答案的扩展,用于删除常见前缀。这种方法的好处是,它以通用的方式查找前缀和后缀,因此无需对任何模式进行硬编码。

cols = data_frame.columns
common_prefix = os.path.commonprefix(cols.tolist())
common_suffix = os.path.commonprefix([col[::-1] for col in cols])[::-1]
data_frame.columns = cols.str.replace(f"{common_prefix}|{common_suffix}", "", regex=True)
name  Age Profession
0  Mallika   23    Student
1     Yash   25      Tutor
2      Abc   14      Clerk

更新,对于更新的问题,相同的解决方案以通用方式工作:

first.name  Current.Age Current.Profession
0    Mallika           23            Student
1       Yash           25              Tutor
2        Abc           14              Clerk

要删除所有单词,而不仅仅是硬编码的单词,您可以尝试

df = data_frame
from functools import reduce
common_words = [i.split(".") for i in df.columns.tolist()]
common_words =reduce(lambda x,y : set(x).intersection(y) ,common_words)
pat = r'b(?:{})b'.format('|'.join(common_words))
df.columns = df.columns.str.replace(pat, "").str[1:-1]

输出:

print(df)

first.name  Current.Age Current.Profession
0   Mallika     23          Student
1   Yash        25          Tutor
2   Abc         14          Clerk

最新更新