如何在Python中将文本转换为数据框架(应用一些规则)

我是Python的新手，我需要构建这个复杂的函数，但不知道如何

我有一个文本数据框架

RepID     RepText
---------------------------
1         Math Math Math  English Physics Sport Sport English English English English 
2         Sport English English English Math Math Physics Physics Physics Computer Computer Computer Computer 
3         Chemistry Chemistry Math Math Math English English English Math Math Math Math Math Sport Sport

我需要创建的函数名为fnClusters

它只是在RepText中找到N个重复的单词并将它们返回到一个数据帧

如果N为3，则出现3次或以上的相同单词将被计数

so Math Math Math Math English Physics English English English English Math也算

Math  English  Physics
------------------------
4       0       0

英语英语英语英语英语数学数学英语数学体育运动算

Math  English  Sports
------------------------
4       6       0

如何在Python中构建这个函数?

使用pandas.Series.str.split和value_counts的一种方法:

new_df = df["RepText"].str.split("s+").apply(pd.Series.value_counts)
n = 3
print(new_df[new_df.ge(n)].fillna(0))

输出:

English  Math  Sport  Physics  Computer  Chemistry
0      5.0   3.0    0.0      0.0       0.0        0.0
1      3.0   0.0    0.0      3.0       4.0        0.0
2      3.0   8.0    0.0      0.0       0.0        0.0

相关内容

最新更新

热门标签：