查找列中某些特征的大多数实例的行集

我有以下表格:

<表类> 词数功能 tbody><<tr>我的0代词的0首选项食品0对象0是冰0甜点奶油1甜点

您尝试过groupby()函数吗?在本例中:

In [1]: df.groupby(["feature", "word"]).size()
Out[2]: word     feature
dessert   ice       1
cream     1
food      object    1 
dtype: int64

首先，使用pandas库。它包含了你最终会使用的许多函数的矢量化实现，所以它比在数百行上循环要快得多。

首先，将csv文件读入pandas数据框架:

df = pd.read_csv('csv_file.csv')

对于给定的示例，这将产生如下所示的数据框架:

word  count      feature
0        my      0      pronoun
1  favorite      0  preferences
2      food      0       object
3        is      0        being
4       ice      0      dessert
5     cream      1      dessert

现在，定义一个函数，该函数接受一行，并计算关键字在随后的100行中出现的次数:

def count_in_next_100(row, keyword):
row_index = row.name # Since the index is numeric, row.name will be the row number
# Take the feature column for the next 100 rows
# Check which of these are == keyword, which will give a bunch of True/False
# Then take their .sum(), so you get the number that are True
total = (df.loc[row_index:row_index+100, "feature"] == keyword).sum()
return total # Return this value.

接下来，将此函数应用于每一行的数据框架，即axis=1

count_dessert = df.apply(count_in_next_100, axis=1, args=("dessert",))

然后，count_dessert.idxmax()将为您提供在随后的100行中dessert出现次数最多的行号。我将留下"查找前3名"这部分作为你的练习，如果你需要帮助，请告诉我。

相关内容

最新更新

热门标签：