我有一个熊猫数据框,DF,看起来像这样:
column1
0 apple is a fruit
1 fruit sucks
2 apple tasty fruit
3 fruits what else
4 yup apple map
5 fire in the hole
6 that is true
我想产生一个列2,这是行中每个单词的列表,以及整列中每个单词的总数。因此,输出将是这样的。
column1 column2
0 apple is a fruit [('apple', 3),('is', 2),('a', 1),('fruit', 3)]
1 fruit sucks [('fruit', 3),('sucks', 1)]
我尝试使用Sklearn,但无法实现上述。需要帮助。
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
x = v.fit_transform(df['text'])
这是一种给出您想要的结果的一种方法,尽管完全避免了sklearn
:
def counts(data, column):
full_list = []
datr = data[column].tolist()
total_words = " ".join(datr).split(' ')
# per rows
for i in range(len(datr)):
#first per row get the words
word_list = re.sub("[^w]", " ", datr[i]).split()
#cycle per word
total_row = []
for word in word_list:
count = []
count = total_words.count(word)
val = (word, count)
total_row.append(val)
full_list.append(total_row)
return full_list
df['column2'] = counts(df,'column1')
df
column1 column2
0 apple is a fruit [(apple, 3), (is, 2), (a, 1), (fruit, 3)]
1 fruit sucks [(fruit, 3), (sucks, 1)]
2 apple tasty fruit [(apple, 3), (tasty, 1), (fruit, 3)]
3 fruits what else [(fruits, 1), (what, 1), (else, 1)]
4 yup apple map [(yup, 1), (apple, 3), (map, 1)]
5 fire in the hole [(fire, 1), (in, 1), (the, 1), (hole, 1)]
6 that is true [(that, 1), (is, 2), (true, 1)]
我不知道您是否可以使用scikit-learn
进行此操作,但是您可以编写功能,然后使用apply()
将其应用于DataFrame
或Series
。
这是您的示例的方法:
test = pd.DataFrame(['apple is a fruit', 'fruit sucks', 'apple tasty fruit'], columns = ['A'])
def a_function(row):
splitted_row = str(row.values[0]).split()
word_occurences = []
for word in splitted_row:
column_occurences = test.A.str.count(word).sum()
word_occurences.append((word, column_occurences))
return word_occurences
test.apply(a_function, axis = 1)
# Output
0 [(apple, 2), (is, 1), (a, 4), (fruit, 3)]
1 [(fruit, 3), (sucks, 1)]
2 [(apple, 2), (tasty, 1), (fruit, 3)]
dtype: object
您可以看到,主要问题是test.A.str.count(word)
将计算word
的所有出现,而分配给word
的模式在字符串中。这就是为什么"a"
显示为4次的原因。这可能应该用一些言论轻松修复(我不太擅长)。
,如果您愿意丢失一些单词,可以在上面的功能中使用此解决方法:
if word not in ['a', 'is']: # you can add here more useless words
word_occurences.append((word, column_occurences))