从Python DataFrame创建一个术语频率矩阵



我在某些Twitter数据上进行了一些自然语言处理。因此,我设法成功地加载和清理了一些推文,并将其放入下面的数据框架中。

id                    text                                                                          
1104159474368024599 repmiketurner the only time that michael cohen told the truth is when he pled that he is guilty also when he said no collusion and i did not tell him to lie
1104155456019357703 rt msnbc president trump and first lady melania trump view memorial crosses for the 23 people killed in the alabama tornadoes t

问题是我试图构建一个术语频率矩阵,其中每一行是一条推文,而每列是对特定行发生的该词的值。我唯一的问题是其他帖子提及术语频率分发文本文件。这是我用来生成以上数据框架的代码

import nltk.classify
from nltk.tokenize import word_tokenize
from nltk.tokenize import wordpunct_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
df_tweetText = df_tweet
#Makes a dataframe of just the text and ID to make it easier to tokenize
df_tweetText = pd.DataFrame(df_tweetText['text'].str.replace(r'[^ws]+', '').str.lower())
#Removing Stop words
#nltk.download('stopwords')
stop = stopwords.words('english')
#df_tweetText['text'] = df_tweetText.apply(lambda x: [item for item in x if item not in stop])
#Remove the https linkes
df_tweetText['text'] = df_tweetText['text'].replace("[https]+[a-zA-Z0-9]{14}",'',regex=True, inplace=False)
#Tokenize the words
df_tweetText

首先,我尝试使用函数 word_dist = nltk.freqdist(df_tweettext ['text'']),但最终会计算整个句子的值,而不是在行中的每个单词。

我尝试过的另一件事是使用 df_tweettext ['text'] = df_tweettext ['text']。apply(word_tokenize)然后再次调用 feqdist 再次但这给了我一个错误,说不可用的类型:'list'

1104159474368024599 [repmiketurner, the, only, time, that, michael, cohen, told, the, truth, is, when, he, pled, that, he, is, guilty, also, when, he, said, no, collusion, and, i, did, not, tell, him, to, lie]
1104155456019357703 [rt, msnbc, president, trump, and, first, lady, melania, trump, view, memorial, crosses, for, the, 23, people, killed, in, the, alabama, tornadoes, t]

是否有一些尝试构建此术语频率矩阵的替代方法?理想情况下,我希望我的数据看起来像这样

id                  |collusion | president |
------------------------------------------ 
1104159474368024599 |  1       |     0     |
1104155456019357703 |  0       |     2     |

编辑1:因此,我决定看一下文本库,并重新创建了其中一个示例。唯一的问题是,它创建了用每个Tweet的一行创建术语文档矩阵。

import textmining
#Creates Term Matrix 
tweetDocumentmatrix = textmining.TermDocumentMatrix()
for column in df_tweetText:
    tweetDocumentmatrix.add_doc(df_tweetText['text'].to_string(index=False))
#    print(df_tweetText['text'].to_string(index=False))
for row in tweetDocumentmatrix.rows(cutoff=1):
    print(row)

edit2:所以我尝试了sklearn,但是这种类型奏效了,但问题是我在列中发现中文/日语角色不应该存在。另外,由于某种原因,我的专栏作为数字显示为数字

from sklearn.feature_extraction.text import CountVectorizer
corpus = df_tweetText['text'].tolist()
vec = CountVectorizer()
X = vec.fit_transform(corpus)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df)
      00  007cigarjoe  08  10  100  1000  10000  100000  1000000  10000000  
0      0            0   0   0    0     0      0       0        0         0   
1      0            0   0   0    0     0      0       0        0         0   
2      0            0   0   0    0     0      0       0        0         0  

可能不是通过在每一行上迭代而不是最佳的,而是可行的。会里的里程可能会因推文的时间和处理多少推文而有所不同。

import pandas as pd
from collections import Counter
# example df
df = pd.DataFrame()
df['tweets'] = [['test','xd'],['hehe','xd'],['sam','xd','xd']]
# result dataframe
df2 = pd.DataFrame()
for i, row in df.iterrows():
    df2 = df2.append(pd.DataFrame.from_dict(Counter(row.tweets), orient='index').transpose())

相关内容

最新更新