我在某些Twitter数据上进行了一些自然语言处理。因此,我设法成功地加载和清理了一些推文,并将其放入下面的数据框架中。
id text
1104159474368024599 repmiketurner the only time that michael cohen told the truth is when he pled that he is guilty also when he said no collusion and i did not tell him to lie
1104155456019357703 rt msnbc president trump and first lady melania trump view memorial crosses for the 23 people killed in the alabama tornadoes t
问题是我试图构建一个术语频率矩阵,其中每一行是一条推文,而每列是对特定行发生的该词的值。我唯一的问题是其他帖子提及术语频率分发文本文件。这是我用来生成以上数据框架的代码
import nltk.classify
from nltk.tokenize import word_tokenize
from nltk.tokenize import wordpunct_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
df_tweetText = df_tweet
#Makes a dataframe of just the text and ID to make it easier to tokenize
df_tweetText = pd.DataFrame(df_tweetText['text'].str.replace(r'[^ws]+', '').str.lower())
#Removing Stop words
#nltk.download('stopwords')
stop = stopwords.words('english')
#df_tweetText['text'] = df_tweetText.apply(lambda x: [item for item in x if item not in stop])
#Remove the https linkes
df_tweetText['text'] = df_tweetText['text'].replace("[https]+[a-zA-Z0-9]{14}",'',regex=True, inplace=False)
#Tokenize the words
df_tweetText
首先,我尝试使用函数 word_dist = nltk.freqdist(df_tweettext ['text'']),但最终会计算整个句子的值,而不是在行中的每个单词。
我尝试过的另一件事是使用 df_tweettext ['text'] = df_tweettext ['text']。apply(word_tokenize)然后再次调用 feqdist 再次但这给了我一个错误,说不可用的类型:'list'。
1104159474368024599 [repmiketurner, the, only, time, that, michael, cohen, told, the, truth, is, when, he, pled, that, he, is, guilty, also, when, he, said, no, collusion, and, i, did, not, tell, him, to, lie]
1104155456019357703 [rt, msnbc, president, trump, and, first, lady, melania, trump, view, memorial, crosses, for, the, 23, people, killed, in, the, alabama, tornadoes, t]
是否有一些尝试构建此术语频率矩阵的替代方法?理想情况下,我希望我的数据看起来像这样
id |collusion | president |
------------------------------------------
1104159474368024599 | 1 | 0 |
1104155456019357703 | 0 | 2 |
编辑1:因此,我决定看一下文本库,并重新创建了其中一个示例。唯一的问题是,它创建了用每个Tweet的一行创建术语文档矩阵。
import textmining
#Creates Term Matrix
tweetDocumentmatrix = textmining.TermDocumentMatrix()
for column in df_tweetText:
tweetDocumentmatrix.add_doc(df_tweetText['text'].to_string(index=False))
# print(df_tweetText['text'].to_string(index=False))
for row in tweetDocumentmatrix.rows(cutoff=1):
print(row)
edit2:所以我尝试了sklearn,但是这种类型奏效了,但问题是我在列中发现中文/日语角色不应该存在。另外,由于某种原因,我的专栏作为数字显示为数字
from sklearn.feature_extraction.text import CountVectorizer
corpus = df_tweetText['text'].tolist()
vec = CountVectorizer()
X = vec.fit_transform(corpus)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df)
00 007cigarjoe 08 10 100 1000 10000 100000 1000000 10000000
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
可能不是通过在每一行上迭代而不是最佳的,而是可行的。会里的里程可能会因推文的时间和处理多少推文而有所不同。
import pandas as pd
from collections import Counter
# example df
df = pd.DataFrame()
df['tweets'] = [['test','xd'],['hehe','xd'],['sam','xd','xd']]
# result dataframe
df2 = pd.DataFrame()
for i, row in df.iterrows():
df2 = df2.append(pd.DataFrame.from_dict(Counter(row.tweets), orient='index').transpose())