统计文本文件中几篇文章中特定单词的频率



我想计算单个文本文件中包含的每篇文章的单词列表的出现次数。每篇文章都可以被识别,因为它们都以一个共同的标签"

广告

这是一个文本文件的示例:

"[<p>Advertisement ,   By   TIM ARANGO  ,     SABRINA TAVERNISE   and     CEYLAN YEGINSU    JUNE 28, 2016 
 ,Credit Ilhas News Agency, via Agence France-Presse — Getty Images,ISTANBUL ......]
[<p>Advertisement ,   By  MILAN SCHREUER  and     ALISSA J. RUBIN    OCT. 5, 2016 
 ,  BRUSSELS — A man wounded two police officers with a knife in Brussels around noon 
on Wednesday in what the authorities called “a potential terrorist attack.” ,  
The two ......]" 

我想做的是计算我在csv文件(20个单词)中每个单词的频率,并将输出写成如下:

  id, attack, war, terrorism, people, killed, said 
  article_1, 45, 5, 4, 6, 2,1
  article_2, 10, 3, 2, 1, 0,0

csv中的单词是这样存储的:

attack
people
killed
attacks
state
islamic

根据建议,我首先尝试通过标记<p>拆分整个文本文件,然后开始计数单词。然后我在文件文本中将列表标记化。

这是我目前为止写的:

opener = open("News_words_most_common.csv")
words = opener.read()
my_pattern = ('w+')
x = re.findall(my_pattern, words)
file_open = open("Training_News_6.csv")
files = file_open.read()
r = files.lower()
stops = set(stopwords.words("english"))
words = r.split("<p>")
token= word_tokenize(words)
string = str(words)
token= word_tokenize(string)
print(token)

输出:

['[', "'", "''", '|', '[', "'", ',', "'advertisement", 
',', 'by', 'milan', 'schreuer'.....']', '|', "''", '\n', "'", ']']

下一步将循环拆分的文章(现在转到已标记的单词列表中)并计算第一个文件中单词的频率。如果你有任何关于如何解释和计数的建议,请让我知道!

我在Anaconda上使用Python 3.5

你可以尝试使用pandas和sklearn:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
vocabulary = [word.strip() for word in open('vocabulary.txt').readlines()]
corpus = open('articles.txt').read().split('<p>Advertisement')
vectorizer = CountVectorizer(min_df=1, vocabulary=vocabulary)
words_matrix = vectorizer.fit_transform(corpus)
df = pd.DataFrame(data=words_matrix.todense(), 
                  index=('article_%s' % i for i in range(words_matrix.shape[0])),
                  columns=vectorizer.get_feature_names())
df.index.name = 'id'
df.to_csv('articles.csv')

文件articles.csv:

$ cat articles.csv
id,attack,people,killed,attacks,state,islamic
article_0,0,0,0,0,0,0
article_1,0,0,0,0,0,0
article_2,1,0,0,0,0,0

你可以试着阅读你的文本文件,然后在'<p>'分割(如果,正如你所说,他们是用来标记新文章的开始),然后你有一个文章列表。一个带有count的简单循环就可以了。

我建议你看看nltk模块。我不确定你的最终目标是什么,但nltk有很容易实现的函数来做这些事情,甚至更多(例如,你可以计算频率,而不是仅仅看一个词在每篇文章中出现的次数,甚至通过逆文档频率缩放它,称为tf-idf)。

也许我没有很好地完成任务…

如果你正在进行文本分类,使用标准scikit矢量器可能会很方便,例如Bag of Words,它接受文本并返回包含单词的数组。您可以直接在分类器中使用它,如果您确实需要csv,也可以将其输出到csv。它已经包含在scikit和Anaconda中。

另一种方法是手动分割。您可以加载数据,将其拆分为单词,对其进行计数,排除停止词(是什么?)并将其放入输出结果文件中。如:

    import re
    from collections import Counter
    txt = open('file.txt', 'r').read()
    words = re.findall('[a-z]+', txt, re.I)
    cnt = Counter(_ for _ in words if _ not in stopwords)

这个怎么样:

import re
from collections import Counter
csv_data = [["'", "\n", ","], ['fox'],
            ['the', 'fox', 'jumped'],
            ['over', 'the', 'fence'],
            ['fox'], ['fence']]
key_words = ['over', 'fox']
words_list = []
for i in csv_data:
    for j in i:
        line_of_words = ",".join(re.findall("[a-zA-Z]+", j))
        words_list.append(line_of_words)
word_count = Counter(words_list)
match_dict = {}
for aword, word_freq in zip(word_count.keys(), word_count.items()):
    if aword in key_words:
        match_dict[aword] = word_freq[1]

结果是:

print('Article words: ', words_list)
print('Article Word Count: ', word_count)
print('Matches: ', match_dict)
Article words:  ['', 'n', '', 'fox', 'the', 'fox', 'jumped', 'over', 'the', 'fence', 'fox', 'fence']
Article Word Count:  Counter({'fox': 3, '': 2, 'the': 2, 'fence': 2, 'n': 1, 'over': 1, 'jumped': 1})
Matches:  {'over': 1, 'fox': 3}

最新更新