计算文件中每个单词的出现次数并加载到panda中



如何计算.txt文件中每个单词的出现次数,并将其加载到pandas数据帧中,同时根据列数对数据帧进行排序?

使用nltk:

# pip install nltk
from nltk.tokenize import RegexpTokenizer
from nltk import FreqDist
import pandas as pd
text = """How do I count the number of occurrences of each word in a .txt file and also load it into the pandas dataframe with columns name and count, also sort the dataframe on column count?"""
tokenizer = RegexpTokenizer(r'w+')
words = tokenizer.tokenize(text)
sr = pd.Series(FreqDist(words))

输出:

>>> sr
How            1
do             1
I              1
count          3
the            3
number         1
of             2
occurrences    1
each           1
word           1
in             1
a              1
txt            1
file           1
and            2
also           2
load           1
it             1
into           1
pandas         1
dataframe      2
with           1
columns        1
name           1
sort           1
on             1
column         1
dtype: int64

考虑到您在test.txt中有以下数据:

stack monkey zimbra
flow zimbra zimbra help Edit Name
Name

你可以这样做:

import string
import pandas as pd
# Open the file in read mode
text = open("test.txt", "r")

# Create an empty dictionary
dic = dict()

# Loop through each line of the file
for line in text:
# Remove the leading spaces and newline character
line = line.strip()

# Convert the characters in line to 
# lowercase to avoid case mismatch
line = line.lower()

# Split the line into words
words = line.split(" ")

# Iterate over each word in line
for word in words:
# Check if the word is already in dictionary
if word in dic:
# Increment count of word by 1
dic[word] = dic[word] + 1
else:
# Add the word to dictionary with count 1
dic[word] = 1
#Convert dict into a dataframe
pd = pd.DataFrame(dic.items(), columns=['Name', 'Occurrence'])
print(pd)

输出:

Name  Occurrence
0   stack           1
1  monkey           1
2  zimbra           3
3    flow           1
4    help           1
5    edit           1
6    name           2