以可以在Python中排序的格式存储输出



我正在学习Python,并尝试将存储在文件中的列表(逗号分隔(转换为数据存储,该数据存储可以用Python排序,每四个字符串填充一行。例如,如果我在文件中有以下内容:

'apples are great'
,'neg': 0.485, 'neu': 0.392, 'pos': 0.123, 'compound': -0.812,
'crayons are waxy'
,'neg': 0.302, 'neu': 0.698, 'pos': 0.0, 'compound': -0.3818,
'a happy girl'
,'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0,
'a phone is alive' 
,'neg': 0.0, 'neu': 0.737, 'pos': 0.263, 'compound': 0.3612,..........

我想要一个具有以下内容的数据帧:

Subject           Neg    Neu    Pos    Compound 
apples are great  0.485  0.392  0.123  -0.812 
crayons are waxy  0.302  0.698  0.0    -0.3818 
a happy girl      0.0    1.0    0.0    0.0 
a phone is alive  0.0    0.737  0.263  0.3612 

我的目标是按复合列进行排序,同时查找第一列中单词的频率。我觉得这应该相对容易,但我尝试过读取数据帧,但它变成了一行,列中的每个值都是一行,然后还试图使它成为一个带有句子的文本块,但再次得到错误的结果。

提前感谢您的帮助。

我尝试过的样品

test = open('Heros_toSort.txt')
test2=test.readlines()
df = pd.DataFrame(test2, columns = ['name'])
df.assign(name=df.name.str.split(','))

Python有一个内置的库csv,可以用来轻松读取数据。这可能也可以用标准的i/o工具来执行,但这就是我的解决方案。

在csv中,每行的第一个条目都是一个用引号括起来的名称,第二个到第五个是值,所有值都以相同的顺序出现。这些值在名称和模式:之后有一个数字。我们可以从名称周围去掉引号,从冒号后面去掉数字,然后使用它来制作pandas数据帧。

import csv
import pandas as pd
data = []
with open('data.csv', 'r') as in_file:
c_reader = csv.reader(in_file, delimiter=',')
for row in c_reader:
this_row = []
#from each row, get the name first, stripping the leading and trailing ' single quotes.
this_row.append(row[0].strip().lstrip(''').rstrip('''))
#get the remaining values
for i in range(1, len(row)):
#all values appear after a ': ' pattern, in the same order. split on ': '
#and get the second half of the split - it's the value we're looking for
this_row.append(row[i].split(": ",1)[1])
#add this to the array
data.append(this_row)
#make a dataframe out of the csv file
df = pd.DataFrame(data, columns=['Name', 'neg', 'neu', 'pos', 'compound'])
print(df)

演示

import re
import collections
import pandas as pd
# The input is a single text line
stringinput = ("'apples are great','neg': 0.485, 'neu': 0.392, 'pos': 0.123,"
+ " 'compound': -0.812,'crayons are waxy','neg': 0.302, "
+ "'neu': 0.698, 'pos': 0.0, 'compound': -0.3818,'a happy girl',"
+ "'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0,"
+ "'a phone is alive' ,'neg': 0.0, 'neu': 0.737, 'pos': 0.263,"
+ " 'compound': 0.3612")
# clean up text file
remap = {
ord(''') : ''
}
cleaned_text = stringinput.translate(remap)
# remove tags of values
f_test = re.sub( "neg:|neu:|pos:|compound:*", '', cleaned_text )
# break text into list
string_to_list = f_test.split(',')
# create list of lists with
# list comprehension.
# Each inner list contains
# 5 elements, such as 
# 'Subject', 'Neg','Neu','Pos','Compound'
list_to_df = [ string_to_list[i : i + 5] 
for i in range(0, len(string_to_list), 5) ]
# generate pandas dataframe
df = pd.DataFrame(
list_to_df,
columns = ['Subject', 'Neg','Neu','Pos','Compound']
)
# sort dataframe based on Compound
df_sorted = df.sort_values(['Compound'],
ascending = False
)
# word frequency
freq = df_sorted['Subject'].to_list()
freq_dict = collections.defaultdict(int)
for text in freq:
for word in text.split(' '):
freq_dict[word] += 1
for word, freq in freq_dict.items():
print(word, freq, sep = 't')

最新更新