csv如下所示。'|'表示不同的列。
2014-09-01 | I love chicken
2014-09-01 | I eat chicken
2014-09-02 | She loves chicken
2014-09-02 | Ha ha ha I love chicken
2014-09-03 | Blah Blah Blah
我想把数据处理成这样。
2014-09-01 | 'i', 2 | 'love', 1 | 'chicken', 2 | 'eat', 1 |
2014-09-02 | 'she', 1 | 'love', 2 | 'chicken', 2 | 'ha', 3 | 'I', 1 |
2014-09-03 | 'blah', 3 |
DATE | WORD, WORDCOUNTS | WORD2, WORDCOUNTS2 | ...
在这里我应该使用什么方法?我最终想绘制一张图表,在x轴上显示日期,在y轴上显示字数(频率)。
以下是我迄今为止最好的方法。
TestStartDate = "2013-11-11"
TestEndDate = "2014-06-10"
with open('Simplified.csv') as f:
reader = csv.reader(f)
for row in reader:
if str(row[0:1])[2:12] == TestStartDate:
#str(row[1:2])[2:str(row[1:2]).find('"')-1] is the second column
tagger = MeCab.Tagger()
rose = tagger.parse(str(row[1:2])[2:str(row[1:2]).find('"')-1])
#print rose
wordCount = {}
wordList = rose.split()[:-1:2]
for word in wordList:
wordCount.setdefault(word, 0)
wordCount[word] += 1
for word, count in wordCount.items():
print '"%s, %i"' % (word, count)
我计划在数据中添加单词和计数。
这对我有用~你真的需要最后一个'|'吗?因为当你把它放进matplotlib或其他东西时,当你再次用"|"分割它时,你会得到一个"。
下面的代码不会在结果的每一行附加一个"|",如果你认为有必要,只需在函数d上附加一个"|",如下所示:
return '%s| %s|'%(tokens[0],'|'.join(["'%s',%s"%(word,words.count(word)) for word in set(words)]))
================
def d(s):
tokens = s.split('|')
words = tokens[-1].strip().lower().split(' ')
return '%s| %s'%(tokens[0],'|'.join(["'%s',%s"%(word,words.count(word)) for word in set(words)]))
def wordcount():
lines=[
'2014-09-01 | I love chicken',
'2014-09-01 | I eat chicken',
'2014-09-02 | She loves chicken',
'2014-09-02 | Ha ha ha I love chicken',
'2014-09-03 | Blah Blah Blah'
]
rows={}
for line in lines:
t_line = line.split(' | ')
if t_line[0] not in rows:
rows[t_line[0]]=''
rows[t_line[0]]+=(' '+t_line[-1])
newrows=[]
for k,v in rows.items():
newrows.append(d('%s | %s'%(k,v)))
print 'n'.join(newrows)
>>2014-09-02 | 'love',1|'i',1|'she',1|'loves',1|'chicken',2|'ha',3
>>2014-09-03 | 'blah',3
>>2014-09-01 | 'i',2|'chicken',2|'love',1|'eat',1
读取输入CSV,创建一个将日期映射到Counter
s的字典。用该行中的单词更新每行给定数据的计数器。然后编写格式为[date,(word1,count1),(word2,count2),…]的行。本例对日期和单词进行排序,但为了更好的性能,可以省略它。
from collections import Counter
import csv
data = {}
with open('my_data.csv') as f:
for date, words in csv.reader(f, delimiter='|'):
data.setdefault(date, Counter()).update(word for word in words.split())
with open('my_counts.csv', 'w') as f:
writer = csv.writer(f, delimiter='|')
for date in sorted(data.keys()):
writer.writerow([date] + ["'{0}', {1}".format(date, data[date]) for date in sorted(data.keys())])
我建议使用Counter
进行计数。
import re
from collections import Counter
stats = {}
with open('in.txt' ,'r') as fin:
for line in fin:
tokens = re.split('[| ]', line)
key = tokens.pop(0)
counter = Counter()
for token in tokens:
counter[token] = counter[token] + 1
if key in stats:
stats[key] = stats[key] + counter
else:
stats[key] = counter
for key, counter in stats.items():
print key, '|', '|'.join([ '"%s", %s' % (k,v) for k,v in counter.items() ]), '|'
这里有一个使用defaultdict和Counter集合的解决方案。
import csv
from collections import defaultdict
from collections import Counter
date_words = defaultdict(lambda: Counter())
with open('test.csv') as psvfile:
reader = csv.reader(psvfile, delimiter="|")
for line in reader:
date = line[0]
words = line[1].split()
date_words[date].update(Counter(words))
你可能还想考虑使用熊猫库,它擅长处理日期和绘制内容。