如何读取和分组此CSV数据



csv如下所示。'|'表示不同的列。

2014-09-01 | I love chicken
2014-09-01 | I eat chicken
2014-09-02 | She loves chicken
2014-09-02 | Ha ha ha I love chicken
2014-09-03 | Blah Blah Blah

我想把数据处理成这样。

2014-09-01 | 'i', 2 | 'love', 1 | 'chicken', 2 | 'eat', 1 |
2014-09-02 | 'she', 1 | 'love', 2 | 'chicken', 2 | 'ha', 3 | 'I', 1 |
2014-09-03 | 'blah', 3 |
DATE | WORD, WORDCOUNTS | WORD2, WORDCOUNTS2 | ...

在这里我应该使用什么方法?我最终想绘制一张图表,在x轴上显示日期,在y轴上显示字数(频率)。

以下是我迄今为止最好的方法。

TestStartDate = "2013-11-11"
TestEndDate = "2014-06-10"
with open('Simplified.csv') as f:
    reader = csv.reader(f)
    for row in reader:
        if str(row[0:1])[2:12] == TestStartDate:
            #str(row[1:2])[2:str(row[1:2]).find('"')-1] is the second column
            tagger = MeCab.Tagger()
            rose = tagger.parse(str(row[1:2])[2:str(row[1:2]).find('"')-1])
            #print rose
            wordCount = {}
            wordList = rose.split()[:-1:2]
            for word in wordList:
                wordCount.setdefault(word, 0)
                wordCount[word] += 1
            for word, count in wordCount.items():
                print '"%s, %i"' % (word, count)

我计划在数据中添加单词和计数。

这对我有用~你真的需要最后一个'|'吗?因为当你把它放进matplotlib或其他东西时,当你再次用"|"分割它时,你会得到一个"。

下面的代码不会在结果的每一行附加一个"|",如果你认为有必要,只需在函数d上附加一个"|",如下所示:

return '%s| %s|'%(tokens[0],'|'.join(["'%s',%s"%(word,words.count(word)) for word in set(words)]))

================

def d(s):
    tokens = s.split('|')
    words = tokens[-1].strip().lower().split(' ')
    return '%s| %s'%(tokens[0],'|'.join(["'%s',%s"%(word,words.count(word)) for word in set(words)]))
def wordcount():
    lines=[
        '2014-09-01 | I love chicken',
        '2014-09-01 | I eat chicken',
        '2014-09-02 | She loves chicken',
        '2014-09-02 | Ha ha ha I love chicken',
        '2014-09-03 | Blah Blah Blah'
    ]
    rows={}
    for line in lines:
        t_line = line.split(' | ')
        if t_line[0] not in rows:
            rows[t_line[0]]=''
        rows[t_line[0]]+=(' '+t_line[-1])
    newrows=[]
    for k,v in rows.items():
        newrows.append(d('%s | %s'%(k,v)))
    print 'n'.join(newrows)

>>2014-09-02 | 'love',1|'i',1|'she',1|'loves',1|'chicken',2|'ha',3
>>2014-09-03 | 'blah',3
>>2014-09-01 | 'i',2|'chicken',2|'love',1|'eat',1

读取输入CSV,创建一个将日期映射到Counters的字典。用该行中的单词更新每行给定数据的计数器。然后编写格式为[date,(word1,count1),(word2,count2),…]的行。本例对日期和单词进行排序,但为了更好的性能,可以省略它。

from collections import Counter
import csv
data = {}
with open('my_data.csv') as f:
    for date, words in csv.reader(f, delimiter='|'):
        data.setdefault(date, Counter()).update(word for word in words.split())
with open('my_counts.csv', 'w') as f:
    writer = csv.writer(f, delimiter='|')
    for date in sorted(data.keys()):
        writer.writerow([date] + ["'{0}', {1}".format(date, data[date]) for date in sorted(data.keys())])

我建议使用Counter进行计数。

import re
from collections import Counter
stats = {}
with open('in.txt' ,'r') as fin:
    for line in fin:
        tokens = re.split('[| ]', line)
        key = tokens.pop(0)
        counter = Counter()
        for token in tokens:
            counter[token] = counter[token] + 1
        if key in stats:
            stats[key] = stats[key] + counter
        else:
            stats[key] = counter
for key, counter in stats.items():
    print key, '|', '|'.join([ '"%s", %s' % (k,v) for k,v in counter.items() ]), '|'

这里有一个使用defaultdict和Counter集合的解决方案。

import csv
from collections import defaultdict
from collections import Counter

date_words = defaultdict(lambda: Counter())

with open('test.csv') as psvfile:
    reader = csv.reader(psvfile, delimiter="|")
    for line in reader:
        date = line[0]
        words = line[1].split()
        date_words[date].update(Counter(words))

你可能还想考虑使用熊猫库,它擅长处理日期和绘制内容。

最新更新