CSV追加覆盖除列标题以外的所有内容



我试图获取在线评论(多个页面),提取每个评论的部分(标题,用户,文本,…)并将该信息写入csv文件。是的,这些问题已经问过很多次了,但我找不到一个能解决我下面的问题:

首先创建&在开头准备CSV文件的列头:

with open('review-raw-data.csv', 'wb') as output:
    fieldnames = ['title', 'text', 'starRating', 'helpfulScore', 'date', 'user', 'id', 'url']
    writer = csv.DictWriter(output, delimiter=',', fieldnames=fieldnames, quoting=csv.QUOTE_ALL, restval='unknown',                       extrasaction='ignore')

效果很好。之后,我试图将提取的信息附加到csv文件中:

def extract(data):
    with open('review-raw-data.csv', 'ab') as output:
        fieldnames = ['title', 'text', 'starRating', 'helpfulScore', 'date', 'user', 'id', 'url']
        writer = csv.DictWriter(output, delimiter=',', fieldnames=fieldnames, lineterminator='n', quoting=csv.QUOTE_ALL, restval='unknown', extrasaction='ignore')
        for review in data:
            # extraction happening...
            reviewobj = Review(title, text, helpfulscore, rating, date, user, reviewid, url)
            writer.writerow({'title': reviewobj.title, 'text': reviewobj.text, 'starRating': reviewobj.rating,
                         'helpfulScore': reviewobj.helpfulscore, 'date': reviewobj.date, 'user': reviewobj.user,
                         'id': reviewobj.reviewid, 'url': reviewobj.url})

这个函数在每个评审页面被接收后被调用。所以这可能不是最聪明/最简单的方法,但它有效。问题是,追加部分在第二次、第三次、……次调用此代码时没有按预期工作。因为在以前的迭代中附加的所有行都会被覆盖。

我想要的示例:(以','分隔的列)

title, user, id
title1, user1, id1
title2, user2, id2
title3, user3, id3

的例子,我得到第二次迭代后:

title, user, id
title2, user2, id2  # row 1 is missing...

的例子,我得到第三次迭代后:

title, user, id
title3, user3, id3  # rows 1 & 2 are missing...

我做错了什么?

如果没有完整的代码,并且不知道如何调用该代码,则不可能确切地告诉您哪里出了问题,但显然您正在调用"create &多次准备"列标头"部分的代码,因为下面的代码按预期工作:

bruno@bigb:~/Work/playground$ cat appcsv.py
import csv
with open('review-raw-data.csv', 'wb') as output:
    fieldnames = ['a', 'b', 'c']
    writer = csv.DictWriter(output, delimiter=',', fieldnames=fieldnames, quoting=csv.QUOTE_ALL, restval='unknown', extrasaction='ignore')
    writer.writeheader()

def extract(data):
    with open('review-raw-data.csv', 'ab') as output:
        fieldnames = ['a', 'b', 'c']
        writer = csv.DictWriter(output, delimiter=',', fieldnames=fieldnames, quoting=csv.QUOTE_ALL, restval='unknown', extrasaction='ignore')
        for row in data:
            writer.writerow(dict(zip(fieldnames, row)))

dataset = [
    [(1, 2, 3), (4, 5, 6)],
    [(5, 6, 7),]
    ]
for data in dataset:
    extract(data)

bruno@bigb:~/Work/playground$ python appcsv.py
bruno@bigb:~/Work/playground$ cat review-raw-data.csv 
"a","b","c"
"1","2","3"
"4","5","6"
"5","6","7"

现在很容易避免覆盖现有文件:只需在打开它之前检查它是否存在:

import os
filename = 'review-raw-data.csv'
flag = "ab" if os.path.exists(filename) else "wb"
with open(filename, flag) as output:
   # etc

作为旁注:您有相当多的重复代码(fieldnames定义,打开文件并创建DictWriter)。你应该在一个函数中提出这个因素,和/或只做这个东西一次,并将写入器传递给extract

def get_writer(outfile):
    fieldnames = [# etc ]
    writer = csv.DictWriter(outfile, delimiter=',', fieldnames=fieldnames, quoting=csv.QUOTE_ALL, restval='unknown', extrasaction='ignore')
def extract(data, writer):
    for review in data:
        # extraction happening...
        reviewobj = Review(title, text, helpfulscore, rating, date, user, reviewid, url)
        writer.writerow({
           'title': reviewobj.title, 'text': reviewobj.text, 
           'starRating': reviewobj.rating, 
           'helpfulScore': reviewobj.helpfulscore, 
           'date': reviewobj.date, 'user': reviewobj.user,
           'id': reviewobj.reviewid, 'url': reviewobj.url
            })
def main():
    filename = 'review-raw-data.csv'
    exists = os.path.exists(filename)
    flag = "ab" if exists else "wb"
    with open(filename) as outfile:
        writer = get_writer(outfile)
        if not exists:
            writer.writeheaders()
        for data in whereever_you_get_your_data_from():
             extract(data, writer)

最新更新