将两个CSV文件合并到JSON中



我想组合两个CSV文件并将其转换为JSON。困难的是,两个文件中的同一列在行中不是唯一的,如下所示:

gpo_full.csv:
Date           hearing_sub_type   Old_ID   witnesses
January,1997   Oversight          22       Babbitt
June,2000      General            21       Blitzer
January,1997   General            23       Alldridge
April,2001     Oversight          25       Michael 
June,2000      Oversight          24       Bruce
CAP_cols.csv:
majortopic   id     Chamber   subtopic   Date
21           79846  1         2103       January,1997
4            79847  2         705        June,2000
13           79848  1         1802       May,2001
7            79849  2         201        June,2000
21           79850  1         401        January,1997

我想要的输出应该是这样的:

[{
"Date": "January,1997",
"GPO": [{
"hearing_sub_type": "Oversight",
"Old_ID": "22",
"witnesses": "Babbitt"
}, {
"hearing_sub_type": "General",
"Old_ID": "23",
"witnesses": "Alldridge"
}]
"CAP": [{
"majortopic": "21",
"id": "79846",
"Chamber": "1"
"subtopic": "2103"
}, {
"majortopic": "21",
"id": "79850",
"Chamber": "1"
"subtopic": "401"
}]
},
and similar for others]

如果G.csv中没有像2001年4月和C.csv中2001年5月这样的匹配数据,格式将相同,但值不是这样的:

[{
"Date": "April,2001",
"GPO": [{
"hearing_sub_type": None,
"Old_ID": None,
"witnesses": None
}]
"CAP": [{
"majortopic": None,
"id": None,
"Chamber": None, 
"subtopic": None
}]
},
and similar for others]

我写的代码是

GPO = open("gpo_full.csv", "r")
CAP = open("CAP_cols.csv", "r")
jsonfile = open('datamerge.json', 'w')
gporeader = csv.DictReader(GPO)
capreader = csv.DictReader(CAP)
output = []
for gporow in gporeader:
output.append(gporow["Date"])
gporow["CAP"] = []
gporow["GPO"] = []
if gporow["Date"] in output:
gporow["GPO"].append(gporow)
CAP.seek(0)
for caprow in capreader:
if (gporow["Date"] == caprow["Date"]):
gporow["CAP"].append(caprow)
print(output)
json.dump(output, jsonfile, sort_keys=True)

它不起作用。更具体地说,我不知道如何提取日期作为关键字并在GPO和CAP中删除它们
我真的很感激martineau为我描绘了一幅完整的画面!

猜测你所说的"它不起作用";通过检查您的代码和所需的输出,我想我已经能够编写一些这样的代码。

首先,这里是我使用的数据,其中»右截断字符用于表示制表符't'字符——换句话说,您有制表符分隔的CSV文件,在使用csv.DictReader读取这些文件时必须指定这些文件,因为默认情况下,值之间有逗号。

输入文件

gpo_full.csv:

Date»hearing_sub_type»Old_ID»witnesses
January,1997»Oversight»22»Babbitt
June,2000»General»21»Blitzer
January,1997»General»23»Alldridge
April,2001»Oversight»25»Michael
June,2000»Oversight»24»Bruce

CAP_cols.csv:

majortopic»id»Chamber»subtopic»Date
21»79846»1»2103»January,1997
4»79847»2»705»June,2000
13»79848»1»1802»May,2001
7»79849»2»201»June,2000
21»79850»1»401»January,1997

代码:

所做的最重要的更改是使output成为一个字典子类,其关键字是日期(请参阅下面关于日期的注意事项(,每个关键字都与嵌套字典相关联,该字典包含自己的一对"CAP""GPO"列表。这样做可以很容易地检测以前是否遇到过日期,并初始化它的嵌套字典——它可以在其他CSV文件中搜索匹配的日期(使其自动成为定义和使用字典子类的原因(。

我认为值得一提的是"重绕";通过CCD_ 9重新读取其中的数据行的CSV文件不起作用。它们必须关闭、重新打开,并重新创建csv.DictReader。这至少部分是因为——在这种情况下——文件的开头有一个特殊的头行,用于定义其字段的名称。

我想提到的一个警告是,由于日期字段在";月、年";无法用于按时间顺序排序的格式,因为它们将以字符串的形式进行字典比较,而不是以数字进行比较——即June,2021将被认为在December,2001之前出现——这是我禁用排序的主要原因。这是可以解决的,但这不是你问题的主要主题。

import csv
import json
from pprint import pprint

class OutputDict(dict):
""" Dictionary subclass to create initial entry for each date. """
def __init__(self, cap_filename):
super().__init__()  # Base class initialization.
self.cap_filename = cap_filename
def __missing__(self, date):
""" Initialize entry for a date when it's first encountered. """
self[date] = {'Date': date, 'GPO': [], 'CAP': []}
# Scan cap csv file for matching dates - occurs once per unique date added.
with open(self.cap_filename) as cap_file:  # Add matches from other CSV file.
cap_reader = csv.DictReader(cap_file, delimiter='t')
for cap_row in cap_reader:
if date == cap_row['Date']:
del cap_row['Date']
self[date]['CAP'].append(cap_row)
# If there weren't any matching dates, create and add an empty row.
if not self[date]['CAP']:  # No matching CAP dates?
cap_row = {field: None for field in cap_reader.fieldnames if field != 'Date'}
self[date]['CAP'].append(cap_row)
return self[date]

# Add all the dates in the gpo_full.csv file to an output dictionary.
output = OutputDict('CAP_cols.csv')  # Automatically add matching dates in CAP_cols file.
with open('gpo_full.csv') as gpo_file:
gpo_reader = csv.DictReader(gpo_file, delimiter='t')
for gpo_row in gpo_reader:
date = gpo_row.pop('Date')
output[date]['GPO'].append(gpo_row)
# Add any dates in CAP_cols.csv that weren't in gpo_full.csv file.
with open('CAP_cols.csv') as cap_file:
cap_reader = csv.DictReader(cap_file, delimiter='t')
for cap_row in cap_reader:
date = cap_row.pop('Date')
if date not in output:  # Create entry for it with a null gpo row.
gpo_row = {field: None for field in gpo_reader.fieldnames if field != 'Date'}
output[date]['GPO'].append(gpo_row)  # Automatically adds cap_row.
output = list(output.values())  # Only want a list of the values.
#pprint(output, sort_dicts=0)  # Show what's going into JSON file.
with open('datamerge.json', 'w') as json_file:
json.dump(output, json_file, indent=4, sort_keys=False)
print('-Fini-')

结果

以下是它在datamerge.json文件中的内容:

[
{
"Date": "January,1997",
"GPO": [
{
"hearing_sub_type": "Oversight",
"Old_ID": "22",
"witnesses": "Babbitt"
},
{
"hearing_sub_type": "General",
"Old_ID": "23",
"witnesses": "Alldridge"
}
],
"CAP": [
{
"majortopic": "21",
"id": "79846",
"Chamber": "1",
"subtopic": "2103"
},
{
"majortopic": "21",
"id": "79850",
"Chamber": "1",
"subtopic": "401"
}
]
},
{
"Date": "June,2000",
"GPO": [
{
"hearing_sub_type": "General",
"Old_ID": "21",
"witnesses": "Blitzer"
},
{
"hearing_sub_type": "Oversight",
"Old_ID": "24",
"witnesses": "Bruce"
}
],
"CAP": [
{
"majortopic": "4",
"id": "79847",
"Chamber": "2",
"subtopic": "705"
},
{
"majortopic": "7",
"id": "79849",
"Chamber": "2",
"subtopic": "201"
}
]
},
{
"Date": "April,2001",
"GPO": [
{
"hearing_sub_type": "Oversight",
"Old_ID": "25",
"witnesses": "Michael"
}
],
"CAP": [
{
"majortopic": null,
"id": null,
"Chamber": null,
"subtopic": null
}
]
},
{
"Date": "May,2001",
"GPO": [
{
"hearing_sub_type": null,
"Old_ID": null,
"witnesses": null
}
],
"CAP": [
{
"majortopic": "13",
"id": "79848",
"Chamber": "1",
"subtopic": "1802"
}
]
}
]

相关内容

  • 没有找到相关文章

最新更新