在python中分析CSV,其中一列表示层次结构的父/子结构



我不确定解析CSV数据的最佳方式,其中第一列用于表示层次结构。

以下是CSV数据的示例。

Symbol,SecurityType,Quantity,Mid,Strike,Expiration
STONK1 (2 options),--,1,21.115,--,--
STONK1 Debit Spread1,Opt Adv,1,7.255,15.00/23.00,01/20/2023 / 08/20/2021
STONK1 01/20/2023 15.00 C,Call,1,8.05,15,1/20/2023
STONK1 08/20/2021 23.00 C,Call,-1,0.795,23,8/20/2021
STONK2 (2 options),--,1,64.66,--,--
STONK2 Debit Spread1,Opt Adv,1,23.05,70.00/50.00,08/20/2021 / 01/20/2023
STONK2 01/20/2023 50.00 C,Call,1,27.05,50,1/20/2023
STONK2 08/20/2021 70.00 C,Call,-1,4,70,8/20/2021

第2行是第3、4和5行的父记录。第3行是第4行和第5行的父记录。";符号";列可用于评估父/子关系。

我如何将这些数据解析为一种可以循环遍历父符号,然后循环遍历子元素的格式?

import csv
import io
import re
csv_content = """
Symbol,SecurityType,Quantity,Mid,Strike,Expiration
STONK1 (2 options),--,1,21.115,--,--
STONK1 Debit Spread1,Opt Adv,1,7.255,15.00/23.00,01/20/2023 / 08/20/2021
STONK1 01/20/2023 15.00 C,Call,1,8.05,15,1/20/2023
STONK1 08/20/2021 23.00 C,Call,-1,0.795,23,8/20/2021
STONK2 (2 options),--,1,64.66,--,--
STONK2 Debit Spread1,Opt Adv,1,23.05,70.00/50.00,08/20/2021 / 01/20/2023
STONK2 01/20/2023 50.00 C,Call,1,27.05,50,1/20/2023
STONK2 08/20/2021 70.00 C,Call,-1,4,70,8/20/2021
"""
parents = {}  # using a dict for convenience, more structured data may be preferable
csv_file = io.StringIO(initial_value=csv_content)  # you can use a `open()` file instead
reader = csv.reader(csv_file)
parent_regex = re.compile(r"([A-Z0-9]+) ((d+) options)")
child_regex = re.compile(r"([A-Z0-9]+) (d{2}/d{2}/d{4}) (d+.d+) (w)")
for line in reader:
if line[0].startswith("Symbol"):
continue  # skip the first line
parent_match = parent_regex.match(line[0])
child_match = child_regex.match(line[0])
if parent_match:  # can use the walrus operator here in Python 3.8+
parent_name = parent_match.group(1)
parents[parent_name] = {"children": []}
elif child_match:
child_name = child_match.group(1)
child_date = child_match.group(2)
child_value = child_match.group(3)
child_letter = child_match.group(4)
parents[child_name]["children"].append({"date": child_date, "value": child_value, "letter": child_letter})
else:
name = line[0].split(" ")[0]
parents[name]["dates"] = line[5]
print(parents)
{'STONK1': {'children': [{'date': '01/20/2023', 'value': '15.00', 'letter': 'C'},
{'date': '08/20/2021', 'value': '23.00', 'letter': 'C'}],
'dates': '01/20/2023 / 08/20/2021'},
'STONK2': {'children': [{'date': '01/20/2023', 'value': '50.00', 'letter': 'C'},
{'date': '08/20/2021', 'value': '70.00', 'letter': 'C'}],
'dates': '08/20/2021 / 01/20/2023'}
}

我使用正则表达式来确定它是哪一行。这很简单,但如果你不相信你的输入文件是正确的,你可能更喜欢使用状态机来知道每一行的期望值并进行验证
此解决方案对行排序更宽容。

当列表中有要分组处理的连续记录时,可以使用itertools.groupby。在您的情况下,您需要对groupby进行嵌套调用,因为您有嵌套的组。实际的CSV解析可以使用普通的csv模块——它是对csv.DictReader返回的dicts进行迭代的方式,将确定组。itertools.groupby将对dict进行低级迭代,根据用于确定组的关键函数将dict分组。

import csv
import itertools
from io import StringIO
data = """
Symbol,SecurityType,Quantity,Mid,Strike,Expiration
STONK1 (2 options),--,1,21.115,--,--
STONK1 Debit Spread1,Opt Adv,1,7.255,15.00/23.00,01/20/2023 / 08/20/2021
STONK1 01/20/2023 15.00 C,Call,1,8.05,15,1/20/2023
STONK1 08/20/2021 23.00 C,Call,-1,0.795,23,8/20/2021
STONK2 (2 options),--,1,64.66,--,--
STONK2 Debit Spread1,Opt Adv,1,23.05,70.00/50.00,08/20/2021 / 01/20/2023
STONK2 01/20/2023 50.00 C,Call,1,27.05,50,1/20/2023
STONK2 08/20/2021 70.00 C,Call,-1,4,70,8/20/2021
STONK2 (1 option),--,1,64.66,--,--
STONK2 Debit Spread1,Opt Adv,1,23.05,70.00/50.00,08/20/2021 / 01/20/2023
STONK2 01/20/2024 50.00 C,Call,1,27.05,50,1/20/2023
"""
# normally this would be
#   reader = csv.DictReader(open("some_file.csv"))
# but I'm using a StringIO so I don't need to have a separate file for this example
reader = csv.DictReader(StringIO(data))

# group all records that have the same starting word in Symbol
def symbol_grouping_fn(rec):
return rec["Symbol"].split()[0]

# group all records within a symbol group with a leading '--' SecurityType
# this is an example of what I call "Fuzzy Groupby", which I describe in 
# this blog post: https://thingspython.wordpress.com/2020/11/11/fuzzy-groupby-unusual-restaurant-part-ii/
cur_option = object()
def option_grouping_fn(rec):
global cur_option
if rec["SecurityType"] == "--":
# start new group by assigning new object to 'cur_option'
cur_option = object()
return cur_option

# groupby will give you groups of records
# use a nested groupby to get records grouped by symbol, and then 
# grouped by option subgroup within those records
for symbol, recs_for_symbol in itertools.groupby(reader, symbol_grouping_fn):
# now look for groups of options, looking for '--' as SecurityType
for _, recs_for_option in itertools.groupby(recs_for_symbol, option_grouping_fn):
header = next(recs_for_option)
print(header)
subheader = next(recs_for_option)
print("-", subheader)
for rec in recs_for_option:
print("- -", rec)

打印:

{'Symbol': 'STONK1 (2 options)', 'SecurityType': '--', 'Quantity': '1', 'Mid': '21.115', 'Strike': '--', 'Expiration': '--'}
- {'Symbol': 'STONK1 Debit Spread1', 'SecurityType': 'Opt Adv', 'Quantity': '1', 'Mid': '7.255', 'Strike': '15.00/23.00', 'Expiration': '01/20/2023 / 08/20/2021'}
- - {'Symbol': 'STONK1 01/20/2023 15.00 C', 'SecurityType': 'Call', 'Quantity': '1', 'Mid': '8.05', 'Strike': '15', 'Expiration': '1/20/2023'}
- - {'Symbol': 'STONK1 08/20/2021 23.00 C', 'SecurityType': 'Call', 'Quantity': '-1', 'Mid': '0.795', 'Strike': '23', 'Expiration': '8/20/2021'}
{'Symbol': 'STONK2 (2 options)', 'SecurityType': '--', 'Quantity': '1', 'Mid': '64.66', 'Strike': '--', 'Expiration': '--'}
- {'Symbol': 'STONK2 Debit Spread1', 'SecurityType': 'Opt Adv', 'Quantity': '1', 'Mid': '23.05', 'Strike': '70.00/50.00', 'Expiration': '08/20/2021 / 01/20/2023'}
- - {'Symbol': 'STONK2 01/20/2023 50.00 C', 'SecurityType': 'Call', 'Quantity': '1', 'Mid': '27.05', 'Strike': '50', 'Expiration': '1/20/2023'}
- - {'Symbol': 'STONK2 08/20/2021 70.00 C', 'SecurityType': 'Call', 'Quantity': '-1', 'Mid': '4', 'Strike': '70', 'Expiration': '8/20/2021'}
{'Symbol': 'STONK2 (1 option)', 'SecurityType': '--', 'Quantity': '1', 'Mid': '64.66', 'Strike': '--', 'Expiration': '--'}
- {'Symbol': 'STONK2 Debit Spread1', 'SecurityType': 'Opt Adv', 'Quantity': '1', 'Mid': '23.05', 'Strike': '70.00/50.00', 'Expiration': '08/20/2021 / 01/20/2023'}
- - {'Symbol': 'STONK2 01/20/2024 50.00 C', 'SecurityType': 'Call', 'Quantity': '1', 'Mid': '27.05', 'Strike': '50', 'Expiration': '1/20/2023'}

最新更新