正则表达式和python可以拆分行，但保留所有新行的第一个标识符

我输入的csv如下(两列，用"|"分隔(：

group_1|{""id"":1,""name"":""name_here"":[{""field_1"":""xx"",""value"":""""},{""field_2"":...}]},{""id"":2,""name"":""name_here"":[{""field_1"":""xx"",""value"":""""},{""field_2"":...}]}
group_2|{""id"":5,""name"":""name_here"":[{""field_1"":""xx"",""value"":""""},{""field_2"":...}]}

想要得到这样的输出：

group_1|{""id"":1,""name"":""name_here"":[{""field_1"":""xx"",""value"":""""},{""field_2"":...}]}
group_1|{""id"":2,""name"":""name_here"":[{""field_1"":""xx"",""value"":""""},{""field_2"":...}]}
group_2|{""id"":5,""name"":""name_here"":[{""field_1"":""xx"",""value"":""""},{""field_2"":...}]}

我通常的解决方案以及为什么它现在不起作用：

以前我会用notepad++来表示regex，搜索({""id"")并用|rn1替换。在这之后，我将把文件导入Excel；组_ x"；对于该列上的每个空单元格(如下所示(。但我的问题是，我有一个巨大的文件(几GB大(，这种方法只会冻结我的电脑。我确信Excel甚至不能处理那么多行(几百万行(。所以我跳了起来，希望有人能给我指明正确的方向。也许将python脚本与正则表达式结合使用？这将特别有用，因为我对这些工具有一个基本的不了解，并且在第一个大型工具之后需要进行几个regex转换，然后可以将其合并到同一个脚本中。但我会感谢任何形式的帮助。提前谢谢。

试试这个

import csv
import re
# Output CSV file
outfile = open('out.csv', 'w')
writer = csv.writer(outfile)
# Open the input CSV and process it
with open("in.csv", "r") as f:
# Read line by line, for large file processing
line = f.readline()
while line:
# Process the line
_id, data = line.strip().split('|')  # Split the line into id and csv rows

# Split and make the data into a list of columns - using regex 
data = [r.strip(",") for r in re.sub('{""id""', 'n{""id""', data).split("n") if r]
# Create a new row, with same id for every column groups
new_row = [[_id, d] for d in data]
# Write the new row to the output CSV
writer.writerows(new_row)
# Read next line
line = f.readline()

我希望这些评论足以理解代码。

嗯。。逐行读取听起来像是解决这个问题的非开发人员方法。然而，我认为这是直观的处理方式。

到目前为止，首先尝试每行读取一个文件。您可以将列名("group_x"(和记录用"分隔|&"；。接下来，你可以用逗号分隔你的记录，然后你可以得到你想要的纯数据记录

但这种方式可能会出现一些错误，如"；列表索引超出范围"，因此，请将try-catch(IndexError(添加到处理中。

最后，您可以通过下面的代码获得您想要的数据。我认为这可能不是你真正想要的确切结果。但是，我想帮你了解一下我所看到的。谢谢

for i, line in enumerate(open("file.txt")):
try:
for record in line.split("|")[1].split("]},"):
print(line.split("|"), "|", record)
except IndexError:
continue

如果您知道字符串的确切结构，建议对字符串使用split。

import csv
# variable to store whole result
full_table = []
# read your csv with delimeter="|"
with open('your_datd.csv', 'r', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter='|')

# split every row by ending before "id" and get first element
for row in reader:
col_1 = row[0]
col_2 = row[1].split(',{""id""')[0]

# build row
row = [col_1, col_2]

# add this row to whole table. Additionally you can make here some modifications with extracted columns.
full_table.append(row)

# create new file and write full_table into it.
with open('output.csv', 'w', newline='') as f:
writer = csv.writer(f, delimiter='|')
writer.writerows(full_table)

如果你的数据集足够大，可以看看熊猫。

PS output.csv将是

group_1|"{""""id"""":1,""""name"""":""""name_here"""":[{""""field_1"""":""""xx"""",""""value"""":""""""""},{""""field_2"""":...}]}"
group_2|"{""""id"""":5,""""name"""":""""name_here"""":[{""""field_1"""":""""xx"""",""""value"""":""""""""},{""""field_2"""":...}]}"

相关内容

最新更新

热门标签：