正则表达式和python可以拆分行,但保留所有新行的第一个标识符



我输入的csv如下(两列,用"|"分隔(:

group_1|{""id"":1,""name"":""name_here"":[{""field_1"":""xx"",""value"":""""},{""field_2"":...}]},{""id"":2,""name"":""name_here"":[{""field_1"":""xx"",""value"":""""},{""field_2"":...}]}
group_2|{""id"":5,""name"":""name_here"":[{""field_1"":""xx"",""value"":""""},{""field_2"":...}]}

想要得到这样的输出:

group_1|{""id"":1,""name"":""name_here"":[{""field_1"":""xx"",""value"":""""},{""field_2"":...}]}
group_1|{""id"":2,""name"":""name_here"":[{""field_1"":""xx"",""value"":""""},{""field_2"":...}]}
group_2|{""id"":5,""name"":""name_here"":[{""field_1"":""xx"",""value"":""""},{""field_2"":...}]}

我通常的解决方案以及为什么它现在不起作用:

以前我会用notepad++来表示regex,搜索({""id"")并用|rn1替换。在这之后,我将把文件导入Excel;组_ x";对于该列上的每个空单元格(如下所示(。但我的问题是,我有一个巨大的文件(几GB大(,这种方法只会冻结我的电脑。我确信Excel甚至不能处理那么多行(几百万行(。所以我跳了起来,希望有人能给我指明正确的方向。也许将python脚本与正则表达式结合使用?这将特别有用,因为我对这些工具有一个基本的不了解,并且在第一个大型工具之后需要进行几个regex转换,然后可以将其合并到同一个脚本中。但我会感谢任何形式的帮助。提前谢谢。

试试这个

import csv
import re
# Output CSV file
outfile = open('out.csv', 'w')
writer = csv.writer(outfile)
# Open the input CSV and process it
with open("in.csv", "r") as f:
# Read line by line, for large file processing
line = f.readline()
while line:
# Process the line
_id, data = line.strip().split('|')  # Split the line into id and csv rows

# Split and make the data into a list of columns - using regex 
data = [r.strip(",") for r in re.sub('{""id""', 'n{""id""', data).split("n") if r]
# Create a new row, with same id for every column groups
new_row = [[_id, d] for d in data]
# Write the new row to the output CSV
writer.writerows(new_row)
# Read next line
line = f.readline()

我希望这些评论足以理解代码。

嗯。。逐行读取听起来像是解决这个问题的非开发人员方法。然而,我认为这是直观的处理方式。

到目前为止,首先尝试每行读取一个文件。您可以将列名("group_x"(和记录用"分隔|&";。接下来,你可以用逗号分隔你的记录,然后你可以得到你想要的纯数据记录

但这种方式可能会出现一些错误,如";列表索引超出范围",因此,请将try-catch(IndexError(添加到处理中。

最后,您可以通过下面的代码获得您想要的数据。我认为这可能不是你真正想要的确切结果。但是,我想帮你了解一下我所看到的。谢谢

for i, line in enumerate(open("file.txt")):
try:
for record in line.split("|")[1].split("]},"):
print(line.split("|"), "|", record)
except IndexError:
continue

如果您知道字符串的确切结构,建议对字符串使用split。

import csv
# variable to store whole result
full_table = []
# read your csv with delimeter="|"
with open('your_datd.csv', 'r', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter='|')

# split every row by ending before "id" and get first element
for row in reader:
col_1 = row[0]
col_2 = row[1].split(',{""id""')[0]

# build row
row = [col_1, col_2]

# add this row to whole table. Additionally you can make here some modifications with extracted columns.
full_table.append(row)

# create new file and write full_table into it.
with open('output.csv', 'w', newline='') as f:
writer = csv.writer(f, delimiter='|')
writer.writerows(full_table)

如果你的数据集足够大,可以看看熊猫。

PS output.csv将是

group_1|"{""""id"""":1,""""name"""":""""name_here"""":[{""""field_1"""":""""xx"""",""""value"""":""""""""},{""""field_2"""":...}]}"
group_2|"{""""id"""":5,""""name"""":""""name_here"""":[{""""field_1"""":""""xx"""",""""value"""":""""""""},{""""field_2"""":...}]}"

最新更新