Python从大型文本文件中读取完整行的块(列值拆分为多行)



我想分块读取一个大的.txt文件(c.2.5GB(,然后在加载到数据库之前执行一些操作。

该文件只有2列(列分隔符为¬(,并使用double quotes限定。第二列中的值可以跨越多行(示例如下(。我曾想过使用这个答案,但问题是它可能会处理不完整的行,因为它取决于预设的chunk size。有人能帮忙吗?我在下面包含了示例数据和代码。

样本数据(Sample_load_file.txt(

"LINE_ID"¬"LINE_TEXT"
"C1111-G00-BC222"¬"this line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
"
"C22-f0-333"¬"2nd row. This line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
*******************************************************************
This line also includes the column delimiter within text qualifier
*******************************************************************
# !¬!¬!¬|
"

代码

import pandas as pd
import os
from dbconnection import DBConnection
path = r'C:Sample_load_file.txt'
db = DBConnection(server ='XXXX', database='XXXX')
def read_in_chunks(file_object, chunk_size=1024):
#Lazy load to read a file piece by piece (avoiding moemory issues)
#Default chunk size: 1k.
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data

def process_chunk(data=piece):
#Build a list of lines based on ' "n" ' as custom separator
data = data.split('"n"')

#Split each line based on ' "¬" ' as custom separator
data = [line.split('"¬"') for line in data]

#Cleanup remaining double quotes
data = [[e.replace('"', '') for e in line] for line in data]

#Check the number of columns
number_of_cols = len(str(data[0]).split('¬'))
number_of_cols

#Load data into a dataframe
df = pd.DataFrame(data)

#Reformat dataframe
df.columns = df.iloc[0] # Set first row as column index
df = df.iloc[1:].reset_index(drop=True) # Drop first line and reset index

#Split the first column into two
try:
df[['LINE_ID', 'LINE_TEXT']] = df['LINE_ID¬LINE_TEXT'].str.split('¬',expand=True)
except:
print('Error')
del df['LINE_ID¬LINE_TEXT']

#Add metadata
df['loaded_by'] = 'XXXX'
df['file_line_number'] = range(2,len(df)+2)
df['load_date'] = pd.datetime.now()
df['source_file'] = path
df['loading_script'] = r'Load_Extracts.ipynb'    

#Load in SQL db
df.to_sql('SQL_table_name', db.engine, schema='dbo', index=False, if_exists='append')

#Load text file
with open(path) as f:
for piece in read_in_chunks(f):
process_data(piece)

如果LINE_ID适合一行,您可以尝试使用生成器,利用多行记录的第一行包含"¬":

def make_records(file):
current = []
for line in file:
line = line.rstrip()
if '"¬"' in line:
if current:
yield " ".join(current)
current = [line]
else:
current.append(line)
yield " ".join(current)

示例输入:

>>> import io
>>> 
>>> s = '''"LINE_ID"¬"LINE_TEXT"
... "C1111-G00-BC222"¬"this line is
... split into
... multiple lines
... % All needs to be read into 1 line
... % Currently that's not happening
... "
... "C22-f0-333"¬"2nd row. This line is
... split into
... multiple lines
... % All needs to be read into 1 line
... % Currently that's not happening
...   *******************************************************************
...   This line also includes the column delimiter within text qualifier
...   *******************************************************************
...   # !¬!¬!¬|
... "'''
>>> f = io.StringIO(s)
>>> for record in make_records(f):
...    print(record)
... 
"LINE_ID"¬"LINE_TEXT"
"C1111-G00-BC222"¬"this line is split into multiple lines % All needs to be read into 1 line % Currently that's not happening "
"C22-f0-333"¬"2nd row. This line is split into multiple lines % All needs to be read into 1 line % Currently that's not happening   *******************************************************************   This line also includes the column delimiter within text qualifier   *******************************************************************   # !¬!¬!¬| "

注意:您可能需要根据需要更改生成器yield的内容,例如listtuple而不是str,删除双引号,跳过第一行。我使用io.StringIO仅用于说明目的;正常的";文件