我想向你寻求帮助。首先,我想介绍一下我的问题。我有两个带有数组的文件,每个文件就像一个数组,行中的每个单词之间都有空格。
First: [9 columns] 3columns are important
2001 5276 data3 data4 data5 data6 data7 data8 data9
2001 23243 data3 data4 data5 data6 data7 data8 data9
....
2001 434343 data3 data4 data5 data6 data7 data8 data9
2002 233 data3 data4 data5 data6 data7 data8 data9
....
2002 23232 data3 data4 data5 data6 data7 data8 data9
Second:[5 columns]
2001 23243 data3' data4' data5'
2001 5276 data3' data4' data5'
....
2001 434343 data3' data4' data5'
2002 23232 data3' data4' data5'
....
2002 233 data3' data4' data5'
I would like to create one file from two above which will contain array as ex.:
2001 5276 data3 data3' data4' data5'
2001 23243 data3 data3' data4' data5'
....
我必须检查每个文件中前两列中的数据是否相等,然后将它们相加:)到目前为止,我已经找到了这个程序,但我不知道如何以正确的方式更改它
file2 = open('file2', 'r')
matrix1 = [line.rstrip().split(' ') for line in file1.readlines()]
matrix2 = [line.rstrip().split(' ') for line in file2.readlines()]
file1.close()
file2.close()
#combine
t_matrix1 = [[r[col] for r in matrix1] for col in range(len(matrix1[0]))]
t_matrix2 = [[r[col] for r in matrix2] for col in range(len(matrix2[0]))]
final_t_matrix = []
for i in (t_matrix1 + t_matrix2):
if i not in final_t_matrix:
final_t_matrix.append(i)
final_matrix = [[r[col] for r in final_t_matrix] for col in range(len(final_t_matrix[0]))]
#output
outfile = open('out.txt', 'w')
for i in final_matrix:
for j in i[:-1]:
outfile.write(j+', ')
outfile.write(i[-1]+'n')
outfile.close()
这里你想要的是一个字典,将每行的前两列从First
映射到整行。这样,当您浏览Second
时,您可以查找前两列,并附加到您在那里找到的行。
有几个问题需要回答,这些问题将准确确定哪种字典:
- 行的顺序是否必须与它们在
First
中的顺序相同? - 如果
Second
中没有匹配的行,会发生什么情况First
? - 反之亦然?
- 如果任一文件中相同的前两列有多行怎么办?
,不可能发生"。然后你可以使用一个简单的dict
:
with open('file1') as file1:
lines = (line.rstrip().split() for line in file1)
rows = {tuple(line[:2]): line[:3] for line in lines}
with open('file2') as file2:
for line in file2:
row = line.rstrip().split()
rows[tuple(row[:2])].append(row[2:])
with open('out.txt', 'w') as outfile:
for row in rows:
outfile.write(', '.join(row) + 'n')
如果我更明确地拼写出来,第一部分对于新手来说可能更容易理解,所以让我这样做:
rows = {}
with open('file1') as file1:
for line in file1:
row = line.rstrip().split()
first_two_columns = tuple(line[:2])
first_three_columns = line[:3]
rows[first_two_columns] = first_three_columns
我做了一些其他简化:
- 使用
with
语句以避免调用close
。 - 不要使用
readlines
;一个文件已经是行的可迭代对象,你所做的只是让Python将整个文件读入内存,并在更多的内存中将其拆分为行,然后才能开始处理这些行。 -
split()
在任何空白上运行上拆分,这可能是您在这里想要的,而不是split(' ')
,它只在空格字符上拆分。 -
', '.join(i)
给你i
的所有成员,每对之间有', '
,就像你对那个内循环所做的那样。
>>> f = open('FileA').readlines()
>>> f1 = open('FileB').readlines()
>>> for i in range(len(f)):
... x=f[i].strip().split()
... for j in range(len(f)):
... y=f1[j].strip().split()
... if x[0] == y[0] and x[1]== y[1]:
... print x[0],x[1],x[2]," ".join(y[2:])
...
2001 5276 data3 data3' data4' data5'
2001 23243 data3 data3' data4' data5'
2001 434343 data3 data3' data4' data5'
2002 233 data3 data3' data4' data5'
2002 23232 data3 data3' data4' data5'
我已经打印了,您可以写入文件
file1 = open('file1', 'r')
file2 = open('file2', 'r')
rows = 0
finalfile = None
for lineno, line in enumerate(file1):
row1 = line.rstrip().split()
first_column1 = row1[0]
second_column1 = row1[1]
#print(str(first_two_columns1)+ " "+ str(first_three_columns1)+ "n")
for lineno, line in enumerate(file2):
row2 = line.rstrip().split()
first_column2 = row2[0]
second_column2 = row2[1]
#print(str(first_two_columns1)+ " "+ str(first_two_columns2)+ "n")
if(float(first_column1) == float(first_column2)) and (second_column1 == second_column2):
new_line = row1[0] + " " + row1[1] + " " + row1[2] + " " + row2[2] + " " + row2[3] + "n"
rows = new_line
final_filename = 'final_file_{}.txt'.format(row1[0])
finalfile = open(final_filename, "w")
finalfile.write(line)
if finalfile:
finalfile.close()
file1.close()
file2.close()
Abarnet 感谢您的建议,多亏了它,我开发了我的脚本:)我有一个问题,因为我的程序创建数组,但它始终是同一行:)如何修复它