我有一个2列的tsv文件,类似于(实际的长得多):
<表类>
ntxt
评论
tbody><<tr>0001 空间delim字符串1 0001 space delim string 2 0001 space delim string 3 0001 space delim string 4 0001 space delim string 5 0002 space delim string 6 0002 space delim string 7 0003 space delim string 8 0003 space delim string 9 0003 space delim string 10 0003 space delim string 11 表类>
试试DefaultDict from Collections
.
from collections import defaultdict
new_data = defaultdict(list)
with open('readme.txt') as f:
heading = f.readline()
lines = [line.strip().split("t") for line in f]
[new_data[i[0]].append(i[1]) for i in lines]
for i, j in new_data.items():
print(i, ','.join(j))
这会给你下面的输出
0001 space delim string 1,space delim string 2,space delim string 3,space delim string 4,space delim string 5
0002 space delim string 6,space delim string 7
0003 space delim string 8,space delim string 9,space delim string 10,space delim string 11
应该可以了。这基本上是一个"报告撰写者"。具有一级分组的模式。
col1 = ''
columns = []
with open('x.txt', 'r') as f:
for line in f:
parts = line.strip().split('t')
if parts[0] != col1:
if col1:
print(col1+'t'+(', '.join(columns)))
col1 = parts[0]
columns = []
columns.append( parts[1] )
if col1:
print(col1+'t'+(', '.join(columns)))
输入:
0001 space delim string 1
0001 space delim string 2
0001 space delim string 3
0001 space delim string 4
0001 space delim string 5
0002 space delim string 6
0002 space delim string 7
0002 space delim string 8
0002 space delim string 9
0003 space delim string 10
0003 space delim string 11
0003 space delim string 12
0003 space delim string 13
这产生:
0001 space delim string 1, space delim string 2, space delim string 3, space delim string 4, space delim string 5
0002 space delim string 6, space delim string 7, space delim string 8, space delim string 9
0003 space delim string 10, space delim string 11, space delim string 12, space delim string 13