Python遍历列表并将没有特殊字符的行连接到前一项



我想知道是否有人有一种hack/cool的解决这个问题的方法。我有一个像这样的文本文件:

NAME:name
ID:id
PERSON:person
LOCATION:location
NAME:name
morenamestuff
ID:id
PERSON:person
LOCATION:location
JUNK

所以我有一些块,它们都包含可以分割成字典的行,而另一些则不能。如何将没有:字符的行连接到前一行?下面是我正在做的

# loop through chunk
    # the first element of dat is a Title, so skip that
    key_map = dict(x.split(':') for x in dat[1:])

但是我当然得到一个错误,因为第二个块有一行没有:字符。所以我想让我的字典在正确分割后看起来像这样:

# there will be a key_map for each chunk of data
key_map['NAME'] == 'name morenamestuff' # 3rd line appended to previous
key_map['ID'] == 'id'
key_map['PERSON'] = 'person'
key_map['LOCATION'] = 'location

解决方案编辑:这是我在github上的最终解决方案,完整的代码在这里:

parseScript.py

import re
import string
bad_chars = '(){}"<>[] '     # characers we want to strip from the string
key_map = []
# parse file
with open("dat.txt") as f:
    data = f.read()
    data = data.strip('n')
    data = re.split('}|[{', data)
# format file
with open("format.dat") as f:
    formatData = [x.strip('n') for x in f.readlines()]
data = filter(len, data)
# strip and split each station
for dat in data[1:-1]:
    # perform black magic, don't even try to understand this
    dat = dat.translate(string.maketrans("", "", ), bad_chars).split(',')
    key_map.append(dict(x.split(':') for x in dat if ':' in x ))
    if ':' not in dat[1]:key_map['NAME']+=dat[k][2]

for station in range(0, len(key_map)):
    for opt in formatData:
        print opt,":",key_map[station][opt]
    print ""

dat.txt

在此查看原始文件

format.dat

NAME
STID
LONGITUDE
LATITUDE
ELEVATION
STATE
ID

out.dat

查看这里的原始信息

如果有疑问,可以自己编写生成器。

itertools.groupby添加到以空格分隔的文本组中。

def chunker(s):
     it = iter(s)
     out = [next(it)]
     for line in it:
         if ':' in line or not line:
             yield ' '.join(out)
             out = []
         out.append(line)
     if out:
         yield ' '.join(out)

用法:

from itertools import groupby
[dict(x.split(':') for x in g) for k,g in groupby(chunker(lines), bool) if k]
Out[65]: 
[{'ID': 'id', 'LOCATION': 'location', 'NAME': 'name', 'PERSON': 'person'},
 {'ID': 'id',
  'LOCATION': 'location',
  'NAME': 'name morenamestuff',
  'PERSON': 'person'}]

(如果这些字段总是相同的,我会去创建一些namedtuples而不是一堆dict s)

from collections import namedtuple
Thing = namedtuple('Thing', 'ID LOCATION NAME PERSON')
[Thing(**dict(x.split(':') for x in g)) for k,g in groupby(chunker(lines), bool) if k]
Out[76]: 
[Thing(ID='id', LOCATION='location', NAME='name', PERSON='person'),
 Thing(ID='id', LOCATION='location', NAME='name morenamestuff', PERSON='person')]

这是满足您所有需求的东西。它处理多行连接,忽略空行,并忽略不在块中出现的垃圾行。它被实现为一个生成器,在每个字典完成时生成它。

def parser(data):
    d = {}
    for line in data:
        line = line.strip()
        if not line:
            if d:
                yield d
            d = {}
        else:
            if ':' in line:
                key, value = line.split(':')
                d[key] = value
            else:
                if d:
                    d[key] = '{} {}'.format(d[key], line)
    if d:
        yield d

当使用此数据运行时:

<>之前不理我名称:name1ID: id1人:person1地点:location1名称:name2morenamestuffID: id2人:person2地点:location2垃圾和其他的东西名称:name3morenamestuff和更多的ID: id3人:person3更多的人的东西地点:location3垃圾更多的垃圾之前
>>> for d in parser(open('data')):
...     print d
{'PERSON': 'person1', 'LOCATION': 'location1', 'NAME': 'name1', 'ID': 'id1'}
{'PERSON': 'person2', 'LOCATION': 'location2', 'NAME': 'name2 morenamestuff', 'ID': 'id2'}
{'PERSON': 'person3 more person stuff', 'LOCATION': 'location3', 'NAME': 'name3 morenamestuff and more', 'ID': 'id3'}

你可以把地段作为清单抓取:

>>> results = list(parser(open('data')))
>>> results
[{'PERSON': 'person1', 'LOCATION': 'location1', 'NAME': 'name1', 'ID': 'id1'}, {'PERSON': 'person2', 'LOCATION': 'location2', 'NAME': 'name2 morenamestuff', 'ID': 'id2'}, {'PERSON': 'person3 more person stuff', 'LOCATION': 'location3', 'NAME': 'name3 morenamestuff and more', 'ID': 'id3'}]

我不觉得itertools或regex特别好用,这里有一个纯python解决方案

separator = ':'
output = []
chunk = None
with open('/tmp/stuff.txt') as f:
    for line in (x.strip() for x in f):
        if not line:
            # we are between 'chunks'
            chunk, key = None, None
            continue
        if chunk is None:
            # we are at the beginning of a new 'chunk'
            chunk, key = {}, None
            output.append(chunk)
        if separator in line:
            key, val = line.split(separator)
            chunk[key] = val
        else:
            chunk[key] += line

不像您所要求的那样优雅,但它可以工作

dat=[['NAME:name',
      'ID:id',
      'PERSON:person',
      'LOCATION:location'],
      ['NAME:name',
      'morenamestuff',
      'ID:id',
      'PERSON:person',
      'LOCATION:location']]
k=1
key_map = dict(x.split(':') for x in dat[k] if ':' in x )
if ':' not in dat[k][1]:key_map['NAME']+=dat[k][1]
key_map>>
{'ID': 'id',
'LOCATION': 'location',
'NAME': 'namemorenamestuff',
'PERSON': 'person'}

只需在没有":"的行中添加一些内容。

if line.find(':') == -1:
    line=line+':None'

最新更新