文本文件数据解析行并输出为列



我正在尝试解析测试文件。 该文件具有以下格式的用户名、地址和电话:

Name: John Doe1
address : somewhere
phone: 123-123-1234
Name: John Doe2
address : somewhere
phone: 123-123-1233
Name: John Doe3
address : somewhere
phone: 123-123-1232

仅适用于近 10k 用户:) 我想做的是将这些行转换为列,例如:

Name: John Doe1                address : somewhere          phone: 123-123-1234
Name: John Doe2                address : somewhere          phone: 123-123-1233
Name: John Doe3                address : somewhere          phone: 123-123-1232

我更喜欢用bash来做,但如果你知道如何用python做,那也很棒,包含此信息的文件在/root/docs/information中。任何提示或帮助将不胜感激。

一种方式与GNU awk

awk 'BEGIN { FS="n"; RS=""; OFS="tt" } { print $1, $2, $3 }' file.txt

结果:

Name: John Doe1     address : somewhere     phone: 123-123-1234
Name: John Doe2     address : somewhere     phone: 123-123-1233
Name: John Doe3     address : somewhere     phone: 123-123-1232

请注意,我已将输出文件分隔符 ( OFS ) 设置为两个制表符 ( tt )。您可以将其更改为您喜欢的任何字符或字符集。呵呵。

用简短的Perl单行

$ perl -ne 'END{print "n"}chomp; /^$/ ? print "n" : print "$_tt"' file.txt

输出

Name: John Doe1         address : somewhere             phone: 123-123-1234
Name: John Doe2         address : somewhere             phone: 123-123-1233
Name: John Doe3         address : somewhere             phone: 123-123-1232

使用 paste,我们可以连接文件中的行:

$ paste -s -d"tttn" file
Name: John Doe1 address : somewhere     phone: 123-123-1234
Name: John Doe2 address : somewhere     phone: 123-123-1233
Name: John Doe3 address : somewhere     phone: 123-123-1232

这似乎基本上可以满足您的需求:

information = 'information'  # file path
with open(information, 'rt') as input:
    data = input.read()
data = data.split('nn')
for group in data:
    print group.replace('n', '     ')

输出:

Name: John Doe1     address : somewhere     phone: 123-123-1234
Name: John Doe2     address : somewhere     phone: 123-123-1233
Name: John Doe3     address : somewhere     phone: 123-123-1232     
我知道

你没有提到awk,但它很好地解决了你的问题:

awk 'BEGIN {RS="";FS="n"} {print $1,$2,$3}' data.txt

这里的大多数解决方案只是重新格式化您正在读取的文件中的数据。也许这就是你想要的。

如果您确实要解析数据,请将其放在数据结构中。

Python中的这个例子:

data="""
Name: John Doe2
address : 123 Main St, Los Angeles, CA 95002
phone: 213-123-1234
Name: John Doe1
address : 145 Pearl St, La Jolla, CA 92013
phone: 858-123-1233
Name: Billy Bob Doe3
address : 454 Heartland St, Mobile, AL 00103
phone: 205-123-1232""".split('nn')      # just a fill-in for your file
                                          # you would use `with open(file) as data:`
addr={}
w0,w1,w2=0,0,0             # these keep track of the max width of the field 
for line in data:
    fields=[e.split(':')[1].strip() for e in [f for f in line.split('n')]]
    nam=fields[0].split()
    name=nam[-1]+', '+' '.join(nam[0:-1])
    addr[(name,fields[2])]=fields
    w0,w1,w2=[max(t) for t in zip(map(len,fields),(w0,w1,w2))]

现在,您可以自由排序,更改格式,放入数据库等。

这将打印包含该数据的格式,排序如下:

for add in sorted(addr.keys()):
    print 'Name: {0:{w0}} Address: {1:{w1}} phone: {2:{w2}}'.format(*addr[add],w0=w0,w1=w1,w2=w2)

指纹:

Name: John Doe1      Address: 145 Pearl St, La Jolla, CA 92013   phone: 858-123-1233
Name: John Doe2      Address: 123 Main St, Los Angeles, CA 95002 phone: 213-123-1234
Name: Billy Bob Doe3 Address: 454 Heartland St, Mobile, AL 00103 phone: 205-123-1232

这是按姓氏排序的,字典键中使用的名字。

现在打印按区号排序:

for add in sorted(addr.keys(),key=lambda x: addr[x][2] ):
    print 'Name: {0:{w0}} Address: {1:{w1}} phone: {2:{w2}}'.format(*addr[add],w0=w0,w1=w1,w2=w2)

指纹:

Name: Billy Bob Doe3 Address: 454 Heartland St, Mobile, AL 00103 phone: 205-123-1232
Name: John Doe2      Address: 123 Main St, Los Angeles, CA 95002 phone: 213-123-1234
Name: John Doe1      Address: 145 Pearl St, La Jolla, CA 92013   phone: 858-123-1233

但是,由于数据位于索引字典中,因此可以将其打印为表格,而不是按邮政编码排序:

# print table header
print '|{0:^{w0}}|{1:^{w1}}|{2:^{w2}}|'.format('Name','Address','Phone',w0=w0+2,w1=w1+2,w2=w2+2)
print '|{0:^{w0}}|{1:^{w1}}|{2:^{w2}}|'.format('----','-------','-----',w0=w0+2,w1=w1+2,w2=w2+2)
# print data sorted by last field of the address - probably a zip code
for add in sorted(addr.keys(),key=lambda x: addr[x][1].split()[-1]):
    print '|{0:>{w0}}|{1:>{w1}}|{2:>{w2}}|'.format(*addr[add],w0=w0+2,w1=w1+2,w2=w2+2)

指纹:

|      Name      |              Address               |    Phone     |
|      ----      |              -------               |    -----     |
|  Billy Bob Doe3|  454 Heartland St, Mobile, AL 00103|  205-123-1232|
|       John Doe1|    145 Pearl St, La Jolla, CA 92013|  858-123-1233|
|       John Doe2|  123 Main St, Los Angeles, CA 95002|  213-123-1234|

您应该能够在字符串上使用 split() 方法解析它:

line = "Name: John Doe1"
key, value = line.split(":")
print(key) # Name
print(value) # John Doe1

您可以迭代行并将它们打印在这样的列中 -

for line in open("/path/to/data"):
    if len(line) != 1:
        # remove n from line's end and make print statement
        # skip the n it adds in the end to continue in our column
        print "%stt" % line.strip(),
    else:
        # re-use the blank lines to end our column
        print
#!/usr/bin/env python
def parse(inputfile, outputfile):
    dictInfo = {'Name':None, 'address':None, 'phone':None}
    for line in inputfile:
    if line.startswith('Name'):
        dictInfo['Name'] = line.split(':')[1].strip()
    elif line.startswith('address'):
        dictInfo['address'] = line.split(':')[1].strip()
    elif line.startswith('phone'):
        dictInfo['phone'] = line.split(':')[1].strip()
        s = 'Name: '+dictInfo['Name']+'t'+'address: '+dictInfo['address'] 
            +'t'+'phone: '+dictInfo['phone']+'n'
        outputfile.write(s)
if __name__ == '__main__':
    with open('output.txt', 'w') as outputfile:
    with open('infomation.txt') as inputfile:
        parse(inputfile, outputfile)

使用 sed 的解决方案。

cat input.txt | sed '/^$/d' | sed 'N; s:n:tt:; N; s:n:tt:'
  1. 第一个管道 sed '/^$/d' 删除空白行。
  2. 第二根管道,sed 'N; s:n:tt:; N; s:n:tt:',组合了线路。
姓名:约翰·多伊1 地址:某处电话:123-123-1234姓名:约翰·多伊2 地址:某处电话:123-123-1233姓名:约翰·多伊3 地址:某处电话:123-123-1232

在 Python 中:

results = []
cur_item = None
with open('/root/docs/information') as f:
    for line in f.readlines():
        key, value = line.split(':', 1)
        key = key.strip()
        value = value.strip()
        if key == "Name":
            cur_item = {}
            results.append(cur_item)
        cur_item[key] = value
for item in results:
    # print item

相关内容

  • 没有找到相关文章

最新更新