Odd.txt报告到Pandas数据帧中



我有一份.txt报告,其中包含账号、地址和信用额度,报告格式为.txt

它有分页符,但通常看起来像这个

Customer Address Credit limit A001 Wendy's 20000 123 Main Street City, State Zip

我希望我的数据帧看起来像这个

Customer Address Credit Limit A001 Wendy's 123 Main Street, City, Statement 20000

这是我正在处理的示例csv的链接。

http://faculty.tlu.edu/mthompson/IDEA%20files/Customer.txt

我试着跳行,但没用。

好的,这个格式没有什么困难,但它不是csv。因此,Python csv模块和pandasread_csv都不能使用。我们将不得不手动解析它。

最复杂的决策是为每个客户确定第一行和最后一行。我会使用:

  • 第一行以一个只有大写字母和数字的单词开头,以一个只包含数字且长度超过100个字符的单词结尾
  • 块在第一个空行结束

一旦完成:

  • 第一行包含帐号、名称、地址的第一行和帐户限额
  • 后续行包含地址的其他行
  • 字段位于固定位置:[5,19(,[23,49(,[57,77(,[90,end_of_line(

在Python中会给出:

fieldpos = [(5,19), (23,49), (57,77), (90, -1)]  # position of fields in the initial line 
inblock = False                                  # we do not start inside a block
account_pat = re.compile(r'[A-Z]+d+s*$')       # regex patterns are compiled once for performance
limit_pat = re.compile(r's*d+$')
data = []                                        # a list for the accounts
with open(file) as fd:
for line in fd:
if not inblock:
if (len(line) > 100):
row = [line[f[0]:f[1]].strip() for f in fieldpos]
if account_pat.match(row[0]) and limit_pat.match(row[-1]):
inblock = True
data.append(row)
else:
line = line.strip()
if len(line) > 0:
row[2] += ', ' + line
else:
inblock = False
# we can now build a dataframe
df = pd.DataFrame(data, columns=['Account Number', 'Name', 'Address', 'Credit Limit'])

它最终给出:

Account Number                 Name                                            Address Credit Limit
0            A001          Dan Ackroyd  Audenshaw, 125 New Street, Montreal, Quebec, H...        20000
1            A123           Mike Atsil  The Vetinary House, 123 Dog Row, Thunder Bay, ...        20000
2            A128            Ivan Aker            The Old House, Ottawa, Ontario, P1D 8D4        10000
3            B001         Kim Basinger    Mesh House, Fish Street, Rouyn, Quebec, J5V 2A9        12000
4            B002       Richard Burton  Eagle Castle, Leafy Lane, Sudbury, Ontario, L3...         9000
5            B004         Jeff Bridges  Arrow Road North, Lakeside, Kenora, Ontario, N...        20000
6            B008          Denise Bent  The Dance Studio, Covent Garden, Montreal, Que...        20000
7            B010          Carter Bout  Removals Close, No Fixed Abode Road, Toronto, ...        20000
8            B022         Ronnie Biggs     Gotaway Cottage, Thunder Bay, Ontario, K3A 6F3         5000
9            C001           Tom Cruise  The Firm, Gunnersbury, Waskaganish, Quebec, G1...        25000
10           C003           John Candy  The Sweet Shop, High Street, Trois Rivieres, Q...        15000

相关内容

  • 没有找到相关文章

最新更新