我有一份.txt报告,其中包含账号、地址和信用额度,报告格式为.txt
它有分页符,但通常看起来像这个
Customer Address Credit limit
A001 Wendy's 20000
123 Main Street
City, State
Zip
我希望我的数据帧看起来像这个
Customer Address Credit Limit
A001 Wendy's 123 Main Street, City, Statement 20000
这是我正在处理的示例csv的链接。
http://faculty.tlu.edu/mthompson/IDEA%20files/Customer.txt
我试着跳行,但没用。
好的,这个格式没有什么困难,但它不是csv。因此,Python csv模块和pandasread_csv
都不能使用。我们将不得不手动解析它。
最复杂的决策是为每个客户确定第一行和最后一行。我会使用:
- 第一行以一个只有大写字母和数字的单词开头,以一个只包含数字且长度超过100个字符的单词结尾
- 块在第一个空行结束
一旦完成:
- 第一行包含帐号、名称、地址的第一行和帐户限额
- 后续行包含地址的其他行
- 字段位于固定位置:[5,19(,[23,49(,[57,77(,[90,end_of_line(
在Python中会给出:
fieldpos = [(5,19), (23,49), (57,77), (90, -1)] # position of fields in the initial line
inblock = False # we do not start inside a block
account_pat = re.compile(r'[A-Z]+d+s*$') # regex patterns are compiled once for performance
limit_pat = re.compile(r's*d+$')
data = [] # a list for the accounts
with open(file) as fd:
for line in fd:
if not inblock:
if (len(line) > 100):
row = [line[f[0]:f[1]].strip() for f in fieldpos]
if account_pat.match(row[0]) and limit_pat.match(row[-1]):
inblock = True
data.append(row)
else:
line = line.strip()
if len(line) > 0:
row[2] += ', ' + line
else:
inblock = False
# we can now build a dataframe
df = pd.DataFrame(data, columns=['Account Number', 'Name', 'Address', 'Credit Limit'])
它最终给出:
Account Number Name Address Credit Limit
0 A001 Dan Ackroyd Audenshaw, 125 New Street, Montreal, Quebec, H... 20000
1 A123 Mike Atsil The Vetinary House, 123 Dog Row, Thunder Bay, ... 20000
2 A128 Ivan Aker The Old House, Ottawa, Ontario, P1D 8D4 10000
3 B001 Kim Basinger Mesh House, Fish Street, Rouyn, Quebec, J5V 2A9 12000
4 B002 Richard Burton Eagle Castle, Leafy Lane, Sudbury, Ontario, L3... 9000
5 B004 Jeff Bridges Arrow Road North, Lakeside, Kenora, Ontario, N... 20000
6 B008 Denise Bent The Dance Studio, Covent Garden, Montreal, Que... 20000
7 B010 Carter Bout Removals Close, No Fixed Abode Road, Toronto, ... 20000
8 B022 Ronnie Biggs Gotaway Cottage, Thunder Bay, Ontario, K3A 6F3 5000
9 C001 Tom Cruise The Firm, Gunnersbury, Waskaganish, Quebec, G1... 25000
10 C003 John Candy The Sweet Shop, High Street, Trois Rivieres, Q... 15000