我需要在不同位置对熊猫的字符串行进行切片,并且我想为此使用矢量化。有人能帮我吗?每一行都有这样的模式:
012016010402AAPL34 010APPLE DRN R$ 000000000415000000000042200000000004150000000000421300000000042080000000003950000000000435000005000000000000012500000000000052664400000000000000009999123100000010000000000000BRAAPLBDR004115
这一行有26个用字符分隔的不同信息,例如:
['01', '2016/01/04', '02', 'AAPL34 ', ...,'115']
每个数据的字符串位置由这个数组定义:[0,2,10,12,24,27,39,49,52,56,69,82,95,108,121,134,147,152,170,188,201,202,210,217,230,242,245]
我尝试对数据帧使用此函数失败:
def row_slice(s,indices):
return pd.Series([s[i:j] for i,j in zip(indices, indices[1:]+[None])])
我正在使用的数据可以通过这个链接下载:
有人能帮帮我吗?
看起来您需要pandas.read_fwf
,直接读取您的文件:
l = [0,2,10,12,24,27,39,49,52,56,69,82,95,108,121,134,147,152,170,188,201,202,210,217,230,242,245]
import numpy as np
df = pd.read_fwf('filename', widths=np.diff(l), header=None)
输出:
0 1 2 3 4 5 6 7 8 9 ... 16 17
0 1 20160104 2 AAPL34 10 APPLE DRN NaN R$ 4150 ... 5 12500
18 19 20 21 22 23
0 52664400 000000000000d i fferent3 1000000 1000000000000
24 25
0 0BRAAPLBDR00 411
[1 rows x 26 columns]
使用前导零(作为字符串),添加dtype=str
参数:
0 1 2 3 4 5 6 7 8 9 ...
0 01 20160104 02 AAPL34 010 APPLE DRN NaN R$ 0000000004150 ...
16 17 18 19 20 21
0 00005 000000000000012500 000000000052664400 000000000000d i fferent3
22 23 24 25
0 1000000 1000000000000 0BRAAPLBDR00 411
[1 rows x 26 columns]
使用类似的代码:
df['col name'].str.extractall('(d{2})(d{8})(d{2})([A-Z]{4}d{2})')
输出:
0 1 2 3
match
0 0 01 20160104 02 AAPL34
下面是一种根据indices
将字符串拆分为子字符串列表的方法:
lst = [s[indices[i]:indices[i+1]].strip() for i in range(len(indices) - 1)]
输出:
['01', '20160104', '02', 'AAPL34', '010', 'APPLE', 'DRN', '', 'R$', '0000000004150', '0000000004220', '0000000004150', '0000000004213', '0000000004208', '0000000003950', '0000000004350', '00005', '000000000000012500', '000000000052664400', '000000000000d', 'i', 'fferent3', '1000000', '1000000000000', '0BRAAPLBDR00', '411']