根据模式数组对pandas数据框的每个字符串行进行切片

我需要在不同位置对熊猫的字符串行进行切片，并且我想为此使用矢量化。有人能帮我吗?每一行都有这样的模式:

012016010402AAPL34      010APPLE       DRN          R$  000000000415000000000042200000000004150000000000421300000000042080000000003950000000000435000005000000000000012500000000000052664400000000000000009999123100000010000000000000BRAAPLBDR004115

这一行有26个用字符分隔的不同信息，例如:

['01', '2016/01/04', '02', 'AAPL34      ', ...,'115']

每个数据的字符串位置由这个数组定义:[0,2,10,12,24,27,39,49,52,56,69,82,95,108,121,134,147,152,170,188,201,202,210,217,230,242,245]

我尝试对数据帧使用此函数失败:

def row_slice(s,indices):
return pd.Series([s[i:j] for i,j in zip(indices, indices[1:]+[None])])

我正在使用的数据可以通过这个链接下载:

有人能帮帮我吗?

看起来您需要pandas.read_fwf，直接读取您的文件:

l = [0,2,10,12,24,27,39,49,52,56,69,82,95,108,121,134,147,152,170,188,201,202,210,217,230,242,245]
import numpy as np
df = pd.read_fwf('filename', widths=np.diff(l), header=None)

输出:

0         1   2       3   4      5    6   7   8     9   ...  16     17  
0   1  20160104   2  AAPL34  10  APPLE  DRN NaN  R$  4150  ...   5  12500   
18             19  20        21       22             23  
0  52664400  000000000000d   i  fferent3  1000000  1000000000000   
24   25  
0  0BRAAPLBDR00  411  
[1 rows x 26 columns]

使用前导零(作为字符串)，添加dtype=str参数:

0         1   2       3    4      5    6    7   8              9   ...  
0  01  20160104  02  AAPL34  010  APPLE  DRN  NaN  R$  0000000004150  ...   
16                  17                  18             19 20        21  
0  00005  000000000000012500  000000000052664400  000000000000d  i  fferent3   
22             23            24   25  
0  1000000  1000000000000  0BRAAPLBDR00  411  
[1 rows x 26 columns]

使用类似的代码:

df['col name'].str.extractall('(d{2})(d{8})(d{2})([A-Z]{4}d{2})')

输出:

0   1   2   3
match               
0   0   01  20160104    02  AAPL34

下面是一种根据indices将字符串拆分为子字符串列表的方法:

lst = [s[indices[i]:indices[i+1]].strip() for i in range(len(indices) - 1)]

输出:

['01', '20160104', '02', 'AAPL34', '010', 'APPLE', 'DRN', '', 'R$', '0000000004150', '0000000004220', '0000000004150', '0000000004213', '0000000004208', '0000000003950', '0000000004350', '00005', '000000000000012500', '000000000052664400', '000000000000d', 'i', 'fferent3', '1000000', '1000000000000', '0BRAAPLBDR00', '411']

相关内容

最新更新

热门标签：