熊猫字符串以列的值结尾,然后将字符串的开头转换为日期进行比较



我有以下df

cluster_id    amount    inv_id            inv_date
1             309.9     07121830990       2018-07-12
1             309.9     07121830990       2018-07-12
2             3130.0    20180501313000B   2018-05-01
2             3130.0    20180501313000B   2018-05-01
3             3330.50   201804253330.50   2018-04-25             
3             3330.50   201804253330.50   2018-04-25
4             70.0      61518             2018-06-15
4             70.0      61518             2018-06-15
5             100.0     011318            2018-01-13
5             100.0     011318            2018-01-13
6             50.0      12202017          2017-12-20
6             50.0      12202017          2017-12-20
7             101.0     0000014482        2017-10-01
7             101.0     0000014482        2017-10-01

我想创建一个由groupbycluster_iddummy_inv_id的布尔列,并将每个组的dummy_invoice_id设置为Trueif,

1. inv_id (stripped non-numerics) ends with amount and the remaining part of inv_id can be coerced into a valid date which is +/- 180 days of the inv_date

2. inv_id (stripped non-numerics) can be coerced into a date which is +/- 180 days of the inv_date 

首先,我将从inv_id中删除任何非数字字符并groupbycluster_id

df['inv_id_stp'] = df.inv_id.str.replace(r'D+', '')
grouped = df.groupby('cluster_id')

然后将amount* 100转换为字符串以方便匹配

df['amount'] = df['amount']*100
df['amt_str'] = df['amount'].apply(str) 

例如309.9'30990'3130.0'313000',这里我想知道如何检查inv_idamount结尾,然后如何检查inv_id的剩余部分是否可以在 +/-180 天内转换为datetime,并且在inv_date的 +/-180 天内,或者是否可以直接转换为日期inv_id。特别是有一些日期格式,即

071218 - 2018-07-12
20180501 - 2018-05-01
61518 - 2018-06-15
12202017 - 2017-12-20
0000014482 - cannot be converted to date 

结果df如下所示,

cluster_id    amount    inv_id            inv_date      dummy_inv_id
1             309.9     07121830990       2018-07-12    True
1             309.9     07121830990       2018-07-12    True
2             3130.0    20180501313000B   2018-05-01    True
2             3130.0    20180501313000B   2018-05-01    True
3             3330.50   201804253330.50   2018-04-25    True          
3             3330.50   201804253330.50   2018-04-25    True
4             70.0      61518             2018-06-15    True
4             70.0      61518             2018-06-15    True
5             100.0     011318            2018-01-13    True
5             100.0     011318            2018-01-13    True
6             50.0      12202017          2017-12-20    True
6             50.0      12202017          2017-12-20    True
7             101.0     0000014482        2017-10-01    False
7             101.0     0000014482        2017-10-01    False

想法是创建具有可能格式的日期时间的帮助程序字典,其中包含用于切片和列表理解转换的字母数 -errors='coerce'为不匹配的值创建NaT

from functools import reduce
#add zeros to length 6
s = df.inv_id.str.replace(r'D+', '').str.zfill(6)
formats = {'%m%d%y':6, 
'%y%m%d':6,
'%Y%m%d':8, 
'%m%d%Y':8}
L = [pd.to_datetime(s.str[:v], format=k, errors='coerce') for k,v in formats.items()]

但是某些格式应该转换不好,因此这些超出范围的日期时间转换为NaT

L = [x.where(x.between('2000-01-01', pd.datetime.now())) for x in L]

并将所有非 NaT 值组合Series.combine_first

s2 = reduce(lambda l,r: pd.Series.combine_first(l,r), L)
print (s2)
0    2018-07-12
1    2018-07-12
2    2018-05-01
3    2018-05-01
4    2018-04-25
5    2018-04-25
6    2018-06-15
7    2018-06-15
8    2018-01-13
9    2018-01-13
10   2017-12-20
11   2017-12-20
12          NaT
13          NaT
Name: inv_id, dtype: datetime64[ns]

最后检查+-180天:

df['new'] = s2.between(s2 - pd.Timedelta(180, unit='d'), s2 + pd.Timedelta(180, unit='d'))

print (df)
cluster_id  amount           inv_id    inv_date    new
0            1   309.9      07121830990  2018-07-12   True
1            1   309.9      07121830990  2018-07-12   True
2            2  3130.0  20180501313000B  2018-05-01   True
3            2  3130.0  20180501313000B  2018-05-01   True
4            3  3330.5  201804253330.50  2018-04-25   True
5            3  3330.5  201804253330.50  2018-04-25   True
6            4    70.0            61518  2018-06-15   True
7            4    70.0            61518  2018-06-15   True
8            5   100.0           011318  2018-01-13   True
9            5   100.0           011318  2018-01-13   True
10           6    50.0         12202017  2017-12-20   True
11           6    50.0         12202017  2017-12-20   True
12           7   101.0       0000014482  2017-10-01  False
13           7   101.0       0000014482  2017-10-01  False

编辑:

添加了从末端删除子字符串的解决方案:

import re
from functools import reduce
df['amt_str'] = (df['amount']*100).round().astype(int).astype(str) 
df['inv_str'] = df.inv_id.str.replace(r'D+', '').str.zfill(6)
#https://stackoverflow.com/a/1038845/2901002
df['inv_str'] = df.apply(lambda x: re.sub('{}$'.format(x['amt_str']),'', x['inv_str']),axis=1)
print (df)
cluster_id  amount           inv_id    inv_date amt_str     inv_str
0            1   309.9      07121830990  2018-07-12   30990      071218
1            1   309.9      07121830990  2018-07-12   30990      071218
2            2  3130.0  20180501313000B  2018-05-01  313000    20180501
3            2  3130.0  20180501313000B  2018-05-01  313000    20180501
4            3  3330.5  201804253330.50  2018-04-25  333050    20180425
5            3  3330.5  201804253330.50  2018-04-25  333050    20180425
6            4    70.0            61518  2018-06-15    7000      061518
7            4    70.0            61518  2018-06-15    7000      061518
8            5   100.0           011318  2018-01-13   10000      011318
9            5   100.0           011318  2018-01-13   10000      011318
10           6    50.0         12202017  2017-12-20    5000    12202017
11           6    50.0         12202017  2017-12-20    5000    12202017
12           7   101.0       0000014482  2017-10-01   10100  0000014482
13           7   101.0       0000014482  2017-10-01   10100  0000014482

formats = {'%m%d%y':6, 
'%y%m%d':6,
'%Y%m%d':8, 
'%m%d%Y':8}
L=[pd.to_datetime(df['inv_str'].str[:v],format=k, errors='coerce') for k,v in formats.items()]
L = [x.where(x.between('2000-01-01', pd.datetime.now())) for x in L]
s2 = reduce(lambda l,r: pd.Series.combine_first(l,r), L)
df['new'] = s2.between(s2 - pd.Timedelta(180, unit='d'), s2 + pd.Timedelta(180, unit='d'))
print (df)
cluster_id  amount           inv_id    inv_date amt_str     inv_str    new
0            1   309.9      07121830990  2018-07-12   30990      071218   True
1            1   309.9      07121830990  2018-07-12   30990      071218   True
2            2  3130.0  20180501313000B  2018-05-01  313000    20180501   True
3            2  3130.0  20180501313000B  2018-05-01  313000    20180501   True
4            3  3330.5  201804253330.50  2018-04-25  333050    20180425   True
5            3  3330.5  201804253330.50  2018-04-25  333050    20180425   True
6            4    70.0            61518  2018-06-15    7000      061518   True
7            4    70.0            61518  2018-06-15    7000      061518   True
8            5   100.0           011318  2018-01-13   10000      011318   True
9            5   100.0           011318  2018-01-13   10000      011318   True
10           6    50.0         12202017  2017-12-20    5000    12202017   True
11           6    50.0         12202017  2017-12-20    5000    12202017   True
12           7   101.0       0000014482  2017-10-01   10100  0000014482  False
13           7   101.0       0000014482  2017-10-01   10100  0000014482  False

最新更新