我想在pandas数据框df
中基于另一个列ID
创建一个列。对于包含字符串SAT
的ID
,我想提取由特殊字符"-"连接的浮点数。并将提取的结果放在名为new_col
的新列中。如果ID
不包含SAT
字符串,则保留NaN
。
df
如下:
Date ID Time
0 2007-01-10 SAT 1 HHSP 900
1 2007-01-10 DOUBLE 7 HHSP 900
2 2007-01-10 SAT GF-06-5CSBG.431 1000
3 2007-01-10 MA HYDRO HHSP 900
4 2007-01-10 2.233 HHSP 900
5 2007-01-10 SAT L2-15-3CSB1.252 1000
6 2007-01-10 SECTION 6 HHSP 900
预期输出:
Date ID Time new_col
0 2007-01-10 SAT 1 HHSP 900 NaN
1 2007-01-10 DOUBLE 7 HHSP 900 NaN
2 2007-01-10 SAT GF-06-5CSBG.431 1000 06-5
3 2007-01-10 MA HYDRO HHSP 900 NaN
4 2007-01-10 2.233 HHSP 900 NaN
5 2007-01-10 SAT L2-15-3 CSB1.252 1000 15-3 * In this case 15-3 instead of 2-15 is extracted because L2 is not completely floats.
6 2007-01-10 SECTION 6 HHSP 900 NaN
使用Series.str.extract
和-
连接的数字,-
在它之前,只有SAT
被Series.str.contains
过滤的值:
m = df['ID'].str.contains('SAT')
df['new_col'] = df.loc[m, 'ID'].str.extract('[-s+](d+-d+)')
print (df)
Date ID Time new_col
0 2007-01-10 SAT 1 HHSP 900 NaN
1 2007-01-10 DOUBLE 7 HHSP 900 NaN
2 2007-01-10 SAT GF-06-5CSBG.431 1000 06-5
3 2007-01-10 MA HYDRO HHSP 900 NaN
4 2007-01-10 2.233 HHSP 900 NaN
5 2007-01-10 SAT L2-15-3CSB1.252 1000 15-3
6 2007-01-10 SECTION 6 HHSP 900 NaN
如果valueSAT
start in column是可能的,使用:
df['new_col'] = df['ID'].str.extract('^SAT.*[-s+](d+-d+)', expand=False)