按条件[Low,Hit,High]分组并创建三个新列



我有一个大型数据集(约500万行(,其中包含机器学习训练的结果。现在我想检查一下结果是否达到";目标范围";是否。假设这个范围包含-0.25+0.25之间的所有值。如果它在这个范围内,它就是Hit,如果它在Low以下,在High的另一边。

我现在将创建这三列Hit、Low、High,并为每一行计算适用的条件,并将一个1放入该列中,另外两列将成为0。之后,我会将这些值分组并进行汇总。但我怀疑一定有更好更快的方法,比如分组时直接计算。


数据

import pandas as pd
df = pd.DataFrame({"Type":["RF", "RF", "RF", "MLP", "MLP", "MLP"], "Value":[-1.5,-0.1,1.7,0.2,-0.7,-0.6]})
+----+--------+---------+
|    | Type   |   Value |
|----+--------+---------|
|  0 | RF     |    -1.5 | <- Low
|  1 | RF     |    -0.1 | <- Hit
|  2 | RF     |     1.7 | <- High
|  3 | MLP    |     0.2 | <- Hit
|  4 | MLP    |    -0.7 | <- Low
|  5 | MLP    |    -0.6 | <- Low
+----+--------+---------+

预期输出

pd.DataFrame({"Type":["RF", "MLP"], "Low":[1,2], "Hit":[1,1], "High":[1,0]})
+----+--------+-------+-------+--------+
|    | Type   |   Low |   Hit |   High |
|----+--------+-------+-------+--------|
|  0 | RF     |     1 |     1 |      1 |
|  1 | MLP    |     2 |     1 |      0 |
+----+--------+-------+-------+--------+

您可以使用cut来定义组,使用pivot_table来重塑:

(df.assign(group=pd.cut(df['Value'],
[float('-inf'), -0.25, 0.25, float('inf')],
labels=['Low', 'Hit', 'High']))
.pivot_table(index='Type', columns='group', values='Value', aggfunc='count')
.reset_index()
.rename_axis(None, axis=1)
)

crosstab:

(pd.crosstab(df['Type'],
pd.cut(df['Value'],
[float('-inf'), -0.25, 0.25, float('inf')],
labels=['Low', 'Hit', 'High'])
)
.reset_index().rename_axis(None, axis=1)
)

输出:

Type  Low  Hit  High
0  MLP    2    1     0
1   RF    1    1     1

您可以先用np.selectassign,然后用crosstab

c1 = df.Value<=-0.25
c2 = df.Value>=0.25
out = pd.crosstab(df['Type'], np.select([c1,c2], ['Low','High'], default='Hit'))
out
Out[32]: 
col_0  High  Hit  Low
Type                 
MLP       0    1    2
RF        1    1    1

你可以试试这个:

# Your code
import pandas as pd
df = pd.DataFrame({"Type":["RF", "RF", "RF", "MLP", "MLP", "MLP"], "Value":[-1.5,-0.1,1.7,0.2,-0.7,-0.6]})
# Set your range 
RANGE_MIN = -0.25
RANGE_MAX = 0.25
# --- define functions to be applied to df ---
# evaluate if value is a low
def eval_low(value):
if value < RANGE_MIN:
return 1
else:
return 0
# evaluate if value is a high
def eval_high(value):
if value > RANGE_MAX:
return 1
else:
return 0
# evaluate if value is a hit
def eval_hit(value):
if value >= RANGE_MIN and value <= RANGE_MAX:
return 1
else:
return 0
# Evaluate the functions in new columns
df['Low'] = df.Value.apply(eval_low)
df['Hit'] = df.Value.apply(eval_hit)
df['High'] = df.Value.apply(eval_high)
# get the summary
df.groupby('Type').sum()
df.assign(Value=pd.cut(df.Value,[-np.inf,-0.25,0.25,np.inf],labels=['Low','Hit','High']))
.groupby('Type').value_counts().reset_index()
.pivot_table(index='Type',columns='Value',values=0)
Value  Low  Hit  High
Type                 
MLP      2    1     0
RF       1    1     1
import pandas as pd
df = pd.DataFrame({"Type":["RF", "RF", "RF", "MLP", "MLP", "MLP"], "Value":[-1.5,-0.1,1.7,0.2,-0.7,-0.6]})
df ['target'] = df.agg({'Value' : lambda x: 'hit' if abs(x)<0.25 else 'low' if x < -0.25 else 'high'})
df['Value'] = 1 
r = df.groupby(['Type', 'target']).count().unstack().fillna(0)
r.columns = list(r.columns.droplevel(0))
print(r)

结果

high  hit  low
Type                
MLP    0.0  1.0  2.0
RF     1.0  1.0  1.0

以下是使用pd.cut().str.get_dummies()的方法

(pd.cut(df.set_index('Type')['Value'],bins = [-np.inf,-.25,.25,np.inf],labels = ['low','hit','high'])
.str.get_dummies()
.groupby(level=0).sum()
.reset_index())

输出:

Type  high  hit  low
0  MLP     0    1    2
1   RF     1    1    1

最新更新