我有一个大型数据集(约500万行(,其中包含机器学习训练的结果。现在我想检查一下结果是否达到";目标范围";是否。假设这个范围包含-0.25
和+0.25
之间的所有值。如果它在这个范围内,它就是Hit
,如果它在Low
以下,在High
的另一边。
我现在将创建这三列Hit、Low、High,并为每一行计算适用的条件,并将一个1
放入该列中,另外两列将成为0
。之后,我会将这些值分组并进行汇总。但我怀疑一定有更好更快的方法,比如分组时直接计算。
数据
import pandas as pd
df = pd.DataFrame({"Type":["RF", "RF", "RF", "MLP", "MLP", "MLP"], "Value":[-1.5,-0.1,1.7,0.2,-0.7,-0.6]})
+----+--------+---------+
| | Type | Value |
|----+--------+---------|
| 0 | RF | -1.5 | <- Low
| 1 | RF | -0.1 | <- Hit
| 2 | RF | 1.7 | <- High
| 3 | MLP | 0.2 | <- Hit
| 4 | MLP | -0.7 | <- Low
| 5 | MLP | -0.6 | <- Low
+----+--------+---------+
预期输出
pd.DataFrame({"Type":["RF", "MLP"], "Low":[1,2], "Hit":[1,1], "High":[1,0]})
+----+--------+-------+-------+--------+
| | Type | Low | Hit | High |
|----+--------+-------+-------+--------|
| 0 | RF | 1 | 1 | 1 |
| 1 | MLP | 2 | 1 | 0 |
+----+--------+-------+-------+--------+
您可以使用cut
来定义组,使用pivot_table
来重塑:
(df.assign(group=pd.cut(df['Value'],
[float('-inf'), -0.25, 0.25, float('inf')],
labels=['Low', 'Hit', 'High']))
.pivot_table(index='Type', columns='group', values='Value', aggfunc='count')
.reset_index()
.rename_axis(None, axis=1)
)
或crosstab
:
(pd.crosstab(df['Type'],
pd.cut(df['Value'],
[float('-inf'), -0.25, 0.25, float('inf')],
labels=['Low', 'Hit', 'High'])
)
.reset_index().rename_axis(None, axis=1)
)
输出:
Type Low Hit High
0 MLP 2 1 0
1 RF 1 1 1
您可以先用np.select
assign
,然后用crosstab
c1 = df.Value<=-0.25
c2 = df.Value>=0.25
out = pd.crosstab(df['Type'], np.select([c1,c2], ['Low','High'], default='Hit'))
out
Out[32]:
col_0 High Hit Low
Type
MLP 0 1 2
RF 1 1 1
你可以试试这个:
# Your code
import pandas as pd
df = pd.DataFrame({"Type":["RF", "RF", "RF", "MLP", "MLP", "MLP"], "Value":[-1.5,-0.1,1.7,0.2,-0.7,-0.6]})
# Set your range
RANGE_MIN = -0.25
RANGE_MAX = 0.25
# --- define functions to be applied to df ---
# evaluate if value is a low
def eval_low(value):
if value < RANGE_MIN:
return 1
else:
return 0
# evaluate if value is a high
def eval_high(value):
if value > RANGE_MAX:
return 1
else:
return 0
# evaluate if value is a hit
def eval_hit(value):
if value >= RANGE_MIN and value <= RANGE_MAX:
return 1
else:
return 0
# Evaluate the functions in new columns
df['Low'] = df.Value.apply(eval_low)
df['Hit'] = df.Value.apply(eval_hit)
df['High'] = df.Value.apply(eval_high)
# get the summary
df.groupby('Type').sum()
df.assign(Value=pd.cut(df.Value,[-np.inf,-0.25,0.25,np.inf],labels=['Low','Hit','High']))
.groupby('Type').value_counts().reset_index()
.pivot_table(index='Type',columns='Value',values=0)
Value Low Hit High
Type
MLP 2 1 0
RF 1 1 1
import pandas as pd
df = pd.DataFrame({"Type":["RF", "RF", "RF", "MLP", "MLP", "MLP"], "Value":[-1.5,-0.1,1.7,0.2,-0.7,-0.6]})
df ['target'] = df.agg({'Value' : lambda x: 'hit' if abs(x)<0.25 else 'low' if x < -0.25 else 'high'})
df['Value'] = 1
r = df.groupby(['Type', 'target']).count().unstack().fillna(0)
r.columns = list(r.columns.droplevel(0))
print(r)
结果
high hit low
Type
MLP 0.0 1.0 2.0
RF 1.0 1.0 1.0
以下是使用pd.cut()
和.str.get_dummies()
的方法
(pd.cut(df.set_index('Type')['Value'],bins = [-np.inf,-.25,.25,np.inf],labels = ['low','hit','high'])
.str.get_dummies()
.groupby(level=0).sum()
.reset_index())
输出:
Type high hit low
0 MLP 0 1 2
1 RF 1 1 1