我有这个数据帧
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Client':np.random.choice(['Customer_A', 'Customer_B'], 1000),
'Product':np.random.choice( ['Guns', 'Ammo', 'Armour'], 1000),
'Value':(np.random.randn(1000))
})
Categoricals = ['Client', 'Product']
df[Categoricals] = df[Categoricals].astype('category')
df = df.drop_duplicates()
df
我想要这个结果
# Non-anonymous function for Anomaly limit
def Anomaly (x):
Q3 = np.nanpercentile(x, q = 75)
Q1 = np.nanpercentile(x, q = 25)
IQR = (Q3 - Q1)
return (Q3 + (IQR * 2.0))
# Non-anonymous function for CriticalAnomaly limit
def CriticalAnomaly (x):
Q3 = np.nanpercentile(x, q = 75)
Q1 = np.nanpercentile(x, q = 25)
IQR = (Q3 - Q1)
return (Q3 + (IQR * 3.0))
# Define metrics
Metrics = {'Value':['count', Anomaly, CriticalAnomaly]}
# Groupby has more than 1 grouping column, so agg can only accept non-anonymous functions
Limits = df.groupby(['Client', 'Product']).agg(Metrics)
Limits
但在大型数据集上速度较慢,因为函数"异常"one_answers"临界异常"必须重新计算Q1、Q3和IQR两次,而不是一次。通过将这两个功能结合在一起,可以使速度更快。但是结果输出到1列而不是2列。
# Combined anomaly functions
def CombinedAnom (x):
Q3 = np.nanpercentile(x, q = 75)
Q1 = np.nanpercentile(x, q = 25)
IQR = (Q3 - Q1)
Anomaly = (Q3 + (IQR * 2.0))
CriticalAnomaly = (Q3 + (IQR * 3.0))
return (Anomaly, CriticalAnomaly)
# Define metrics
Metrics = {'Value':['count', CombinedAnom]}
# Groupby has more than 1 grouping column, so agg can only accept non-anonymous functions
Limits = df.groupby(['Client', 'Product']).agg(Metrics)
Limits
如何制作一个组合函数,使结果分为两列
如果使用apply
而不是agg
,则可以返回一个被解压缩为列的Series
:
def f(g):
return pd.Series({
'c1': np.sum(g.b),
'c2': np.prod(g.b)
})
df = pd.DataFrame({'a': list('aabbcc'), 'b': [1,2,3,4,5,6]})
df.groupby('a').apply(f)
这来自:
a b
0 a 1
1 a 2
2 b 3
3 b 4
4 c 5
5 c 6
至
c1 c2
a
a 3 2
b 7 12
c 11 30