传递哪些 Pandas 数据类型以在分组依据中转换或应用



在尝试调试groupby函数应用程序时,有人建议我使用虚拟函数来"查看正在传递的内容"到每个组的函数中。 当然,我是游戏:

import numpy as np
import pandas as pd
np.random.seed(0) # so we can all play along at home
categories = list('abc')
categories = categories * 4
data_1 = np.random.randn(len(categories))
data_2 = np.random.randn(len(categories))
df = pd.DataFrame({'category': categories, 'data_1': data_1, 'data_2': data_2})
def f(x):
    print type(x)
    return x
print 'single column transform'
df.groupby(['category'])['data_1'].transform(f)
print 'n'
print 'single column (nested) transform'
df.groupby(['category'])[['data_1']].transform(f)
print 'n'
print 'multiple column transform'
df.groupby(['category'])[['data_1', 'data_2']].transform(f)
print 'n'
print 'n'
print 'single column apply'
df.groupby(['category'])['data_1'].apply(f)
print 'n'
print 'single column (nested) apply'
df.groupby(['category'])[['data_1']].apply(f)
print 'n'
print 'multiple column apply'
df.groupby(['category'])[['data_1', 'data_2']].apply(f)

这会将以下内容放入我的标准输出中:

single column transform
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>

single column (nested) transform
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>

multiple column transform
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>


single column apply
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>

single column (nested) apply
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>

multiple column apply
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>

所以看起来是:

  • 变换
    • 单列:3 Series
    • 单列(嵌套):2 Series 和 3 DataFrame
    • 多列:3 Series和 3 DataFrame
  • 应用
    • 单列:3 Series
    • 单列(嵌套):4 DataFrame
    • 多列:4 DataFrame

这是怎么回事? 谁能解释为什么这 6 个调用中的每一个都会导致上述一系列对象被传递给指定的函数?

GroupBy.transform将尝试为您的函数fast_path和slow_path。

  • fast_path:使用数据帧对象调用函数
  • slow_path:使用 DataFrame.apply 函数调用函数

当fast_path的结果与slow_path相同时,它将选择fast_path。

以下输出表示它最终选择了fast_path:

multiple column transform
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>

这是代码链接:

https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L2277

编辑

要检查调用堆栈,请执行以下操作:

import numpy as np
import pandas as pd
np.random.seed(0) # so we can all play along at home
categories = list('abc')
categories = categories * 4
data_1 = np.random.randn(len(categories))
data_2 = np.random.randn(len(categories))
df = pd.DataFrame({'category': categories, 'data_1': data_1, 'data_2': data_2})
import traceback
import inspect
import itertools
def f(x):
    flag = True
    stack = itertools.dropwhile(lambda x:"#stop here" not in x, 
                                traceback.format_stack(inspect.currentframe().f_back))
    print "*"*20
    print x
    print type(x)
    print
    print "n".join(stack)
    return x
df.groupby(['category'])[['data_1', 'data_2']].transform(f) #stop here

最新更新