使用Dask中的Panda剪切功能



如何在Dask中使用pd.cut((?由于数据集很大,我无法在完成pd.cut((.之前将整个数据集放入内存

当前代码在Pandas中工作,但需要更改为Dask:

import pandas as pd
d = {'name': [1, 5, 1, 10, 5, 1], 'amount': [1, 5, 3, 8, 4, 1]}
df = pd.DataFrame(data=d)
#Groupby name and add column sum (of amounts) and count (number of grouped rows)
df = (df.groupby('name')['amount'].agg(['sum', 'count']).reset_index().sort_values(by='name', ascending=True))
print(df.head(15))
#Groupby bins and chnage sum and count based on grouped rows
df = df.groupby(pd.cut(df['name'], 
bins=[0,4,8,100], 
labels=['namebin1', 'namebin2', 'namebin3']))['sum', 'count'].sum().reset_index()
print(df.head(15))

输出:

name  sum  count
0  namebin1    5      3
1  namebin2    9      2
2  namebin3    8      1

我试过了:

import pandas as pd
import dask.dataframe as dd
d = {'name': [1, 5, 1, 10, 5, 1], 'amount': [1, 5, 3, 8, 4, 1]}
df = pd.DataFrame(data=d)
df = dd.from_pandas(df, npartitions=2)
df = df.groupby('name')['amount'].agg(['sum', 'count']).reset_index()
print(df.head(15))
df = df.groupby(df.map_partitions(pd.cut, 
df['name'],                  
bins=[0,4,8,100], 
labels=['namebin1', 'namebin2', 'namebin3']))['sum', 'count'].sum().reset_index()
print(df.head(15))

给出错误:TypeError("cut() got multiple values for argument 'bins'",)

您看到此错误的原因是调用pd.cut()时将分区作为第一个参数,这是它没有预料到的(请参阅文档(。

你可以把它包装在一个自定义函数中,然后调用它,就像这样:

import pandas as pd
import dask.dataframe as dd
def custom_cut(partition, bins, labels):
result = pd.cut(x=partition["name"], bins=bins, labels=labels)
return result
d = {'name': [1, 5, 1, 10, 5, 1], 'amount': [1, 5, 3, 8, 4, 1]}
df = pd.DataFrame(data=d)
df = dd.from_pandas(df, npartitions=2)
df = df.groupby('name')['amount'].agg(['sum', 'count']).reset_index()
df = df.groupby(df.map_partitions(custom_cut,               
bins=[0,4,8,100], 
labels=['namebin1', 'namebin2', 'namebin3']))[['sum', 'count']].sum().reset_index()
df.compute()
name        sum    count
namebin1    5      3
namebin2    9      2
namebin3    8      1

最新更新