我正在使用Dask执行以下操作。
import dask.dataframe as dd
import pandas as pd
salary_df = pd.DataFrame({"Salary":[10000, 50000, 25000, 30000, 7000]})
salary_category = pd.DataFrame({"Hi":[5000, 20000, 25000, 30000, 90000],
"Low":[0, 5001, 20001, 25001, 30001],
"category":["Very Poor", "Poor", "Medium", "Rich", "Super Rich" ]
})
sal_ddf = dd.from_pandas(salary_df, npartitions=10)
salary_category.index = pd.IntervalIndex.from_arrays(salary_category['Low'],salary_category['Hi'],closed='both')
sal_ddf['Category'] = sal_ddf['Salary'].apply(lambda x : salary_category.iloc[salary_category.index.get_loc(x)]['category'])
我确实得到了结果,但下面的线路上有警告
sal_ddf['Category'] = sal_ddf['Salary'].apply(lambda x : salary_category.iloc[salary_category.index.get_loc(x)]['category'])
You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
Before: .apply(func)
After: .apply(func, meta=('Salary', 'object'))
我在这里错过了什么?
此处缺少的关键字参数是meta
。Dask生成一个自动建议(在警告消息中(:
After: .apply(func, meta=('Salary', 'object'))
由于这是一条警告消息,对于许多用例,指定meta
是可选的,但如果您希望明确计算变量的dtype
,则可能会很有用。
运行以下代码段不应生成警告消息:
# extracted your code into `func` for readability only
func = lambda x: salary_category.iloc[salary_category.index.get_loc(x)]['category']
sal_ddf['Category'] = sal_ddf['Salary'].apply(func, meta=('Salary', 'object'))
有关更多详细信息,此链接可能很有用:meta。