Dask警告提供了明确的输出类型



我正在使用Dask执行以下操作。

import dask.dataframe as dd
import pandas as pd

salary_df = pd.DataFrame({"Salary":[10000, 50000, 25000, 30000, 7000]})
salary_category = pd.DataFrame({"Hi":[5000, 20000, 25000, 30000, 90000],
"Low":[0,  5001, 20001, 25001, 30001],
"category":["Very Poor", "Poor", "Medium", "Rich", "Super Rich" ]
})
sal_ddf = dd.from_pandas(salary_df, npartitions=10)
salary_category.index = pd.IntervalIndex.from_arrays(salary_category['Low'],salary_category['Hi'],closed='both')
sal_ddf['Category'] = sal_ddf['Salary'].apply(lambda x : salary_category.iloc[salary_category.index.get_loc(x)]['category'])

我确实得到了结果,但下面的线路上有警告

sal_ddf['Category'] = sal_ddf['Salary'].apply(lambda x : salary_category.iloc[salary_category.index.get_loc(x)]['category'])
You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
Before: .apply(func)
After:  .apply(func, meta=('Salary', 'object'))

我在这里错过了什么?

此处缺少的关键字参数是meta。Dask生成一个自动建议(在警告消息中(:

After:  .apply(func, meta=('Salary', 'object'))

由于这是一条警告消息,对于许多用例,指定meta是可选的,但如果您希望明确计算变量的dtype,则可能会很有用。

运行以下代码段不应生成警告消息:

# extracted your code into `func` for readability only
func = lambda x: salary_category.iloc[salary_category.index.get_loc(x)]['category']
sal_ddf['Category'] = sal_ddf['Salary'].apply(func, meta=('Salary', 'object'))

有关更多详细信息,此链接可能很有用:meta。

最新更新