我的问题类似于这个线程在pandas 中从具有多个值的列创建虚拟对象
目标:我想在下面产生类似的结果,但使用dask
在Pandas
import pandas as pd
df = pd.DataFrame({'fruit': ['Banana, , Apple, Dragon Fruit,,,', 'Kiwi,', 'Lemon, Apple, Banana', ',']})
df['fruit'].str.get_dummies(sep=',')
它将输出以下内容:
Apple Banana Dragon Fruit Banana Kiwi Lemon
0 1 1 0 1 1 0 0
1 0 0 0 0 0 1 0
2 0 1 1 0 0 0 1
3 0 0 0 0 0 0 0
上面的get_dummies((的类型<pandas.core.strings.StringMethods>
现在的问题是,对于dask等价<dask.dataframe.accessor.StringAccessor>
如何使用dask解决问题?
显然,这在dask中是不可能的,因为我们之前不知道输出列。看见https://github.com/dask/dask/issues/4403.
这是可能的@Santosh_Kumar_Janumahanti。
并且,@anwari那里是.get_dummies
dask等价物。
Dask在计算方面很懒惰。因此,直到计算之后,dask才知道列中的所有唯一值。因此,dask要求在一个热编码(即dd.get_dummies
(之前对列进行分类。
str.split
必须首先执行,然后.get_dummies
可以对新列执行,然后新编码的列可以加入原始DF。
这就是我解决问题的方法:
df = pd.DataFrame({'fruit': ['Banana, , Apple, Dragon Fruit,,,', 'Kiwi,', 'Lemon, Apple, Banana', ',']})
df = dd.from_pandas(df, npartitions=2) # I made sure to choose a partition count lower than my row number.
col_name_to_split = 'fruit'
def split_col(df: pd.DataFrame, col: str) -> pd.DataFrame:
tmp_df = df[col].str.split(',', expand=True)
df = df.drop(columns=[col])
tmp_df.columns = [f'{col}__{x}' for x in tmp_df.columns] # dunderscore to distinguish when dropping later and not mix with encoded integers.
anticipated_splits = 7 # Dask will not know how many splits will result from mapping this function (dask can only see the values within the current partition it is iterating over), so you must inform Dask in advance (similar to supplying `meta`).
for col1 in [f'{col}__{num}' for num in range(anticipated_splits)]:
tmp_df[col1] = tmp_df.get(col1, float('nan')) # Fill in with blank if this partition happens not to have all the newly split columns
df = df.join(tmp_df)
return df
df = df.map_partitions(split_col, col_name_to_split)
split_cols = [col for col in df.columns if f"{col_name_to_split}__" in col]
df = df.categorize(split_cols) # This will trigger all queued computations so that dask will know how many dummy columns to make.
tmp_dfs = {}
for col in split_cols:
tmp_df = dd.get_dummies(df[col], prefix=col_name_to_split)
tmp_df = tmp_df.map(lambda v: float('nan') if not v else v) #
# So that `.combine_first` doesn't overwrite existing `True`s with incoming `False`s.
# Caveat: the one hot columns render as `1.0` and as `True` in different columns. Dunno why. But these values are treated as equivalent
df = df.combine_first(tmp_df)
tmp_dfs[col] = tmp_df
df = df.drop(columns=split_cols)
fruit_ fruit_ fruit_ Apple fruit_ Banana fruit_ Dragon Fruit fruit_Banana
0 True True True NaN True True
1 True NaN NaN NaN NaN NaN
2 NaN NaN True True NaN NaN
3 True NaN NaN NaN NaN NaN
fruit_Kiwi fruit_Lemon
0 NaN NaN
1 True NaN
2 NaN True
3 NaN NaN
(这与我在这里的回答几乎相同(