如何从DASK数据框架中提取50行



我想从dask数据框架中提取50行,但我不能。最后,我想制作每个类别有50行的新数据框。

当我运行此代码时,

import dask.dataframe as dd
ddf = dd.from_pandas(train_csv, npartitions=30)
classes = train_csv.landmark_id.unique()
for cl in tqdm(classes):
    tmpdf = ddf.loc[ddf["landmark_id"] == cl]
    print(len(tmpdf))

结果是

1048
359
182
149
94
57
78
157
.
.
.

因此,每个TMPDF必须具有超过50行。但是当我运行此代码时,

import dask.dataframe as dd
ddf = dd.from_pandas(train_csv, npartitions=30)
classes = train_csv.landmark_id.unique()
for cl in tqdm(classes):
    tmpdf = ddf.loc[ddf["landmark_id"] == cl]
    tmpdf = tmpdf[:50]
    print(len(tmpdf))

结果是

1
1
1
1
1
.
.
.

我认为索引可能是错误的。

import dask.dataframe as dd
ddf = dd.from_pandas(train_csv, npartitions=30)
classes = train_csv.landmark_id.unique()
for cl in tqdm(classes):
    tmpdf = ddf.loc[ddf["landmark_id"] == cl]
    tmpdf = tmpdf.reset_index()
    tmpdf = tmpdf[:50]
    print(len(tmpdf))

但结果是

1048
359
182
149
94
57
78
.
.
.

发生了什么?

我也尝试了.compute()我运行了此代码

import dask.dataframe as dd
ddf = dd.from_pandas(train_csv, npartitions=30)
classes = train_csv.landmark_id.unique()
for cl in tqdm(classes):
    tmpdf = ddf.loc[ddf["landmark_id"] == cl]
    tmpdf = tmpdf.compute()
    tmpdf = tmpdf[:50]
    print(len(tmpdf))

现在我可以正确的结果,

50
50
50
50
50
.
.
.

但是执行时间太长。我使用dask的最初原因是速度...

此行for cl in tqdm(classes):给我错误

  0%|          | 0/5 [00:00<?, ?it/s]Traceback (most recent call last):
  File "....py", line ...., in <module>
    for cl in tqdm(classes):
  File "...tqdm_tqdm.py", line 1000, in __iter__
    for obj in iterable:
  File "...daskdataframecore.py", line 2046, in __getitem__
    raise NotImplementedError()
NotImplementedError

所以我不确定您的代码如何在循环中打印整数。

无论如何,如果您打印出classes,您会发现它是一个延迟对象(dask Series(

print(classes)
Dask Series Structure:
npartitions=1
    object
       ...
Name: landmark_id, dtype: object
Dask Name: unique-agg, xx tasks

所以,iiuc,您需要在循环之前对classes进行计算。使用

for cl in tqdm(classes.compute()):

for cl in classes.compute():

相关内容

  • 没有找到相关文章

最新更新