我想从dask数据框架中提取50行,但我不能。最后,我想制作每个类别有50行的新数据框。
当我运行此代码时,
import dask.dataframe as dd
ddf = dd.from_pandas(train_csv, npartitions=30)
classes = train_csv.landmark_id.unique()
for cl in tqdm(classes):
tmpdf = ddf.loc[ddf["landmark_id"] == cl]
print(len(tmpdf))
结果是
1048
359
182
149
94
57
78
157
.
.
.
因此,每个TMPDF必须具有超过50行。但是当我运行此代码时,
import dask.dataframe as dd
ddf = dd.from_pandas(train_csv, npartitions=30)
classes = train_csv.landmark_id.unique()
for cl in tqdm(classes):
tmpdf = ddf.loc[ddf["landmark_id"] == cl]
tmpdf = tmpdf[:50]
print(len(tmpdf))
结果是
1
1
1
1
1
.
.
.
我认为索引可能是错误的。
import dask.dataframe as dd
ddf = dd.from_pandas(train_csv, npartitions=30)
classes = train_csv.landmark_id.unique()
for cl in tqdm(classes):
tmpdf = ddf.loc[ddf["landmark_id"] == cl]
tmpdf = tmpdf.reset_index()
tmpdf = tmpdf[:50]
print(len(tmpdf))
但结果是
1048
359
182
149
94
57
78
.
.
.
发生了什么?
我也尝试了.compute()
我运行了此代码
import dask.dataframe as dd
ddf = dd.from_pandas(train_csv, npartitions=30)
classes = train_csv.landmark_id.unique()
for cl in tqdm(classes):
tmpdf = ddf.loc[ddf["landmark_id"] == cl]
tmpdf = tmpdf.compute()
tmpdf = tmpdf[:50]
print(len(tmpdf))
现在我可以正确的结果,
50
50
50
50
50
.
.
.
但是执行时间太长。我使用dask的最初原因是速度...
此行for cl in tqdm(classes):
给我错误
0%| | 0/5 [00:00<?, ?it/s]Traceback (most recent call last):
File "....py", line ...., in <module>
for cl in tqdm(classes):
File "...tqdm_tqdm.py", line 1000, in __iter__
for obj in iterable:
File "...daskdataframecore.py", line 2046, in __getitem__
raise NotImplementedError()
NotImplementedError
所以我不确定您的代码如何在循环中打印整数。
无论如何,如果您打印出classes
,您会发现它是一个延迟对象(dask
Series
(
print(classes)
Dask Series Structure:
npartitions=1
object
...
Name: landmark_id, dtype: object
Dask Name: unique-agg, xx tasks
所以,iiuc,您需要在循环之前对classes
进行计算。使用
for cl in tqdm(classes.compute()):
或
for cl in classes.compute():