我需要在DASK数据帧中添加一列,该列应该包含自动递增ID。我知道如何在Pandas中实现,因为我在SO上找到了Pandas解决方案,但我不知道如何在DASK中实现。我的最佳尝试是这样的,结果发现autoincrement函数只为我的100行测试文件运行了两次,并且所有的id都是2。
def autoincrement(self):
print('*')
self.report_line = self.report_line + 1
return self.report_line
self.df = self.df.map_partitions(
lambda df: df.assign(raw_report_line=self.autoincrement())
)
Pandas的方式看起来有点像
df.insert(0, 'New_ID', range(1, 1 + len(df)))
或者,如果我可以获取特定CSV行的行号并将其添加到列中,那就太好了,在现阶段,这似乎不容易实现。
您可以分配一个全部为1的伪列,并获取累积
In [1]: import dask.datasets
In [2]: import pandas as pd
In [3]: import numpy as np
In [4]: df = dask.datasets.timeseries()
In [5]: df
Out[5]:
Dask DataFrame Structure:
id name x y
npartitions=30
2000-01-01 int64 object float64 float64
2000-01-02 ... ... ... ...
... ... ... ... ...
2000-01-30 ... ... ... ...
2000-01-31 ... ... ... ...
Dask Name: make-timeseries, 30 tasks
In [6]: df['row_number'] = df.assign(partition_count=1).partition_count.cumsum()
In [7]: df.compute()
Out[7]:
id name x y row_number
timestamp
2000-01-01 00:00:00 928 Sarah -0.597784 0.160908 1
2000-01-01 00:00:01 1000 Zelda -0.034756 -0.073912 2
2000-01-01 00:00:02 1028 Patricia -0.962331 -0.458834 3
2000-01-01 00:00:03 1010 Hannah -0.225759 -0.227945 4
2000-01-01 00:00:04 958 Charlie 0.223131 -0.672307 5
... ... ... ... ... ...
2000-01-30 23:59:55 1052 Jerry -0.636159 0.683076 2591996
2000-01-30 23:59:56 973 Quinn -0.575324 0.272144 2591997
2000-01-30 23:59:57 1049 Jerry 0.143286 -0.122490 2591998
2000-01-30 23:59:58 971 Victor -0.866174 0.751534 2591999
2000-01-30 23:59:59 966 Edith -0.718382 -0.333261 2592000
[2592000 rows x 5 columns]