将dask系列连接到数据帧时出错



我有一个多个dask核心系列,我想将其合并到一个数据帧中,以进一步写入csv文件,我该如何做到这一点。我在尝试执行相同操作时出现以下错误,请给出建议。。。

数据

1,2014-04-07T10:51:09.277Z,214536502,0
1,2014-04-07T10:54:09.868Z,214536500,0
1,2014-04-07T10:54:46.998Z,214536506,0
1,2014-04-07T10:57:00.306Z,214577561,0
2,2014-04-07T13:56:37.614Z,214662742,0
2,2014-04-07T13:57:19.373Z,214662742,0
2,2014-04-07T13:58:37.446Z,214825110,0
2,2014-04-07T13:59:50.710Z,214757390,0
2,2014-04-07T14:00:38.247Z,214757407,0
2,2014-04-07T14:02:36.889Z,214551617,0

代码

import dask
import datetime as dt
clicksdat = dd.read_csv('C:UsersTGDownloadsyoochoose-dataFullyoochoose-clicks100.dat', names=['Sid','Timestamp','itemid','itemcategory'], dtype={'sid':np.int64,'timestamp':np.object,'itemid':np.object,'itemcategory':np.object})
clicksdat['Timestamp']=clicksdat.Timestamp.apply(pd.to_datetime)
segment = ['EM']*24
segment[7:10] = ['M']*3
segment[10:13] = ['A']*3
segment[13:18] = ['E']*5
segment[18:23] = ['N']*5
segment[23] = 'MN'
maxtemp=clicksdat.groupby('Sid')['Timestamp'].max()
mintemp=clicksdat.groupby('Sid')['Timestamp'].min()
duration=(maxtemp.sub(mintemp).apply(lambda x:  x.total_seconds() ))
day=maxtemp.apply(lambda x:  x.day )
month=maxtemp.apply(lambda x:  x.month)
noofnavigations=[clicksdat.groupby('Sid').count().Timestamp][0]
totalitems=clicksdat.groupby('Sid')['itemid'].nunique()
totalcats=clicksdat.groupby('Sid')['itemcategory'].nunique()
timesegment= maxtemp.apply(lambda x:  segment[x.hour])
segmentchange=((maxtemp.apply(lambda x:  segment[x.hour])!=mintemp.apply(lambda x:  segment[x.hour])))
purchased=(clicksdat['Sid'].unique()).apply(lambda x: x in buyersession)
print(type(maxtemp),type(mintemp),type(duration),type(day),type(month),type(noofnavigations),type(totalitems),type(totalcats),type(timesegment),type(segmentchange),type(purchased))
#percentile_list = pd.DataFrame({'purchased' : purchased,'duration':duration,'day':day,'month':month,'noofnavigations':noofnavigations,'totalitems':totalitems,'totalcats':totalcats,'timesegment':timesegment,'segmentchange':segmentchange  },index=noofnavigations.index)
percentile_list = dd.concat([purchased,duration,day,month,noofnavigations,totalitems,totalcats,timesegment,segmentchange],axis=1)                          
percentile_list.to_csv('C:UsersTGDownloadsyoochoose-dataFullyoochoose-clicks1001-727.csv')

错误

(<class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>, <class 'dask.dataframe.core.Series'>)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-121-ad7fc3cf8839> in <module>()
     25 print(type(maxtemp),type(mintemp),type(duration),type(day),type(month),type(noofnavigations),type(totalitems),type(totalcats),type(timesegment),type(segmentchange),type(purchased))
     26 #percentile_list = pd.DataFrame({'purchased' : purchased,'duration':duration,'day':day,'month':month,'noofnavigations':noofnavigations,'totalitems':totalitems,'totalcats':totalcats,'timesegment':timesegment,'segmentchange':segmentchange  },index=noofnavigations.index)
---> 27 percentile_list = dd.concat([purchased,duration,day,month,noofnavigations,totalitems,totalcats,timesegment,segmentchange],axis=1)
     28 
     29 percentile_list.to_csv('C:UsersTGDownloadsyoochoose-dataFullyoochoose-clicks1001-727.csv')
C:UsersTGAnaconda3envsdato-envlibsite-packagesdaskdataframemulti.pyc in concat(dfs, axis, join, interleave_partitions)
    576     else:
    577         if axis == 1:
--> 578              raise ValueError('Unable to concatenate DataFrame with unknown '
    579                               'division specifying axis=1')
    580         else:
ValueError: Unable to concatenate DataFrame with unknown division specifying axis=1

首先-您的代码没有运行-因为有一些未定义的引用(dd,np)。因此,如果不投入不必要的时间,我就无法重现你的问题
但由于我也有类似的问题,我有一个想法:试着为数据帧设置一个索引。(在我的情况下,只要有一个有效的索引,一切都很好。但使用.drop_duplicates()以某种方式破坏了索引或除法,我遇到了和你一样的错误)

最新更新