以以下所有内容为例。
我有以下熊猫数据帧:
name col1 col2 col3 col4 col5 col6 col7
---- ---- ---- ---- ---- ---- ---- ----
doc1 100 1 1 1 1 1 1
doc1 200 2 2 2 2 2 2
doc1 300 3 3 3 3 3 3
doc2 100 11 11 11 11 11 11
doc2 200 21 21 21 21 21 21
doc2 300 31 31 31 31 31 31
doc2 300 31 31 31 31 31 31
doc3 100 12 12 12 12 12 12
doc3 100 12 12 12 12 12 12
doc3 200 22 22 22 22 22 22
doc3 300 32 32 32 32 32 32
列name
应用于聚合数据。
现在,我需要转换数组中给定docX
的所有列colX
的所有数据。
然后,以数组结束。
但是每个对象(单个数组(必须有5行,所以每个没有5行的文档都应该用0来完成。
然后在上面的例子中,我希望得到以下内容:
data = [
[
[100, 1, 1, 1, 1, 1, 1],
[200, 2, 2, 2, 2, 2, 2],
[300, 3, 3, 3, 3, 3, 3],
[ 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0]
],
[
[100, 11, 11, 11, 11, 11, 11],
[200, 21, 21, 21, 21, 21, 21],
[300, 31, 31, 31, 31, 31, 31],
[300, 31, 31, 31, 31, 31, 31],
[ 0, 0, 0, 0, 0, 0, 0]
],
[
[100, 12, 12, 12, 12, 12, 12],
[100, 12, 12, 12, 12, 12, 12],
[200, 22, 22, 22, 22, 22, 22],
[300, 32, 32, 32, 32, 32, 32],
[ 0, 0, 0, 0, 0, 0, 0]
]
]
data.shape == (3, 5, 7)
我怎样才能用一种聪明的方式做这件事?
我不确定smartly
,但您可以尝试使用reindex:进行pivot
tmp = (df.assign(row=df.groupby('name').cumcount())
.pivot_table(index=['row'],columns=['name'],fill_value=0)
.reindex(np.arange(5), fill_value=0).T
.unstack(level=0).to_numpy()
)
out = tmp.reshape(len(ret), 5, -1)
输出:
array([[[100, 1, 1, 1, 1, 1, 1],
[200, 2, 2, 2, 2, 2, 2],
[300, 3, 3, 3, 3, 3, 3],
[ 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0]],
[[100, 11, 11, 11, 11, 11, 11],
[200, 21, 21, 21, 21, 21, 21],
[300, 31, 31, 31, 31, 31, 31],
[300, 31, 31, 31, 31, 31, 31],
[ 0, 0, 0, 0, 0, 0, 0]],
[[100, 12, 12, 12, 12, 12, 12],
[100, 12, 12, 12, 12, 12, 12],
[200, 22, 22, 22, 22, 22, 22],
[300, 32, 32, 32, 32, 32, 32],
[ 0, 0, 0, 0, 0, 0, 0]]])
获取列name
的名称更改位置(从doc1到doc2再到doc3(。这将用于拆分数据帧:
进口熊猫作为pd将numpy导入为np
split = df.index[~df.name.eq(df.name.shift())][1:]
split
Int64Index([3, 7], dtype='int64')
使用numpy.split
:,使用split
变量拆分数据帧
df_split = np.split(df.iloc[:, 1:].to_numpy(), split)
df_split
[array([[100, 1, 1, 1, 1, 1, 1],
[200, 2, 2, 2, 2, 2, 2],
[300, 3, 3, 3, 3, 3, 3]]),
array([[100, 11, 11, 11, 11, 11, 11],
[200, 21, 21, 21, 21, 21, 21],
[300, 31, 31, 31, 31, 31, 31],
[300, 31, 31, 31, 31, 31, 31]]),
array([[100, 12, 12, 12, 12, 12, 12],
[100, 12, 12, 12, 12, 12, 12],
[200, 22, 22, 22, 22, 22, 22],
[300, 32, 32, 32, 32, 32, 32]])]
获取单个数组的长度-value_counts或列表理解效果良好:
split = [len(arr) for arr in df_split]
split
[3, 4, 4]
创建零数组:
zeros = np.zeros((3, 5, 7))
最后,用df_split 中的值填充零
zeros[0, : split[0]] = df_split[0]
zeros[1, : split[1]] = df_split[1]
zeros[2, : split[2]] = df_split[2]
zeros
array([[[100., 1., 1., 1., 1., 1., 1.],
[200., 2., 2., 2., 2., 2., 2.],
[300., 3., 3., 3., 3., 3., 3.],
[ 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0.]],
[[100., 11., 11., 11., 11., 11., 11.],
[200., 21., 21., 21., 21., 21., 21.],
[300., 31., 31., 31., 31., 31., 31.],
[300., 31., 31., 31., 31., 31., 31.],
[ 0., 0., 0., 0., 0., 0., 0.]],
[[100., 12., 12., 12., 12., 12., 12.],
[100., 12., 12., 12., 12., 12., 12.],
[200., 22., 22., 22., 22., 22., 22.],
[300., 32., 32., 32., 32., 32., 32.],
[ 0., 0., 0., 0., 0., 0., 0.]]])