有效地在pandas中创建稀疏数据透视表

我正在工作将具有两列(a和B)的记录列表转换为矩阵表示。我一直在pandas中使用pivot函数，但最终得到的结果相当大。pandas是否支持转向稀疏格式?我知道我可以转动它，然后把它变成某种稀疏表示，但这并不像我想的那么优雅。我的最终目标是将其用作预测模型的输入。

或者，在pandas之外是否存在某种稀疏枢轴能力?

编辑:这是一个非稀疏枢轴的例子

import pandas as pd
frame=pd.DataFrame()
frame['person']=['me','you','him','you','him','me']
frame['thing']=['a','a','b','c','d','d']
frame['count']=[1,1,1,1,1,1]
frame
  person thing  count
0     me     a      1
1    you     a      1
2    him     b      1
3    you     c      1
4    him     d      1
5     me     d      1
frame.pivot('person','thing')
        count            
thing       a   b   c   d
person                   
him       NaN   1 NaN   1
me          1 NaN NaN   1
you         1 NaN   1 NaN

这创建了一个矩阵，它可以包含所有可能的人和事物的组合，但它不是稀疏的。

http://docs.scipy.org/doc/scipy/reference/sparse.html

稀疏矩阵占用更少的空间，因为它们可以暗示像NaN或0这样的东西。如果我有一个非常大的数据集，这个枢轴函数可以生成一个矩阵，它应该是稀疏的，因为有大量的nan或0。我希望我可以通过立即生成一些稀疏的东西来节省大量的空间/内存，而不是创建一个密集的矩阵，然后将其转换为稀疏。

这里是一个基于人和事物的数据和索引创建稀疏scipy矩阵的方法。person_u和thing_u是表示您想要创建的枢轴的行和列的唯一项的列表。注意:这假设你的count列已经包含了你想要的值。

from scipy.sparse import csr_matrix
person_u = list(sort(frame.person.unique()))
thing_u = list(sort(frame.thing.unique()))
data = frame['count'].tolist()
row = frame.person.astype('category', categories=person_u).cat.codes
col = frame.thing.astype('category', categories=thing_u).cat.codes
sparse_matrix = csr_matrix((data, (row, col)), shape=(len(person_u), len(thing_u)))
>>> sparse_matrix 
<3x4 sparse matrix of type '<type 'numpy.int64'>'
    with 6 stored elements in Compressed Sparse Row format>
>>> sparse_matrix.todense()
matrix([[0, 1, 0, 1],
        [1, 0, 0, 1],
        [1, 0, 1, 0]])

根据您最初的问题，scipy稀疏矩阵应该足以满足您的需求，但是如果您希望有一个稀疏的数据框，您可以执行以下操作:

dfs=pd.SparseDataFrame([ pd.SparseSeries(sparse_matrix[i].toarray().ravel(), fill_value=0) 
                              for i in np.arange(sparse_matrix.shape[0]) ], index=person_u, columns=thing_u, default_fill_value=0)
>>> dfs
     a  b  c  d
him  0  1  0  1
me   1  0  0  1
you  1  0  1  0
>>> type(dfs)
pandas.sparse.frame.SparseDataFrame

@khammel之前发布的答案很有用，但不幸的是，由于pandas和Python的变化，不再有效。下面的代码应该会产生相同的输出:

from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype
person_c = CategoricalDtype(sorted(frame.person.unique()), ordered=True)
thing_c = CategoricalDtype(sorted(frame.thing.unique()), ordered=True)
row = frame.person.astype(person_c).cat.codes
col = frame.thing.astype(thing_c).cat.codes
sparse_matrix = csr_matrix((frame["count"], (row, col)), 
                           shape=(person_c.categories.size, thing_c.categories.size))
>>> sparse_matrix
<3x4 sparse matrix of type '<class 'numpy.int64'>'
     with 6 stored elements in Compressed Sparse Row format>
>>> sparse_matrix.todense()
matrix([[0, 1, 0, 1],
        [1, 0, 0, 1],
        [1, 0, 1, 0]], dtype=int64)

dfs = pd.SparseDataFrame(sparse_matrix, 
                         index=person_c.categories, 
                         columns=thing_c.categories, 
                         default_fill_value=0)
>>> dfs
        a   b   c   d
 him    0   1   0   1
  me    1   0   0   1
 you    1   0   1   0

主要变化如下:

.astype()不再接受"分类"。您必须创建一个CategoricalDtype对象。
sort()不再工作

其他变化更肤浅:

使用类别大小而不是唯一系列对象的长度，只是因为我不想不必要地创建另一个对象
csr_matrix (frame["count"])的数据输入不需要是列表对象
pandas SparseDataFrame接受一个scipy。

我有一个类似的问题，我绊倒了这篇文章。唯一的区别是，我在DataFrame中有两列，它们定义了输出矩阵的"行维"(i)。我认为这可能是一个有趣的概括，我使用了grouper:

# function
import pandas as pd
from scipy.sparse import csr_matrix
def df_to_sm(data, vars_i, vars_j):
    grpr_i = data.groupby(vars_i).grouper
    idx_i = grpr_i.group_info[0]
    grpr_j = data.groupby(vars_j).grouper
    idx_j = grpr_j.group_info[0]
    data_sm = csr_matrix((data['val'].values, (idx_i, idx_j)),
                         shape=(grpr_i.ngroups, grpr_j.ngroups))
    return data_sm, grpr_i, grpr_j

# example
data = pd.DataFrame({'var_i_1' : ['a1', 'a1', 'a1', 'a2', 'a2', 'a3'],
                     'var_i_2' : ['b2', 'b1', 'b1', 'b1', 'b1', 'b4'],
                     'var_j_1' : ['c2', 'c3', 'c2', 'c1', 'c2', 'c3'],
                     'val' : [1, 2, 3, 4, 5, 6]})
data_sm, _, _ = df_to_sm(data, ['var_i_1', 'var_i_2'], ['var_j_1'])
data_sm.todense()

这个答案更新了@Alnilam的答案中的方法，使用最新的pandas库，不再包含该答案中的所有函数。

from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype
rcLabel, vLabel = ('person', 'thing'), 'count'
rcCat = [CategoricalDtype(sorted(frame[col].unique()), ordered=True) for col in rcLabel]
rc = [frame[column].astype(aType).cat.codes for column, aType in zip(rcLabel, rcCat)]
mat = csr_matrix((frame[vLabel], rc), shape=tuple(cat.categories.size for cat in rcCat))
dfPivot = ( pd.DataFrame.sparse.from_spmatrix(
    mat, index=rcCat[0].categories, columns=rcCat[1].categories) )

相关内容

最新更新

热门标签：