使用rpy2的Panda DataFrame分位数回归模型中的不一致数组



我正在使用rpy2(2.7.6)对engel数据集进行分位数回归:

import statsmodels as sm
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
pandas2ri.activate()
quantreg = importr('quantreg')
data = sm.datasets.engel.load_pandas().data
qreg = quantreg.rq('foodexp ~ income', data=data, tau=0.5)

然而,这会产生以下错误:

qreg = quantreg.rq('foodexp ~ income', data=data, tau=0.5)
Traceback (most recent call last):
  File "<ipython-input-22-02ee1015737c>", line 1, in <module>
    quantreg.rq('foodexp ~ income', data=data, tau=0.5)
  File "C:Anacondalibsite-packagesrpy2robjectsfunctions.py", line 178, in __call__
    return super(SignatureTranslatedFunction, self).__call__(*args, **kwargs)
  File "C:Anacondalibsite-packagesrpy2robjectsfunctions.py", line 106, in __call__
    res = super(Function, self).__call__(*new_args, **new_kwargs)
RRuntimeError: Error in y - x %*% z$coef : non-conformable arrays

据我所知,在这种情况下,不一致的数组意味着缺少一些值,或者使用的"数组"大小不同。我可以确认事实并非如此:

data.count()
Out[26]: 
income     235
foodexp    235
dtype: int64
data.shape
Out[27]: (235, 2)

这个错误还意味着什么?rpy2中从DataFrame到data.frame的转换是否可能工作不正常,或者我在这里遗漏了什么?其他人能确认这个错误吗?

以防万一,这里有一些关于R和Python版本的信息。

R version 3.2.0 (2015-04-16) -- "Full of Ingredients"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)
Python 2.7.11 |Anaconda 2.3.0 (64-bit)| (default, Dec  7 2015, 14:10:42) [MSC v.1500 64 bit (AMD64)]
on win32

如有任何帮助,我们将不胜感激。

编辑1:

如果我直接从R加载数据集,我不会得到错误:

from rpy2.robjects import r
r.data('engel')
data = r['engel']
qreg = quantreg.rq('foodexp ~ income', data=data, tau=0.5)

所以我认为pandas2ri的转换有问题。当我尝试用pandas2ri.py2ri手动将DataFrame转换为data.frame时,也会发生同样的错误。

编辑2:

有趣的是,如果我使用了不推荐使用的pandas.rpy.common.convert_to_r_dataframe,错误就消失了:

import pandas.rpy.common as com
rdata = com.convert_to_r_dataframe(data)
qreg = quantreg.rq('foodexp ~ income', data=rdata, tau=0.5)

pandas2ri中肯定有一个错误,这里也证实了这一点。

如rpy2问题跟踪器上的回答:

问题的根源似乎是pandas数据帧中的列被转换为每个只有一列的Array对象。

>>> pandas2ri.py2ri_pandasdataframe(data) 
<DataFrame - Python:0x7f8af3c2afc8 / R:0x92958b0>
[Array, Array]
  income: <class 'rpy2.robjects.vectors.Array'>
  <Array - Python:0x7f8af57ef908 / R:0x92e1bf0>
[420.157651, 541.411707, 901.157457, ..., 581.359892, 743.077243, 1057.676711]
  foodexp: <class 'rpy2.robjects.vectors.Array'>
  <Array - Python:0x7f8af3c2ab88 / R:0x92e7600>
[255.839425, 310.958667, 485.680014, ..., 468.000798, 522.601906, 750.320163]

区别是微妙的,但这似乎混淆了quantreg包。还有其他R函数似乎独立于对象是一个带一列的数组还是一个向量而工作。

将列转向R矢量似乎是解决问题所需要的:

from rpy2.robjects.vectors import FloatVector
mydata=pandas2ri.py2ri_pandasdataframe(data)
from rpy2.robjects.packages import importr
base=importr('base')
mydata[0]=base.as_vector(mydata[0])
mydata[1]=base.as_vector(mydata[1])
# now this is working
qreg = quantreg.rq('foodexp ~ income', data=mydata, tau=0.5)

现在,我想收集更多的数据,了解这是否可以在不破坏其他事情的情况下解决问题。为此,我将修复程序转换为从pandas转换器派生的自定义转换器:

from rpy2.robjects import default_converter
from rpy2.robjects.conversion import Converter, localconverter
from rpy2.robjects.packages import importr
from rpy2.robjects import numpy2ri, pandas2ri, vectors
import numpy
my_converter = Converter('my converter',
                         template=pandas2ri.converter)
base=importr('base')
def ndarray_forcevector(obj):
    func=numpy2ri.converter.py2ri.registry[numpy.ndarray]
    # current conversion as performed by numpy
    res=func(obj)
    if len(obj.shape) == 1:
        # force into an R vector
        res=base.as_vector(res)
    return res
@my_converter.py2ri.register(pandas2ri.PandasSeries)
def py2ri_pandasseries(obj):
    # this is a copy of the function with the same name in pandas2ri, with
    # the call to ndarray_forcevector() as the only difference
    if obj.dtype == '<M8[ns]':
        # time series
        d = [vectors.IntVector([x.year for x in obj]),
             vectors.IntVector([x.month for x in obj]),
             vectors.IntVector([x.day for x in obj]),
             vectors.IntVector([x.hour for x in obj]),
             vectors.IntVector([x.minute for x in obj]),
             vectors.IntVector([x.second for x in obj])]
        res = vectors.ISOdatetime(*d)
        #FIXME: can the POSIXct be created from the POSIXct constructor ?
        # (is '<M8[ns]' mapping to Python datetime.datetime ?)
        res = vectors.POSIXct(res)
    else:
        # converted as a numpy array
        res = ndarray_forcevector(obj)
    # "index" is equivalent to "names" in R
    if obj.ndim == 1:
        res.do_slot_assign('names',
                           vectors.StrVector(tuple(str(x) for x in obj.index)))
    else:
        res.do_slot_assign('dimnames',
                          vectors.SexpVector(conversion.py2ri(obj.index)))
    return res

使用这种新转换器的最简单方法可能是在上下文管理器中:

with localconverter(default_converter + my_converter) as cv:
    qreg = quantreg.rq('foodexp ~ income', data=data, tau=0.5)

最新更新