我正在使用rpy2(2.7.6)对engel数据集进行分位数回归:
import statsmodels as sm
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
pandas2ri.activate()
quantreg = importr('quantreg')
data = sm.datasets.engel.load_pandas().data
qreg = quantreg.rq('foodexp ~ income', data=data, tau=0.5)
然而,这会产生以下错误:
qreg = quantreg.rq('foodexp ~ income', data=data, tau=0.5)
Traceback (most recent call last):
File "<ipython-input-22-02ee1015737c>", line 1, in <module>
quantreg.rq('foodexp ~ income', data=data, tau=0.5)
File "C:Anacondalibsite-packagesrpy2robjectsfunctions.py", line 178, in __call__
return super(SignatureTranslatedFunction, self).__call__(*args, **kwargs)
File "C:Anacondalibsite-packagesrpy2robjectsfunctions.py", line 106, in __call__
res = super(Function, self).__call__(*new_args, **new_kwargs)
RRuntimeError: Error in y - x %*% z$coef : non-conformable arrays
据我所知,在这种情况下,不一致的数组意味着缺少一些值,或者使用的"数组"大小不同。我可以确认事实并非如此:
data.count()
Out[26]:
income 235
foodexp 235
dtype: int64
data.shape
Out[27]: (235, 2)
这个错误还意味着什么?rpy2中从DataFrame到data.frame的转换是否可能工作不正常,或者我在这里遗漏了什么?其他人能确认这个错误吗?
以防万一,这里有一些关于R和Python版本的信息。
R version 3.2.0 (2015-04-16) -- "Full of Ingredients"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)
Python 2.7.11 |Anaconda 2.3.0 (64-bit)| (default, Dec 7 2015, 14:10:42) [MSC v.1500 64 bit (AMD64)]
on win32
如有任何帮助,我们将不胜感激。
编辑1:
如果我直接从R加载数据集,我不会得到错误:
from rpy2.robjects import r
r.data('engel')
data = r['engel']
qreg = quantreg.rq('foodexp ~ income', data=data, tau=0.5)
所以我认为pandas2ri
的转换有问题。当我尝试用pandas2ri.py2ri
手动将DataFrame转换为data.frame时,也会发生同样的错误。
编辑2:
有趣的是,如果我使用了不推荐使用的pandas.rpy.common.convert_to_r_dataframe
,错误就消失了:
import pandas.rpy.common as com
rdata = com.convert_to_r_dataframe(data)
qreg = quantreg.rq('foodexp ~ income', data=rdata, tau=0.5)
pandas2ri
中肯定有一个错误,这里也证实了这一点。
如rpy2问题跟踪器上的回答:
问题的根源似乎是pandas数据帧中的列被转换为每个只有一列的Array对象。
>>> pandas2ri.py2ri_pandasdataframe(data)
<DataFrame - Python:0x7f8af3c2afc8 / R:0x92958b0>
[Array, Array]
income: <class 'rpy2.robjects.vectors.Array'>
<Array - Python:0x7f8af57ef908 / R:0x92e1bf0>
[420.157651, 541.411707, 901.157457, ..., 581.359892, 743.077243, 1057.676711]
foodexp: <class 'rpy2.robjects.vectors.Array'>
<Array - Python:0x7f8af3c2ab88 / R:0x92e7600>
[255.839425, 310.958667, 485.680014, ..., 468.000798, 522.601906, 750.320163]
区别是微妙的,但这似乎混淆了quantreg
包。还有其他R函数似乎独立于对象是一个带一列的数组还是一个向量而工作。
将列转向R矢量似乎是解决问题所需要的:
from rpy2.robjects.vectors import FloatVector
mydata=pandas2ri.py2ri_pandasdataframe(data)
from rpy2.robjects.packages import importr
base=importr('base')
mydata[0]=base.as_vector(mydata[0])
mydata[1]=base.as_vector(mydata[1])
# now this is working
qreg = quantreg.rq('foodexp ~ income', data=mydata, tau=0.5)
现在,我想收集更多的数据,了解这是否可以在不破坏其他事情的情况下解决问题。为此,我将修复程序转换为从pandas转换器派生的自定义转换器:
from rpy2.robjects import default_converter
from rpy2.robjects.conversion import Converter, localconverter
from rpy2.robjects.packages import importr
from rpy2.robjects import numpy2ri, pandas2ri, vectors
import numpy
my_converter = Converter('my converter',
template=pandas2ri.converter)
base=importr('base')
def ndarray_forcevector(obj):
func=numpy2ri.converter.py2ri.registry[numpy.ndarray]
# current conversion as performed by numpy
res=func(obj)
if len(obj.shape) == 1:
# force into an R vector
res=base.as_vector(res)
return res
@my_converter.py2ri.register(pandas2ri.PandasSeries)
def py2ri_pandasseries(obj):
# this is a copy of the function with the same name in pandas2ri, with
# the call to ndarray_forcevector() as the only difference
if obj.dtype == '<M8[ns]':
# time series
d = [vectors.IntVector([x.year for x in obj]),
vectors.IntVector([x.month for x in obj]),
vectors.IntVector([x.day for x in obj]),
vectors.IntVector([x.hour for x in obj]),
vectors.IntVector([x.minute for x in obj]),
vectors.IntVector([x.second for x in obj])]
res = vectors.ISOdatetime(*d)
#FIXME: can the POSIXct be created from the POSIXct constructor ?
# (is '<M8[ns]' mapping to Python datetime.datetime ?)
res = vectors.POSIXct(res)
else:
# converted as a numpy array
res = ndarray_forcevector(obj)
# "index" is equivalent to "names" in R
if obj.ndim == 1:
res.do_slot_assign('names',
vectors.StrVector(tuple(str(x) for x in obj.index)))
else:
res.do_slot_assign('dimnames',
vectors.SexpVector(conversion.py2ri(obj.index)))
return res
使用这种新转换器的最简单方法可能是在上下文管理器中:
with localconverter(default_converter + my_converter) as cv:
qreg = quantreg.rq('foodexp ~ income', data=data, tau=0.5)