我有一个包含32个变量和大约900个观测值的数据集,我想在多元线性回归模型(statsmodel-ols(中进行测试。我想看看哪一个在一起效果最好——我基本上是在强行这样做,因为任何人都不清楚这种关系。不幸的是,它需要几个小时才能完成。我决定尝试多处理来加快速度。对于每个变量组合,脚本将:
- 构建一个语句
- 执行线性回归
- 提取汇总值(p/Bic/R平方(
- 将它们存储在数据帧中
我有前3个工作,但当我试图将存储在数据帧中并在最后输出时,它什么也不返回。有什么想法吗?我已将数据帧声明为全局。我相信这个函数是有效的,因为我在原始模型中使用了它的修改版本。
import pandas as pd
import random
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statistics import mean
from statistics import median
from multiprocessing import Pool
import datetime
import os
#Create the dataframe
inLoc='C:\temp\retailer_cost\'
inName='raw_data_v1_2.csv'
inFile = inLoc + inName
df=pd.read_csv(inFile)
#Create the dataframe to store the summary results in
summaryDF = pd.DataFrame(columns=['modelID','statement','num_vars','BIC','AIC','RSQ','RSQ_ADJ','CONDITION','AVG_PVALUE','MEDIAN_PVALUE','POSITIVE_VALUES'])
combList = [['a','b','c','d','e'],
['a','b','c','d',],
['a','b','c','e'],
['a','b','d','e'],
['a','c','d','e'],
['b','c','d','e']]
################################################################
#Function
################################################################
def processor(combin):
date_time = str(datetime.datetime.now().time())
#Declare SummaryDF as global
global summaryDF
stmt,interceptOut = createStmt('adjusted_value', combin)
print(stmt)
mod = smf.ols(formula=stmt, data=df)
results = mod.fit()
modID = str(date_time) + '_' + str(interceptOut)
avg = mean(list(results.pvalues))
mdn = median(list(results.pvalues))
#Extract coefficients
pVals = list(dict(results.pvalues).items())
coeffs = list(dict(results.params).items())
tVals = list(dict(results.tvalues).items())
#Create the record to add
summOut = {'modelID': modID, 'statement': stmt, 'num_vars': str(len(combin)), 'BIC': str(results.bic) ,'AIC': str(results.aic) ,'RSQ': str(results.rsquared) ,'RSQ_ADJ': str(results.rsquared_adj),'CONDITION': str(results.condition_number),'AVG_PVALUE': str(avg),'MEDIAN_PVALUE': str(mdn)}
summaryDF = summaryDF.append(summOut, ignore_index = True)
if __name__ == '__main__':
pool = Pool()
pool.map(processor, combList)
#Produces nothing
summaryDF.to_csv('c:\temp\olsModelOut.csv', index=False)
您必须从processor
函数返回summOut
,并将值存储在列表中(此处为data
(。之后,您可以将summOut
的列表转换为数据帧summaryDF
。你可以这样做:
def processor(combin):
...
return summOut
if __name__ == '__main__':
with multiprocessing.Pool() as pool:
data = pool.map(processor, combList)
summaryDF = pd.DataFrame(data)