如何在Spark2中实际应用保存的RF模型并进行预测?



这是一个新手问题,因为我似乎找不到一个简单的方法。

我正在使用天气数据做航空公司数据集,如果>15分钟预测延误。

航空公司数据集(2007年和2008年(:http://stat-computing.org/dataexpo/2009/the-data.html

天气:

wget ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2007.csv.gz -O /tmp/weather_2007.csv.gz
wget ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2008.csv.gz -O /tmp/weather_2008.csv.gz

我的代码来自这个 URL https://github.com/neil90/spark_airline_delays/blob/master/spark_airplane.ipynb 但我为 Spark 2.3 更改了它:

df_airline_2007 = sqlContext.read.format("csv").option("header", "true").load("/ACMEAirDB/2007/2007.csv")
df_weather_2007 = sqlContext.read.format("csv").option("header", "false").load("/ACMEAirDB/weather_2007/weather_2007.csv")
df_airline_2008 = sqlContext.read.format("csv").option("header", "true").load("/ACMEAirDB/2008/2008.csv")
df_weather_2008 = sqlContext.read.format("csv").option("header", "false").load("/ACMEAirDB/weather_2008/weather_2008.csv")
df_airline_raw = df_airline_2007.unionAll(df_airline_2008)
df_weather_raw = df_weather_2007.unionAll(df_weather_2008)

#Function to create year,month,day into date for airline to join on to weather
def to_date(year,month,day): 
dt = "%04d%02d%02d" % (year, month, day)
return dt
sqlContext.udf.register("to_date", to_date)
#Function to discrentize time in airline
def discretize_tod(val):
hour = int(val[:2])
if hour < 8:
return 0
if hour < 16:
return 1
return 2
sqlContext.udf.register("discretize_tod", discretize_tod)
df_airline_raw.registerTempTable("df_airpline_raw")
df_weather_raw.registerTempTable("df_weather_raw")
#Create Final Airline transformation
df_airline = sqlContext.sql("""SELECT 
Year as year, Month as month, DayofMonth as day, DayOfWeek as dow,
CarrierDelay as carrier, Origin as origin, Dest as dest, Distance as distance, 
discretize_tod(DepTime) as tod, CASE WHEN DepDelay >= 15 THEN 1 ELSE 0 END as delay, 
to_date(cast(Year as int), cast(Month as int), cast(DayofMonth as int)) As date 
FROM df_airpline_raw
WHERE Cancelled = 0 AND Origin = 'ORD'""")
#Create Base Weather Transformation Table
df_weather = sqlContext.sql("""SELECT 
_C0 AS station,
_C1 As date,
_C2 As metric,
_C3 As value, 
_C4 As t1, 
_C5 As t2, 
_C6 As t3, 
_C7 As time
FROM df_weather_raw
""")

# df_weather.show(10)    

#Create Tmin and Tmax Weather DF
df_weather.registerTempTable("df_weather")
#Create DFs for Weather Tmin and Tmax Values 
df_weather_tmin = sqlContext.sql("""SELECT 
date, 
value as temp_min 
FROM df_weather 
WHERE station = 'USW00094846' 
AND metric = 'TMIN'""")
df_weather_tmax = sqlContext.sql("""SELECT 
date, 
value as temp_max 
FROM df_weather 
WHERE station = 'USW00094846' 
AND metric = 'TMAX'""")
#Join Airline with Weather Tmin and Tmax Dataframes
df_airline_tmin = df_airline.join(df_weather_tmin, 
df_weather_tmin.date == df_airline.date, 
"inner").drop(df_weather_tmin.date)
df_airline_tmin_and_tmax = df_airline_tmin.join(df_weather_tmax, 
df_weather_tmax.date == df_airline_tmin.date, 
"inner").drop(df_weather_tmax.date)
df_airline_tmin_and_tmax.registerTempTable("df_airline_tmin_and_tmax")
df_all = sqlContext.sql("""SELECT 
delay,
year,
month, 
day, 
dow, 
cast (tod AS int) tod, 
distance, 
temp_min, 
temp_max
FROM df_airline_tmin_and_tmax""")
#Cache Dataframe because we split it later on
df_all.cache()
#Linear Regression
#import necessary librarys
from pyspark.mllib.regression import  LabeledPoint
# from pyspark.mllib.tree import DecisionTree, RandomForest
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.linalg import DenseVector

#Create labeledPoint Parser
def parseDF(row):
values = [row.delay, row.month, row.day, row.dow, row.tod, row.distance, row.temp_min, row.temp_max]
return LabeledPoint(values[0], DenseVector(values[1:]))
#Convert Dataframes to LabeledPoint for modeling
train_data = df_all.filter("year=2007").rdd.map(parseDF)
test_data = df_all.filter("year=2008").rdd.map(parseDF)

#Train Models
modelRF = RandomForest.trainClassifier(train_data, numClasses=2, categoricalFeaturesInfo={},
numTrees=500, impurity='gini', maxDepth=5)

#Apply CART model on Test Data
predictionsRF = modelRF.predict(test_data.map(lambda x: x.features))
predictionsAndLabelsRFRDD = predictionsRF.zip(test_data.map(lambda lp: lp.label))
predictionsAndLabelsRF = predictionsAndLabelsRFRDD.collect()
import pandas as pd
#Create function
def confusion_matrix(predAndLabel):
y_actual = pd.Series([x for x, y in predAndLabel], name = 'Actual')
y_pred = pd.Series([y for x, y in predAndLabel], name = 'Predicted')
matrix = pd.crosstab(y_actual,y_pred)
accuracy = float(matrix[0][0] + matrix[1][1])/(matrix[0][0] + matrix[0][1] + matrix[1][0] + matrix[1][1])
return matrix, accuracy

#RandomForest Confusion Matrix and Model Accuracy
df_confusion_RF, accuracy_RF = confusion_matrix(predictionsAndLabelsRF)
print('RF Confusion Matrix:')
print(df_confusion_RF)
print('nRF Model Accuracy: {0}'.format(accuracy_RF))

我在下面得到正确的输出:

RF Confusion Matrix:
Predicted     0.0    1.0
Actual                  
0.0        237594  93003
1.0          2300   2433
RF Model Accuracy: 0.715793397549

所以我的问题是:现在我有了模型predictionsRF,我如何将其应用于"现实世界"的记录?

这是我的新手尝试:

df_validation = sqlContext.sql("""SELECT 
1 delay,
2008 year,
6 month, 
19 day, 
4 dow, 
1 tod, 
925 distance, 
111 temp_min, 
272 temp_max
""")
validation_data = df_validation.rdd.map(parseDF)
df_validation.show(1)
validationsRF = modelRF.predict(validation_data.map(lambda x: x.features))
validationsAndLabelsRFRDD = validationsRF.zip(validation_data.map(lambda lp: lp.label))
validationsAndLabelsRF = validationsAndLabelsRFRDD.collect()
print(validationsRF.collect())

1. 我使用validationsRF.collect()作为预测的延迟结果是否正确?

2. 如何从df_validation中删除delay列而不会收到错误(如下(?

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 284.0 failed 4 times, most recent failure: Lost task 0.3 in stage 284.0 (TID 4544, ip-172-31-40-184.us-west-2.compute.internal, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 229, in main
process()
File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 224, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/hdp/current/spark2-client/python/pyspark/serializers.py", line 372, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "<stdin>", line 5, in parseDF
File "/usr/hdp/current/spark2-client/python/pyspark/sql/types.py", line 1561, in __getattr__
raise AttributeError(item)
AttributeError: delay
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:204)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:407)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:153)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.GeneratedMethodAccessor184.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 229, in main
process()
File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 224, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/hdp/current/spark2-client/python/pyspark/serializers.py", line 372, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "<stdin>", line 5, in parseDF
File "/usr/hdp/current/spark2-client/python/pyspark/sql/types.py", line 1561, in __getattr__
raise AttributeError(item)

如何从df_validation中删除延迟列并且没有收到错误(如下(?

不要假设它存在于您的parseDF函数中。具体来说,失败的原因是:

values = [row.delay, ...]

但老实说,只需切换到ML Pipeline。

.我使用 validationsRF.collect(( 作为预测的延迟结果是否正确?

为什么要这样做?

from pyspark.mllib.linalg import Vectors
modelRF.predict(Vectors.dense([2008, 6, 19, 4, 1, 925, 111, 272]))

最新更新