我试图执行Apache SPARK文档使用反射推断模式段中提供的基本示例。
我在Cloudera快速入门VM(CDH5)上这样做
我要执行的例子如下::
# sc is an existing SparkContext.
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
# Load a text file and convert each line to a Row.
lines = sc.textFile("/user/cloudera/analytics/book6_sample.csv")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))
# Infer the schema, and register the DataFrame as a table.
schemaPeople = sqlContext.createDataFrame(people)
schemaPeople.registerTempTable("people")
# SQL can be run over DataFrames that have been registered as a table.
teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
# The results of SQL queries are RDDs and support all the normal RDD operations.
teenNames = teenagers.map(lambda p: "Name: " + p.name)
for teenName in teenNames.collect():
print(teenName)
我运行的代码完全如上所示,但总是得到错误"IndexError: list index out of range"当我执行最后一个命令(for循环)。
输入文件book6_sample可在book6_sample.csv .
我运行的代码完全如上所示,但总是得到错误"IndexError: list index out of range"当我执行最后一个命令(for循环)。
请告诉我哪里做错了。
提前感谢。
问候,斯里兰卡
您的文件末尾有一个空行,这导致了这个错误。在文本编辑器中打开文件并删除那一行希望它能起作用