我正在尝试使用Spark(Java)从CSV读取Student
记录,并面临将其转换为键入Dataset<Student>
的错误。
SparkSession spark = SparkSession.builder().appName("testingSQL").master("local[*]").getOrCreate();
Encoder<Student> studentEncoder = Encoders.bean(Student.class);
Dataset<Student> df = spark.read().option("header", true)
//.schema(studentEncoder.schema())
.csv("src/main/resources/exams/students2.csv")
.as(studentEncoder);
df.show();
<<p>学生类/strong>public class Student implements Serializable {
int studentId;
int examCenterId;
String subject;
int year;
int quarter;
int score;
String grade;
// getters and setters
}
CSV文件
studentId,examCenterId,subject,year,quarter,score,grade
1,1,Math,2005,1,41,D
1,1,Spanish,2005,1,51,C
1,1,German,2005,1,39,D
1,1,Physics,2005,1,35,D
然而,当我试图读取这些记录时,我面临以下两个问题:
问题1:当读取器代码中的.schema(studentEncoder.schema())
部分被注释掉时,执行抛出一个upcastFailureError
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast score from string to int.
The type path of the target object is:
- field (class: "int", name: "score")
- root class: "com.virtualpairprogrammers.model.Student"
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object
at org.apache.spark.sql.errors.QueryCompilationErrors$.upCastFailureError(QueryCompilationErrors.scala:154)
问题2:当.schema(studentEncoder.schema())
行未注释时,spark确实显示Student行,但值显示在不正确的列上,并且一些列完全为空。例如:分数显示在主题栏。
<表类>examCenterId 年级 季度得分 studentId 主题年 tbody><<tr>1 1 空 2005 1 41 空 11 空 2005 1 51 空 11 空 2005 1 39 空 11 空 2005 1 35 空 表类>
问题1的解决方案:添加一个选项来推断解决它的模式。
Dataset<Student> df = spark.read().option("header", true)
.option("inferSchema", true)
.csv("src/main/resources/exams/students2.csv")
.as(studentEncoder);
警告:inferSchema
需要对数据进行额外传递。