Spark Java |读取CSV到类型化数据集错误



我正在尝试使用Spark(Java)从CSV读取Student记录,并面临将其转换为键入Dataset<Student>的错误。

<<p>读者代码/strong>:
SparkSession spark = SparkSession.builder().appName("testingSQL").master("local[*]").getOrCreate();
Encoder<Student> studentEncoder = Encoders.bean(Student.class);
Dataset<Student> df = spark.read().option("header", true)
//.schema(studentEncoder.schema())
.csv("src/main/resources/exams/students2.csv")
.as(studentEncoder);
df.show();
<<p>学生类/strong>
public class Student implements Serializable {
int studentId;
int examCenterId;
String subject;
int year;
int quarter;
int score;
String grade;
// getters and setters
}

CSV文件

studentId,examCenterId,subject,year,quarter,score,grade
1,1,Math,2005,1,41,D
1,1,Spanish,2005,1,51,C
1,1,German,2005,1,39,D
1,1,Physics,2005,1,35,D

然而,当我试图读取这些记录时,我面临以下两个问题:

问题1:当读取器代码中的.schema(studentEncoder.schema())部分被注释掉时,执行抛出一个upcastFailureError

Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast score from string to int.
The type path of the target object is:
- field (class: "int", name: "score")
- root class: "com.virtualpairprogrammers.model.Student"
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object
at org.apache.spark.sql.errors.QueryCompilationErrors$.upCastFailureError(QueryCompilationErrors.scala:154)

问题2:.schema(studentEncoder.schema())行未注释时,spark确实显示Student行,但值显示在不正确的列上,并且一些列完全为空。例如:分数显示在主题栏。

<表类>examCenterId年级季度得分studentId主题年tbody><<tr>11空2005141空11空2005151空11空2005139空11空2005135空

问题1的解决方案:添加一个选项来推断解决它的模式。

Dataset<Student> df = spark.read().option("header", true)
.option("inferSchema", true)
.csv("src/main/resources/exams/students2.csv")
.as(studentEncoder);

警告:inferSchema需要对数据进行额外传递。

最新更新