Spark Java |读取CSV到类型化数据集错误

我正在尝试使用Spark(Java)从CSV读取Student记录，并面临将其转换为键入Dataset<Student>的错误。

<<p>读者代码/strong>:
SparkSession spark = SparkSession.builder().appName("testingSQL").master("local[*]").getOrCreate(); Encoder<Student> studentEncoder = Encoders.bean(Student.class); Dataset<Student> df = spark.read().option("header", true) //.schema(studentEncoder.schema()) .csv("src/main/resources/exams/students2.csv") .as(studentEncoder); df.show();
<<p>学生类/strong>
public class Student implements Serializable { int studentId; int examCenterId; String subject; int year; int quarter; int score; String grade; // getters and setters }
CSV文件
studentId,examCenterId,subject,year,quarter,score,grade 1,1,Math,2005,1,41,D 1,1,Spanish,2005,1,51,C 1,1,German,2005,1,39,D 1,1,Physics,2005,1,35,D
然而，当我试图读取这些记录时，我面临以下两个问题:
问题1:当读取器代码中的.schema(studentEncoder.schema())部分被注释掉时，执行抛出一个upcastFailureError

Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast score from string to int. The type path of the target object is: - field (class: "int", name: "score") - root class: "com.virtualpairprogrammers.model.Student" You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object at org.apache.spark.sql.errors.QueryCompilationErrors$.upCastFailureError(QueryCompilationErrors.scala:154)
问题2:当.schema(studentEncoder.schema())行未注释时，spark确实显示Student行，但值显示在不正确的列上，并且一些列完全为空。例如:分数显示在主题栏。
<表类>examCenterId年级季度得分studentId主题年tbody><<tr>11空2005141空11空2005151空11空2005139空11空2005135空
问题1的解决方案:添加一个选项来推断解决它的模式。

Dataset<Student> df = spark.read().option("header", true) .option("inferSchema", true) .csv("src/main/resources/exams/students2.csv") .as(studentEncoder);
警告:inferSchema需要对数据进行额外传递。

相关内容

最新更新

热门标签：