我想从以下格式的文本文件中获取:
first line
column1;column2;column3
column1;column2;column3
last line
将其转换为不带第一行和最后一行的数据帧我跳过了第一行和最后一行,但后来我变成了一行和一列中的其余文本如何排列行?我还有一个数据帧的架构
var textFile = sc.textFile("*.txt")
val header = textFile.first()
val total = textFile.count()
var rows = textFile.zipWithIndex().filter(x => x._2 < total - 1).map(x => x._1).filter(x => x != header)
val schema = StructType(Array(
StructField("col1", IntegerType, true),
StructField("col2", StringType, true),
StructField("col3", StringType, true),
StructField("col4", StringType, true)
))
你应该执行以下操作(为清楚起见进行了注释(
//creating schema
import org.apache.spark.sql.types._
val schema = StructType(Array(
StructField("col1", StringType, true),
StructField("col2", StringType, true),
StructField("col3", StringType, true)
))
//reading text file and finding total lines
val textFile = sc.textFile("*.txt")
val total = textFile.count()
//indexing lines for filtering the first and the last lines
import org.apache.spark.sql.Row
val rows = textFile.zipWithIndex()
.filter(x => x._2 != 0 && x._2 < total - 1)
.map(x => Row.fromSeq(x._1.split(";").toSeq)) //converting the lines to Row of sequences
//finally creating dataframe
val df = sqlContext.createDataFrame(rows, schema)
df.show(false)
应该给你
+-------+-------+-------+
|col1 |col2 |col3 |
+-------+-------+-------+
|column1|column2|column3|
|column1|column2|column3|
+-------+-------+-------+