我正在尝试读取Spark 2.4中的.txt文件并将其加载到数据帧中。文件数据看起来像:-
在一个经理的领导下,有许多员工
Manager_21: Employee_575,Employee_2703,
Manager_11: Employee_454,Employee_158,
Manager_4: Employee_1545,Employee_1312
我用Scala Spark 2.4编写的代码:-
val df = spark.read
.format("csv")
.option("header", "true") //first line in file has headers
.option("mode", "DROPMALFORMED")
.load("D:/path/myfile.txt")
df.printSchema()
不幸的是,在打印模式时,在单个Manager_21下所有Employee都可见。
root
|-- Manager_21: servant_575: string (nullable = true)
|-- Employee_454: string (nullable = true)
|-- Employee_1312 string (nullable = true)
……等
我不确定这在星火标量中是否可行。。。。
预期输出:
同一列中经理的所有员工例如:21经理有2名员工,他们都在同一列。或者,我们如何才能看到哪个员工都在某个特定的经理手下。
Manager_21 |Manager_11 |Manager_4
Employee_575 |Employee_454 |Employee_1545
Employee_2703|Employee_158|Employee_1312
有其他办法吗。。。。。请推荐
感谢
尝试使用spark.read.text
,然后使用groupBy
和.pivot
来获得所需的结果。
Example:
val df=spark.read.text("<path>")
df.show(10,false)
//+--------------------------------------+
//|value |
//+--------------------------------------+
//|Manager_21: Employee_575,Employee_2703|
//|Manager_11: Employee_454,Employee_158 |
//|Manager_4: Employee_1545,Employee_1312|
//+--------------------------------------+
import org.apache.spark.sql.functions._
df.withColumn("mid",monotonically_increasing_id).
withColumn("col1",split(col("value"),":")(0)).
withColumn("col2",split(split(col("value"),":")(1),",")).
groupBy("mid").
pivot(col("col1")).
agg(min(col("col2"))).
select(max("Manager_11").alias("Manager_11"),max("Manager_21").alias("Manager_21") ,max("Manager_4").alias("Manager_4")).
selectExpr("explode(arrays_zip(Manager_11,Manager_21,Manager_4))").
select("col.*").
show()
//+-------------+-------------+--------------+
//| Manager_11| Manager_21| Manager_4|
//+-------------+-------------+--------------+
//| Employee_454| Employee_575| Employee_1545|
//| Employee_158|Employee_2703| Employee_1312|
//+-------------+-------------+--------------+
UPDATE:
val df=spark.read.text("<path>")
val df1=df.withColumn("mid",monotonically_increasing_id).
withColumn("col1",split(col("value"),":")(0)).
withColumn("col2",split(split(col("value"),":")(1),",")).
groupBy("mid").
pivot(col("col1")).
agg(min(col("col2"))).
select(max("Manager_11").alias("Manager_11"),max("Manager_21").alias("Manager_21") ,max("Manager_4").alias("Manager_4")).
selectExpr("explode(arrays_zip(Manager_11,Manager_21,Manager_4))")
//create temp table
df1.createOrReplaceTempView("tmp_table")
sql("select col.* from tmp_table").show(10,false)
//+-------------+-------------+--------------+
//|Manager_11 |Manager_21 |Manager_4 |
//+-------------+-------------+--------------+
//| Employee_454| Employee_575| Employee_1545|
//|Employee_158 |Employee_2703|Employee_1312 |
//+-------------+-------------+--------------+