创建Spark Dataframe摘要



我有一个Spark Dataframe,我试图总结,以找到过长的列:

// Set up test data
// Look for long columns (>=3), ie 1 is ok row,, 2 is bad on column 3, 3 is bad on column 2
val df = Seq(
( 1, "a", "bb", "cc", "file1" ),
( 2, "d", "ee", "fff", "file2" ),
( 3, "g", "hhhh", "ii", "file3" )
).
toDF("rowId", "col1", "col2", "col3", "filename")

我可以总结列的长度,找出过长的,像这样:

// Look for long columns (>=3), ie 1 is ok row,, 2 is bad on column 3, 3 is bad on column 2
val df2 = df.columns
.map(c => (c, df.agg(max(length(df(s"$c")))).as[String].first()))
.toSeq.toDF("columnName", "maxLength")
.filter($"maxLength" > 2)

如果我尝试添加现有的文件名列到映射,我得到一个错误:

val df2 = df.columns
.map(c => ($"filename", c, df.agg(max(length(df(s"$c")))).as[String].first()))
.toSeq.toDF("fn", "columnName", "maxLength")
.filter($"maxLength" > 2)

我已经尝试了$"filename"语法的一些变体。如何将filename列合并到摘要中?

tbody> <<tr>
columnNamemaxLengthfilename
col24file3
col33file2

@wBob是否达到了您的目标?

  1. 按文件名分组并获得每列的最大值:
val cols = df.columns.dropRight(1) // to remove the filename col
val maxLength = cols.map(c => s"max(length(${c})) as ${c}").mkString(",")
print(maxLength)
df.createOrReplaceTempView("temp")
val df1 = spark
.sql(s"select filename, ${maxLength} from temp group by filename")
df1.show()`

输出:

+--------+-----+----+----+----+
|filename|rowId|col1|col2|col3|
+--------+-----+----+----+----+
|   file1|    1|   1|   2|   2|
|   file2|    1|   1|   2|   3|
|   file3|    1|   1|   4|   2|
+--------+-----+----+----+----+
  1. 使用子查询获取每列的最大值,并使用union将结果连接起来:
df1.createOrReplaceTempView("temp2")
val res = cols.map(col => {
spark.sql(s"select '${col}' as columnName,  $col as maxLength, filename from temp2 " +
s"where $col = (select max(${col}) from temp2)")
}).reduce(_ union _)
res.show()

结果:

+----------+---------+--------+
|columnName|maxLength|filename|
+----------+---------+--------+
|     rowId|        1|   file1|
|     rowId|        1|   file2|
|     rowId|        1|   file3|
|      col1|        1|   file1|
|      col1|        1|   file2|
|      col1|        1|   file3|
|      col2|        4|   file3|
|      col3|        3|   file2|
+----------+---------+--------+

注意rowIdcol1有多个条目,因为最大值不是唯一的。

可能有一种更优雅的方式来写它,但我现在正在努力寻找一个。

再推一点,效果更好。

df.select(  
col("*"), 
array( // make array of columns name/value/length
(for{ col_name <- df.columns  } yield 
struct(
length(col(col_name)).as("length"),
lit(col_name).as("col"),
col(col_name).cast("String").as("col_value")
)  
).toSeq:_* ).alias("rowInfo") 
)
.select(
col("rowId"),
explode( // explode array into rows
expr("filter(rowInfo, x -> x.length >= 3)") //filter the array for the length your interested in
).as("rowInfo") 
)
.select(
col("rowId"),
col("rowInfo.*") // turn struct fields into columns
)
.sort("length").show
+-----+------+--------+---------+
|rowId|length|     col|col_value|
+-----+------+--------+---------+
|    2|     3|    col3|      fff|
|    3|     4|    col2|     hhhh|
|    3|     5|filename|    file3|
|    1|     5|filename|    file1|
|    2|     5|filename|    file2|
+-----+------+--------+---------+

按总文本长度对表进行排序可能就足够了。这可以快速而简洁地实现。

df.select( 
col("*"), 
length( // take the length
concat(   //slap all the columns together
(for( col_name <- df.columns ) yield col(col_name)).toSeq:_*  
)
)
.as("length") 
)
.sort( //order by total length
col("length").desc
).show()
+-----+----+----+----+--------+------+
|rowId|col1|col2|col3|filename|length|
+-----+----+----+----+--------+------+
|    3|   g|hhhh|  ii|   file3|    13|
|    2|   d|  ee| fff|   file2|    12|
|    1|   a|  bb|  cc|   file1|    11|
+-----+----+----+----+--------+------+

对数组[struct]进行排序,首先对第一个字段进行排序,然后对第二个字段进行排序。当我们把刺的大小放在前面时,这个方法就有效了。如果对字段重新排序,就会得到不同的结果。如果你愿意,你可以很容易地接受多个结果,但我认为发现一行是具有挑战性的,可能已经足够了。

df.select(  
col("*"), 
reverse( //sort ascending
sort_array( //sort descending
array( // add all columns lengths to an array
(for( col_name <- df.columns ) yield struct(length(col(col_name)),lit(col_name),col(col_name).cast("String")) ).toSeq:_* )
)
)(0) // grab the row max
.alias("rowMax") )
.sort("rowMax").show
+-----+----+----+----+--------+--------------------+
|rowId|col1|col2|col3|filename|              rowMax|
+-----+----+----+----+--------+--------------------+
|    1|   a|  bb|  cc|   file1|[5, filename, file1]|
|    2|   d|  ee| fff|   file2|[5, filename, file2]|
|    3|   g|hhhh|  ii|   file3|[5, filename, file3]|
+-----+----+----+----+--------+--------------------+

最新更新