需要帮助自动化下面的 Spark 逻辑以在 python 中获取列详细信息



我需要帮助来自动化下面的代码,以便它应该自动从表中选择标题并将所有标题列转换为行,然后需要执行基本聚合(最大值,最小值。等(

任何线索都会有很大帮助。

下面我从表中手动传递列名,从表中选择标题并将其转换为列。

from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql.functions import *
df = sc.parallelize([ 
Row(name='Alice', age=5, height=80), 
Row(name='Mujkesh', age=10, height=90), 
Row(name='Ganesh', age=15, height=100)]).toDF()
df.createOrReplaceTempView("Test")
df3 = spark.sql("desc Test" )
df4=df.selectExpr("stack(3,'name',bigint(name),'age',bigint(age),'height',bigint(height)) as (col_name,data)").groupBy(col("col_name")).agg(max(col("data")).alias("Max_val"),min(col("data")).alias("Min_val"))
df5=df3.join(df4,['col_name'],'inner').orderBy("col_name").drop(col("comment"))
df5.show()
Input Data :
+---+------+-------+
|age|height|   name|
+---+------+-------+
|  5|    80|  Alice|
| 10|    90|Mujkesh|
| 15|   100| Ganesh|
+---+------+-------+
output DATA: 
+--------+---------+-------+-------+
|col_name|data_type|Max_val|Min_val|
+--------+---------+-------+-------+
|     age|   bigint|     15|      5|
|  height|   bigint|    100|     80|
|    name|   string|   null|   null|
+--------+---------+-------+-------+

您可以使用 -获取列的名称 - spark_df.schema.names

您也可以尝试在该特定列上使用透视。

下面的代码是用scala编写的。

scala> df.printSchema
root
|-- name: string (nullable = true)
|-- age: integer (nullable = false)
|-- height: integer (nullable = false)
scala> df.show(false)
+-------+---+------+
|name   |age|height|
+-------+---+------+
|Alice  |5  |80    |
|Mujkesh|10 |90    |
|Ganesh |15 |100   |
+-------+---+------+
scala> 
val expr = array(
df
.schema
.map(c => (c.name,c.dataType.typeName)) // column name & its data type
.map(c => 
struct(
lit(c._1).as("column_name"),
if(c._2 != "string") max(col(c._1)).as("max") else lit(null).cast("int").as("max"),
if(c._2 != "string") min(col(c._1)).as("min") else lit(null).cast("int").as("min"),
lit(c._2).as("datatype")
)
):_*
)
scala> df
.select(expr.as("summary"))
.select(explode($"summary").as("summary"))
.select("summary.*")
.show(false)
+-----------+----+----+--------+
|column_name|max |min |datatype|
+-----------+----+----+--------+
|name       |null|null|string  |
|age        |15  |5   |integer |
|height     |100 |80  |integer |
+-----------+----+----+--------+

最新更新