使用火花数据帧转置



我正试图弄清楚如何使用spark数据帧来解决这个用例。

在下面的谷歌表中,我有源数据,人们回答的调查问题将存储在其中。此外,问题栏大约将超过1000栏,并且更具动态性,而不是固定的。

有一个元数据表,它解释了这个问题、它的描述以及它可以包含的选项。

输出表应该和我在表格中提到的一样。关于如何实现这一目标,有什么建议或想法吗?

https://docs.google.com/spreadsheets/d/1BAY8XWaio1DbzcQeQgru6PuNfT9A7Uhf650x_-PAjqo/edit#gid=0

假设您的主表名为df:

+---------+-----------+-----------+------+------+------+
|survey_id|response_id|person_name|Q1D102|Q1D103|Q1D105|
+---------+-----------+-----------+------+------+------+
|xyz      |xyz        |john       |1     |2     |1     |
|abc      |abc        |foo        |3     |1     |1     |
|def      |def        |bar        |2     |2     |2     |
+---------+-----------+-----------+------+------+------+

映射表称为df2:

+-----------+-------------+-------------------+---------+-----------+
|question_id|question_name|question_text      |choice_id|choice_desc|
+-----------+-------------+-------------------+---------+-----------+
|Q1D102     |Gender       |What is your gender|1        |Male       |
|Q1D102     |Gender       |What is your gender|2        |Female     |
|Q1D102     |Gender       |What is your gender|3        |Diverse    |
|Q1D103     |Age          |What is your age   |1        |20 - 50    |
|Q1D103     |Age          |What is your age   |2        |50 >       |
|Q1D105     |work_status  |Do you work        |1        |Yes        |
|Q1D105     |work_status  |Do you work        |2        |No         |
+-----------+-------------+-------------------+---------+-----------+

我们可以构建一个动态的unpivot表达式,如下所示:

val columns = df.columns.filter(c => c.startsWith("Q1D"))
val data = columns.map(c => s"'$c', $c").mkString(",")
val finalExpr = s"stack(${columns.length}, $data) as (question_id, choice_id)"

对于3个问题,我们得到以下表达式(Q1D102Q1D103Q1D105(:stack(3, 'Q1D102', Q1D102,'Q1D103', Q1D103,'Q1D105', Q1D105) as (question_id, choice_id)

最后,我们使用构造的变量:

df = df
.selectExpr("survey_id", "response_id", "person_name", finalExpr)
.join(df2, Seq("question_id", "choice_id"), "left")

你会得到这样的结果:

+-----------+---------+---------+-----------+-----------+-------------+-------------------+-----------+
|question_id|choice_id|survey_id|response_id|person_name|question_name|question_text      |choice_desc|
+-----------+---------+---------+-----------+-----------+-------------+-------------------+-----------+
|Q1D102     |1        |xyz      |xyz        |john       |Gender       |What is your gender|Male       |
|Q1D102     |2        |def      |def        |bar        |Gender       |What is your gender|Female     |
|Q1D102     |3        |abc      |abc        |foo        |Gender       |What is your gender|Diverse    |
|Q1D103     |1        |abc      |abc        |foo        |Age          |What is your age   |20 - 50    |
|Q1D103     |2        |xyz      |xyz        |john       |Age          |What is your age   |50 >       |
|Q1D103     |2        |def      |def        |bar        |Age          |What is your age   |50 >       |
|Q1D105     |1        |xyz      |xyz        |john       |work_status  |Do you work        |Yes        |
|Q1D105     |1        |abc      |abc        |foo        |work_status  |Do you work        |Yes        |
|Q1D105     |2        |def      |def        |bar        |work_status  |Do you work        |No         |
+-----------+---------+---------+-----------+-----------+-------------+-------------------+-----------+

我认为这正是你所需要的(只是无序的(,祝你好运!

最新更新