我正试图弄清楚如何使用spark数据帧来解决这个用例。
在下面的谷歌表中,我有源数据,人们回答的调查问题将存储在其中。此外,问题栏大约将超过1000栏,并且更具动态性,而不是固定的。
有一个元数据表,它解释了这个问题、它的描述以及它可以包含的选项。
输出表应该和我在表格中提到的一样。关于如何实现这一目标,有什么建议或想法吗?
https://docs.google.com/spreadsheets/d/1BAY8XWaio1DbzcQeQgru6PuNfT9A7Uhf650x_-PAjqo/edit#gid=0
假设您的主表名为df
:
+---------+-----------+-----------+------+------+------+
|survey_id|response_id|person_name|Q1D102|Q1D103|Q1D105|
+---------+-----------+-----------+------+------+------+
|xyz |xyz |john |1 |2 |1 |
|abc |abc |foo |3 |1 |1 |
|def |def |bar |2 |2 |2 |
+---------+-----------+-----------+------+------+------+
映射表称为df2
:
+-----------+-------------+-------------------+---------+-----------+
|question_id|question_name|question_text |choice_id|choice_desc|
+-----------+-------------+-------------------+---------+-----------+
|Q1D102 |Gender |What is your gender|1 |Male |
|Q1D102 |Gender |What is your gender|2 |Female |
|Q1D102 |Gender |What is your gender|3 |Diverse |
|Q1D103 |Age |What is your age |1 |20 - 50 |
|Q1D103 |Age |What is your age |2 |50 > |
|Q1D105 |work_status |Do you work |1 |Yes |
|Q1D105 |work_status |Do you work |2 |No |
+-----------+-------------+-------------------+---------+-----------+
我们可以构建一个动态的unpivot表达式,如下所示:
val columns = df.columns.filter(c => c.startsWith("Q1D"))
val data = columns.map(c => s"'$c', $c").mkString(",")
val finalExpr = s"stack(${columns.length}, $data) as (question_id, choice_id)"
对于3个问题,我们得到以下表达式(Q1D102
、Q1D103
和Q1D105
(:stack(3, 'Q1D102', Q1D102,'Q1D103', Q1D103,'Q1D105', Q1D105) as (question_id, choice_id)
最后,我们使用构造的变量:
df = df
.selectExpr("survey_id", "response_id", "person_name", finalExpr)
.join(df2, Seq("question_id", "choice_id"), "left")
你会得到这样的结果:
+-----------+---------+---------+-----------+-----------+-------------+-------------------+-----------+
|question_id|choice_id|survey_id|response_id|person_name|question_name|question_text |choice_desc|
+-----------+---------+---------+-----------+-----------+-------------+-------------------+-----------+
|Q1D102 |1 |xyz |xyz |john |Gender |What is your gender|Male |
|Q1D102 |2 |def |def |bar |Gender |What is your gender|Female |
|Q1D102 |3 |abc |abc |foo |Gender |What is your gender|Diverse |
|Q1D103 |1 |abc |abc |foo |Age |What is your age |20 - 50 |
|Q1D103 |2 |xyz |xyz |john |Age |What is your age |50 > |
|Q1D103 |2 |def |def |bar |Age |What is your age |50 > |
|Q1D105 |1 |xyz |xyz |john |work_status |Do you work |Yes |
|Q1D105 |1 |abc |abc |foo |work_status |Do you work |Yes |
|Q1D105 |2 |def |def |bar |work_status |Do you work |No |
+-----------+---------+---------+-----------+-----------+-------------+-------------------+-----------+
我认为这正是你所需要的(只是无序的(,祝你好运!