Pyspark Join Tables

我是Pyspark中的新手。我有"表A"one_answers"表B"，我需要加入两个以获取"表C"。谁能帮助我？

我正在使用dataframes ...

我不知道如何以正确的方式加入这些桌子...

表A ：

+--+----------+-----+    
|id|year_month| qt  |
+--+----------+-----+
| 1|   2015-05| 190 |
| 2|   2015-06| 390 |
+--+----------+-----+

表B ：

+---------+-----+
year_month| sem |
+---------+-----+
|  2016-01| 1   |
|  2015-02| 1   |
|  2015-03| 1   |
|  2016-04| 1   |
|  2015-05| 1   |
|  2015-06| 1   |
|  2016-07| 2   |
|  2015-08| 2   |
|  2015-09| 2   |
|  2016-10| 2   |
|  2015-11| 2   |
|  2015-12| 2   |
+---------+-----+

表C ：加入添加列，还添加行...

+--+----------+-----+-----+    
|id|year_month| qt  | sem |
+--+----------+-----+-----+
| 1| 2015-05  | 0   | 1   | 
| 1| 2016-01  | 0   | 1   |
| 1| 2015-02  | 0   | 1   |
| 1| 2015-03  | 0   | 1   |
| 1| 2016-04  | 0   | 1   |
| 1| 2015-05  | 190 | 1   |
| 1| 2015-06  | 0   | 1   | 
| 1| 2016-07  | 0   | 2   |
| 1| 2015-08  | 0   | 2   |
| 1| 2015-09  | 0   | 2   |
| 1| 2016-10  | 0   | 2   |
| 1| 2015-11  | 0   | 2   |
| 1| 2015-12  | 0   | 2   |
| 2| 2015-05  | 0   | 1   | 
| 2| 2016-01  | 0   | 1   |
| 2| 2015-02  | 0   | 1   |
| 2| 2015-03  | 0   | 1   |
| 2| 2016-04  | 0   | 1   |
| 2| 2015-05  | 0   | 1   |
| 2| 2015-06  | 390 | 1   | 
| 2| 2016-07  | 0   | 2   |
| 2| 2015-08  | 0   | 2   |
| 2| 2015-09  | 0   | 2   |
| 2| 2016-10  | 0   | 2   |
| 2| 2015-11  | 0   | 2   |
| 2| 2015-12  | 0   | 2   |
+--+----------+-----+-----+

代码：

from pyspark import HiveContext
sqlContext =  HiveContext(sc)
lA = [(1,"2015-05",190),(2,"2015-06",390)]
tableA = sqlContext.createDataFrame(lA, ["id","year_month","qt"])
tableA.show()
lB = [("2016-01",1),("2015-02",1),("2015-03",1),("2016-04",1),
      ("2015-05",1),("2015-06",1),("2016-07",2),("2015-08",2),
      ("2015-09",2),("2016-10",2),("2015-11",2),("2015-12",2)]
tableB = sqlContext.createDataFrame(lB,["year_month","sem"])
tableB.show()

它不是真正的join更多的笛卡尔产品（cross join）

Spark 2

import pyspark.sql.functions as psf
tableA.crossJoin(tableB)
    .withColumn(
        "qt", 
        psf.when(tableB.year_month == tableA.year_month, psf.col("qt")).otherwise(0))
    .drop(tableA.year_month)

Spark 1.6

tableA.join(tableB)
    .withColumn(
        "qt", 
        psf.when(tableB.year_month == tableA.year_month, psf.col("qt")).otherwise(0))
    .drop(tableA.year_month)
+---+---+----------+---+
| id| qt|year_month|sem|
+---+---+----------+---+
|  1|  0|   2015-02|  1|
|  1|  0|   2015-03|  1|
|  1|190|   2015-05|  1|
|  1|  0|   2015-06|  1|
|  1|  0|   2016-01|  1|
|  1|  0|   2016-04|  1|
|  1|  0|   2015-08|  2|
|  1|  0|   2015-09|  2|
|  1|  0|   2015-11|  2|
|  1|  0|   2015-12|  2|
|  1|  0|   2016-07|  2|
|  1|  0|   2016-10|  2|
|  2|  0|   2015-02|  1|
|  2|  0|   2015-03|  1|
|  2|  0|   2015-05|  1|
|  2|390|   2015-06|  1|
|  2|  0|   2016-01|  1|
|  2|  0|   2016-04|  1|
|  2|  0|   2015-08|  2|
|  2|  0|   2015-09|  2|
|  2|  0|   2015-11|  2|
|  2|  0|   2015-12|  2|
|  2|  0|   2016-07|  2|
|  2|  0|   2016-10|  2|
+---+---+----------+---+

相关内容

最新更新

热门标签：