pyspark:减去忽略某些列的数据帧

我想在pyspark中的2个数据帧之间进行减去。挑战是我在减去数据框架时必须忽略某些列。但是End DataFrame应该具有所有列，包括忽略的列。

这是一个示例：

userLeft = sc.parallelize([
    Row(id=u'1', 
        first_name=u'Steve', 
        last_name=u'Kent', 
        email=u's.kent@email.com',
        date1=u'2017-02-08'),
    Row(id=u'2', 
        first_name=u'Margaret', 
        last_name=u'Peace', 
        email=u'marge.peace@email.com',
        date1=u'2017-02-09'),
    Row(id=u'3', 
        first_name=None, 
        last_name=u'hh', 
        email=u'marge.hh@email.com',
        date1=u'2017-02-10')
]).toDF()
userRight = sc.parallelize([
    Row(id=u'2', 
        first_name=u'Margaret', 
        last_name=u'Peace', 
        email=u'marge.peace@email.com',
        date1=u'2017-02-11'),
    Row(id=u'3', 
        first_name=None, 
        last_name=u'hh', 
        email=u'marge.hh@email.com',
        date1=u'2017-02-12')
]).toDF()

预期：

ActiveDF = userLeft.subtract(userRight) ||| Ignore "date1" column while subtracting.

最终结果应该看起来像这样的内容，包括" date1"列。

+----------+--------------------+----------+---+---------+
|     date1|               email|first_name| id|last_name|
+----------+--------------------+----------+---+---------+
|2017-02-08|    s.kent@email.com|     Steve|  1|     Kent|
+----------+--------------------+----------+---+---------+

似乎您需要anti-join：

userLeft.join(userRight, ["id"], "leftanti").show()
+----------+----------------+----------+---+---------+  
|     date1|           email|first_name| id|last_name|
+----------+----------------+----------+---+---------+
|2017-02-08|s.kent@email.com|     Steve|  1|     Kent|
+----------+----------------+----------+---+---------+

您还可以使用full join，只保留null值：

userLeft.join(
    userRight, 
    [c for c in userLeft.columns if c != "date1"], 
    "full"
 ).filter(psf.isnull(userLeft.date1) | psf.isnull(userRight.date1)).show()
    +------------------+----------+---+---------+----------+----------+
    |             email|first_name| id|last_name|     date1|     date1|
    +------------------+----------+---+---------+----------+----------+
    |marge.hh@email.com|      null|  3|       hh|2017-02-10|      null|
    |marge.hh@email.com|      null|  3|       hh|      null|2017-02-12|
    |  s.kent@email.com|     Steve|  1|     Kent|2017-02-08|      null|
    +------------------+----------+---+---------+----------+----------+

如果要使用加入，无论是leftanti还是full，都需要在"联接列"中找到null的默认值（我认为我们在上一个线程中讨论了它）。

您也可以只能drop打扰您subtract和join的列：

df = userLeft.drop("date1").subtract(userRight.drop("date1"))
userLeft.join(df, df.columns).show()
    +----------------+----------+---+---------+----------+
    |           email|first_name| id|last_name|     date1|
    +----------------+----------+---+---------+----------+
    |s.kent@email.com|     Steve|  1|     Kent|2017-02-08|
    +----------------+----------+---+---------+----------+

相关内容

最新更新

热门标签：