如何使用Scala删除数据帧火花中的重复行



我想从连接2个数据帧的结果中删除重复项,如下所示。有没有更好的方法来加入或删除重复项,这样我就可以实现所需的数据帧?

我有如下数据帧A(df_A(:

+---------+--------+--------+-----------------------+----------+--------+---------+----------+
|ID_A     |1ST_NM_A|2ND_NM_A|ADDRESS1_A             |ADDRESS2_A|CITY_A  |ZIP_A    |PHONE_A   |
+---------+--------+--------+-----------------------+----------+--------+---------+----------+
|134305632|BAVAL   |WELLS   |143 N 24TH BLV         |NA        |DALLAS  |802035401|9232122398|
|134305632|BAVAL   |WELLS   |8898 N SIMS ST         |NA        |DALLAS  |802053436|9232122398|
|134305632|BAVAL   |WELLS   |786 BULIAN ST          |NA        |DALLAS  |862365146|9232122398|
|134305632|BAVAL   |WELLS   |324 DROOP STREET APT 33|NA        |WESTPORT|530298435|9232122398|
+---------+--------+--------+-----------------------+----------+--------+---------+----------+

以及数据帧B(df_B(如下:

+---------+--------+--------+-----------------------+----------+--------+---------+----------+
|ID_B     |1ST_NM_B|2ND_NM_B|ADDRESS1_B             |ADDRESS2_B|CITY_B  |ZIP_B    |PHONE_B   |
+---------+--------+--------+-----------------------+----------+--------+---------+----------+
|134305632|BAVAL   |WELLS   |786 BULIAN ST          |NA        |DALLAS  |862365146|9232122398|
|134305632|BAVAL   |WELLS   |143 N 24TH BLV         |NA        |DALLAS  |802035401|9232122398|
|134305632|BAVAL   |WELLS   |8898 N SIMS ST         |NA        |DALLAS  |802053436|9232122398|
|134305632|BAVAL   |WELLS   |324 DROOP STREET APT 33|NA        |WESTPORT|530298435|9232122398|
+---------+--------+--------+-----------------------+----------+--------+---------+----------+

我有数百万条记录,这个样本只是一条ID为134305632的记录。正如您所看到的,两个数据帧都包含相同的数据。但是,给定ID的两个数据帧中的数据可能不相同。对于这个特定的ID 来说是一样的

我想要的结果如下:

+---------+--------+--------+-----------------------+----------+--------+---------+----------+-------+--------+--------+-----------------------+----------+--------+---------+----------+
|ID_A     |1ST_NM_A|2ND_NM_A|ADDRESS1_A             |ADDRESS2_A|CITY_A  |ZIP_A    |PHONE_A   |ID_B   |1ST_NM_B|2ND_NM_B|ADDRESS1_B             |ADDRESS2_B|CITY_B  |ZIP_B    |PHONE_B   |
+---------+--------+--------+-----------------------+----------+--------+---------+----------+-------+--------+--------+-----------------------+----------+--------+---------+----------+
|134305632|BAVAL   |WELLS   |786 BULIAN ST          |NA        |DALLAS |862365146|9232122398|134305632|BAVAL   |WELLS   |786 BULIAN ST          |NA        |DALLAS   |862365146|9232122398|
|134305632|BAVAL   |WELLS   |324 DROOP STREET APT 33|NA       |WESTPORT|530298435|9232122398|134305632|BAVAL   |WELLS   |324 DROOP STREET APT 33|NA        |WESTPORT|530298435|9232122398|
|134305632|BAVAL   |WELLS   |143 N 24TH BLV         |NA        |DALLAS |802035401|9232122398|134305632|BAVAL   |WELLS   |143 N 24TH BLV         |NA        |DALLAS    |802035401|9232122398|
|134305632|BAVAL   |WELLS   |8898 N SIMS ST         |NA        |DALLAS |802053436|9232122398|134305632|BAVAL   |WELLS   |8898 N SIMS ST         |NA        |DALLAS    |802053436|9232122398|
+---------+--------+--------+-----------------------+----------+--------+---------+----------+--------+--------+--------+-----------------------+----------+--------+---------+----------+

我执行了连接操作并进行了dropDuplicates,我得到了下面的结果,这不是我想要的结果。

val dedupped_df = df_A.join(df_B, col("ID_A") === col("ID_B"), "inner")
.dropDuplicates("ID_A", "ADDRESS1_A", "ADDRESS2_A", "CITY_A", "ZIP_A")
dedupped_df.show()
+---------+--------+--------+-----------------------+----------+--------+---------+----------+--------+--------+--------+-------------+----------+------+---------+----------+
|ID_A     |1ST_NM_A|2ND_NM_A|ADDRESS1_A             |ADDRESS2_A|CITY_A  |ZIP_A    |PHONE_A   |ID_B    |1ST_NM_B|2ND_NM_B|ADDRESS1_B   |ADDRESS2_B|CITY_B|ZIP_B    |PHONE_B   |
+---------+--------+--------+-----------------------+----------+--------+---------+----------+--------+--------+--------+-------------+----------+------+---------+----------+
|134305632|BAVAL   |WELLS   |324 DROOP STREET APT 33|NA       |WESTPORT|530298435|9232122398|134305632|BAVAL   |WELLS   |786 BULIAN ST|NA        |DALLAS|862365146|9232122398|
|134305632|BAVAL   |WELLS   |143 N 24TH BLV         |NA        |DALLAS |802035401|9232122398|134305632|BAVAL   |WELLS   |786 BULIAN ST|NA        |DALLAS|862365146|9232122398|
|134305632|BAVAL   |WELLS   |786 BULIAN ST          |NA        |DALLAS |862365146|9232122398|134305632|BAVAL   |WELLS   |786 BULIAN ST|NA        |DALLAS|862365146|9232122398|
|134305632|BAVAL   |WELLS   |8898 N SIMS ST         |NA        |DALLAS |802053436|9232122398|134305632|BAVAL   |WELLS   |786 BULIAN ST|NA        |DALLAS|862365146|9232122398|
+---------+--------+--------+-----------------------+----------+--------+---------+----------+--------+--------+--------+-------------+----------+------+---------+----------+

以下是加入后删除重复之前的结果数据帧

+---------+--------+--------+-----------------------+----------+--------+---------+----------+--------+--------+--------+-----------------------+----------+--------+---------+----------+
|ID_A     |1ST_NM_A|2ND_NM_A|ADDRESS1_A             |ADDRESS2_A|CITY_A  |ZIP_A    |PHONE_A   |ID_B    |1ST_NM_B|2ND_NM_B|ADDRESS1_B             |ADDRESS2_B|CITY_B  |ZIP_B    |PHONE_B   |
+---------+--------+--------+-----------------------+----------+--------+---------+----------+--------+--------+--------+-----------------------+----------+--------+---------+----------+
|134305632|BAVAL   |WELLS   |143 N 24TH BLV         |NA        |DALLAS |802035401|9232122398|134305632|BAVAL   |WELLS   |324 DROOP STREET APT 33|NA        |WESTPORT|530298435|9232122398|
|134305632|BAVAL   |WELLS   |143 N 24TH BLV         |NA        |DALLAS |802035401|9232122398|134305632|BAVAL   |WELLS   |786 BULIAN ST          |NA        |DALLAS  |862365146|9232122398|
|134305632|BAVAL   |WELLS   |143 N 24TH BLV         |NA        |DALLAS |802035401|9232122398|134305632|BAVAL   |WELLS   |8898 N SIMS ST         |NA        |DALLAS  |802053436|9232122398|
|134305632|BAVAL   |WELLS   |143 N 24TH BLV         |NA        |DALLAS |802035401|9232122398|134305632|BAVAL   |WELLS   |143 N 24TH BLV         |NA        |DALLAS  |802035401|9232122398|
|134305632|BAVAL   |WELLS   |8898 N SIMS ST         |NA        |DALLAS |802053436|9232122398|134305632|BAVAL   |WELLS   |324 DROOP STREET APT 33|NA        |WESTPORT|530298435|9232122398|
|134305632|BAVAL   |WELLS   |8898 N SIMS ST         |NA        |DALLAS |802053436|9232122398|134305632|BAVAL   |WELLS   |786 BULIAN ST          |NA        |DALLAS  |862365146|9232122398|
|134305632|BAVAL   |WELLS   |8898 N SIMS ST         |NA        |DALLAS |802053436|9232122398|134305632|BAVAL   |WELLS   |8898 N SIMS ST         |NA        |DALLAS  |802053436|9232122398|
|134305632|BAVAL   |WELLS   |8898 N SIMS ST         |NA        |DALLAS |802053436|9232122398|134305632|BAVAL   |WELLS   |143 N 24TH BLV         |NA        |DALLAS  |802035401|9232122398|
|134305632|BAVAL   |WELLS   |786 BULIAN ST          |NA        |DALLAS |862365146|9232122398|134305632|BAVAL   |WELLS   |324 DROOP STREET APT 33|NA        |WESTPORT|530298435|9232122398|
|134305632|BAVAL   |WELLS   |786 BULIAN ST          |NA        |DALLAS |862365146|9232122398|134305632|BAVAL   |WELLS   |786 BULIAN ST          |NA        |DALLAS  |862365146|9232122398|
|134305632|BAVAL   |WELLS   |786 BULIAN ST          |NA        |DALLAS |862365146|9232122398|134305632|BAVAL   |WELLS   |8898 N SIMS ST         |NA        |DALLAS  |802053436|9232122398|
|134305632|BAVAL   |WELLS   |786 BULIAN ST          |NA        |DALLAS |862365146|9232122398|134305632|BAVAL   |WELLS   |143 N 24TH BLV         |NA        |DALLAS  |802035401|9232122398|
|134305632|BAVAL   |WELLS   |324 DROOP STREET APT 33|NA        |WESTPORT|530298435|9232122398|134305632|BAVAL   |WELLS   |324 DROOP STREET APT 33|NA        |WESTPORT|530298435|9232122398|
|134305632|BAVAL   |WELLS   |324 DROOP STREET APT 33|NA        |WESTPORT|530298435|9232122398|134305632|BAVAL   |WELLS   |786 BULIAN ST          |NA        |DALLAS  |862365146|9232122398|
|134305632|BAVAL   |WELLS   |324 DROOP STREET APT 33|NA        |WESTPORT|530298435|9232122398|134305632|BAVAL   |WELLS   |8898 N SIMS ST         |NA        |DALLAS  |802053436|9232122398|
|134305632|BAVAL   |WELLS   |324 DROOP STREET APT 33|NA        |WESTPORT|530298435|9232122398|134305632|BAVAL   |WELLS   |143 N 24TH BLV         |NA        |DALLAS  |802035401|9232122398|
+---------+--------+--------+-----------------------+----------+--------+---------+----------+------ --+--------+--------+-----------------------+----------+--------+---------+----------+
df.dropDuplicates()
// also you can set columns

最新更新