我想从连接2个数据帧的结果中删除重复项,如下所示。有没有更好的方法来加入或删除重复项,这样我就可以实现所需的数据帧?
我有如下数据帧A(df_A(:
+---------+--------+--------+-----------------------+----------+--------+---------+----------+
|ID_A |1ST_NM_A|2ND_NM_A|ADDRESS1_A |ADDRESS2_A|CITY_A |ZIP_A |PHONE_A |
+---------+--------+--------+-----------------------+----------+--------+---------+----------+
|134305632|BAVAL |WELLS |143 N 24TH BLV |NA |DALLAS |802035401|9232122398|
|134305632|BAVAL |WELLS |8898 N SIMS ST |NA |DALLAS |802053436|9232122398|
|134305632|BAVAL |WELLS |786 BULIAN ST |NA |DALLAS |862365146|9232122398|
|134305632|BAVAL |WELLS |324 DROOP STREET APT 33|NA |WESTPORT|530298435|9232122398|
+---------+--------+--------+-----------------------+----------+--------+---------+----------+
以及数据帧B(df_B(如下:
+---------+--------+--------+-----------------------+----------+--------+---------+----------+
|ID_B |1ST_NM_B|2ND_NM_B|ADDRESS1_B |ADDRESS2_B|CITY_B |ZIP_B |PHONE_B |
+---------+--------+--------+-----------------------+----------+--------+---------+----------+
|134305632|BAVAL |WELLS |786 BULIAN ST |NA |DALLAS |862365146|9232122398|
|134305632|BAVAL |WELLS |143 N 24TH BLV |NA |DALLAS |802035401|9232122398|
|134305632|BAVAL |WELLS |8898 N SIMS ST |NA |DALLAS |802053436|9232122398|
|134305632|BAVAL |WELLS |324 DROOP STREET APT 33|NA |WESTPORT|530298435|9232122398|
+---------+--------+--------+-----------------------+----------+--------+---------+----------+
我有数百万条记录,这个样本只是一条ID为134305632的记录。正如您所看到的,两个数据帧都包含相同的数据。但是,给定ID的两个数据帧中的数据可能不相同。对于这个特定的ID 来说是一样的
我想要的结果如下:
+---------+--------+--------+-----------------------+----------+--------+---------+----------+-------+--------+--------+-----------------------+----------+--------+---------+----------+
|ID_A |1ST_NM_A|2ND_NM_A|ADDRESS1_A |ADDRESS2_A|CITY_A |ZIP_A |PHONE_A |ID_B |1ST_NM_B|2ND_NM_B|ADDRESS1_B |ADDRESS2_B|CITY_B |ZIP_B |PHONE_B |
+---------+--------+--------+-----------------------+----------+--------+---------+----------+-------+--------+--------+-----------------------+----------+--------+---------+----------+
|134305632|BAVAL |WELLS |786 BULIAN ST |NA |DALLAS |862365146|9232122398|134305632|BAVAL |WELLS |786 BULIAN ST |NA |DALLAS |862365146|9232122398|
|134305632|BAVAL |WELLS |324 DROOP STREET APT 33|NA |WESTPORT|530298435|9232122398|134305632|BAVAL |WELLS |324 DROOP STREET APT 33|NA |WESTPORT|530298435|9232122398|
|134305632|BAVAL |WELLS |143 N 24TH BLV |NA |DALLAS |802035401|9232122398|134305632|BAVAL |WELLS |143 N 24TH BLV |NA |DALLAS |802035401|9232122398|
|134305632|BAVAL |WELLS |8898 N SIMS ST |NA |DALLAS |802053436|9232122398|134305632|BAVAL |WELLS |8898 N SIMS ST |NA |DALLAS |802053436|9232122398|
+---------+--------+--------+-----------------------+----------+--------+---------+----------+--------+--------+--------+-----------------------+----------+--------+---------+----------+
我执行了连接操作并进行了dropDuplicates,我得到了下面的结果,这不是我想要的结果。
val dedupped_df = df_A.join(df_B, col("ID_A") === col("ID_B"), "inner")
.dropDuplicates("ID_A", "ADDRESS1_A", "ADDRESS2_A", "CITY_A", "ZIP_A")
dedupped_df.show()
+---------+--------+--------+-----------------------+----------+--------+---------+----------+--------+--------+--------+-------------+----------+------+---------+----------+
|ID_A |1ST_NM_A|2ND_NM_A|ADDRESS1_A |ADDRESS2_A|CITY_A |ZIP_A |PHONE_A |ID_B |1ST_NM_B|2ND_NM_B|ADDRESS1_B |ADDRESS2_B|CITY_B|ZIP_B |PHONE_B |
+---------+--------+--------+-----------------------+----------+--------+---------+----------+--------+--------+--------+-------------+----------+------+---------+----------+
|134305632|BAVAL |WELLS |324 DROOP STREET APT 33|NA |WESTPORT|530298435|9232122398|134305632|BAVAL |WELLS |786 BULIAN ST|NA |DALLAS|862365146|9232122398|
|134305632|BAVAL |WELLS |143 N 24TH BLV |NA |DALLAS |802035401|9232122398|134305632|BAVAL |WELLS |786 BULIAN ST|NA |DALLAS|862365146|9232122398|
|134305632|BAVAL |WELLS |786 BULIAN ST |NA |DALLAS |862365146|9232122398|134305632|BAVAL |WELLS |786 BULIAN ST|NA |DALLAS|862365146|9232122398|
|134305632|BAVAL |WELLS |8898 N SIMS ST |NA |DALLAS |802053436|9232122398|134305632|BAVAL |WELLS |786 BULIAN ST|NA |DALLAS|862365146|9232122398|
+---------+--------+--------+-----------------------+----------+--------+---------+----------+--------+--------+--------+-------------+----------+------+---------+----------+
以下是加入后删除重复之前的结果数据帧
+---------+--------+--------+-----------------------+----------+--------+---------+----------+--------+--------+--------+-----------------------+----------+--------+---------+----------+
|ID_A |1ST_NM_A|2ND_NM_A|ADDRESS1_A |ADDRESS2_A|CITY_A |ZIP_A |PHONE_A |ID_B |1ST_NM_B|2ND_NM_B|ADDRESS1_B |ADDRESS2_B|CITY_B |ZIP_B |PHONE_B |
+---------+--------+--------+-----------------------+----------+--------+---------+----------+--------+--------+--------+-----------------------+----------+--------+---------+----------+
|134305632|BAVAL |WELLS |143 N 24TH BLV |NA |DALLAS |802035401|9232122398|134305632|BAVAL |WELLS |324 DROOP STREET APT 33|NA |WESTPORT|530298435|9232122398|
|134305632|BAVAL |WELLS |143 N 24TH BLV |NA |DALLAS |802035401|9232122398|134305632|BAVAL |WELLS |786 BULIAN ST |NA |DALLAS |862365146|9232122398|
|134305632|BAVAL |WELLS |143 N 24TH BLV |NA |DALLAS |802035401|9232122398|134305632|BAVAL |WELLS |8898 N SIMS ST |NA |DALLAS |802053436|9232122398|
|134305632|BAVAL |WELLS |143 N 24TH BLV |NA |DALLAS |802035401|9232122398|134305632|BAVAL |WELLS |143 N 24TH BLV |NA |DALLAS |802035401|9232122398|
|134305632|BAVAL |WELLS |8898 N SIMS ST |NA |DALLAS |802053436|9232122398|134305632|BAVAL |WELLS |324 DROOP STREET APT 33|NA |WESTPORT|530298435|9232122398|
|134305632|BAVAL |WELLS |8898 N SIMS ST |NA |DALLAS |802053436|9232122398|134305632|BAVAL |WELLS |786 BULIAN ST |NA |DALLAS |862365146|9232122398|
|134305632|BAVAL |WELLS |8898 N SIMS ST |NA |DALLAS |802053436|9232122398|134305632|BAVAL |WELLS |8898 N SIMS ST |NA |DALLAS |802053436|9232122398|
|134305632|BAVAL |WELLS |8898 N SIMS ST |NA |DALLAS |802053436|9232122398|134305632|BAVAL |WELLS |143 N 24TH BLV |NA |DALLAS |802035401|9232122398|
|134305632|BAVAL |WELLS |786 BULIAN ST |NA |DALLAS |862365146|9232122398|134305632|BAVAL |WELLS |324 DROOP STREET APT 33|NA |WESTPORT|530298435|9232122398|
|134305632|BAVAL |WELLS |786 BULIAN ST |NA |DALLAS |862365146|9232122398|134305632|BAVAL |WELLS |786 BULIAN ST |NA |DALLAS |862365146|9232122398|
|134305632|BAVAL |WELLS |786 BULIAN ST |NA |DALLAS |862365146|9232122398|134305632|BAVAL |WELLS |8898 N SIMS ST |NA |DALLAS |802053436|9232122398|
|134305632|BAVAL |WELLS |786 BULIAN ST |NA |DALLAS |862365146|9232122398|134305632|BAVAL |WELLS |143 N 24TH BLV |NA |DALLAS |802035401|9232122398|
|134305632|BAVAL |WELLS |324 DROOP STREET APT 33|NA |WESTPORT|530298435|9232122398|134305632|BAVAL |WELLS |324 DROOP STREET APT 33|NA |WESTPORT|530298435|9232122398|
|134305632|BAVAL |WELLS |324 DROOP STREET APT 33|NA |WESTPORT|530298435|9232122398|134305632|BAVAL |WELLS |786 BULIAN ST |NA |DALLAS |862365146|9232122398|
|134305632|BAVAL |WELLS |324 DROOP STREET APT 33|NA |WESTPORT|530298435|9232122398|134305632|BAVAL |WELLS |8898 N SIMS ST |NA |DALLAS |802053436|9232122398|
|134305632|BAVAL |WELLS |324 DROOP STREET APT 33|NA |WESTPORT|530298435|9232122398|134305632|BAVAL |WELLS |143 N 24TH BLV |NA |DALLAS |802035401|9232122398|
+---------+--------+--------+-----------------------+----------+--------+---------+----------+------ --+--------+--------+-----------------------+----------+--------+---------+----------+
df.dropDuplicates()
// also you can set columns