如何在不打乱顺序的情况下连接两个数据帧?(PySpark SQL)



所以我有这个数据帧hotel_weather_top_ten:

----------+-----+--------+----------+
|           id|month|abs_diff|row_number|
+-------------+-----+--------+----------+
| 996432412678|    8|     9.0|         1|
| 970662608897|    8|     9.0|         2|
| 730144440321|    8|     9.0|         3|
| 824633720837|    8|     9.0|         4|
| 953482739712|    8|     8.0|         5|
|1013612281861|    8|     8.0|         6|
|1288490188802|    8|     8.0|         7|
| 429496729601|    8|     8.0|         8|
| 429496729604|    8|     8.0|         9|
|1262720385026|    8|     8.0|        10|
| 146028888072|    9|    19.0|         1|
|1571958030336|    9|    19.0|         2|
| 146028888068|    9|    16.0|         3|
|  25769803779|    9|    15.0|         4|
|1099511627777|    9|    13.0|         5|
| 206158430212|    9|    13.0|         6|
|1262720385024|    9|    13.0|         7|
... and so on

基本上,它只显示每个酒店每个月的前10个abs_diff值。

现在我想加入hotel_weather_names,所以我有了酒店名称:

+-------------+--------------------+
|           id|                name|
+-------------+--------------------+
| 858993459206|   2608 E Malone Ave|
|1400159338497|     345 Chestnut St|
| 463856467968|      45400 Park Ave|
| 481036337152|  1118 Government St|
|1709396983810|165 Saint Emanuel St|
| 790273982465|1237 Us Highway 4...|
| 231928233986|       Maslinica Bay|
| 592705486849|   1223 Radford Blvd|
| 515396075520|    3750 Meridian St|
| 352187318275|      1407 E Rusk St|
|1202590842884|     2030 Formosa Rd|
| 747324309509|        602 E 4th St|
| 695784701955|10821 Caribbean Blvd|
| 644245094400|1725 Long Beach Blvd|
|1176821039105| 15902 S Western Ave|
|1365799600130|3201 E Pacific Co...|
| 309237645313|   8541 S Hampton Rd|
| 678604832769|       221 Shultz Rd|
| 755914244097|1320 Harrisburg Pike|
| 523986010113|        615 Clark Rd|
+-------------+--------------------+
only showing top 20 rows

但不知怎么的,当我加入时,结果并不是我所期望的。例如:

|          id|month|abs_diff|row_number|             name|
+------------+-----+--------+----------+-----------------+
|996432412678|    8|     9.0|         1|     736 Idaho St|
|996432412678|    8|     9.0|         1|     736 Idaho St|
|996432412678|    8|     9.0|         1|     736 Idaho St|
|996432412678|    8|     9.0|         1|     736 Idaho St|
|996432412678|    8|     9.0|         1|     736 Idaho St|
|996432412678|    8|     9.0|         1|     736 Idaho St|
|996432412678|    8|     9.0|         1|     736 Idaho St|
|970662608897|    8|     9.0|         2|    1755 Idaho St|
|970662608897|    8|     9.0|         2|    1755 Idaho St|
|970662608897|    8|     9.0|         2|    1755 Idaho St|
|970662608897|    8|     9.0|         2|    1755 Idaho St|
|970662608897|    8|     9.0|         2|    1755 Idaho St|
|970662608897|    8|     9.0|         2|    1755 Idaho St|
|970662608897|    8|     9.0|         2|    1755 Idaho St|
|730144440321|    8|     9.0|         3|2620 Jennings Way|
|730144440321|    8|     9.0|         3|2620 Jennings Way|
|730144440321|    8|     9.0|         3|2620 Jennings Way|
|730144440321|    8|     9.0|         3|2620 Jennings Way|
|730144440321|    8|     9.0|         3|2620 Jennings Way|
|730144440321|    8|     9.0|         3|2620 Jennings Way|
+------------+-----+--------+----------+-----------------+
only showing top 20 rows

所以没有前十名了。如何保持订单并相应地添加名称?这就是我加入他们的方式:result = hotel_weather_top_ten.join(hotel_weather_names, on='id', how='inner')

事先谢谢!

您可以通过以下操作来实现。

from pyspark.sql import functions
hotel_weather_top_ten = (hotel_weather_top_ten
.withColumnRenamed("id", "id1")
.withColumn("order", functions.monotonically_increasing_id()))
hotel_weather_top_ten = (hotel_weather_top_ten
.join(hotel_weather_names,
on=hotel_weather_top_ten["id1"] == hotel_weather_names["id"], 
how="inner")
.orderBy("order")
.select("id","month","abs_diff","row_number","name")

您得到的输出可能是由于hotel_weather_names中存在重复的idname映射。应用hotel_weather_names.distinct()可能是一种潜在的解决方案。

最新更新