所以我有这个数据帧hotel_weather_top_ten
:
----------+-----+--------+----------+
| id|month|abs_diff|row_number|
+-------------+-----+--------+----------+
| 996432412678| 8| 9.0| 1|
| 970662608897| 8| 9.0| 2|
| 730144440321| 8| 9.0| 3|
| 824633720837| 8| 9.0| 4|
| 953482739712| 8| 8.0| 5|
|1013612281861| 8| 8.0| 6|
|1288490188802| 8| 8.0| 7|
| 429496729601| 8| 8.0| 8|
| 429496729604| 8| 8.0| 9|
|1262720385026| 8| 8.0| 10|
| 146028888072| 9| 19.0| 1|
|1571958030336| 9| 19.0| 2|
| 146028888068| 9| 16.0| 3|
| 25769803779| 9| 15.0| 4|
|1099511627777| 9| 13.0| 5|
| 206158430212| 9| 13.0| 6|
|1262720385024| 9| 13.0| 7|
... and so on
基本上,它只显示每个酒店每个月的前10个abs_diff
值。
现在我想加入hotel_weather_names
,所以我有了酒店名称:
+-------------+--------------------+
| id| name|
+-------------+--------------------+
| 858993459206| 2608 E Malone Ave|
|1400159338497| 345 Chestnut St|
| 463856467968| 45400 Park Ave|
| 481036337152| 1118 Government St|
|1709396983810|165 Saint Emanuel St|
| 790273982465|1237 Us Highway 4...|
| 231928233986| Maslinica Bay|
| 592705486849| 1223 Radford Blvd|
| 515396075520| 3750 Meridian St|
| 352187318275| 1407 E Rusk St|
|1202590842884| 2030 Formosa Rd|
| 747324309509| 602 E 4th St|
| 695784701955|10821 Caribbean Blvd|
| 644245094400|1725 Long Beach Blvd|
|1176821039105| 15902 S Western Ave|
|1365799600130|3201 E Pacific Co...|
| 309237645313| 8541 S Hampton Rd|
| 678604832769| 221 Shultz Rd|
| 755914244097|1320 Harrisburg Pike|
| 523986010113| 615 Clark Rd|
+-------------+--------------------+
only showing top 20 rows
但不知怎么的,当我加入时,结果并不是我所期望的。例如:
| id|month|abs_diff|row_number| name|
+------------+-----+--------+----------+-----------------+
|996432412678| 8| 9.0| 1| 736 Idaho St|
|996432412678| 8| 9.0| 1| 736 Idaho St|
|996432412678| 8| 9.0| 1| 736 Idaho St|
|996432412678| 8| 9.0| 1| 736 Idaho St|
|996432412678| 8| 9.0| 1| 736 Idaho St|
|996432412678| 8| 9.0| 1| 736 Idaho St|
|996432412678| 8| 9.0| 1| 736 Idaho St|
|970662608897| 8| 9.0| 2| 1755 Idaho St|
|970662608897| 8| 9.0| 2| 1755 Idaho St|
|970662608897| 8| 9.0| 2| 1755 Idaho St|
|970662608897| 8| 9.0| 2| 1755 Idaho St|
|970662608897| 8| 9.0| 2| 1755 Idaho St|
|970662608897| 8| 9.0| 2| 1755 Idaho St|
|970662608897| 8| 9.0| 2| 1755 Idaho St|
|730144440321| 8| 9.0| 3|2620 Jennings Way|
|730144440321| 8| 9.0| 3|2620 Jennings Way|
|730144440321| 8| 9.0| 3|2620 Jennings Way|
|730144440321| 8| 9.0| 3|2620 Jennings Way|
|730144440321| 8| 9.0| 3|2620 Jennings Way|
|730144440321| 8| 9.0| 3|2620 Jennings Way|
+------------+-----+--------+----------+-----------------+
only showing top 20 rows
所以没有前十名了。如何保持订单并相应地添加名称?这就是我加入他们的方式:result = hotel_weather_top_ten.join(hotel_weather_names, on='id', how='inner')
事先谢谢!
您可以通过以下操作来实现。
from pyspark.sql import functions
hotel_weather_top_ten = (hotel_weather_top_ten
.withColumnRenamed("id", "id1")
.withColumn("order", functions.monotonically_increasing_id()))
hotel_weather_top_ten = (hotel_weather_top_ten
.join(hotel_weather_names,
on=hotel_weather_top_ten["id1"] == hotel_weather_names["id"],
how="inner")
.orderBy("order")
.select("id","month","abs_diff","row_number","name")
您得到的输出可能是由于hotel_weather_names
中存在重复的id
和name
映射。应用hotel_weather_names.distinct()
可能是一种潜在的解决方案。