pyspark:在执行join - restrict相同列名选择时设置别名



执行左连接。我选择数据框中的列。

temp_join=ldt_ffw_course_attendee[["languages_id","course_attendee_status",
"course_attendee_completed_flag","course_video_id","mem_id", "course_id"]].
join(languages[["languages_id"]],
ldt_ffw_course_attendee.languages_id==languages.languages_id,"left")

打印存储在temp_join

中的列
for col in temp_join.dtypes:
print(col[0]+" , "+col[1])

languages_id, int course_attendee_status, intCourse_attendee_completed_flag, int course_video_id, int mem_id,Int course_id, Int languages_id, Int

如何在任何数据帧中为languages_id创建别名?或者,我如何限制只从一个数据帧中选择languages_id

您可以使用.alias()来命名您的数据框架

df1 = spark.createDataFrame([('a', 'b')], schema=['col1', 'col2'])
df2 = spark.createDataFrame([('a', 'c')], schema=['col1', 'col2'])
df3 = df1.alias('df1').join(df2.alias('df2'), on='col1', how='inner')
df3.printSchema()
df3.show(1, False)
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col2: string (nullable = true)
+----+----+----+
|col1|col2|col2|
+----+----+----+
|a   |b   |c   |
+----+----+----+

执行查询时,可以通过df.column_name:

查询该列。
df3.select(
'df1.col1',
'df1.col2',
'df2.col2'
).show(3, False)
+----+----+----+
|col1|col2|col2|
+----+----+----+
|a   |b   |c   |
+----+----+----+

最新更新