执行左连接。我选择数据框中的列。
temp_join=ldt_ffw_course_attendee[["languages_id","course_attendee_status",
"course_attendee_completed_flag","course_video_id","mem_id", "course_id"]].
join(languages[["languages_id"]],
ldt_ffw_course_attendee.languages_id==languages.languages_id,"left")
打印存储在temp_join
for col in temp_join.dtypes:
print(col[0]+" , "+col[1])
languages_id, int course_attendee_status, intCourse_attendee_completed_flag, int course_video_id, int mem_id,Int course_id, Int languages_id, Int
如何在任何数据帧中为languages_id
创建别名?或者,我如何限制只从一个数据帧中选择languages_id
您可以使用.alias()
来命名您的数据框架
df1 = spark.createDataFrame([('a', 'b')], schema=['col1', 'col2'])
df2 = spark.createDataFrame([('a', 'c')], schema=['col1', 'col2'])
df3 = df1.alias('df1').join(df2.alias('df2'), on='col1', how='inner')
df3.printSchema()
df3.show(1, False)
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col2: string (nullable = true)
+----+----+----+
|col1|col2|col2|
+----+----+----+
|a |b |c |
+----+----+----+
执行查询时,可以通过df.column_name
:
df3.select(
'df1.col1',
'df1.col2',
'df2.col2'
).show(3, False)
+----+----+----+
|col1|col2|col2|
+----+----+----+
|a |b |c |
+----+----+----+