我有一个名为teams的表,其中有一列名为team_name
CHN
MUM
HYD
BNG
我想得到以下结果。
scala> val df = hc.sql("""select a.team_name,b.team_name from
| (select team_name,'dummy' as join_key from landing.teams)a
| inner join
| (select team_name,'dummy' as join_key from landing.teams)b
| on(a.join_key = b.join_key)
| where a.team_name != b.team_name
| """)
df: org.apache.spark.sql.DataFrame = [team_name: string, team_name: string]
scala> df.show(100,false)
+---------+---------+
|team_name|team_name|
+---------+---------+
|MUM |CHN |
|MUM |BNG |
|MUM |HYD |
|CHN |BNG |
|CHN |HYD |
|CHN |MUM |
|BNG |CHN |
|BNG |HYD |
|BNG |MUM |
|HYD |CHN |
|HYD |BNG |
|HYD |MUM |
+---------+---------+
我能够实现这个解决方案。但是,是否可以在不使用我在上面使用查询的伪列的情况下获得相同的结果。请帮助
您只需要执行一个数据帧交叉连接。
构建数据帧:
>>> from pyspark.sql.types import (
... StringType,
... StructField,
... StructType
... )
>>>
>>> city_list = [["CHN"], ["MUM"], ["HYD"], ["BNG"]]
>>> city_schema = StructType([
... StructField('city_name', StringType(), True)
... ])
>>>
>>> df = spark.createDataFrame(city_list, schema=city_schema)
交叉连接:
>>> df2 = df.crossJoin(df.withColumnRenamed('city_name', 'city_name1'))
>>> df2.filter(df2['city_name'] != df2['city_name1']).show()
+---------+----------+
|city_name|city_name1|
+---------+----------+
| CHN| MUM|
| CHN| HYD|
| CHN| BNG|
| MUM| CHN|
| MUM| HYD|
| MUM| BNG|
| HYD| CHN|
| HYD| MUM|
| HYD| BNG|
| BNG| CHN|
| BNG| MUM|
| BNG| HYD|
+---------+----------+
这是一个pyspark示例,但scala的语法也是一样的。