案例1:当我试图获取"b.no"时,会得到错误,下面会共享代码和错误消息。如何从第二个数据帧(即别名为b)中获取值。这里是否允许从b中选择值。如果我去掉b.no,它运行良好。
df1.csv否,名称,sal1,斯里兰卡,30002,闸板,20003,sam,25004,克里,50005,tom,4000
df2.csv否,名称,sal1,斯里兰卡,30001,vas,40002,闸板,20003,sam,25004,克里,50005,汤姆,45005,玩具,42005,koy,4999吉姆,30907,kim,2080
代码:
from pyspark.shell import spark
from pyspark.sql import SQLContext
sc = spark.sparkContext
sqlContext = SQLContext(sc)
df11 = spark.read.option("header","true").option("delimiter", ",").csv("C:\inputs\df1.csv")
df22 = spark.read.option("header","true").option("delimiter", ",").csv("C:\inputs\df2.csv")
print("df11", df11.count())
print("df22", df22.count())
resDF = df11.alias("a").join(df22.alias("b"), on='no').select("a.no", "a.name", "b.no")
print("resDF", resDF.count())
print("resDF", resDF.distinct().show())
错误:
py4j.protocol.Py4JJava错误:调用o48.select时出错。:org.apache.spark.sql.AnalysisException:无法解析给定输入列的"b.no
":[b.sal,a.no,b.name,a.sal,a.name];;pyspark.sql.utils.AnalysisException:"无法解析给定输入列的'b.no
':[b.sal,a.no,b.name,a.sal,.name];;\n'Project[no#10,name#11,'b.no]\n+-AnalysisBarrier \n+-Project[no[10,name#10,sal#12,name#27,sal#28]\n+-Join Inner,(no#10=no#26)\n:-SubqueryAlias a:+-Relation[no#110,name#111,sal#12]csv\n+-子查询别名b\n+-关系[编号#26,名称#27,sal#28]csv\n">
情况2:当我使用b.sal获取重复值时,它不会过滤掉。
resDF = df11.alias("a").join(df22.alias("b"), on='no').select("a.no", "a.name", "b.sal")
print("resDF", resDF.distinct().show())
在这种情况下,如何仅基于"no"获得不同的值。
案例1中的问题是,当您使用字符串(或arraytype)作为联接参数时,spark将只添加a.no而不添加b.no,以避免联接后出现重复列(有关详细信息,请参阅链接)。您可以通过定义像F.col('a.no')==col('b.no')这样的联接表达式来避免这种情况
from pyspark.sql import types as T
from pyspark.sql import functions as F
columns1 = ['no','name','sal']
columns2 = ['no','name','sal']
vals1 = [(1,'sri',3000) ,(2,'ram',2000) ,(3,'sam',2500) ,(4,'kri',5000) ,(5,'tom',4000)]
vals2 = [(1,'sri',3000) ,(1,'vas',4000) ,(2,'ram',2000) ,(3,'sam',2500), (4,'kri',5000) ,(5,'tom',4500) ,(5,'toy',4200) ,(5,'koy',4999) ,(6,'jim',3090) ,(7,'kim',2080)]
df1 = spark.createDataFrame(vals1, columns1)
df2 = spark.createDataFrame(vals2, columns2)
#here I use a expression instead of a string
resDF = df1.alias("a").join(df2.alias("b"), F.col('a.no') == col('b.no')).select("a.no", "a.name", "b.no")
resDF.show()
输出:
+---+----+---+
| no|name| no|
+---+----+---+
| 0| 1| 0|
+---+----+---+
对于您的案例2:dataframe distinct方法比较数据帧的每一行。当您只需要一列的唯一值时,您必须首先执行选择:
resDF = df1.alias("a").join(df2.alias("b"), F.col('a.no') == col('b.no')).select("a.no", "a.name", "b.sal")
resDF.select('no').distinct().show()