我正在Pyspark中编写一个简单的数据帧脚本,但无法"别名"数据帧。我做错了什么。
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from pyspark.sql.types import IntegerType,StructType,StructField,StringType,IntegerType
spark = SparkSession.builder.appName('myDFApp').master('local').getOrCreate()
sc = spark.sparkContext
input_data = [('retail', '2017-01-03T13:21:00', 134),
('marketing', '2017-01-03T13:21:00', 100)]
rdd_schema = StructType([StructField('business', StringType(), True),
StructField('date', StringType(), True),
StructField("US.sales", IntegerType(), True)])
input_df = spark.createDataFrame(input_data, rdd_schema)
print('Count= ', input_df.count())
# this line below works
df_1 = input_df.select((input_df.business).alias('partnership'))
# this line does not work
df_2 = input_df.alias("s").
where(s.date > "2016-01-03")
df_2.show()
我得到的错误是:
Count= 2
Traceback (most recent call last):
File "/home/hadoop/opt/inscape/test_dataframe.py", line 22, in <module>
where(s.date > "2016-01-03")
NameError: name 's' is not defined
我做错了什么?
谢谢
当您为数据帧设置别名时,您会更改 Spark 元数据中的引用名称,而不是 Python 中的引用变量,在 Python 中,数据帧名称仍然是input_df
。您可以使用col
对象来访问s
数据帧。请尝试以下操作来修复它:
from pyspark.sql.functions import *
df_2 = input_df.alias("s").
where(col("s.date") > "2016-01-03")