我正在使用pyspark 1.6.1
,我创建了一个数据帧,如下所示:
toy_df = sqlContext.createDataFrame([('blah',10)], ['name', 'age'])
现在,看看当我尝试使用where
并再次使用select
查询此数据帧中的'blah'
时会发生什么情况:
toy_df_where = toy_df.where(toy_df['name'] != 'blah')
toy_df_where.count()
0
toy_df_select = toy_df.select(toy_df['name'] != 'blah')
toy_df_select.count()
1
为什么这两个选项的结果不同?
谢谢。
where
以及filter
用于过滤行,而select
用于选择列,因此在您的 select 语句中,toy_df['name'] != 'blah'
构造一个带有布尔值的新列,select 方法将其选择到结果数据框中,或者更清楚地看到此示例:
>>> toy_df = sqlContext.createDataFrame([('blah',10), ('foo', 20)], ['name', 'age'])
>>> toy_df_where = toy_df.where(toy_df['name'] != 'blah')
>>> toy_df_where.show()
+----+---+
|name|age|
+----+---+
| foo| 20|
+----+---+
# filter works the same way as where
>>> toy_df_filter = toy_df.filter(toy_df['name'] != 'blah')
>>> toy_df_filter.show()
+----+---+
|name|age|
+----+---+
| foo| 20|
+----+---+
>>> toy_df_select = toy_df.select((toy_df['name'] != 'blah').alias('cond'))
# give the column a new name with alias
>>> toy_df_select.show()
+-----+
| cond|
+-----+
|false|
| true|
+-----+