假设我有以下pyspark dataframe:
>>> df = spark.createDataFrame([('A', 'Amsterdam', 3.4), ('B', 'London', None), ('C', None, None), ('D', None, 11.1)], ['c1', 'c2', 'c3'])
>>> df.show()
+---+---------+----+
| c1| c2| c3|
+---+---------+----+
| A|Amsterdam| 3.4|
| B| London|null|
| C| null|null|
| D| null|11.1|
+---+---------+----+
现在,我如何选择或过滤任何行,包含至少一个 null值,例如?:
>>> df.SOME-COMMAND-HERE.show()
+---+---------+----+
| c1| c2| c3|
+---+---------+----+
| B| London|null|
| C| null|null|
| D| null|11.1|
+---+---------+----+
通过删除所需行来创建原始中间数据框架。然后从原始:
中"减去"它# Create the data frame
df = spark.createDataFrame([('A', 'Amsterdam', 3.4), ('B', 'London', None), ('C', None, None), ('D', None, 11.1)], ['c1', 'c2', 'c3'])
df.show()
+---+---------+----+
| c1| c2| c3|
+---+---------+----+
| A|Amsterdam| 3.4|
| B| London|null|
| C| null|null|
| D| null|11.1|
+---+---------+----+
# Construct an intermediate dataframe without the desired rows
df_drop = df.dropna('any')
df_drop.show()
+---+---------+---+
| c1| c2| c3|
+---+---------+---+
| A|Amsterdam|3.4|
+---+---------+---+
# Then subtract it from the original to reveal the desired rows
df.subtract(df_drop).show()
+---+------+----+
| c1| c2| c3|
+---+------+----+
| B|London|null|
| C| null|null|
| D| null|11.1|
+---+------+----+
构造适当的原始SQL查询并应用:
# Create the data frame
df = spark.createDataFrame([('A', 'Amsterdam', 3.4), ('B', 'London', None), ('C', None, None), ('D', None, 11.1)], ['c1', 'c2', 'c3'])
df.show()
+---+---------+----+
| c1| c2| c3|
+---+---------+----+
| A|Amsterdam| 3.4|
| B| London|null|
| C| null|null|
| D| null|11.1|
+---+---------+----+
# Compose the approprate raw SQL query
sql_query_base = 'SELECT * FROM df WHERE '
sql_query_apps = ['{} IS NULL'.format(col_name) for col_name in df.columns]
sql_query = str_base + ' OR '.join(sql_query_apps)
sql_query
'SELECT * FROM df WHERE c1 IS NULL OR c2 IS NULL OR c3 IS NULL'
# Register the dataframe as a SQL table
sqlContext.registerDataFrameAsTable(df, 'df')
# Apply raw SQL
sqlContext.sql(sql_query).show()
+---+------+----+
| c1| c2| c3|
+---+------+----+
| B|London|null|
| C| null|null|
| D| null|11.1|
+---+------+----+