过滤pyspark dataframe以保持包含至少1个空值的行(保持掉落)



假设我有以下pyspark dataframe:

>>> df = spark.createDataFrame([('A', 'Amsterdam', 3.4), ('B', 'London', None), ('C', None, None), ('D', None, 11.1)], ['c1', 'c2', 'c3'])
>>> df.show()
+---+---------+----+
| c1|       c2|  c3|
+---+---------+----+
|  A|Amsterdam| 3.4|
|  B|   London|null|
|  C|     null|null|
|  D|     null|11.1|
+---+---------+----+

现在,我如何选择或过滤任何行,包含至少一个 null值,例如?:

>>> df.SOME-COMMAND-HERE.show()
+---+---------+----+
| c1|       c2|  c3|
+---+---------+----+
|  B|   London|null|
|  C|     null|null|
|  D|     null|11.1|
+---+---------+----+

通过删除所需行来创建原始中间数据框架。然后从原始:

中"减去"它
# Create the data frame
df = spark.createDataFrame([('A', 'Amsterdam', 3.4), ('B', 'London', None), ('C', None, None), ('D', None, 11.1)], ['c1', 'c2', 'c3'])
df.show()
+---+---------+----+
| c1|       c2|  c3|
+---+---------+----+
|  A|Amsterdam| 3.4|
|  B|   London|null|
|  C|     null|null|
|  D|     null|11.1|
+---+---------+----+
# Construct an intermediate dataframe without the desired rows
df_drop = df.dropna('any')
df_drop.show()
+---+---------+---+
| c1|       c2| c3|
+---+---------+---+
|  A|Amsterdam|3.4|
+---+---------+---+
# Then subtract it from the original to reveal the desired rows
df.subtract(df_drop).show()
+---+------+----+
| c1|    c2|  c3|
+---+------+----+
|  B|London|null|
|  C|  null|null|
|  D|  null|11.1|
+---+------+----+

构造适当的原始SQL查询并应用:

# Create the data frame
df = spark.createDataFrame([('A', 'Amsterdam', 3.4), ('B', 'London', None), ('C', None, None), ('D', None, 11.1)], ['c1', 'c2', 'c3'])
df.show()
+---+---------+----+
| c1|       c2|  c3|
+---+---------+----+
|  A|Amsterdam| 3.4|
|  B|   London|null|
|  C|     null|null|
|  D|     null|11.1|
+---+---------+----+
# Compose the approprate raw SQL query
sql_query_base = 'SELECT * FROM df WHERE '
sql_query_apps = ['{} IS NULL'.format(col_name) for col_name in df.columns]
sql_query = str_base + ' OR '.join(sql_query_apps)
sql_query
'SELECT * FROM df WHERE c1 IS NULL OR c2 IS NULL OR c3 IS NULL'
# Register the dataframe as a SQL table
sqlContext.registerDataFrameAsTable(df, 'df')
# Apply raw SQL
sqlContext.sql(sql_query).show()
+---+------+----+
| c1|    c2|  c3|
+---+------+----+
|  B|London|null|
|  C|  null|null|
|  D|  null|11.1|
+---+------+----+

相关内容

  • 没有找到相关文章

最新更新