PySpark数据帧,用于使用 SQL 或 Pandas 数据帧获取前 5 行



我正在尝试根据rate_increase获取每个地区的前 5 个项目。 我正在尝试使用火花.sql如下所示:

输入:

district   item   rate_increase(%)
Arba     coil    500
Arba     pen    -85
Arba     hat     50
Cebu     oil    -40
Cebu     pen     1100

Top5item = spark.sql('select district, item , rate_increase, ROW_NUMBER() OVER (PARTITION BY district ORDER BY rate_increase DESC) AS RowNum from rateTable where rate_increase > 0')

这行得通。 如何在同一语句中筛选前 5 个产品。我尝试了如下方法,通过晶石.sql是更好的方法吗?

Top5item = spark.sql('select district, item from (select NCSA, Product, growthRate, ROW_NUMBER() OVER (PARTITION BY NCSA ORDER BY growthRate DESC) AS RowNum from rateTable where rate_increase > 0) where RowNum <= 5 order by NCSA')

输出:

district   item   rate_increase(%)
Arba     coil    500
Arba     hat     50
Cebu     pen     1100

谢谢。

Lilly, 您可以使用熊猫从csv读取数据,也可以创建如下所示的熊猫数据帧,然后将其转换为Spark数据帧

import pandas as pd
data_1 = { 
'district': ["Arba", "Arba", "Arba","Cebu", "Cebu"],
'item': ['coil', 'pen', 'hat','oil','pen'],
'rate_increase(%)': [500,-85,50,-40,1100]}
pandas_df = pd.DataFrame(data_1)
ddf_1 = spark.createDataFrame(pandas_df)
ddf_1.createOrReplaceTempView("ddf_1")
output = spark.sql("""
select district, item , `rate_increase(%)` from (
select row_number() over (partition by district order by `rate_increase(%)` desc) as RowNum, district,item, `rate_increase(%)`  from ddf_1  where  `rate_increase(%)` > 0 )
where RowNum <= 5 order by district, RowNum
""")
output.show()
+--------+----+----------------+
|district|item|rate_increase(%)|
+--------+----+----------------+
|    Arba|coil|             500|
|    Arba| hat|              50|
|    Cebu| pen|            1100|
+--------+----+----------------+

请记住查询的执行顺序:

从/加入 ->其中 -> 分组依据 -> 具有 ->选择

where 子句where RowNum <= 5不起作用,因为它不知道RowNum是什么。

尝试使用子查询块:

spark.sql("""
select district, item , `rate_increase(%)` from (
select row_number() over (partition by district order by `rate_increase(%)` desc) as RowNum, district,item, `rate_increase(%)`  from ddf_1  where  `rate_increase(%)` > 0 )
where RowNum <= 5 order by district, RowNum
""").show()

输出:

+--------+----+----------------+
|district|item|rate_increase(%)|
+--------+----+----------------+
|    Arba|coil|             500|
|    Arba| hat|              50|
|    Cebu| pen|            1100|
+--------+----+----------------+

我尝试使用熊猫作为一个简单的解决方案。

Top5item = df.sort_values('rate_increase(%)', ascending = True).groupby(['district']).head(5)

按地区分组后的升序(rate_increase(%)(仍然不起作用。谢谢

最新更新