pyspark - 无法从日期列获取一年中的季度和星期



我有一个pyspark数据帧,看起来像这样:

+--------+----------+---------+----------+-----------+--------------------+
|order_id|product_id|seller_id|      date|pieces_sold|       bill_raw_text|
+--------+----------+---------+----------+-----------+--------------------+
|     668|    886059|     3205|2015-01-14|         91|pbdbzvpqzqvtzxone...|
|    6608|    541277|     1917|2012-09-02|         44|cjucgejlqnmfpfcmg...|
|   12962|    613131|     2407|2016-08-26|         90|cgqhggsjmrgkrfevc...|
|   14223|    774215|     1196|2010-03-04|         46|btujmkfntccaewurg...|
|   15131|    769255|     1546|2018-11-28|         13|mrfsamfuhpgyfjgki...|
|   15625|     86357|     2455|2008-04-18|         50|wlwsliatrrywqjrih...|
|   18470|     26238|      295|2009-03-06|         86|zrfdpymzkgbgdwFwz...|
|   29883|    995036|     4596|2009-10-25|         86|oxcutwmqgmioaelsj...|
|   38428|    193694|     3826|2014-01-26|         82|yonksvwhrfqkytypr...|
|   41023|    949332|     4158|2014-09-03|         83|hubxhfdtxrqsfotdq...|
+--------+----------+---------+----------+-----------+--------------------+

我想创建两个列,一个是年份的季度,另一个是年度的周数。以下是我所做的,参考了一年中一周和一季度的文档:

from pyspark.sql import functions as F
sales_table = sales_table.withColumn("week_year", F.date_format(F.to_date("date", "yyyy-mm-dd"),
F.weekofyear("d")))
sales_table = sales_table.withColumn("quarter", F.date_format(F.to_date("date", "yyyy-mm-dd"),
F.quarter("d")))
sales_table.show(10)

这是一个错误:

Column is not iterable
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 945, in date_format
return Column(sc._jvm.functions.date_format(_to_java_column(date), format))
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1296, in __call__
args_command, temp_args = self._build_args(*args)
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1260, in _build_args
(new_args, temp_args) = self._get_args(args)
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1247, in _get_args
temp_arg = converter.convert(arg, self.gateway_client)
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 510, in convert
for element in object:
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/column.py", line 353, in __iter__
raise TypeError("Column is not iterable")
TypeError: Column is not iterable

如何创建和追加这两列?

有没有更好或有效的方法来创建这些列,而不是每次都必须将date列转换为yyyy-mm-dd格式,然后在一个命令中创建这两列?

您可以直接使用字符串列date上的函数。

from pyspark.sql import functions as F
df = df.select(
'*',
F.weekofyear('date').alias('week_year'), 
F.quarter('date').alias('quarter')
)
df.show()
+--------+----------+---------+----------+-----------+--------------------+---------+-------+
|order_id|product_id|seller_id|      date|pieces_sold|       bill_raw_text|week_year|quarter|
+--------+----------+---------+----------+-----------+--------------------+---------+-------+
|     668|    886059|     3205|2015-01-14|         91|pbdbzvpqzqvtzxone...|        3|      1|
|    6608|    541277|     1917|2012-09-02|         44|cjucgejlqnmfpfcmg...|       35|      3|
|   12962|    613131|     2407|2016-08-26|         90|cgqhggsjmrgkrfevc...|       34|      3|
|   14223|    774215|     1196|2010-03-04|         46|btujmkfntccaewurg...|        9|      1|
|   15131|    769255|     1546|2018-11-28|         13|mrfsamfuhpgyfjgki...|       48|      4|
|   15625|     86357|     2455|2008-04-18|         50|wlwsliatrrywqjrih...|       16|      2|
|   18470|     26238|      295|2009-03-06|         86|zrfdpymzkgbgdwFwz...|       10|      1|
|   29883|    995036|     4596|2009-10-25|         86|oxcutwmqgmioaelsj...|       43|      4|
|   38428|    193694|     3826|2014-01-26|         82|yonksvwhrfqkytypr...|        4|      1|
|   41023|    949332|     4158|2014-09-03|         83|hubxhfdtxrqsfotdq...|       36|      3|
+--------+----------+---------+----------+-----------+--------------------+---------+-------+

您不必在此处使用date_format函数,因为您已经在yyyy-MM-dd格式中使用了date,直接在date列上使用week_of_year and quarter

示例:

df.show()
#+----------+
#|      date|
#+----------+
#|2015-01-14|
#+----------+
from pyspark.sql import functions as F
df.withColumn("week_year", F.weekofyear(F.col("date"))).
withColumn("quarter", F.quarter(F.col("date"))).
show()
#+----------+---------+-------+
#|      date|week_year|quarter|
#+----------+---------+-------+
#|2015-01-14|        3|      1|
#+----------+---------+-------+

最新更新