如何在pyspark上的数组(括号中)中添加数字



这是我的数据帧

+--------------------+--------------------+--------------------+--------------------+--------------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
|                 ifa|           longitude|            latitude|          first_seen|           last_seen|hour_00|hour_01|hour_02|hour_03|hour_04|hour_05|hour_06|hour_07|hour_08|hour_09|hour_10|hour_11|hour_12|hour_13|hour_14|hour_15|hour_16|hour_17|hour_18|hour_19|hour_20|hour_21|hour_22|hour_23|
+--------------------+--------------------+--------------------+--------------------+--------------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
|00000000-240f-e42...|          [106.6284]|            [-6.177]|[2021-06-18 02:35...|[2021-06-18 02:35...|    [0]|    [0]|    [0]|    [0]|    [0]|    [0]|    [0]|    [0]|    [0]|    [0]|    [0]|    [0]|    [0]|    [0]|    [0]|    [0]|    [0]|    [0]|    [0]|    [1]|    [0]|    [0]|    [0]|    [0]|
|00045a11-3666-48b...|[103.519611, 103....|[-3.779082, -3.78...|[2021-06-17 18:39...|[2021-06-17 18:44...| [0, 0]| [0, 0]| [0, 0]| [0, 0]| [0, 0]| [0, 0]| [0, 0]| [0, 0]| [0, 0]| [0, 0]| [0, 0]|[55, 0]| [0, 0]| [0, 0]| [0, 0]| [0, 0]| [0, 0]| [0, 0]| [0, 4]| [0, 0]| [0, 0]| [0, 0]| [0, 0]| [0, 0]|
+--------------------+--------------------+--------------------+--------------------+--------------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+

这是我的代码

%%time
A = A.withColumn('Morning', (col('hour_05') + col('hour_06') + col('hour_07')))
A = A.withColumn('Office Hour', (col('hour_08')+ col('hour_09') + col('hour_10') + col('hour_11') + col('hour_12')+ col('hour_13') + col('hour_14') + col('hour_15') + col('hour_16') + col('hour_17')))
#A = A.withColumn('Lunch Time', (col('hour_12')+ col('hour_13')))
A = A.withColumn('After Office', (col('hour_18')+ col('hour_19') + col('hour_20') + col('hour_21') + col('hour_22') + col('hour_23')))
A = A.withColumn('Bed Time', (col('hour_00')+ col('hour_01') + col('hour_02') + col('hour_03') + col('hour_04') + col('hour_22') + col('hour_23')))

这是错误信息

AnalysisException: "cannot resolve '(`hour_05` + `hour_06`)' due to data type mismatch: '(`hour_05` + `hour_06`)' requires (numeric or calendarinterval) type, not array<bigint>;;n'Project [ifa#1, longitude#24, latitude#25, first_seen#26, last_seen#27, hour_00#28, hour_01#29, hour_02#30, hour_03#31, hour_04#32, hour_05#33, hour_06#34, hour_07#35, hour_08#36, hour_09#37, hour_10#38, hour_11#39, hour_12#40, hour_13#41, hour_14#42, hour_15#43, hour_16#44, hour_17#45, hour_18#46, ... 6 more fields]n+- Filter isnotnull(hour_00#28)n   +- Project [ifa#1, gps#7.longitude AS longitude#24, gps#7.latitude AS latitude#25, gps#7.first_seen AS first_seen#26, gps#7.last_seen AS last_seen#27, gps#7.hour_00 AS hour_00#28, gps#7.hour_01 AS hour_01#29, gps#7.hour_02 AS hour_02#30, gps#7.hour_03 AS hour_03#31, gps#7.hour_04 AS hour_04#32, gps#7.hour_05 AS hour_05#33, gps#7.hour_06 AS hour_06#34, gps#7.hour_07 AS hour_07#35, gps#7.hour_08 AS hour_08#36, gps#7.hour_09 AS hour_09#37, gps#7.hour_10 AS hour_10#38, gps#7.hour_11 AS hour_11#39, gps#7.hour_12 AS hour_12#40, gps#7.hour_13 AS hour_13#41, gps#7.hour_14 AS hour_14#42, gps#7.hour_15 AS hour_15#43, gps#7.hour_16 AS hour_16#44, gps#7.hour_17 AS hour_17#45, gps#7.hour_18 AS hour_18#46, ... 5 more fields]n      +- Relation[id#0,ifa#1,source#2,device#3,app#4,connection#5,geoip_connection#6,gps#7,geoip_location#8,user#9,brq_count#10L,day#11] parquetn"

首先,在获取数组列的和之前,应该添加每个数组列的元素。

您可以使用更高阶的内置函数聚合来获取数组元素的和。

示例:

>>> df = spark.createDataFrame(([1,2],[5,6]),"array<int>").withColumn("c2",array(lit(1)))
>>> df.show()
+------+---+
| value| c2|
+------+---+
|[1, 2]|[1]|
|[5, 6]|[1]|
+------+---+
>>> df.withColumn("array_sum",aggregate("value",lit(0),lambda x,y:x+y)+aggregate("c2",lit(0),lambda x,y:x+y)).show()
+------+---+---------+
| value| c2|array_sum|
+------+---+---------+
|[1, 2]|[1]|        4|
|[5, 6]|[1]|       12|
+------+---+---------+

最新更新