使用Pyspark解析JSON字符串列表中的每个值的出现



我是PySpark的新手,我正在努力寻找每个IP地址在以下列表中出现的次数:

sampleJson = [('{"user":100, "ips" : ["191.168.192.101", "191.168.192.103", "191.168.192.96", "191.168.192.99"]}',), ('{"user":101, "ips" : ["191.168.192.102", "191.168.192.105", "191.168.192.103", "191.168.192.107"]}',), ('{"user":102, "ips" : ["191.168.192.105", "191.168.192.101", "191.168.192.105", "191.168.192.107"]}',), ('{"user":103, "ips" : ["191.168.192.96", "191.168.192.100", "191.168.192.107", "191.168.192.101"]}',), ('{"user":104, "ips" : ["191.168.192.99", "191.168.192.99", "191.168.192.102", "191.168.192.99"]}',),('{"user":105, "ips" : ["191.168.192.99", "191.168.192.99", "191.168.192.100", "191.168.192.96"]}',),]

理想情况下,我需要的结果看起来像这样:

3333

可以使用explosion函数将数组元素转换成行:

json_df = spark.createDataFrame(sampleJson)
sch=StructType([StructField('user', StringType(), 
False),StructField('ips',ArrayType(StringType()))])
json_df = json_df.withColumn("n",from_json(col("_1"),sch)).select("n.*")
json_df = json_df 
.withColumn('ip', explode("ips")) 
.groupby('ip') 
.agg(count('*').alias('count'))

json_df.show()