Pyspark:将一列嵌套到多个新列中



我在hadoop上有一个 network.log

{"Source":"Network","Detail":"Event=01|Device=Mobile|ClientIP=10.0.0.0|URL=example.com"}

我想将其作为数据框架加载,用|分配Detail。然后,我想使用 =进一步拆分每个新列,左侧为列名,右侧为值。

预期的结果是:

Source  | Event | Device | ClientIP | URL
Network | 01    | Mobile | 10.0.0.0 | example.com

我已经完成了第一个拆分,如下所示:

from pyspark import SparkContext
from pyspark.sql import functions, SQLContext
INPUT_PATH = 'network.log'
sc = SparkContext("local", "NetworkEvent")
sqlContext = SQLContext(sc)
raw = sqlContext.read.json(INPUT_PATH)
detail_col = functions.split(raw['Detail'], '|')
for i in range(4):
    raw = raw.withColumn('col_' + str(i), detail_col.getItem(i))
raw.show()

我的问题是,我可以同时在detail_col.getItem(i)的顶部同时进行第二次拆分吗?我可以想到为新数据框架的每一列创建另一个UDF,但是一个UDF中是否有更优雅的方式?非常感谢!

注意:我正在使用Spark 1.5.0,因此Pandas'UDFS将无法使用。

在1.5.0中,您可以使用regexp_extract

from pyspark.sql import functions as F
for i in ['Event', 'Device', 'ClientIP', 'URL']:
    df = df.withColumn(i, F.regexp_extract('Detail',"{}=([^|]+)".format(i),1))
df.show()
+-------+--------------------+-----+------+--------+-----------+
| Source|              Detail|Event|Device|ClientIP|        URL|
+-------+--------------------+-----+------+--------+-----------+
|Network|Event=01|Device=M...|   01|Mobile|10.0.0.0|example.com|
+-------+--------------------+-----+------+--------+-----------+

无需为此东西编写UDF,您可以应用多个替代方案并实现此目标,以下是替代方案之一: -

from pyspark import SparkContext
from pyspark.sql import functions
INPUT_PATH = 'network.log'
sc = SparkContext("local", "NetworkEvent")
sqlContext = SQLContext(sc)
raw = sqlContext.read.json(INPUT_PATH)
detail_col = functions.split(raw['Detail'], '|')
cols_to_be = raw.select([functions.split(detail_col.getItem(i), "=").getItem(0).alias("col_"+str(i)) for i in range(4)]).first()
for i in range(4):
    raw = raw.withColumn(
        cols_to_be["col_"+str(i)], 
        functions.split(detail_col.getItem(i), "=").getItem(1)
    )

raw.show()
+--------------------+-------+-----+------+--------+-----------+
|              Detail| Source|Event|Device|ClientIP|        URL|
+--------------------+-------+-----+------+--------+-----------+
|Event=01|Device=M...|Network|   01|Mobile|10.0.0.0|example.com|
+--------------------+-------+-----+------+--------+-----------+

希望您的详细信息应遵循模式。

最新更新