Pyspark:从嵌套字典创建火花数据框



如何从嵌套字典创建火花数据框?我是新来的。我不想使用熊猫数据框。

我的字典看起来像:-

{'prathameshsalap@gmail.com': {'Date': datetime.date(2019, 10, 21),'idle_time': datetime.datetime(2019, 10, 21, 1, 50)},
'vaishusawant143@gmail.com': {'Date': datetime.date(2019, 10, 21),'idle_time': datetime.datetime(2019, 10, 21, 1, 35)},
'you@example.com': {'Date': datetime.date(2019, 10, 21),'idle_time': datetime.datetime(2019, 10, 21, 1, 55)}
}

我想使用 pyspark 数据帧将此字典转换为火花数据帧。

我的预期输出:-

Date    idle_time
user_name       
prathameshsalap@gmail.com   2019-10-21  2019-10-21 01:50:00
vaishusawant143@gmail.com   2019-10-21  2019-10-21 01:35:00
you@example.com             2019-10-21  2019-10-21 01:55:00

您需要重做字典并构建行以正确推断架构。

import datetime
from pyspark.sql import Row
data_dict = {
'prathameshsalap@gmail.com': {
'Date': datetime.date(2019, 10, 21),
'idle_time': datetime.datetime(2019, 10, 21, 1, 50)
},
'vaishusawant143@gmail.com': {
'Date': datetime.date(2019, 10, 21),
'idle_time': datetime.datetime(2019, 10, 21, 1, 35)
},
'you@example.com': {
'Date': datetime.date(2019, 10, 21),
'idle_time': datetime.datetime(2019, 10, 21, 1, 55)
}
}
data_as_rows = [Row(**{'user_name': k, **v}) for k,v in data_dict.items()]
data_df = spark.createDataFrame(data_as_rows).select('user_name', 'Date', 'idle_time')
data_df.show(truncate=False)
>>>
+-------------------------+----------+-------------------+
|user_name                |Date      |idle_time          |
+-------------------------+----------+-------------------+
|prathameshsalap@gmail.com|2019-10-21|2019-10-21 01:50:00|
|vaishusawant143@gmail.com|2019-10-21|2019-10-21 01:35:00|
|you@example.com          |2019-10-21|2019-10-21 01:55:00|
+-------------------------+----------+-------------------+

注意:如果您已经准备好了架构并且不需要推断,则只需将架构提供给createDataFrame函数:

import pyspark.sql.types as T
schema = T.StructType([
T.StructField('user_name', T.StringType(), False),
T.StructField('Date', T.DateType(), False),
T.StructField('idle_time', T.TimestampType(), False)
])
data_as_tuples = [(k, v['Date'], v['idle_time']) for k,v in data_dict.items()]
data_df = spark.createDataFrame(data_as_tuples, schema=schema)
data_df.show(truncate=False)
>>>
+-------------------------+----------+-------------------+
|user_name                |Date      |idle_time          |
+-------------------------+----------+-------------------+
|prathameshsalap@gmail.com|2019-10-21|2019-10-21 01:50:00|
|vaishusawant143@gmail.com|2019-10-21|2019-10-21 01:35:00|
|you@example.com          |2019-10-21|2019-10-21 01:55:00|
+-------------------------+----------+-------------------+

将字典转换为元组列表,然后每个元组将成为 Spark 数据帧中的一行:

rows = []
for key, value in data.items():
row = (key,value['Date'], value['idle_time'])
rows.append(row)

为数据定义架构:

from pyspark.sql.types import *
sch = StructType([
StructField('user_name', StringType()),
StructField('date', DateType()),
StructField('idle_time', TimestampType())
])

创建 Spark 数据帧:

df = spark.createDataFrame(rows, sch)
df.show()
+--------------------+----------+-------------------+
|           user_name|      date|          idle_time|
+--------------------+----------+-------------------+
|prathameshsalap@g...|2019-10-21|2019-10-21 01:50:00|
|vaishusawant143@g...|2019-10-21|2019-10-21 01:35:00|
|     you@example.com|2019-10-21|2019-10-21 01:55:00|
+--------------------+----------+-------------------+

最新更新