创建structtype/printschema的数据帧两个pyspark数据帧的结果



我有两个spark数据帧,我想比较它们的数据类型,但我想并排比较它们,我想把它们放在一个数据帧中,包含这两个数据帧的模式:

df1.printSchema()

返回

root
|-- orgID: string (nullable = true)
|-- deptID: string (nullable = true)
|-- systemID: string (nullable = true)
|-- eventId: string (nullable = true)
|-- eventType: string (nullable = true)
|-- autoID: string (nullable = true)
|-- personID: string (nullable = true)
|-- employeeFirst: string (nullable = true)
|-- employeeMiddle: string (nullable = true)
|-- employeeLast: string (nullable = true)
|-- employeeDOB: string (nullable = true)
df2.printSchema()

返回

root
|-- orgID: integer (nullable = true)
|-- deptID: string (nullable = true)
|-- systemID: string (nullable = true)
|-- eventId: integer (nullable = true)
|-- eventType: string (nullable = true)
|-- autoID: string (nullable = true)
|-- personID: integer (nullable = true)
|-- employeeFirst: string (nullable = true)
|-- employeeMiddle: string (nullable = true)
|-- employeeLast: string (nullable = true)
|-- employeeDOB: timestamp (nullable = false)

我想将两者的数据帧放在一起,以创建另一列来比较df1type和df2type['True','False']

+-------------+----------+---------+
|       column|   df1type|  df2type|
+-------------+----------+---------+
|        orgID|    string|  integer|
|       deptID|    string|   string|
...
|  employeeDOB|    string|timestamp|
+-------------+----------+---------+

到目前为止,我可以从以下内容中看出:

df1.schema == df2.schema

两个数据帧不相等。以上将返回False

我试着将每个printSchema转换成一个表,然后进行合并,但我认为放入printSchema((结果很有挑战性。

我需要找出哪些常见列具有不同的structTypes。有别的办法吗?

我不知道是不是这样,但试试这个。

df2.except(df1)

最新更新