Spark DataFrame添加缺失值

i具有以下格式的数据框架。我想为每个客户添加空行以丢失时间戳。

+-------------+----------+------+----+----+
| Customer_ID | TimeSlot |  A1  | A2 | An |
+-------------+----------+------+----+----+
| c1          |        1 | 10.0 |  2 |  3 |
| c1          |        2 | 11   |  2 |  4 |
| c1          |        4 | 12   |  3 |  5 |
| c2          |        2 | 13   |  2 |  7 |
| c2          |        3 | 11   |  2 |  2 |
+-------------+----------+------+----+----+

结果表应为格式

+-------------+----------+------+------+------+
| Customer_ID | TimeSlot |  A1  |  A2  |  An  |
+-------------+----------+------+------+------+
| c1          |        1 | 10.0 | 2    | 3    |
| c1          |        2 | 11   | 2    | 4    |
| c1          |        3 | null | null | null |
| c1          |        4 | 12   | 3    | 5    |
| c2          |        1 | null | null | null |
| c2          |        2 | 13   | 2    | 7    |
| c2          |        3 | 11   | 2    | 2    |
| c2          |        4 | null | null | null |
+-------------+----------+------+------+------+

我有100万客户和360个（在上面的示例中仅描绘了4个）时间插槽。我想出了一种用2列（Customer_ID，limeSlot）创建数据框的方法，并使用（1 m x 360行）使用原始数据框架进行左外连接。

有更好的方法吗？

您可以将其表示为SQL查询：

select df.customerid, t.timeslot,
       t.A1, t.A2, t.An
from (select distinct customerid from df) c cross join
     (select distinct timeslot from df) t left join
     df
     on df.customerid = c.customerid and df.timeslot = t.timeslot;

注意：

您可能应该将其放入另一个数据框中。
您可能会与可用的客户和/或时间段的桌子。使用这些而不是子征服。

我认为可以使用Gordon Linoff的答案，但是您可以添加以下Thinsg，如您所说，有数百万个客户，并且您正在执行它们。

使用tally Table进行时间插槽??因为它可能会带来更好的性能。有关更多USALBLITY，请参阅以下链接

http://www.sqlservercentral.com/articles/t-sql/62867/

，我认为您应该使用分区或行号函数来将列cultuertic划分，并根据某些分区值选择客户。例如，只需选择行号值，然后与Tally表交叉连接。它可以带来您的表现。

相关内容

最新更新

热门标签：