在Apache Spark中以自定义对象为键映射到DataFrame

我在从RDD创建DataFrame时遇到问题。

首先，我使用Spark创建我正在使用的数据（通过对工人的模拟），作为回报，我得到了Report对象。

这些Report对象由两个HashMap组成，其中映射和自定义之间的键几乎相同，值为Integer/Double。值得注意的是，我目前需要这些关键点和贴图来在模拟过程中有效地添加和更新值，因此将其更改为"平面"对象可能会失去很多效率。

public class Key implements Serializable, Comparable<Key> {
    private final States states;
    private final String event;
    private final double age;
    ...
}

美国是

public class States implements Serializable, Comparable<States> {
    private String stateOne;
    private String stateTwo;
    ...
}

状态曾经是枚举，但事实证明，DataFrame不喜欢这样。（字符串仍然是从枚举中设置的，以确保值是正确的。）

问题是，我想将这些映射转换为DataFrames，这样我就可以使用SQL等来操作/过滤数据。

我能够通过创建一个类似Bean的来创建DataFrames

public class Event implements Serializable {
    private String stateOne;
    private String stateTwo;
    private String event;
    private Double age;
    private Integer value;
    ...
}

使用getter和setter，但有没有一种方法可以让我只使用Tuple2（或类似的东西）来创建我的DataFrame？哪一个能给我一个很好的数据库结构？

我试过像这个一样使用Tuple2

JavaRDD<Report> reports = dataSet.map(new SimulationFunction(REPLICATIONS_PER_WORKER)).cache();
JavaRDD<Tuple2<Key, Integer>> events = reports.flatMap(new FlatMapFunction<Report, Tuple2<Key, Integer>>() {
    @Override
    public Iterable<Tuple2<Key, Integer>> call(Report t) throws Exception {
        List<Tuple2<Key, Integer>> list = new ArrayList<>(t.getEvents().size());
        for(Entry<Key, Integer> entry : t.getEvents().entrySet()) {
            list.add(new Tuple2<>(entry.getKey(), entry.getValue()));
        }
        return list;
    }
});
DataFrame schemaEvents = sqlContext.createDataFrame(events, ????);

但我不知道该在哪里打问号。

希望我已经把自己说得足够清楚了，你能对此有所了解。提前谢谢！

正如zero323所说，我不可能做我想做的事情。从现在起，我只会坚持使用bean。

相关内容

最新更新

热门标签：