Apache Spark缓存在派生数据帧上工作吗

我正在使用Apache Spark做一些工作，但我不确定数据帧"；框架3"；将使用来自"；框架1"；是否。从概念上描述场景的代码如下：

frame1 = spark.read.csv("hdfs:....")
frame1.cache()
frame2 = frame1.select("name", "price").filter("price > 20")
frame2.show() #Data is being cached so this action takes longer
frame2.show() #Data has been cached so this action takes a short amount of time
frame3 = frame2.select("name","price").filter("price > 30")
frame3.show() #Does this action use the cached data from frame 1 or not since frame 2 was built from frame 1?

有人有什么想法吗？

谢谢，Aurora

在上述场景中。DataFrameframe3将执行frame2转换。然而，在进行此转换时，它将使用数据帧frame 1的cache版本，而不是从csv读取数据

Spark使用懒惰进化来进行更好的优化。因此，在执行任何操作之前，都不会执行任何转换。这对于在单个数据帧上进行多重转换的情况非常好。

然而，在单个转换后的数据帧被引用到多个其他位置的情况下，最好缓存该数据帧。话虽如此。在上面的例子中，我没有看到frame1在其他地方被引用，所以缓存它是没有意义的。(除非这只是一个理解的例子)

注：根据评论更新答案，我错过了我们的一些重要信息。在对数据帧执行正确的操作之前，它不会被缓存。

相关内容

最新更新

热门标签：