3
我对火花缓存行为有点困惑。 我想计算相关的数据集(B),高速缓存,并unpersist源数据集(一) - 这里是我的代码:火花缓存区别2.0.2和2.1.1
val spark = SparkSession.builder().appName("test").master("local[4]").getOrCreate()
import spark.implicits._
val a = spark.createDataset(Seq(("a", 1), ("b", 2), ("c", 3)))
a.createTempView("a")
a.cache
println(s"Is a cached: ${spark.catalog.isCached("a")}")
val b = a.filter(x => x._2 < 3)
b.createTempView("b")
// calling action
b.cache.first
println(s"Is b cached: ${spark.catalog.isCached("b")}")
spark.catalog.uncacheTable("a")
println(s"Is b cached after a was unpersisted: ${spark.catalog.isCached("b")}")
当使用火花2.0.2它按预期工作:
Is a cached: true
Is b cached: true
Is b cached after a was unpersisted: true
但在2.1.1:
Is a cached: true
Is b cached: true
Is b cached after a was unpersisted: false
我怎么能archieve 2.1.1相同的行为?
谢谢。
是,火花的家伙说,这是 '被设计' 的行为 - https://issues.apache.org/jira/browse/SPARK-21478。尽管不清楚如何处理不再需要的大型缓存数据集。有任何想法吗? – yanik1984