如何计时Spark程序执行速度

我想让我的Spark程序执行速度有时间，但由于懒惰，这是相当困难的。让我们考虑到这里本（意义）代码：如何计时Spark程序执行速度

var graph = GraphLoader.edgeListFile(context, args(0)) 
val graph_degs = graph.outerJoinVertices(graph.degrees).triplets.cache 

/* I'd need to start the timer here */ 
val t1 = System.currentTimeMillis 
val edges = graph_degs.flatMap(trip => { /* do something*/ }) 
         .union(graph_degs) 

val count = edges.count 
val t2 = System.currentTimeMillis 
/* I'd need to stop the timer here */ 

println("It took " + t2-t1 + " to count " + count)

的事情是，转换是懒洋洋地所以没有什么被前val count = edges.count行评估。但根据我的观点，t1获得一个值，尽管上面的代码没有值......在定时器启动后，即使在代码中的位置，在t1以上的代码也被评估。这是一个问题...

在Spark Web UI中，我找不到任何有趣的内容，因为我需要花费特定代码行后的时间。你认为有一个简单的解决方案，看看什么时候一组转化得到真正的评估？

来源

2017-04-20 Matt

的可能的复制[如何分析在Scala中的方法？]（http://stackoverflow.com/questions/9160001/how-to-profile-methods-in-scala） –

看起来不像重复，因为这篇文章是特定于Apache Spark，它提供了特定的测量工具，并呈现特定的性能分析挑战 - 就像这里所描述的那样：评估是懒惰的，并且存在重新评估的代码块可能不代表测量的操作。 –

由于连续转换（同一任务内 - 的含义，它们不是由混洗分离并进行与同动作的一部分）作为一个单一的“步骤”被执行，火花确实不单独测量它们。并从驱动程序代码 - 你不能。

什么你可以做的是测量应用您功能，每条记录的持续时间，并使用累加器来总括起来，如：

// create accumulator 
val durationAccumulator = sc.longAccumulator("flatMapDuration") 

// "wrap" your "doSomething" operation with time measurement, and add to accumulator 
val edges = rdd.flatMap(trip => { 
    val t1 = System.currentTimeMillis 
    val result = doSomething(trip) 
    val t2 = System.currentTimeMillis 
    durationAccumulator.add(t2 - t1) 
    result 
}) 

// perform the action that would trigger evaluation 
val count = edges.count 

// now you can read the accumulated value 
println("It took " + durationAccumulator.value + " to flatMap " + count)

您可以重复这一过程，任何单独的转换。

免责声明：

当然，这不包括的时间花在星火洗牌周围的东西和做的实际计数 - 对，的确，星火UI是你最好的资源。
请注意，累加器对重试等事情很敏感 - 重试任务会更新累加器两次。

风格注：可以使代码的可重用性通过创建measure功能“包装”周围的任何功能和更新给定的累加器：

// write this once: 
def measure[T, R](action: T => R, acc: LongAccumulator): T => R = input => { 
    val t1 = System.currentTimeMillis 
    val result = action(input) 
    val t2 = System.currentTimeMillis 
    acc.add(t2 - t1) 
    result 
} 

// use it with any transformation: 
rdd.flatMap(measure(doSomething, durationAccumulator))

来源

2017-04-20 17:16:14

是的，这是一个好主意！但不是那么准确，因为正如你所说的那样，“这不包括Spark花时间洗牌并进行实际计数的时间”。无论如何，这很有趣。 – Matt

Spark Web UI记录每一个动作，甚至报告该动作每个阶段的时间 - 它都在那里！你需要浏览阶段标签，而不是作业。我发现只有编译并提交代码才可以使用它。它在REPL中没用，你有没有使用它？

来源

2017-04-20 16:49:46

如何计时Spark程序执行速度

回答

相关问题