如何列出Hive表的分区修剪输入？

我正在使用Spark SQL查询Hive中的数据。数据被分区并且Spark SQL在查询时正确地修剪分区。如何列出Hive表的分区修剪输入？

但是，我需要列出源表以及分区过滤器或特定的输入文件（.inputFiles将是明显的选择，但它不反映修剪）给定的查询，以确定哪个部分计算将发生的数据。

我能够得到的最接近的是通过调用df.queryExecution.executedPlan.collectLeaves()。这包含相关的计划节点为HiveTableScanExec实例。但是，对于org.apache.spark.sql.hive包，此类别为private[hive]。我认为相关领域是relation和partitionPruningPred。

有什么办法可以达到这个目的吗？

更新：我能得到的相关信息感谢亚采的建议，并通过使用返回relationgetHiveQlPartitions并提供partitionPruningPred作为参数：

scan.findHiveTables(execPlan).flatMap(e => e.relation.getHiveQlPartitions(e.partitionPruningPred))

这包含了所有我需要的数据，包括所有输入文件的路径，正确分区修剪。

来源

2017-09-14 binarek

那么，你需要查询执行的低级细节，并且事情在那里是颠簸的。 您已收到警告:)

正如您在注释中所述，所有执行信息均在此private[hive] HiveTableScanExec中。得到一些洞察HiveTableScanExec物理运算符（即在执行时蜂房表）

一种方法是在org.apache.spark.sql.hive包不是private[hive]创造一种后门。

package org.apache.spark.sql.hive 

import org.apache.spark.sql.hive.execution.HiveTableScanExec 
object scan { 
    def findHiveTables(execPlan: org.apache.spark.sql.execution.SparkPlan) = execPlan.collect { case hiveTables: HiveTableScanExec => hiveTables } 
}

更改代码以满足您的需求。

随着scan.findHiveTables，我通常使用:paste -raw而在spark-shell潜入这样的“未知领域”。

你可以那么只需做到以下几点：

scala> spark.version 
res0: String = 2.4.0-SNAPSHOT 

// Create a Hive table 
import org.apache.spark.sql.types.StructType 
spark.catalog.createTable(
    tableName = "h1", 
    source = "hive", // <-- that makes for a Hive table 
    schema = new StructType().add($"id".long), 
    options = Map.empty[String, String]) 

// select * from h1 
val q = spark.table("h1") 
val execPlan = q.queryExecution.executedPlan 
scala> println(execPlan.numberedTreeString) 
00 HiveTableScan [id#22L], HiveTableRelation `default`.`h1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#22L] 

// Use the above code and :paste -raw in spark-shell 

import org.apache.spark.sql.hive.scan 
scala> scan.findHiveTables(execPlan).size 
res11: Int = 1

relation场是巢表后，它一直使用星火分析仪使用，以解决数据源和蜂巢表ResolveRelations和FindDataSourceTable逻辑规则解决。

通过使用ExternalCatalog接口，您可以获得几乎所有Spark使用的来自Hive Metastore的所有信息，该接口可用作spark.sharedState.externalCatalog。这使您几乎可以使用Spark用于规划Hive表上的查询的所有元数据。

来源

2018-01-17 12:04:04

谢谢！我能够使用返回的'relation'上的'getHiveQlPartitions'获取相关信息，并提供'partitionPruningPred'作为参数： 'scan.findHiveTables（execPlan）.flatMap（e => e.relation.getHiveQlPartitions（e。partitionPruningPred））' 这包含我需要的所有数据，包括所有输入文件的路径，正确分区修剪。不幸的是，低级别的包私人访问是必需的，标准的'inputFiles'本身并不这样做。我认为这是出于性能原因？ – binarek

如何列出Hive表的分区修剪输入？

回答

相关问题