2016-11-20 80 views
0

我有几个Spark Dataframes(我们可以称它们为Table a,Table b等)。 我想根据对其中一个表的查询结果向表a添加一列,但此表每次都会根据表a的某个字段的值进行更改。所以这个查询应该是参数化的。 下面我列出一个例子来解决问题:根据参数化SQL查询在Spark Dataframe中添加列,这取决于数据帧某些字段的值。

每个表都有OID列和TableName列以及当前表的名称加上其他列。

This is the fixed query to be performed on Tab A to add new column: 

    Select $ColumnName from $TableName where OID=$oids 

    Tab A 
    | oids|TableName |ColumnName | other fields|New Column: ValueOidDb 
    ================================================================ 
    | 2 | Book  | Title  |  x  |result query:harry potter 
    | 8 | Book  | Isbn  |  y  |result query: 556 
    | 1 | Author | Name  |  z  |result query:Tolkien 
    | 4 | Category |Description|  b  |result query: Commedy 


    Tab Book 
    | OID |TableName |Title  |Isbn |other fields| 
    ================================================================ 
    | 2 | Book  |harry potter| 123 | x   | 
    | 8 | Book  | hobbit  | 556 | y   | 
    | 21 | Book  | etc  | 8942 | z   | 
    | 5 | Book  | etc2  | 984 | b   | 

    Tab Author 
    | OID |TableName  |Name  |nationality |other fields| 
    ================================================================ 
    | 5 | Author  |J.Rowling | eng  | x   | 
    | 2 | Author  |Geor. Martin| us   | y   | 
    | 1 | Author  | Tolkien | eng  | z   | 
    | 13 | Author  | Dan Brown | us   | b   | 


    | OID | TableName |Description | 
    ===================================== 
    | 12 | Category | Fantasy | 
    | 4 | Category | Commedy | 
    | 9 | Category | Thriller | 
    | 7 | Category | Action  | 

我试着用这个UDF

def setValueOid = (oid: Int,TableName: String, TableColumn: String) => { 

    try{ 
     sqlContext.sql(s"Select $currTableColumn from $currTableName where OID = $curroid ").first().toString() 
     } 
    catch{ 
     case x: java.lang.NullPointerException => "error" 
     } 

     } 
    sqlContext.udf.register("setValueOid", setValueOid) 

    val FinalRtxf = sqlContext.sql("SELECT all the column of TAB A ," 
       + " setValueOid(oid, Table,AttributeDatabaseColumn) as  ValueOidDb" 
       + " FROM TAB A") 

我把代码中的一个尝试捕捉,否则它给了我一个NullPointerException异常,但它不工作,因为它总是会返回一个“问题” 。 如果我尝试这个功能没有SQL查询的只是路过一些手动参数它完美的作品:

  val try=setValueOid(8,"BOOK","ISBN") 
      try: String = [0977326403 ]     FINISHED 
      Took 4 sec. Last updated by anonymous at November 20 2016, 3:29:28 AM. 

我读到这里,是不是可以做一个查询UDF内 Trying to execute a spark sql query from a UDF

因此,如何能我解决了我的问题?我不知道如何进行参数化连接。我尝试这样做:

 %sql 
     Select all attributes TAB A,  
     FROM TAB A as a 
     join (Select $AttributeDatabaseColumn ,TableName from $Table where OID=$oid) as b 
     on a.Table=b.TableName 

,但它给了我这个例外:

org.apache.spark.sql.AnalysisException: cannot recognize input near '$' 'AttributeDatabaseColumn' ',' in select clause; line 3 pos 1  at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:318) 
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41) 
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40) 

回答

0

一个选项:

  • 变换每个BookAuthorCategory到窗体:

    root 
    |-- oid: integer (nullable = false) 
    |-- tableName: string (nullable = true) 
    |-- properties: map (nullable = true) 
    | |-- key: string 
    | |-- value: string (valueContainsNull = true) 
    

    例如在Book第一记录:

    val book = Seq((2L, "Book", 
        Map("title" -> "harry potter", "Isbn" -> "123", "other field" -> "x") 
    )).toDF("oid", "title", "properties") 
    
    +---+---------+---------------------------------------------------------+ 
    |oid|tableName|properties            | 
    +---+---------+---------------------------------------------------------+ 
    |2 |Book  |Map(title -> harry potter, Isbn -> 123, other field -> x)| 
    +---+---------+---------------------------------------------------------+ 
    
  • 工会BookAuthorCategory作为属性。

    val properties = book.union(author).union(category) 
    
  • 加入与基表:

    val comb = properties.join(table, Seq($"oid", $"tableName")) 
    
  • 使用case when ...基于tableNameproperties字段添加新列。

+0

我是新的火花。我怎样才能以这种形式转换每个数据框(书,作者等)?在这个应用程序中,数据框也是书,作者等,但是我的程序将运行在不同的应用程序中,并且数据框可能会及时更改(表A将始终保留)。我想有一个通用的方法,而不是基于这个具体的例子,因为我不知道先验什么将书,作者等这是可能的吗?谢谢 – Thanas

+0

也那些表书作者等有千行,我无法手动执行此映射,这将是疯狂的 – Thanas