2017-02-24 68 views
3

我有一个配置单元查询,它使用XPath从XML返回一组数组。 我想将数组的这些元素插入配置单元表中。如何将数据插入XPath返回的数组中的hive表中

在hivexml表XML内容是:

<tag><row Id="1" TagName=".net" Count="244006" ExcerptPostId="3624959" WikiPostId="3607476" /><row Id="2" TagName="html" Count="602809" ExcerptPostId="3673183" WikiPostId="3673182" /><row Id="3" TagName="javascript" Count="1274350" ExcerptPostId="3624960" WikiPostId="3607052" /><row Id="4" TagName="css" Count="434937" ExcerptPostId="3644670" WikiPostId="3644669" /><row Id="5" TagName="php" Count="1009113" ExcerptPostId="3624936" WikiPostId="3607050" /><row Id="8" TagName="c" Count="236386" ExcerptPostId="3624961" WikiPostId="3607013" /></tag> 

它返回组阵列的该查询:

select xpath(str,'/tag/row/@Id'), xpath(str,'/tag/row/@TagName'), xpath(str,'/tag/row/@Count'), xpath(str,'/tag/row/@ExcerptPostId'), xpath(str,'/tag/row/@WikiPostId') from hivexml;" 

和上面查询的输出(设定阵列)是:

["1","2","3","4","5"] [".net","html","css","php","c"] ["244006","602809","434937","1009113","236386"] ["3624959","3673183","3644670","3624936","3624961"] ["3607476","36 
73182","3644669","3607050","3607013"] 

我想插入这些值到一个配置单元表中,就像在这种格式:

1 .net 244006  3624959 3607476 
2 html 602809  3673183 3673182 
3 css  434937  3644670 3644669 
4 php  1009113 3624936 3607050 
5 c  236386  3624961 3607013 

如果我做一个插入上述选择查询:

insert into newhivexml select xpath(str,'/tags/row/@Id'), xpath(str,'/tag/row/@TagName'), xpath(str,'/tag/row/@Count'), xpath(str,'/tag/row/@ExcerptPostId'), xpath(str,'/tag/row/@WikiPostId') from hivexml;" 

然后我得到一个错误:

NoMatchingMethodException No matching method for class org.apache.hadoop.hive.ql.udf.UDFToInteger with (array). Possible choices: FUNC(bigint) FUNC(boolean) FU NC(decimal(38,18)) FUNC(double) FUNC(float) FUNC(smallint) FUNC(string) FUNC(struct) FUNC(timestamp) FUNC(tinyin t) FUNC(void)

我认为,我们不能直接插入这样的,我在这里失去了一些东西。谁能告诉我如何做到这一点?也就是说,将数组中的这些值插入到表中。

+0

下载只是为了确保 - 的XML刚刚开始列中的e列,而不是整个数据,对不对? –

回答

2

的XPath _...(STR,CONCAT(“/标签/行[”,pe.pos +1,']/@ ......))

create table hivexml (str string); 

insert into hivexml values ('<tag><row Id="1" TagName=".net" Count="244006" ExcerptPostId="3624959" WikiPostId="3607476" /><row Id="2" TagName="html" Count="602809" ExcerptPostId="3673183" WikiPostId="3673182" /><row Id="3" TagName="javascript" Count="1274350" ExcerptPostId="3624960" WikiPostId="3607052" /><row Id="4" TagName="css" Count="434937" ExcerptPostId="3644670" WikiPostId="3644669" /><row Id="5" TagName="php" Count="1009113" ExcerptPostId="3624936" WikiPostId="3607050" /><row Id="8" TagName="c" Count="236386" ExcerptPostId="3624961" WikiPostId="3607013" /></tag>'); 

select xpath_int (str,concat('/tag/row[',pe.pos+1,']/@Id'   )) as Id 
     ,xpath_string (str,concat('/tag/row[',pe.pos+1,']/@TagName'  )) as TagName 
     ,xpath_int (str,concat('/tag/row[',pe.pos+1,']/@Count'  )) as Count 
     ,xpath_int (str,concat('/tag/row[',pe.pos+1,']/@ExcerptPostId')) as ExcerptPostId 
     ,xpath_int (str,concat('/tag/row[',pe.pos+1,']/@WikiPostId' )) as WikiPostId 

from hivexml 
     lateral view posexplode (xpath(str,'/tag/row/@Id')) pe 
; 

+----+------------+---------+---------------+------------+ 
| id | tagname | count | excerptpostid | wikipostid | 
+----+------------+---------+---------------+------------+ 
| 1 | .net  | 244006 |  3624959 | 3607476 | 
| 2 | html  | 602809 |  3673183 | 3673182 | 
| 3 | javascript | 1274350 |  3624960 | 3607052 | 
| 4 | css  | 434937 |  3644670 | 3644669 | 
| 5 | php  | 1009113 |  3624936 | 3607050 | 
| 8 | c   | 236386 |  3624961 | 3607013 | 
+----+------------+---------+---------------+------------+ 
+0

感谢它的工作!但一个小故障是我们不能把换行符在查询中..它显示错误“该命令的语法不正确。”。如果我把所有东西放在一条线上,它就可以工作! –

0

问题是,XPath函数将返回所有匹配结果,每个请求在独立数组中都不加入。如果它适合你,你可以使用猪八戒这个批处理模式可以简化过程分解为单个步骤:

REGISTER /usr/hdp/current/pig-client/lib/piggybank.jar DEFINE XPathAll org.apache.pig.piggybank.evaluation.xml.XPathAll(); 

A = LOAD '/tmp/text.xml' using org.apache.pig.piggybank.storage.XMLLoader('tag') as (x:chararray); 

B = FOREACH A GENERATE XPathAll(x, 'row/@Id',false,false).$0, 
    XPathAll(x, 'row/@TagName',false,false).$0, 
    XPathAll(x, 'row/@Count',false,false).$0, 
    XPathAll(x, 'row/@ExcerptPostId',false,false).$0, 
    XPathAll(x, 'row/@WikiPostId',false,false).$0; 

DUMP B; 

(1,.net,244006,3624959,3607476) 
(2,html,602809,3673183,3673182) 
(3,javascript,1274350,3624960,3607052) 
(4,css,434937,3644670,3644669) 
(5,php,1009113,3624936,3607050) 
(8,c,236386,3624961,3607013) 

STORE B INTO "YourTable" USING org.apache.hive.hcatalog.pig.HCatStorer(); 
1

xpath(str,concat('/ tag/row [',pe.pos + 1,']/@ *'))

这是一个非常干净的方式来提取一个元素的所有值。
它的属性的顺序似乎没有什么在这里让我吃惊不将根据XML内,但通过他们的名字字母顺序排列的顺序 -
@伯爵,@ ExcerptPostId,@标识,@标记名@ WikiPostId

不幸的是,我不能认为它是一个合法的解决方案,除非我知道字母属性顺序是有保证的。

select xpath (str,concat('/tag/row[',pe.pos+1,']/@*')) as row_values 

from hivexml 
     lateral view posexplode (xpath(str,'/tag/row/@Id')) pe 
; 

-

["244006","3624959","1",".net","3607476"] 
["602809","3673183","2","html","3673182"] 
["1274350","3624960","3","javascript","3607052"] 
["434937","3644670","4","css","3644669"] 
["1009113","3624936","5","php","3607050"] 
["236386","3624961","8","c","3607013"] 

select row_values[2] as Id 
     ,row_values[3] as TagName 
     ,row_values[0] as Count  
     ,row_values[1] as ExcerptPostId 
     ,row_values[4] as WikiPostId 

from (select xpath (str,concat('/tag/row[',pe.pos+1,']/@*')) as row_values 

     from hivexml 
       lateral view posexplode (xpath(str,'/tag/row/@Id')) pe 
     ) x 
; 

+----+------------+---------+---------------+------------+ 
| id | tagname | count | excerptpostid | wikipostid | 
+----+------------+---------+---------------+------------+ 
| 1 | .net  | 244006 |  3624959 | 3607476 | 
| 2 | html  | 602809 |  3673183 | 3673182 | 
| 3 | javascript | 1274350 |  3624960 | 3607052 | 
| 4 | css  | 434937 |  3644670 | 3644669 | 
| 5 | php  | 1009113 |  3624936 | 3607050 | 
| 8 | c   | 236386 |  3624961 | 3607013 | 
+----+------------+---------+---------------+------------+ 
+1

你是真正的蜂巢大师。甚至没有想象过这样的事情可以通过Hive在单个查询中完成。 +1为每个解决方案 – Alex

1

分裂+ str_to_map

select vals["Id"]    as Id 
     ,vals["TagName"]   as TagName 
     ,vals["Count"]   as Count  
     ,vals["ExcerptPostId"] as ExcerptPostId 
     ,vals["WikiPostId"]  as WikiPostId 

from (select str_to_map(e.val,' ','=') as vals 

     from hivexml 
       lateral view posexplode(split(translate(str,'"',''),'/?><row')) e 

     where e.pos <> 0 
     ) x 
; 

+----+------------+---------+---------------+------------+ 
| id | tagname | count | excerptpostid | wikipostid | 
+----+------------+---------+---------------+------------+ 
| 1 | .net  | 244006 |  3624959 | 3607476 | 
| 2 | html  | 602809 |  3673183 | 3673182 | 
| 3 | javascript | 1274350 |  3624960 | 3607052 | 
| 4 | css  | 434937 |  3644670 | 3644669 | 
| 5 | php  | 1009113 |  3624936 | 3607050 | 
| 8 | c   | 236386 |  3624961 | 3607013 | 
+----+------------+---------+---------------+------------+ 
1

如果数据是XML文档

XML SERDE可以从https://github.com/01org/graphbuilder/blob/master/src/com/intel/hadoop/graphbuilder/preprocess/inputformat/XMLInputFormat.java

add jar /home/cloudera/hivexmlserde-1.0.5.3.jar; 

create external table hivexml_ext 
(
    Id    string 
    ,TagName   string 
    ,Count   string 
    ,ExcerptPostId string 
    ,WikiPostId  string 
) 
row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe' 
with serdeproperties 
(
    "column.xpath.Id"   = "/row/@Id" 
    ,"column.xpath.TagName"  = "/row/@TagName" 
    ,"column.xpath.Count"   = "/row/@Count " 
    ,"column.xpath.ExcerptPostId" = "/row/@ExcerptPostId" 
    ,"column.xpath.WikiPostId" = "/row/@WikiPostId" 
) 
stored as 
inputformat  'com.ibm.spss.hive.serde2.xml.XmlInputFormat' 
outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' 
location  '/user/hive/warehouse/hivexml' 
tblproperties 
(
    "xmlinput.start" = "<row" 
    ,"xmlinput.end" = "/>" 
) 
; 

select * from hivexml_ext as x 
; 

+------+------------+---------+-----------------+--------------+ 
| x.id | x.tagname | x.count | x.excerptpostid | x.wikipostid | 
+------+------------+---------+-----------------+--------------+ 
| 1 | .net  | 244006 |   3624959 |  3607476 | 
| 2 | html  | 602809 |   3673183 |  3673182 | 
| 3 | javascript | 1274350 |   3624960 |  3607052 | 
| 4 | css  | 434937 |   3644670 |  3644669 | 
| 5 | php  | 1009113 |   3624936 |  3607050 | 
| 8 | c   | 236386 |   3624961 |  3607013 | 
+------+------------+---------+-----------------+--------------+ 
+0

我没有Java在我的电脑..将上面的代码运行在PowerShell中,如果我复制它,因为它是?我担心添加jar文件的第一行。 –

+0

在您下载jar之后,应该在配置单元中执行'add jar'命令。把瓶子放在你喜欢的任何地方,并相应地改变路径。 –

+0

该jar文件应该在我的本地机器或天青?我已经把它放在我的本地机器上,但它的显示文件不存在。 –

相关问题