2017-05-25 93 views
1

我试图将嵌套的XML数据加载到Hive中。样本数据如下...使用SerDe将嵌套的XML数据加载到Hive中

<CustomerOrders> 
    <Customers> 
    <CustID>ALFKI</CustID> 
    <Orders> 
     <OrderID>10643</OrderID> 
     <CustomerID>ALFKI</CustomerID> 
     <OrderDate>1997-08-25</OrderDate> 
    </Orders> 
    <Orders> 
     <OrderID>10692</OrderID> 
     <CustomerID>ALFKI</CustomerID> 
     <OrderDate>1997-10-03</OrderDate> 
    </Orders> 
    <CompanyName>Alfreds Futterkiste</CompanyName> 
    </Customers> 
    <Customers> 
    <CustID>ANATR</CustID> 
    <Orders> 
     <OrderID>10308</OrderID> 
     <CustomerID>ANATR</CustomerID> 
     <OrderDate>1996-09-18</OrderDate> 
    </Orders> 
    <CompanyName>Ana Trujillo Emparedados y helados</CompanyName> 
    </Customers> 
</CustomerOrders> 

下面是我使用的命令:

CREATE TABLE CUSTOMERORDERS(
      CustID STRING, 
      Orders ARRAY<STRUCT<OrderID:STRING,CustomerID:STRING,OrderDate:STRING>>, 
      CompanyName STRING) 
      ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe' 
      WITH SERDEPROPERTIES (
      "column.xpath.CustID"="/Customers/CustID/text()", 
      "column.xpath.Orders"="/Customers/Orders", 
      "column.xpath.OrderID"="/Customers/Orders/OrderID", 
      "column.xpath.CustomerID"="/Customers/Orders/CustomerID", 
      "column.xpath.OrderDate"="/Customers/Orders/OrderDate", 
      "column.xpath.CompanyName"="/Customers/CompanyName/text()") 
      STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' 
      OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' 
      TBLPROPERTIES ("xmlinput.start"="<Customers>","xmlinput.end"= "</Customers>"); 

输出,我gettings是:

hive> select * from customerorders; 
OK 
ALFKI [{"orderid":null,"customerid":null,"orderdate":null},{"orderid":null,"customerid":null,"orderdate":null}]  Alfreds Futterkiste 
ANATR [{"orderid":null,"customerid":null,"orderdate":null}] Ana Trujillo Emparedados y helados 
Time taken: 0.039 seconds, Fetched: 2 row(s) 

我越来越null值为OrderID,CustomerIDOrderDate。任何人都可以帮助我解决这个问题吗?

感谢

+0

我想我不应该配置'OrderID','CustomerID','OrderDate'在'SERDEPROPERTIES' ,因为它们不是表格列。所以,我删除了它们。我为'订单'尝试了'/ text()'。在这种情况下,我得到'NULL'。 'hive> select * from customerorders;采取 OK ALFKI NULL艾尔弗雷德Futterkiste ANATR NULL安娜特鲁希略EmparedadosŸhelados 时间:0.037秒,抓取时间:2行(S)' –

回答

1
create external table customerorders 
(
    custid  string 
    ,orders  array<struct<Orders:struct<OrderID:string,CustomerID:string,OrderDate:string>>> 
    ,companyname string 
) 
row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe' 
with serdeproperties 
(
    "column.xpath.CustID"  = "/Customers/CustID/text()" 
    ,"column.xpath.Orders"  = "/Customers/Orders" 
    ,"column.xpath.CompanyName" = "/Customers/CompanyName/text()" 
) 

stored as 
inputformat  'com.ibm.spss.hive.serde2.xml.XmlInputFormat' 
outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' 
tblproperties 
(
    "xmlinput.start" = "<Customers>" 
    ,"xmlinput.end"  = "</Customers>" 
); 

-

select * from customerorders 
; 

-

+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------+ 
| custid |                   orders                   |   companyname    | 
+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------+ 
| ALFKI | [{"orders":{"orderid":"10643","customerid":"ALFKI","orderdate":"1997-08-25"}},{"orders":{"orderid":"10692","customerid":"ALFKI","orderdate":"1997-10-03"}}] | Alfreds Futterkiste    | 
| ANATR | [{"orders":{"orderid":"10308","customerid":"ANATR","orderdate":"1996-09-18"}}]                    | Ana Trujillo Emparedados y helados | 
+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------+   
+0

非常感谢您的解决方案 –