2017-10-07 79 views

我有这样一个xml:如何使用Spark RDD解析文本文件中的嵌套XML?

1234^12^999^`<row><ab key="someKey" value="someValue"/><ab key="someKey1" value="someValue1"/></row>`^23232 

我们可以正常解析XML文件轻松使用Scala的XML支持,甚至使用databricks XML格式,但我怎么解析嵌入到文本的XML。


val top5duration = data.map(line => line.split("^")).filter(line => {line(2)==100}).map(line => line(4)) 



Avik,你可以发布一些示例数据?我想解析出一个XML不应该是非常诡计,一旦我们有一些视觉效果。截至目前,我想我们都在玩“盲人和大象”.. –


确定:^ 200^2017-06-05 22:35:21.543^^ –



问题:如何处理嵌套的XML元素?我将如何访问 他们?


例如:可以说我希望每一个标题(字符串类型)/ 作者(WrappedArray)的组合,可以实现explode它:


schema : 

|-- title: string (nullable = true) 
|-- author: array (nullable = true) 
| |-- element: struct (containsNull = true) 
| | |-- initial: array (nullable = true) 
| | | |-- element: string (containsNull = true) 
| | |-- lastName: string (nullable = true) 

|    title|    author| 
|Proper Motions of...|[[WrappedArray(J,...| 
|Catalogue of 2055...|[[WrappedArray(J,...| 
|    null|    null| 
|Katalog von 3356 ...|[[WrappedArray(J)...| 
|Astrographic Cata...|[[WrappedArray(P)...| 
|Astrographic Cata...|[[WrappedArray(P)...| 
|Results of observ...|[[WrappedArray(H,...| 
|  AGK3 Catalogue|[[WrappedArray(W)...| 
|Perth 70: A Catal...|[[WrappedArray(E)...| 

import org.apache.spark.sql.functions; 
DataFrame exploded = src.select(src.col("title"),functions.explode(src.col("author")).as("auth")) 
exploded = exploded.select(exploded.col("initial"), 



|-- initial: array (nullable = true) 
| |-- element: string (containsNull = true) 
|-- title: string (nullable = true) 
|-- lastName: string (nullable = true) 

|initial|    title|  lastName| 
| [J, H]|Proper Motions of...|  Spencer| 
| [J]|Proper Motions of...|  Jackson| 
| [J, H]|Catalogue of 2055...|  Spencer| 


<?xml version='1.0' ?> 
<!DOCTYPE datasets SYSTEM "http://www.cs.washington.edu/research/projects/xmltk/xmldata/data/nasa/dataset_053.dtd"> 
<dataset subject="astronomy" xmlns:xlink="http://www.w3.org/XML/XLink/0.9"> 
    <title>Proper Motions of Stars in the Zone Catalogue -40 to -52 degrees 
of 20843 Stars for 1900</title> 
    <altname type="ADC">1005</altname> 
    <altname type="CDS">I/5</altname> 
    <altname type="brief">Proper Motions in Cape Zone Catalogue -40/-52</altname> 
    <title>Proper Motions of Stars in the Zone Catalogue -40 to -52 degrees 
of 20843 Stars for 1900</title> 
    <name>His Majesty's Stationery Office, London</name> 
    <keywords parentListURL="http://messier.gsfc.nasa.gov/xml/keywordlists/adc_keywords.html"> 
    <keyword xlink:href="Positional_data.html">Positional data</keyword> 
    <keyword xlink:href="Proper_motions.html">Proper motions</keyword> 
    <para>This catalog, listing the proper motions of 20,843 stars 
    from the Cape Astrographic Zones, was compiled from three series of 
    photographic plates. The plates were taken at the Royal Observatory, 
    Cape of Good Hope, in the following years: 1892-1896, 1897-1910, 
    1923-1928. Data given include centennial proper motion, photographic 
    and visual magnitude, Harvard spectral type, Cape Photographic 
    Durchmusterung (CPD) identification, epoch, right ascension and 
    declination for 1900.</para> 
    <tableLink xlink:href="czc.dat"> 
    <title>The catalogue</title> 
    <definition>Number 5</definition> 
    <definition>Catalogue Identification Number</definition> 
    <definition>Visual Magnitude</definition> 
    <definition>Right Ascension for 1900 hours</definition> 
    <definition>Right Ascension for 1900 minutes</definition> 
    <definition>Right Ascension seconds in 0.01sec 1900</definition> 
    <definition>Declination Sign</definition> 
    <definition>Declination for 1900 degrees</definition> 
    <definition>Declination for 1900 arcminutes</definition> 
    <definition>Declination for 1900 arcseconds</definition> 
    <definition>Epoch -1900</definition> 
    <definition>Cape Photographic 
             Durchmusterung Zone</definition> 
    <definition>Cape Photographic Durchmusterung Number</definition> 
    <definition>Photographic Magnitude</definition> 
    <definition>HD Spectral Type</definition> 
    <definition>Proper Motion in RA 
     <para>the relation is pmRA = 15 * pmRAs * cos(DE) 
    if pmRAs is expressed in s/yr and pmRA in arcsec/yr</para> 
    <definition>Proper Motion in RA</definition> 
    <definition>Proper Motion in Dec</definition> 
    <lastName>Julie Anne Watko</lastName> 
<dataset subject="astronomy" xmlns:xlink="http://www.w3.org/XML/XLink/0.9"> 
    <title>Catalogue of 20554 Faint Stars in the Cape Astrographic Zone -40 to -52 Degrees 
for the Equinox of 1900.0</title> 
    <altname type="ADC">1006</altname> 
    <altname type="CDS">I/6</altname> 
    <altname type="brief">Cape 20554 Faint Stars, -40 to -52, 1900.0</altname> 
    <title>Catalogue of 20554 Faint Stars in the Cape Astrographic Zone -40 to -52 Degrees 
for the Equinox of 1900.0</title> 
    <name>His Majesty's Stationery Office, London</name> 
    <keywords parentListURL="http://messier.gsfc.nasa.gov/xml/keywordlists/adc_keywords.html"> 
    <keyword xlink:href="Positional_data.html">Positional data</keyword> 
    <keyword xlink:href="Proper_motions.html">Proper motions</keyword> 
    <para>This catalog contains positions, precessions, proper motions, and 
    photographic magnitudes for 20,554 stars. These were derived from 
    photographs taken at the Royal Observatory, Cape of Good Hope between 1923 
    and 1928. It covers the astrographic zones -40 degrees to -52 degrees of 
    declination. The positions are given for epoch 1900 (1900.0). It includes 
    spectral types for many of the stars listed. It extends the earlier 
    catalogs derived from the same plates to fainter magnitudes. The 
    computer-readable version consists of a single data table.</para> 
    <para>The stated probable error for the star positions is 0.024 seconds of time 
    (R.A.) and 0.25 seconds of arc (dec.) for stars with one determination, 
    0.017 seconds of time, and 0.18 seconds of arc for two determinations, and 
    0.014/0.15 for stars with three determinations.</para> 
    <para>The precession and secular variations were derived from Newcomb's constants.</para> 
    <para>The authors quote probable errors of the proper motions in both coordinates 
    of 0.008 seconds of arc for stars with one determination, 0.0055 seconds for 
    stars with two determinations, and 0.0044 for stars with three.</para> 
    <para>The photographic magnitudes were derived from the measured diameters on the 
    photographic plates and from the magnitudes given in the Cape Photographic 
    <para>The spectral classification of the cataloged stars was done with the 
    assistance of Annie Jump Cannon of the Harvard College Observatory.</para> 
    <para>The user should consult the source reference for more details of the 
    measurements and reductions. See also the notes in this document for 
    additional information on the interpretation of the entries.</para> 
    <tableLink xlink:href="faint.dat"> 
    <definition>Cape Number</definition> 
     <para>A = Astrographic Star 
    F = Faint Proper Motion Star 
    N = Other Note</para> 
    <definition>Cape Phot. Durchmusterung (CPD) Zone 
     <para>All CPD Zones are negative. - signs are not included in data. 
     "0" in column 8 signifies Astrographic Plate instead of CPD.</para> 
    <definition>CPD Number or Astrographic Plate 
     <para>See also note on CPDZone. 
     Astrographic plate listed "is the more southerly on which the 
     star occurs." Thus, y-coordinate is positive wherever possible.</para> 
    <definition>[1234] Remarks 
     <para>A number from 1-4 appears in this byte for double stars where 
    the same CPD number applies to more than one star.</para> 
    <definition>Photographic Magnitude 
     <para>The Photographic Magnitude is "determined from the CPD Magnitude 
     and the diameter on the Cape Astrographic Plates by means of the 
     data given in the volume on the Magnitudes of Stars in the Cape 
     Zone Catalogue." 
    A null value (99.9) signifies a variable star.</para> 
    <definition>Mean Right Ascension hours 1900</definition> 
    <definition>Mean Right Ascension minutes 1900</definition> 
    <definition>Mean Right Ascension seconds 1900</definition> 
    <definition>Mean Declination degrees 1900</definition> 
    <definition>Mean Declination arcminutes 1900</definition> 
    <definition>Mean Declination arcseconds 1900</definition> 
    <definition>Number of Observations</definition> 
    <definition>Epoch +1900</definition> 
    <definition>Proper Motion in RA seconds of time</definition> 
    <definition>Proper Motion in RA arcseconds</definition> 
    <definition>Proper Motion in Dec arcseconds</definition> 
    <definition>HD Spectral Type</definition> 
    <lastName>Julie Anne Watko</lastName> 
<dataset subject="astronomy" xmlns:xlink="http://www.w3.org/XML/XLink/0.9"> 
    <title>Proper Motions of 1160 Late-Type Stars</title> 
    <altname type="ADC">1014</altname> 
    <altname type="CDS">I/14</altname> 
    <altname type="brief">Proper Motions of 1160 Late-Type Stars</altname> 
    <title>Proper Motions of 1160 Late-Type Stars</title> 
     <lastName>Fogh Olsen</lastName> 
    <name>Astron. Astrophys. Suppl. Ser.</name> 
    <holding role="similar">II/38 : Stars observed photoelectrically by Dickow et al. 
    <xlink:simple href="II/38"/> 
    </holding>Fogh Olsen H.J. 1970, Astron. Astrophys. Suppl. Ser., 2, 69. 
    Fogh Olsen H.J. 1970, Astron. Astrophys., Suppl. Ser., 1, 189.</related> 
    <keywords parentListURL="http://messier.gsfc.nasa.gov/xml/keywordlists/adc_keywords.html"> 
    <keyword xlink:href="Proper_motions.html">Proper motions</keyword> 
    <para>Improved proper motions for the 1160 stars contained in the photometric 
    catalog by Dickow et al. (1970) are presented. Most of the proper motions 
    are from the GC, transferred to the system of FK4. For stars not included 
    in the GC, preliminary AGK or SAO proper motions are given. Fogh Olsen 
    (Astron. Astrophys. Suppl. Ser., 1, 189, 1970) describes the method of 
    improvement. The mean errors of the centennial proper motions increase with 
    increasing magnitude. In Right Ascension, these range from 0.0043/cos(dec) 
    for very bright stars to 0.096/cos(dec) for the faintest stars. In Dec- 
    lination, the range is from 0.065 to 1.14.</para> 
    <tableLink xlink:href="pmlate.dat"> 
    <title>Proper motion data</title> 
     <para>Henry Draper or Bonner Durchmusterung number</para> 
    <definition>Centennial Proper Motion RA</definition> 
    <definition>Centennial Proper Motion Dec</definition> 
    <definition>Radial Velocity</definition> 
    <lastName>Julie Anne Watko</lastName> 
<dataset subject="astronomy" xmlns:xlink="http://www.w3.org/XML/XLink/0.9"> 
    <title>Katalog von 3356 Schwachen Sternen fuer das Aequinoktium 1950 
+89 degrees</title> 
    <altname type="ADC">1016</altname> 
    <altname type="CDS">I/16</altname> 
    <altname type="brief">Catalog of 3356 Faint Stars, 1950</altname> 
    <title>Katalog von 3356 Schwachen Sternen fuer das Aequinoktium 1950 
+89 degrees</title> 
    <name>Verlag der Sternwarte, Hamburg-Bergedorf</name> 
    <keywords parentListURL="http://messier.gsfc.nasa.gov/xml/keywordlists/adc_keywords.html"> 
    <keyword xlink:href="Fundamental_catalog.html">Fundamental catalog</keyword> 
    <keyword xlink:href="Positional_data.html">Positional data</keyword> 
    <keyword xlink:href="Proper_motions.html">Proper motions</keyword> 
    <para>This catalog of 3356 faint stars was derived from meridian circle 
    observations at the Bergedorf and Heidelberg Observatories. The 
    positions are given for the equinox 1950 on the FK3 system. The stars 
    are mainly between 8.0 and 10.0 visual magnitude. A few are brighter 
    than 8.0 mag. The lower limit in brightness resulted from the visibility 
    of the stars.</para> 
    <para>All stars were observed at both the Heidelberg and Bergedorf 
    Observatories. Normally, at each observatory, two observations were 
    obtained with the clamp east and two with the clamp west. The mean 
    errors are comparable for the two observatories with no significant 
    systematic difference in the positions between them. The mean errors of 
    the resulting positions should be approximated 0.011s/cos(dec) in right 
    ascension and).023" in declination.</para> 
    <para>The proper motions were derived from a comparison with the catalog 
    positions with the positions in the AGK2 and AGK2A with a 19 year 
    baseline and from a comparison of new positions with those in Kuestner 
    1900 with about a fifty year baseline.</para> 
    <para>The magnitudes were taken from the AGK2. Most spectral types were 
    determined by A. N. Vyssotsky. A few are from the Bergedorfer 
    <tableLink xlink:href="catalog.dat"> 
    <title>The catalog</title> 
    <definition>Catalog number</definition> 
    <definition>BD zone</definition> 
    <definition>BD number</definition> 
    <definition>Photographic magnitude</definition> 
    <definition>Spectral class</definition> 
    <definition>Right Ascension hours (1950)</definition> 
    <definition>Right Ascension minutes (1950)</definition> 
    <definition>Right Ascension seconds (1950)</definition> 
    <definition>First order precession in RA per century</definition> 
    <definition>Second order precession in RA per century</definition> 
    <definition>Proper motion in RA from AGK2 positions</definition> 
    <definition>Proper motion in RA from Kuestner positions</definition> 
    <definition>Sign of declination (1950)</definition> 
    <definition>Declination degrees (1950)</definition> 
    <definition>Declination minutes (1950)</definition> 
    <definition>Declination seconds (1950)</definition> 
    <definition>First order precession in dec per century</definition> 
    <definition>Second order precession in dec per century</definition> 
    <definition>Proper motion in DE from AGK2 positions</definition> 
    <definition>Proper motion in DE from Kuestner positions</definition> 
    <definition>Epoch of observation - 1900.0</definition> 
    <definition>Note for star in printed catalog 
     <para>1 = ma (blend?) 
    3 = pr (preceding) 
    4 = seq (following) 
    5 = bor (northern) 
    6 = au (southern) 
    * = other note in printed volume (All notes in the printed volume have not 
     been indicated in this version.) 
    the printed volume sometimes has additional information on the systems with 
    numerical remarks.</para> 
    <lastName>Nancy Grace Roman</lastName> 

我没有纯XML文件。我有一个文本文件,XML数据嵌入其中。另外,我需要根据每个'ab'标签的属性(键和值)提取数据。你能告诉我如何提取这些值,因为我需要存储这些值并处理它们。 –


如果您在RD何况还有XML d [字符串]格式, 你可以把它转换成数据帧与Databricks实用工具类:


是的,就像我在问题中提到的那样,可以轻松完成。但是我在textFile中的文本数据中嵌入了XML数据。 我已经将XML数据提取到另一个RDD中,但我无法从所有'ab'标记(嵌套标记)中提取键和值属性值。 –



<DOCTYPE data [ 
    <!ELEMENT data O O (field+)> 
    <!ELEMENT field O O (#PCDATA|markup)> 
    <!ELEMENT markup O O (row)> 
    <!ELEMENT row - - (ab+)> 
    <!ELEMENT ab - - (#PCDATA)> 
    <!ENTITY start-field "<field>"> 
    <!SHORTREF in-data "^" start-field> 
    <!USEMAP in-data data> 
    <!ENTITY start-markup "<markup>"> 
    <!ENTITY end-markup "</markup>"> 
    <!SHORTREF in-field "`" start-markup> 
    <!USEMAP in-field field> 
    <!SHORTREF in-markup "`" end-markup> 
    <!USEMAP in-markup markup> 
1234^12^999^`<row><ab key="someKey" value="someValue"/><ab key="someKey1" value="someValue1"/></row>`^23232 


     <ab key="someKey" value="someValue"/> 
     <ab key="someKey1" value="someValue1"/> 




<!DOCTYPE data [ 
    <!-- ... same declarations as above ... --> 
    <ENTITY datafile SYSTEM "datafile.csv"> 



好方法!但是这需要修改数据文件本身吗?这不是我的选择,所以我需要访问XML数据而不修改数据文件。 –


@avik aggarwal更新了我的回答 – imhotap


当然可以尝试。 –



  1. 读取数据为文本文件,并使用分隔符^
  2. 过滤掉不良记录不赋予架构
  3. 对阵前面定义的模式的数据定义模式
  4. 拆分字符串。
  5. 现在,您将在元组中拥有如下所示的数据,并且我们将解析中间的xml数据。

    (1234,12,999,"<row><ab key="someKey" value="someValue"/><ab key="someKey1" value="someValue1"/></row>, 23232) 
  6. xml.attribute(“key”),因为它将返回所有的键。

  7. 如果您需要值someValue且对someValue1不感兴趣,则循环遍历此节点序列并应用contains(“key”)过滤器以消除其他键。我使用了数据中存在的关键持续时间。
  8. 在上一步中应用xpath \“@ value”来获取值。

similar question in cloudera

//define a case class for schema match with data input 

case class stb (server_unique_id:Int,request_type:Int,event_id:Int,stb_timestamp:String,stb_xml:String,device_id:String,secondary_timestamp: String) 

val data = spark.read.textFile(args(0)).rdd;///read data from supplied path from CLI 

//check for^delimiter and 7 fields, else filter out 

var clean_Data = data.filter { line => {line.trim().contains("^")}} 
.map { line => {line.split("\\^")}} 
.filter{ line => line.length == 7} 

//match the schema and filter out data having event id = 100 and the tag having Duration 

var tup_Map = clean_Data.map{ line => stb (line(0).toInt,line(1).toInt,line(2).toInt,line(3),line(4),line(5),line(6))} 
.filter(line => (line.event_id == 100 && line.stb_xml.contains("Duration"))); 

//xml is of name-value format, hence the attrbutes are all same(n,v) 

//parse through the xml structure and find out necessary data 

//xmlnv will parse top level to nodeseq having 8 different data like duration,channel in self closing tags 

//and name-value format 

var xml_Map = tup_Map.map{line => 
var xmld = XML.loadString(line.stb_xml); 
var xmlnv = xmld \\ "nv"; 

var duration = 0; 
for { i <- 0 to xmlnv.length-1 if xmlnv(i).attributes.toString().contains("Duration") } duration = (xmlnv(i) \\ "@v").text.toInt; 

var channelNum = 0; 
for { i <- 0 to xmlnv.length-1 if xmlnv(i).attributes.toString().contains("ChannelNumber") } channelNum = (xmlnv(i) \\ "@v").text.toInt; 

var channelType = ""; 
for { i <- 0 to xmlnv.length-1 if xmlnv(i).attributes.toString().contains("ChannelType") } channelType = (xmlnv(i) \\ "@v").text; 

(duration, channelNum, channelType,line.device_id) 

//persist xml_Map for further operations 


这不提供问题的答案。一旦你有足够的[声誉](https://stackoverflow.com/help/whats-reputation),你将可以[对任何帖子发表评论](https://stackoverflow.com/help/privileges/comment);相反,[提供不需要提问者澄清的答案](https://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-c​​an- I-DO-代替)。 - [来自评论](/ review/low-quality-posts/18326409) – Jacobm001