2012-04-17 98 views
0

所以我的问题几乎与this previous StackOverflow question一样,但我在问这个问题,因为我不喜欢接受的答案。Scala:解析连接的XML文档

我有串联的XML文档的文件:

<?xml version="1.0" encoding="UTF-8"?> 
<someData>...</someData> 
<?xml version="1.0" encoding="UTF-8"?> 
<someData>...</someData> 
... 
<?xml version="1.0" encoding="UTF-8"?> 
<someData>...</someData> 

我想分析出每一个。

据我所知,我不能使用scala.xml.XML,因为这取决于每个文件/字符串模型的一个文档。

是否有Parser的子类我可以使用它来解析输入源中的XML文档吗?因为那样我就可以做一些像many1 xmldoc或其他类似的东西。

+0

这个问题是重复的,除非你解释_why_你不喜欢其他答案。说明没有你提出的类型的解析器是不够的IMO完整的问题/答案。 – 2012-04-17 19:01:09

+0

@RexKerr:公平点。我发现接受的答案是不可接受的,因为“打破'<?xml'”让我感到[用正则表达式解析XML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except -xhtml-self-contained-tags/1732454#1732454),因为标记计数(因为存在'<![CDATA [') – rampion 2012-04-17 20:59:41

回答

0

好吧,我想出了一个答案,我更高兴。

基本上我尝试解析使用SAXParser的XML,就像scala.xml.XML.load做,但小心SAXParseException s表示表示,解析器在错误的地方遇到了<?xml

然后,我抓取已经解析的任何根元素,将输入倒回到足够的位置,然后从那里重新开始解析。

// An input stream that can recover from a SAXParseException 
object ConcatenatedXML { 
    // A reader that can be rolled back to the location of an exception 
    class Relocator(val re : java.io.Reader) extends java.io.Reader { 
    var marked = 0 
    var firstLine : Int = 1 
    var lineStarts : IndexedSeq[Int] = Vector(0) 
    override def read(arr : Array[Char], off : Int, len : Int) = { 
     // forget everything but the start of the last line in the 
     // previously marked area 
     val pos = lineStarts(lineStarts.length - 1) - marked 
     firstLine += lineStarts.length - 1 

     // read the next chunk of data into the given array 
     re.mark(len) 
     marked = re.read(arr,off,len) 

     // find the line starts for the lines in the array 
     lineStarts = pos +: (for (i <- 0 until marked if arr(i+off) == '\n') yield (i+1)) 

     marked 
    } 
    override def close { re.close } 
    override def markSupported = false 
    def relocate(line : Int, col : Int , off : Int) { 
     re.reset 
     val skip = lineStarts(line - firstLine) + col + off 
     re.skip(skip) 
     marked = 0 
     firstLine = 1 
     lineStarts = Vector(0) 
    } 
    } 

    def parse(str : String) : List[scala.xml.Node] = parse(new java.io.StringReader(str)) 
    def parse(re : java.io.Reader) : List[scala.xml.Node] = parse(new Relocator(re)) 

    // parse all the concatenated XML docs out of a file 
    def parse(src : Relocator) : List[scala.xml.Node] = { 
    val parser = javax.xml.parsers.SAXParserFactory.newInstance.newSAXParser 
    val adapter = new scala.xml.parsing.NoBindingFactoryAdapter 

    adapter.scopeStack.push(scala.xml.TopScope) 
    try { 

     // parse this, assuming it's the last XML doc in the string 
     parser.parse(new org.xml.sax.InputSource(src), adapter) 
     adapter.scopeStack.pop 
     adapter.rootElem.asInstanceOf[scala.xml.Node] :: Nil 

    } catch { 
     case (e : org.xml.sax.SAXParseException) => { 
     // we found the start of another xmldoc 
     if (e.getMessage != """The processing instruction target matching "[xX][mM][lL]" is not allowed.""" 
      || adapter.hStack.length != 1 || adapter.hStack(0) == null){ 
      throw(e) 
     } 

     // tell the adapter we reached the end of a document 
     adapter.endDocument 

     // grab the current root node 
     adapter.scopeStack.pop 
     val node = adapter.rootElem.asInstanceOf[scala.xml.Node] 

     // reset to the start of this doc 
     src.relocate(e.getLineNumber, e.getColumnNumber, -6) 

     // and parse the next doc 
     node :: parse(src) 
     } 
    } 
    } 
} 

println(ConcatenatedXML.parse(new java.io.BufferedReader(
    new java.io.FileReader("temp.xml") 
))) 
println(ConcatenatedXML.parse(
    """|<?xml version="1.0" encoding="UTF-8"?> 
    |<firstDoc><inner><innerer><innermost></innermost></innerer></inner></firstDoc> 
    |<?xml version="1.0" encoding="UTF-8"?> 
    |<secondDoc></secondDoc> 
    |<?xml version="1.0" encoding="UTF-8"?> 
    |<thirdDoc>...</thirdDoc> 
    |<?xml version="1.0" encoding="UTF-8"?> 
    |<lastDoc>...</lastDoc>""".stripMargin 
)) 
try { 
    ConcatenatedXML.parse(
    """|<?xml version="1.0" encoding="UTF-8"?> 
     |<firstDoc> 
     |<?xml version="1.0" encoding="UTF-8"?> 
     |</firstDoc>""".stripMargin 
) 
    throw(new Exception("That should have failed")) 
} catch { 
    case _ => println("catches really incomplete docs") 
} 
0

如果您关注的是安全性,你可以用独特的标签包装你的大块:

def mkTag = "block"+util.Random.alphanumeric.take(20).mkString 
val reader = io.Source.fromFile("my.xml") 
def mkChunk(it: Iterator[String], chunks: Vector[String] = Vector.empty): Vector[String] = { 
    val (chunk,extra) = it.span(s => !(s.startsWith("<?xml") && s.endsWith("?>")) 
    val tag = mkTag 
    def tagMe = "<"+tag+">"+chunk.mkString+"</"+tag+">" 
    if (!extra.hasNext) chunks :+ tagMe 
    else if (!chunk.hasNext) mkChunk(extra, chunks) 
    else mkChunk(extra, chunks :+ tagMe) 
} 
val chunks = mkChunk(reader.getLines()) 
reader.close 
val answers = xml.XML.fromString("<everything>"+chunks.mkString+"</everything>") 
// Now take apart the resulting parse 

既然你已经提供了独特的封闭标签,它是可能的,如果有人已经嵌入文字,你将有一个解析错误XML标签在某处,但你不会意外得到错误的解析数。

(警告:代码类型,但不检查的话 - 它给的想法,不完全正确的行为)