2010-01-02 102 views
0

我尝试解析xml文件。在标签中的文本被成功解析(或者看起来如此),但我想输出为不包含在某些标签中的文本,下面的程序只是忽略它。标签丢失时解析xml文件

from xml.etree.ElementTree import XMLTreeBuilder 

class HtmlLatex:      # The target object of the parser 
    out = '' 
    var = '' 
    def start(self, tag, attrib): # Called for each opening tag. 
     pass 
    def end(self, tag):    # Called for each closing tag. 
     if tag == 'i': 
      self.out += self.var 
     elif tag == 'sub': 
      self.out += '_{' + self.var + '}' 
     elif tag == 'sup': 
      self.out += '^{' + self.var + '}' 
     else: 
      self.out += self.var 
    def data(self, data): 
     self.var = data 
    def close(self): 
     print(self.out) 


if __name__ == '__main__': 
    target = HtmlLatex() 
    parser = XMLTreeBuilder(target=target) 

    text = '' 
    with open('input.txt') as f1: 
     text = f1.read() 

    print(text) 

    parser.feed(text) 
    parser.close() 

输入我想分析的一部分: <p><i>p</i><sub>0</sub> = (<i>m</i><sup>3</sup>+(2<i>l</i><sub>2</sub>+<i>l</i><sub>1</sub>) <i>m</i><sup>2</sup>+(<i>l</i><sub>2</sub><sup>2</sup>+2<i>l</i><sub>1</sub> <i>l</i><sub>2</sub>+<i>l</i><sub>1</sub><sup>2</sup>) <i>m</i>) /(<i>m</i><sup>3</sup>+(3<i>l</i><sub>2</sub>+2<i>l</i><sub>1</sub>)) }.</p>

+1

这就像没有XML我见过。当然你不想要一个_html_解析器? – James 2010-01-02 15:08:25

+0

它是从这里生产的:http://wims.unice.fr/wims/en_tool~linear~linsolver.en.html 当你得到解决方案时,如果你看看源代码,你会看到类似的东西。 – 2010-01-02 15:28:46

+1

刚编辑出LaTeX标签。 ??? – 2010-01-02 17:03:53

回答

2

这是一个pyparsing版本 - 我希望评论足够说明。

src = """<p><i>p</i><sub>0</sub> = (<i>m</i><sup>3</sup>+(2<i>l</i><sub>2</sub>+<i>l</i><sub>1</sub>) """ \ 
     """<i>m</i><sup>2</sup>+(<i>l</i><sub>2</sub><sup>2</sup>+2<i>l</i><sub>1</sub> <i>l</i><sub>2</sub>+""" \ 
     """<i>l</i><sub>1</sub><sup>2</sup>) <i>m</i>) /(<i>m</i><sup>3</sup>+(3<i>l</i><sub>2</sub>+""" \ 
     """2<i>l</i><sub>1</sub>)) }.</p>""" 

from pyparsing import makeHTMLTags, anyOpenTag, anyCloseTag, Suppress, replaceWith 

# set up tag matching for <sub> and <sup> tags 
SUB,endSUB = makeHTMLTags("sub") 
SUP,endSUP = makeHTMLTags("sup") 

# all other tags will be suppressed from the output 
ANY,endANY = map(Suppress,(anyOpenTag,anyCloseTag)) 

SUB.setParseAction(replaceWith("_{")) 
SUP.setParseAction(replaceWith("^{")) 
endSUB.setParseAction(replaceWith("}")) 
endSUP.setParseAction(replaceWith("}")) 

transformer = (SUB | endSUB | SUP | endSUP | ANY | endANY) 

# now use the transformer to apply these transforms to the input string 
print transformer.transformString(src) 

给人

p_{0} = (m^{3}+(2l_{2}+l_{1}) m^{2}+(l_{2}^{2}+2l_{1} l_{2}+l_{1}^{2}) m) /(m^{3}+(3l_{2}+2l_{1})) }. 
3

看一看BeautifulSoup,一个Python库用于解析,导航和操作HTML和XML。它有一个方便的界面,可能会解决您的问题...

+0

感谢您的建议。我会看看它。 – 2010-01-02 16:07:50