Python：从XML树中的标记中提取文本

我目前正在解析维基百科转储，试图提取一些有用的信息。解析发生在XML中，我只想提取每个页面的文本/内容。现在我想知道如何在另一个标签内的标签内找到所有文本。我搜索了类似的问题，但只发现了单个标签有问题的问题。这里是什么，我想实现一个例子：Python：从XML树中的标记中提取文本

<revision> 
    <timestamp>2001-01-15T13:15:00Z</timestamp> 
    <contributor> 
     <username>Foobar</username> 
     <id>65536</id> 
    </contributor> 
    <comment>I have just one thing to say!</comment> 
    <text>A bunch of [[text]] here.</text> 
    <minor /> 
    </revision> 

    <example_tag> 
    <timestamp>2001-01-15T13:15:00Z</timestamp> 
    <contributor> 
     <username>Foobar</username> 
     <id>65536</id> 
    </contributor> 
    <comment>I have just one thing to say!</comment> 
    <text>A bunch of [[text]] here.</text> 
    <minor /> 
    </example_tag>

我怎样才能提取文本标签中的文本，但只有当它被包含在版本树？

来源

2017-03-17 J. Williams

可以使用xml.etree.elementtree包为和使用XPath查询：

import xml.etree.ElementTree as ET 

root = ET.fromstring(the_xml_string) 
for content in root.findall('.//revision/othertag'): 
    # ... process content, for instance 
    print(content.text)

（其中the_xml_string是包含XML代码的字符串）。

或者，获取与列表中理解文本元素的列表：

import xml.etree.ElementTree as ET 

texts = [content.text for content inET.fromstring(the_xml_string).findall('.//revision/othertag')]

所以.text具有内部文本。请注意，您将不得不用标签替换othertag（例如text）。如果该标签可以是任意深revision标签，则应该使用.//revision//othertag作为XPath查询。

来源

2017-03-17 10:48:49

Python：从XML树中的标记中提取文本

回答

相关问题