2015-04-07 100 views
2

我想获得所有没有配对的元素。 这是从顶部到底部删除括号的XML标记列表。 我想查找对(例如,打开标签note并关闭标签/note),将它们从列表中删除,然后留下没有成对标签的标签。Python列表搜索,元素比较和消除

如何遍历列表,比较每个标签与所有其他标签,并举例说:例如:aha,我发现另一个以正斜杠开头的'note'标签?

谢谢。

任何其他 - 更好的想法,以找到不匹配的标签?

PS:我希望保留列表的顺序,如果可能的话,当标签与列表中的另一个标签进行比较时使用相等性。如果使用'in'运算符,它将不起作用,因为如果标记名称是一个字母(如'a'),则搜索将返回包含a的所有元素,而不是完全匹配'a'。

tags = ['note', 'to', 'bbb', 'bbb', 'firstname', '/firstname', 'lastname', '/lastname', 'from', 'hello', 'hello', 'hello', 'hello', 'hello', 'l', '/from', '/to', 'elephant', 'll', 'from', '/from', 'a1', 'img', 'a2', 'from', 'from', '/from', '/from', '/a2', '/img', '/a1', 'heading', '/heading', 'body', '/body', '/note'] 

回答

0

您可以使用所有结束标签创建set,然后使用该设置来过滤标签。

>>> closing = set([t for t in tags if t.startswith("/")]) 
>>> [t for t in tags if "/" + t not in closing and t not in closing] 
['bbb', 'bbb', 'hello', 'hello', 'hello', 'hello', 'hello', 'l', 'elephant', 'll'] 

但是请注意,这并不会真正尊重标签“对的”,只是看是否有在列表中的相同标签的“关闭”的变体。例如,给定tags = ["a", "a", "/a"]tags = ["a", "/a", "a"],它将从列表中删除两个个实例的a

+0

谢谢。这在寻找开启和关闭标签方面有诀窍。我如何找到配对的标签? – user1552294

+0

@ user1552294不确定你在问什么。为此显示一些示例输入和输出。 –

0

该程序的第一部分获取列表中的所有标签。如果您注意到这是找到不匹配括号的问题。它可以通过将该列表视为堆栈并找出哪些标签是有缺陷来解决,一路迭代。

import re 

def clean_attr(attr): 
    attr_list = re.split(r'\s+', attr) 
    if len(attr_list) == 1: 
     return attr 
    else: 
     return attr_list[0] + '>' 

line=""" 
<?xml version="1.0"?> 
<catalog> 
    <book id="bk101"> 
     <author>Gambardella, Matthew</author> 
     <title>XML Developer's Guide</title> 
     <genre>Computer</genre> 
     <price>44.95</price> 
     <publish_date>2000-10-01</publish_date> 
     <description>An in-depth look at creating applications 
     with XML.</description> 
    </book> 
    <book id="bk102"> 
     <author>Ralls, Kim</author> 
     <title>Midnight Rain</title> 
     <genre>Fantasy</genre> 
     <price>5.95</price> 
     <publish_date>2000-12-16</publish_date> 
     <description>A former architect battles corporate zombies, 
     an evil sorceress, and her own childhood to become queen 
     of the world.</description> 
    </book> 
    <book id="bk103"> 
     <author>Corets, Eva</author> 
     <title>Maeve Ascendant</title> 
     <genre>Fantasy</genre> 
     <price>5.95</price> 
     <publish_date>2000-11-17</publish_date> 
     <description>After the collapse of a nanotechnology 
     society in England, the young survivors lay the 
     foundation for a new society.</description> 
    </book> 
    <book id="bk104"> 
     <author>Corets, Eva</author> 
     <title>Oberon's Legacy</title> 
     <genre>Fantasy</genre> 
     <price>5.95</price> 
     <publish_date>2001-03-10</publish_date> 
     <description>In post-apocalypse England, the mysterious 
     agent known only as Oberon helps to create a new life 
     for the inhabitants of London. Sequel to Maeve 
     Ascendant.</description> 
    </book> 
    <book id="bk105"> 
     <author>Corets, Eva</author> 
     <title>The Sundered Grail</title> 
     <genre>Fantasy</genre> 
     <price>5.95</price> 
     <publish_date>2001-09-10</publish_date> 
     <description>The two daughters of Maeve, half-sisters, 
     battle one another for control of England. Sequel to 
     Oberon's Legacy.</description> 
    </book> 
    <book id="bk106"> 
     <author>Randall, Cynthia</author> 
     <title>Lover Birds</title> 
     <genre>Romance</genre> 
     <price>4.95</price> 
     <publish_date>2000-09-02</publish_date> 
     <description>When Carla meets Paul at an ornithology 
     conference, tempers fly as feathers get ruffled.</description> 
    </book> 
    <book id="bk107"> 
     <author>Thurman, Paula</author> 
     <title>Splish Splash</title> 
     <genre>Romance</genre> 
     <price>4.95</price> 
     <publish_date>2000-11-02</publish_date> 
     <description>A deep sea diver finds true love twenty 
     thousand leagues beneath the sea.</description> 
    </book> 
    <book id="bk108"> 
     <author>Knorr, Stefan</author> 
     <title>Creepy Crawlies</title> 
     <genre>Horror</genre> 
     <price>4.95</price> 
     <publish_date>2000-12-06</publish_date> 
     <description>An anthology of horror stories about roaches, 
     centipedes, scorpions and other insects.</description> 
    </book> 
    <book id="bk109"> 
     <author>Kress, Peter</author> 
     <title>Paradox Lost</title> 
     <genre>Science Fiction</genre> 
     <price>6.95</price> 
     <publish_date>2000-11-02</publish_date> 
     <description>After an inadvertant trip through a Heisenberg 
     Uncertainty Device, James Salway discovers the problems 
     of being quantum.</description> 
    </book> 
    <book id="bk110"> 
     <author>O'Brien, Tim</author> 
     <title>Microsoft .NET: The Programming Bible</title> 
     <genre>Computer</genre> 
     <price>36.95</price> 
     <publish_date>2000-12-09</publish_date> 
     <description>Microsoft's .NET initiative is explored in 
     detail in this deep programmer's reference.</description> 
    </book> 
     <author>O'Brien, Tim</author> 
     <title>MSXML3: A Comprehensive Guide</title> 
     <genre>Computer</genre> 
     <price>36.95</price> 
     <publish_date>2000-12-01</publish_date> 
     <description>The Microsoft MSXML3 parser is covered in 
     detail, with attention to XML DOM interfaces, XSLT processing, 
     SAX and more.</description> 
    </book> 
    <book id="bk112"> 
     <author>Galos, Mike</author> 
     <title>Visual Studio 7: A Comprehensive Guide</title> 
     <genre>Computer</genre> 
     <price>49.95</price> 
     <publish_date>2001-04-16</publish_date> 
     <description>Microsoft Visual Studio 7 is explored in depth, 
     looking at how Visual Basic, Visual C++, C#, and ASP+ are 
     integrated into a comprehensive development 
     environment. 
    </book> 
</catalog> 

""" 
attr_open = re.findall(r'<[\w+\s=\"]+>', line) 
attr_closed = re.findall(r'<\/\w+>', line) 
all_attrs = re.findall(r'<[\w+\s=\"]+>|<\/\w+>', line) 

all_attrs_cleaned = map(clean_attr, all_attrs) 

# print all_attrs_cleaned 

list_as_stack = [] 
not_closed = [] 
all_attrs_cleaned = iter(all_attrs_cleaned) 

an_attr = all_attrs_cleaned.next() 

try: 
    while all_attrs_cleaned: 
     if not an_attr.startswith('</'): 
      list_as_stack.append(an_attr) 
      an_attr = all_attrs_cleaned.next() 
     else: 
      temp = list_as_stack[-1] 
      if re.search(r'\w+', temp).group(0) == re.search(r'\w+', an_attr).group(0): 
       list_as_stack.pop() 
       an_attr = all_attrs_cleaned.next() 
      else: 
       if len(list_as_stack) != 0: 
        not_closed.append(an_attr) 
       an_attr = all_attrs_cleaned.next() 
except Exception: 
    print "Stop Iter" 

print list_as_stack 
print not_closed 

在上述程序中,所述第一阵列告诉你,其标记不闭合,并且第二阵列告诉你哪个结束标记没有打开的标签。