Python BeautifulSoup解析特定文本

-1

我正在解析一个html文件，我想找到文件中“小报表公司”的部分，并且在它旁边有或没有“X”或复选框。复选框通常使用Wingdings字体或ascii代码完成。在下面的HTML中，你会看到它旁边有一个þ的翅膀。Python BeautifulSoup解析特定文本

我没有问题显示文本的正则表达式搜索的结果，但我在进行下一步和寻找复选框时遇到问题。

我将使用它来解析许多不同的html文件，这些文件不会都遵循相同的格式，但其中大多数将使用像这个示例一样的表和ascii文本。

下面是HTML代码：

<HTML> 
<HEAD><TITLE></TITLE></HEAD> 
<BODY> 
<DIV align="left">Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, or a smaller reporting company. See the definitions of &#147;large accelerated filer,&#148; &#147;accelerated filer&#148; and &#147;smaller reporting company&#148;. (Check one): 
</DIV> 

<DIV align="center"> 
<TABLE style="font-size: 10pt" cellspacing="0" border="0" cellpadding="0" width="100%"> 
<!-- Begin Table Head --> 
<TR valign="bottom"> 
    <TD width="22%">&nbsp;</TD> 
    <TD width="3%">&nbsp;</TD> 
    <TD width="22%">&nbsp;</TD> 
    <TD width="3%">&nbsp;</TD> 
    <TD width="22%">&nbsp;</TD> 
    <TD width="3%">&nbsp;</TD> 
    <TD width="22%">&nbsp;</TD> 
</TR> 
<TR></TR> 
<!-- End Table Head --> 
<!-- Begin Table Body --> 
<TR valign="bottom"> 
    <TD align="center" valign="top"><FONT style="white-space: nowrap"> Large accelerated filer <FONT style="font-family: Wingdings">&#111;</FONT></FONT> 
    </TD> 
    <TD>&nbsp;</TD> 
    <TD align="center" valign="top"><FONT style="white-space: nowrap">Accelerated filer <FONT style="font-family: Wingdings">&#111;</FONT></FONT> 
    </TD> 
    <TD>&nbsp;</TD> 
    <TD align="center" valign="top"><FONT style="white-space: nowrap"> Non-accelerated filer <FONT style="font-family: Wingdings">&#111;</FONT> </FONT> 
    <FONT style="white-space: nowrap">(Do not check if a smaller reporting company)</FONT> 
    </TD> 
    <TD>&nbsp;</TD> 
    <TD align="center" valign="top"><FONT style="white-space: nowrap"> Smaller reporting company <FONT style="font-family: Wingdings">&#254;</FONT></FONT></TD> 
</TR> 
<!-- End Table Body --> 
</TABLE> 
</DIV></BODY></HTML>

这里是我的Python代码：

import os, sys, string, re 
from BeautifulSoup import BeautifulSoup 

rawDataFile = "testfile1.html" 
f = open(rawDataFile) 
soup = BeautifulSoup(f) 
f.close() 

search = soup.findAll(text=re.compile('[sS]maller.*[rR]eporting.*[cC]ompany')) 
print search

问：我怎么能将此设为具有取决于第一第二的搜索搜索？所以当我找到“小型报告公司”时，我可以搜索下几行，看看是否有ascii代码？我一直在阅读汤文档。我试图做find和findNext，但是我一直无法使它工作。

来源

2012-01-08 Josh

我敢打赌，你应该纠正_“有一个‘X’或旁边的复选框对它“_ to _”有一个“X”** ON **旁边的复选框“_并且你没有。这让我感到困惑，并且困扰着我对你的问题的理解。你不在乎被人理解吗？ – eyquem 2012-01-08 23:35:27

_“在下面的HTML中，你会看到它在旁边有一个þ的外形。”哪里？ – eyquem 2012-01-08 23:44:19

你叫做'ascii code'是什么？它是o和þ ?? – eyquem 2012-01-08 23:45:34

如果您知道翼形角色的位置不会改变，您可以使用.next。

>>> nodes = soup.findAll(text=re.compile('[sS]maller.*[rR]eporting.*[cC]ompany')) 
>>> nodes[-1].next.next # last item in list is the only good one... kinda crap 
u'&#254;'

从那里

或者你可以上去，然后find：

>>> nodes[-1].parent.find('font',style="font-family: Wingdings").next 
u'&#254;'

或者你可以做到这一点反过来：

>>> soup.findAll(text='&#254;')[0].previous.previous 
u' Smaller reporting company '

此假设你知道狂怒caharcters你正在寻找。

最后一个策略有额外的好处，可以过滤出正则表达式正在捕获的其他垃圾，我想你并不是真的想要的;你可以通过结果循环，知道你只在正确的列表上工作，所以你可以仔细阅读if。

来源

2012-01-08 23:30:11

您可以尝试遍历结构并检查内部标记中的值或检查外部标记中的值。我不记得如何做到这一点，我最终使用了lxml，但我认为bsoup可以做到这一点。

如果你不能让bsoup做它检查lxml。这取决于你在做什么，可能会更快。它也有与lxml一起使用bsoup的钩子。

来源

2012-01-08 22:05:24 Demolishun

lxml有一个宽容的HTML解析器。您不需要bsoup（现在已经被作者弃用），并且应该避免使用正则表达式来解析HTML。

这是在第一粗剪你正在寻找：

guff = """\ 
<HTML> 
<HEAD><TITLE></TITLE></HEAD> 
[snip] 
</DIV></BODY></HTML> 
""" 
from lxml.html import fromstring 
doc = fromstring(guff) 
for td_el in doc.iter('td'): 
    font_els = list(td_el.iter('font')) 
    if not font_els: continue 
    print 
    for el in font_els: 
     print (el.text, el.attrib)

这将产生：

(' Large accelerated filer ', {'style': 'white-space: nowrap'}) 
('o', {'style': 'font-family: Wingdings'}) 

('Accelerated filer ', {'style': 'white-space: nowrap'}) 
('o', {'style': 'font-family: Wingdings'}) 

(' Non-accelerated filer ', {'style': 'white-space: nowrap'}) 
('o', {'style': 'font-family: Wingdings'}) 
('(Do not check if a smaller reporting company)', {'style': 'white-space: nowrap 
'}) 

(' Smaller reporting company ', {'style': 'white-space: nowrap'}) 
(u'\xfe', {'style': 'font-family: Wingdings'})

来源

2012-01-09 01:23:11

lxml的问题是libxml的依赖性，它并不总是可用 - 例如， Jython根本无法使用它。 BS的美妙之处在于它只是纯Python的一个文件。 – 2012-01-09 09:10:05

Python BeautifulSoup解析特定文本

回答

相关问题