美丽的汤4 HTML文档目录

我这个代码工作：美丽的汤4 HTML文档目录

from bs4 import BeautifulSoup 
import glob 
import os 
import re 

def trade_spider(): 
    os.chdir(r"C:\Users\6930p\FLO'S DATEIEN\Master FAU\Sommersemester 2016\02_Masterarbeit\04_Testumgebung\01_Probedateien für Analyseaspekt\Independent Auditors Report") 
    for file in glob.glob('*.html'): 
     with open(file, encoding="utf8") as f: 
      contents = f.read() 
      soup = BeautifulSoup(contents, "html.parser") 
      results = [item for item in soup.findAll("ix:nonfraction") if re.match("^[^:]:AuditFeesExpenses", item['name'])] 
      print(results) 
       #print(file, end="| ") 
       #print(item['name'], end="| ") 
       #print(item.get_text()) 
trade_spider()

我想在我的电脑上BS4某个目录解析多个HTML文档。我的目标是找到以“ix：NonFraction ....”开头的标签，其中包含一个名称属性，可以在'AuditFeesExpenses'之前具有多个表达式，比如name =“aurep：AuditFeesExpenses，name = bus：AuditFeesExpenses”等等（这就是为什么我我正在使用正则表达式）。所以，如果BS4找到了特定的标签，我想用soup.get_text（Value）从中提取文本。

任何一个想法，我已经错过了？

UPDATE：一个例子标签是：

<td style=" width:12.50%; text-align:right; " class="ta_60"> 
<ix:nonFraction contextRef="ThirdPartyAgentsHypercube_FY_31_12_2012_Set1" 
name="ns19:AuditFeesExpenses" unitRef="GBP" decimals="0" 
format="ixt2:numdotdecimal" scale="0" xmlns:ix="http://www.xbrl.org 
/2008/inlineXBRL">3,600</ix:nonFraction></td>

通常这个标记出现在同一行，为了清楚起见，我插了几个换行符！

我最后的代码如下所示：

from bs4 import BeautifulSoup 
import glob 
import os 
import re 

def trade_spider(): 
    os.chdir(r"C:\Users\6930p\FLO'S DATEIEN\Master FAU\Sommersemester 2016\02_Masterarbeit\04_Testumgebung\01_Probedateien für Analyseaspekt\Independent Auditors Report") 
    for file in glob.glob('*.html'): 
     with open(file, encoding="utf8") as f: 
      contents = f.read() 
      soup = BeautifulSoup(contents, "html.parser") 
      for item in soup.findAll("ix:nonfraction"): 
       if re.match(".*AuditFeesExpenses", item['name']): 
        print(file, end="| ") 
        print(item['name'], end="| ") 
        print(item.get_text()) 
trade_spider()

，并给了我这样的输出：

Prod224_0010_00079350_20140331.html |英国aurep：AuditFeesExpenses | 2,000

来源

2016-05-10 Florian Schramm

findAll()函数具有name作为其第一个参数。当你调用

`soup.findAll('ix:NonFraction', name=re.compile("^[^:]:AuditFeesExpenses"))`,

你实际上调用soup与参数name=ix:NonFraction和name=re.compile("^[^:]:AuditFeesExpenses")。当然，我们只能设置name等于这两个输入中的一个，从而给出错误。

错误消息显示find_all()而不是findAll()。从docs，我们看到findAll是旧方法名称find_all。应该使用find_all方法。

混淆可能来自属性name。区分BeautifulSoup属性name和html属性name很重要。为了证明，我认为一个标签的格式如下：

<body> 
    <ix:NonFraction name="AuditFeesExpenses">stuff<ix:NonFraction> 
</body>

我们可以找到所有<ix:NonFraction>标签与soup.find_all("ix:nonfraction")。这使包含结果如下列表：

[<ix:NonFraction name="AuditFeesExpenses">stuff<ix:NonFraction>]

迭代通过这一个项目列表，看到两个不同的名属性。首先，我们访问BeautifulSoup name属性为对象的属性：

for item in soup.find_all("ix:nonfraction"): 
    print(item.name) 

Out: 'ix:nonfraction'

要查看HTML name属性，访问name作为字典键：

for item in soup.find_all("ix:nonfraction"): 
    print(item['name']) 

Out: 'AuditFeesExpenses'

加入这两个搜索起来缩小结果：

results = [item for item in soup.find_all("ix:nonfraction") if re.match("^[^:]:AuditFeesExpenses", item['name']) 

Out: [<ix:nonfraction name="ns19:AuditFeesExpenses">3,600</ix:nonfraction>]

或者，如果我们想获得每场比赛的文字：

完整输出

results = [item.get_text() for item in soup.find_all("ix:nonfraction") if re.match("^[^:]:AuditFeesExpenses", item['name']) 

Out: [3,600]

建议代码：

from bs4 import BeautifulSoup 
import glob 
import os 

def trade_spider(): 
    os.chdir(r"C:\Independent Auditors Report") 
    for file in glob.glob('*.html'): 
     with open(file, encoding="utf8") as f: 
      contents = f.read() 
      soup = BeautifulSoup(contents, "html.parser") 
      for item in soup.findAll("ix:nonfraction"): 
       if re.match("^[^:]:AuditFeesExpenses", item['name']) 
        print(file, end="| ") 
        print(item['name'], end="| ") 
        print(item.get_text()) 
trade_spider()

来源

2016-05-10 17:59:54 SNygard

我更新了我的问题，这样你可以看到，我想用我的代码 –

更新的答案示例代码。我认为这个问题来自两个不同的'name'属性。最终的解决方案可能需要2个步骤：获取所有'NonFraction'标签，然后过滤以获得所有'AuditFeesExpenses'名称。 – SNygard

这个工作几乎完美，但python现在打印文档中的每个NonFraction-Tag-Name（每个文档〜100-200）。是否有机会仅对“AuditFeesExpenses”进行过滤，并同时告诉Python收集标签> 3,600 <之间的文本。如果我能解决这个问题，代码将完美工作！ –

美丽的汤4 HTML文档目录

回答

相关问题