忽略的内容用美丽的汤

https://en.wikipedia.org/wiki/America

我需要抓住的H2，H3和p标签中的内容。不过，我想忽略标题和内容：

“另见”
“注释”
“参考”
忽略所有表/网址

如何我会在美丽的汤中做到这一点吗？我当前的代码如下：

def open_document(): 
    for i in range (1, 1+1): 
     with open(directory_of_raw_documents + str(i), "r") as document: 
      html = document.read() 
      soup = BeautifulSoup(html, "html.parser") 
      body = soup.find('div', id='bodyContent') 
      results = "" 
      for item in body.find_all(['h2','h3','p']): 
       results += item.get_text() + "\n" 
       results = results.replace("[edit]","") 
      print(results) 

open_document()

我所需的输出不会有任何表中的任何内容，查看所有，Notes或参考部分。我宁愿不使用维基百科的模块在Python 2.7

来源

2016-11-04 Jorge

soup.find(something)

意味着你找到整个文档中的东西，如果你想忽略的一些内容，你需要的情况下缩小范围，在你，你可以用途：

soup.find(id = 'bodyContent') #this narrow the scope to the main content.

比你可以使用find_all：

soup.find(id = 'bodyContent').find_all(name=['h2','h3','p'], href=False)

来源

2016-11-17 05:05:24

忽略的内容用美丽的汤

回答

相关问题