要打印的文档类型声明

from urllib.request import urlopen 
from bs4 import BeautifulSoup 
import re 

html = urlopen("http://www.bbc.co.uk/iplayer/live/bbcone?area=london") 
bsObj = BeautifulSoup(html, "html.parser") 
version = bsObj.find(string = re.compile('DOCTYPE html')) 

if version in bsObj: 
    print("Yes") 
else: 
    print("No")

我知道“http://www.bbc.co.uk/iplayer/live/bbcone?area=london”是HTML 5（！DOCTYPE HTML）DOCTYPE声明，但是当我运行此脚本输出是“否”。我究竟做错了什么？要打印的文档类型声明

来源

2017-04-05 Jason

''不是HTML标签，但声明和'find（）'不适用于这些的全文。显然。有关一些想法，请参阅http://stackoverflow.com/questions/2499358/get-document-doctype-with-beautifulsoup。 – kindall

@kindall - 这个问题看起来像它应该是一个愚蠢的，虽然我犹豫不决，因为你没有;-) ... – mgilson

我觉得这个问题不是一个严格的愚蠢，因为它是问问'.find（）'是什么问题，而不是如何获得文档类型。 – kindall

Doctype是一个给浏览器的指令，所以find和find_all不会正常找到它，因为它不是html标签。

除此之外，你的正则表达式不起作用，因为BS中的string值只有html而不是DOCTYPE html。

可以使用链接，用户kindall提到或使用这种方式：

import requests 
from bs4 import BeautifulSoup, Doctype 

html = requests.get("http://www.bbc.co.uk/iplayer/live/bbcone?area=london") 
soup = BeautifulSoup(html.content, "html.parser") 
version = soup.find_all(string="html") 
DOCTYPE = next(item for item in version if isinstance(item, Doctype)) 

print (DOCTYPE)

，它将打印：

HTML

来源

2017-04-06 08:19:53 Zroq

要打印的文档类型声明

回答

相关问题