取计算的文本样式，而无需渲染在python

整个HTML给定一个HTML输入取计算的文本样式，而无需渲染在python

html='''This is <b>Bold</b> or <strong>Also Bold</strong> or even <font style="text-weight: bold">Style Bold</font>'''

我想仅过滤出大胆话

注意，这个例子是简化，我的真实使用情况下，我有几百万的文件要处理，有更多的结构和我不关心更多的HTML标签。

result=["Bold","Also Bold","Style Bold"]

的主要问题是，有几种方法来设置字体粗细（HTML标签/样式表）

而且我不知道是否有一个Python包，可以使只有我在乎的标签关于并观察结果，或者唯一的办法是编写一个解析器我自己。

来源

2017-04-03 Uri Goren

我怀疑有选择粗体文字的专用库。然而，它与HTML解析器像BeautifulSoup很简单：

from bs4 import BeautifulSoup 

input = """This is <b>Bold</b> or <strong>Also Bold</strong> or even <font style="text-weight: bold">Style Bold</font>""" 

soup = BeautifulSoup(input, "html.parser") 

bold = soup.select("b, strong, [style*=bold]") 

# > bold = [<b>Bold</b>, <strong>Also Bold</strong>, <font style="text-weight: bold">Style Bold</font>] 

bold_textonly = list(map(lambda tag: tag.text, bold)) # extract text from tags 

# > bold_textonly = ['Bold', 'Also Bold', 'Style Bold']

的[style*=bold]应font-weight: bold或bolder匹配任何标记。如果你只想<font>标签，选择将font[style*=bold]。

Working example at repl.it

有两种创建某些特定的字体，如font-weight: 700左右粗体文本的其他方法。但是，这也很容易添加。

，当然，这仅适用于内联样式。由外部样式选择文本进行了大胆的将是更具挑战性的......

来源

2017-04-03 19:52:31 helb

我不认为这是一个普遍的解决方案可靠，以涵盖所有可能的使用情况下（如字体样式可以通过CSS设置），但你可以得到接近它，并找到所有的b和strong元素，以及与“大胆”里子font元素。

使用BeautifulSoup library工作实施例（使用searching function）：

from bs4 import BeautifulSoup 


html = '''This is <b>Bold</b> or <strong>Also Bold</strong> or even <font style="text-weight: bold">Style Bold</font>''' 
soup = BeautifulSoup(html, "html.parser") 


def bold_only(tag): 
    is_b = tag.name == 'b' 
    is_strong = tag.name == 'strong' 
    is_bold_font = tag.name == 'font' and 'style' in tag.attrs and 'bold' in tag['style'] 

    return is_b or is_strong or is_bold_font 

print([bold.get_text() for bold in soup.find_all(bold_only)])

打印：

['Bold', 'Also Bold', 'Style Bold']

来源

2017-04-03 19:53:25 alecxe

取计算的文本样式，而无需渲染在python

回答

相关问题