如何在python中获得html2text的清晰输出？

我有以下的Python程序：如何在python中获得html2text的清晰输出？

import urllib.request as urllib2 
import html2text 

html = urllib2.urlopen("http://www.stern.de/") 
page_source = html.read() 

h = html2text.HTML2Text() 
h.ignore_links = True 
h.ignore_images = True 

text = h.handle(str(page_source)) 

print (text)

输出是：

\n \n\n 

    * \n Anmelden 
\n\n 

    * \n 

Sie haben noch keinen Account? 

\n Kostenlos neu registrieren 

\n \n 

\n

我怎样可以过滤掉 “\ n”？

我试了一下，例如这种方式，它不工作：

wordList = text.split() 

for word in wordList: 
    if word != "\n": 
     print (word)

这是分裂后的输出：

\n\n 
* 
\n 
Anmelden 
\n\n 
* 
\n 
Sie 
haben 
noch 
keinen 
Account? 
\n 
Kostenlos 
neu 
registrieren 
\n 
\n 
\n

所以我的检查没有工作。如何检查\ n换行符号？

来源

2015-08-28 Eternal_Sunshine

这就是换行符。如果您打印它，它将“消失”（相反，正确地打破线条而不是显示为'\ n'）。你真的想把所有的文本都放在一起吗？ –

我希望将每个单词分隔为一个数组。如果我不忽略它\ n被识别为一个单词 –

'text.split（）'将把空格作为空格 – jonrsharpe

-2

你试过replace？

text.replace('\n', '')

来源

2015-08-28 15:59:57 Oberix

这不起作用。我尝试过这个 –

好的我解决了这个问题，因为我调试了它，发现\ n处于调试模式\ n。

text = text.replace('\\n', '')

来源

2015-08-28 16:25:40

如何在python中获得html2text的清晰输出？

回答

相关问题