2010-06-09 78 views
0

我遇到HTML::TreeBuilder问题;它显示输出中的mojibake /怪异字符。为什么HTML :: TreeBuilder在输出中显示mojibake /奇怪的字符?

use strict; 
use WWW::Curl::Easy; 
use HTML::TreeBuilder; 
my $cookie_file ='/tmp/pcook'; 
my $curl = new WWW::Curl::Easy; 
my $response_body; 
my $charset = 'utf-8'; 
$DocOffline::charset = undef; 
$curl->setopt (CURLOPT_URL, 'http://www.breitbart.com/article.php?id=D9G7CR5O0&show_article=1'); 
$curl->setopt (CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.9 (KHTML, like Gecko) Chrome/6.0.400.0 Safari/533.9'); 
$curl->setopt (CURLOPT_HEADER, 0); 
$curl->setopt (CURLOPT_FOLLOWLOCATION, 1); 
$curl->setopt (CURLOPT_AUTOREFERER, 1); 
$curl->setopt (CURLOPT_SSL_VERIFYPEER, 0); 
$curl->setopt (CURLOPT_COOKIEFILE, $cookie_file); 
$curl->setopt (CURLOPT_COOKIEJAR, $cookie_file); 
$curl->setopt (CURLOPT_HEADERFUNCTION, \&headerCallback); 
open (my $fileb, ">", \$response_body); 
$curl->setopt(CURLOPT_WRITEDATA,$fileb); 
my $retcode = $curl->perform; 
if ($retcode == 0) { 
    my $dom_tree = HTML::TreeBuilder->new(); 
    $dom_tree->ignore_elements(qw(script style)); 
    $dom_tree->utf8_mode(1); 
    $dom_tree->parse($response_body); 
    $dom_tree->eof(); 
    print $dom_tree->as_HTML('<>&', ' ', {}); 
} 
sub headerCallback { 
my($data, $pointer) = @_; 
$data =~ m/Content-Type:\s*.*;\s*charset=(.*)/; 
if ($1) { 
    $charset = $1; 
    $charset =~ s/[^a-zA-Z0-9_\-]*//g; 
} 
return length($data); 
} 
+2

您正在打印DOM树,但您的终端可能不支持UTF-8。尝试将其写入文件,然后使用浏览器阅读,首先正确显示页面。 – MvanGeest 2010-06-09 14:18:16

+0

我尝试打印为CGI到浏览器,结果是一样的 – Vjy 2010-06-09 16:20:04

回答

2

因为你的代码是无论在形状和内容很乱,你甚至没有做一个简化的测试情况下你的整个程序中你没有得到一整天的答案。 MvanGeest也在附带的问题的评论中产生了误诊。

的问题是,谁写布赖特巴特的CMS是无能的人,他们插入NCR &#151;(这是一个非打印字符,甚至无效字符)时,他们应该简单地插入字符U+2014 EM DASH) ;毕竟,文档编码被声明为UTF-8。 (人们可以清楚地看到编码应该是Windows-1252,其中编码点151(十进制)被分配。)

您可以通过显式的解码/编码步骤解决他们的不足之处。

use Encode qw(encode decode); 
⋮ 
my $string_representation = $dom_tree->as_HTML('<>&', ' ', {}); 
my $octets = encode('UTF-8', decode('Windows-1252', $string_representation); 
⋮ 
# send the correct Content-Type header in your CGI program before printing the HTTP body 
print $octets;