2011-04-08 83 views
2

arghhh,这并不容易。我试图用perl解析一些邮件。我们举个例子:Perl MIME ::解析器和嵌套bodys中的编码(message/rfc_822)

From: [email protected] 
Content-Type: multipart/mixed; 
     boundary="----_=_NextPart_001_01CBE273.65A0E7AA" 
To: [email protected] 

This is a multi-part message in MIME format. 

------_=_NextPart_001_01CBE273.65A0E7AA 
Content-Type: multipart/alternative; 
     boundary="----_=_NextPart_002_01CBE273.65A0E7AA" 


------_=_NextPart_002_01CBE273.65A0E7AA 
Content-Type: text/plain; 
     charset="UTF-8" 
Content-Transfer-Encoding: base64 

[base64-content] 
------_=_NextPart_002_01CBE273.65A0E7AA 
Content-Type: text/html; 
     charset="UTF-8" 
Content-Transfer-Encoding: base64 

[base64-content] 
------_=_NextPart_002_01CBE273.65A0E7AA-- 
------_=_NextPart_001_01CBE273.65A0E7AA 
Content-Type: message/rfc822 
Content-Transfer-Encoding: 7bit 

X-MimeOLE: Produced By Microsoft Exchange V6.5 
Content-class: urn:content-classes:message 
MIME-Version: 1.0 
Content-Type: multipart/mixed; 
     boundary="----_=_NextPart_003_01CBE272.13692C80" 
From: [email protected] 
To: [email protected] 

This is a multi-part message in MIME format. 

------_=_NextPart_003_01CBE272.13692C80 
Content-Type: multipart/alternative; 
     boundary="----_=_NextPart_004_01CBE272.13692C80" 


------_=_NextPart_004_01CBE272.13692C80 
Content-Type: text/plain; 
     charset="iso-8859-1" 
Content-Transfer-Encoding: quoted-printable 

=20 

Viele Gr=FC=DFe 

------_=_NextPart_004_01CBE272.13692C80 
Content-Type: text/html; 
     charset="iso-8859-1" 
Content-Transfer-Encoding: quoted-printable 

<html>...</html> 
------_=_NextPart_004_01CBE272.13692C80-- 
------_=_NextPart_003_01CBE272.13692C80 
Content-Type: application/x-zip-compressed; 
     name="abc.zip" 
Content-Transfer-Encoding: base64 
Content-Disposition: attachment; 
     filename="abc.zip" 

[base64-content] 

------_=_NextPart_003_01CBE272.13692C80-- 
------_=_NextPart_001_01CBE273.65A0E7AA-- 

这封邮件是从Outlook发出的,附带另一封邮件。正如你所看到的,这是一个非常复杂的邮件,它具有许多不同的内容类型(text/plain,text/html,message/rfc_822,application/xyz)... 而rfc_822部分是问题所在。我在Perl 5.8(Debian Squeeze)中编写了一个脚本,用MIME :: Parser解析这个消息。

use MIME::Parser; 
my $parser = MIME::Parser->new; 
$parser->output_to_core(1); 
my $top_entity = $parser->parse(\*STDIN); 
my $plain_body = ""; 
my $html_body = ""; 
my $content_type; 
foreach my $part ($top_entity->parts_DFS) { 
    $content_type = $part->effective_type; 
    $body = $part->bodyhandle; 
    if ($body) { 
     if ($content_type eq 'text/plain') { 
      $plain_body = $plain_body . "\n" if ($plain_body ne ''); 
      $plain_body = $plain_body . $body->as_string; 
     } elsif ($content_type eq 'text/html') { 
      $html_body = $html_body . "\n" if ($html_body ne ''); 
      $html_body = $html_body . $body->as_string; 
     } 
    } 
} 
# parsing of attachment comes later 
print $plain_body; 

第一个消息部分(base64内容)包含德语元音变音,它们在标准输出处正确显示。嵌套的rfc_822消息由MIME :: Parser自动分析,并与顶级主体汇集为一个实体。您可以看到,嵌套的rfc_822也包含引用打印的德语元音变音。但是这些在STDOUT没有正确显示。在打印之前,引用可打印的元音变音正确显示,但不是base64编码的元素。我正在尝试几个小时来分离提取rfc_822并进行一些编码,但没有任何帮助。还有谁可以帮忙?

Regards

回答

1

假设您的控制台显示UTF-8,这是有道理的。 它正确地显示了你已经解码了什么,但是,当然,latin1字符没有正确显示。
稍后,您将转换为UTF-8,但如果数据已经是UTF8,则这没有意义。所以只显示前latin1变音符号。

如果不查看内容类型中的“字符集”并相应采取行动,则无法获得此权限。

+0

好的,谢谢。我明白有什么问题。我现在正在使用一个PHP脚本,我很喜欢这个脚本。 – rabudde 2011-05-16 04:41:34