2013-03-22 48 views
0

我试图从HTML页面中获取所有独特的电子邮件到数组中。该文件是巨大的,并没有真正的模式来获取电子邮件。PHP从一个巨大的html文件中提取独特的电子邮件,将其放入数组中

下面是一个名为GetEmails.html的示例html ---实际的文件将包含css和更多的代码来筛选。在这个例子中,注意电子邮件的独特模式。总之不是所有用空格分开,但有的用逗号和半冒号等。

<html> 
<body> 
<p>This is some text and here is an email [email protected] and in this text we will see lots of emails like [email protected]; [email protected], [email protected] or even dot orgs too like [email protected] and all types such as [email protected],[email protected] and even [email protected] some might be bold [email protected] and some will look like this Email:<strong>[email protected]</strong> 
</p> 
<p><u>There will be pages and pages and pages of text to sift thru so get the emails into an array.</u></p> 
<p>This is some text and here is an email [email protected] and in this text we will see lots of emails like [email protected]; [email protected], [email protected] or even dot orgs too like [email protected] and all types such as [email protected],[email protected] and even [email protected] some might be bold [email protected] and some will look like this Email:<strong>[email protected]</strong> and repeat This is some text and here is an email [email protected] and in this text we will see lots of emails like [email protected]; [email protected], [email protected] or even dot orgs too like [email protected] and all types such as [email protected],[email protected] and even [email protected] some might be bold [email protected] and some will look like this Email:<strong>[email protected]</strong></p> 
<p>&nbsp;</p> 
</body> 
</html> 

我想使用带有空格的爆炸,但可能不工作,并且可能会占用太多的资源。只是想知道在PHP中是否有一个简单的函数来帮助我将所有的电子邮件转换为数组。这是我试过的。

<? 

$lines = file('GetEmails.html'); 


foreach ($lines as $line_num => $line) { 

/// Finds if line has email. 
    if (preg_match('/\b[A-Z0-9._%+-][email protected][A-Z0-9.-]+\.[A-Z]{2,4}\b/si', $line)) 
{ 

// Puts that line into an array 
$line = explode(" " , strip_tags($line)); 

// Finds if one of the itmes has an @ sign 
$fl_array = preg_grep("/@/", $line); 

// Puts that email in an array 
$TheEmails[] = trim($fl_array); 

// Puts only the unique emails an an array 
$UniqueEmails= array_unique($TheEmails); 

?> 

但是,上面的代码工作,我将使用的巨大文件恐怕它不必要地使用资源。此外,它不会考虑用逗号分隔的电子邮件,如ed @ ed.com,mike @ mike.com

有关最佳方式的任何想法? 至少这将是非常非常有帮助学习如何做到这一点最好的方式,即使我只能得到由空间等分开的电子邮件...

希望这是有道理的。非常感谢!

+0

'preg_match_all'? – Tchoupi 2013-03-22 03:15:35

+0

它不是重复的,因为我不相信问题可以解决电子邮件旁边有字符的问题,如逗号或< or a >等。 – 2013-03-22 03:29:32

+0

其实我是错误的。该链接上的代码工作。我应该删除这篇文章还是相信那篇文章? – 2013-03-22 03:33:44

回答

相关问题