我试图从HTML页面中获取所有独特的电子邮件到数组中。该文件是巨大的,并没有真正的模式来获取电子邮件。PHP从一个巨大的html文件中提取独特的电子邮件,将其放入数组中
下面是一个名为GetEmails.html的示例html ---实际的文件将包含css和更多的代码来筛选。在这个例子中,注意电子邮件的独特模式。总之不是所有用空格分开,但有的用逗号和半冒号等。
<html>
<body>
<p>This is some text and here is an email [email protected] and in this text we will see lots of emails like [email protected]; [email protected], [email protected] or even dot orgs too like [email protected] and all types such as [email protected],[email protected] and even [email protected] some might be bold [email protected] and some will look like this Email:<strong>[email protected]</strong>
</p>
<p><u>There will be pages and pages and pages of text to sift thru so get the emails into an array.</u></p>
<p>This is some text and here is an email [email protected] and in this text we will see lots of emails like [email protected]; [email protected], [email protected] or even dot orgs too like [email protected] and all types such as [email protected],[email protected] and even [email protected] some might be bold [email protected] and some will look like this Email:<strong>[email protected]</strong> and repeat This is some text and here is an email [email protected] and in this text we will see lots of emails like [email protected]; [email protected], [email protected] or even dot orgs too like [email protected] and all types such as [email protected],[email protected] and even [email protected] some might be bold [email protected] and some will look like this Email:<strong>[email protected]</strong></p>
<p> </p>
</body>
</html>
我想使用带有空格的爆炸,但可能不工作,并且可能会占用太多的资源。只是想知道在PHP中是否有一个简单的函数来帮助我将所有的电子邮件转换为数组。这是我试过的。
<?
$lines = file('GetEmails.html');
foreach ($lines as $line_num => $line) {
/// Finds if line has email.
if (preg_match('/\b[A-Z0-9._%+-][email protected][A-Z0-9.-]+\.[A-Z]{2,4}\b/si', $line))
{
// Puts that line into an array
$line = explode(" " , strip_tags($line));
// Finds if one of the itmes has an @ sign
$fl_array = preg_grep("/@/", $line);
// Puts that email in an array
$TheEmails[] = trim($fl_array);
// Puts only the unique emails an an array
$UniqueEmails= array_unique($TheEmails);
?>
但是,上面的代码工作,我将使用的巨大文件恐怕它不必要地使用资源。此外,它不会考虑用逗号分隔的电子邮件,如ed @ ed.com,mike @ mike.com
有关最佳方式的任何想法? 至少这将是非常非常有帮助学习如何做到这一点最好的方式,即使我只能得到由空间等分开的电子邮件...
希望这是有道理的。非常感谢!
'preg_match_all'? – Tchoupi 2013-03-22 03:15:35
它不是重复的,因为我不相信问题可以解决电子邮件旁边有字符的问题,如逗号或< or a >等。 – 2013-03-22 03:29:32
其实我是错误的。该链接上的代码工作。我应该删除这篇文章还是相信那篇文章? – 2013-03-22 03:33:44