2011-09-24 74 views
-1

我需要从网页中剥离HTML标记返回的纯文本中提取数据。标签被剥离出来,因为页面由表格数据组成,但是嵌套在表格中的表格嵌套在表格中等等(非常丑陋的HTML代码)。清洁代码(用HTML精简),并剥出标签后,该网站会返回类似这样的信息:使用正则表达式使用正则表达式或其他更高效的方法提取信息

Visitor ID :   123456789   HostName: 127.0.01     IP :  127.0.0.1  First Visit -> Entry Page :   First Visit Entry Page Title Example First Visit -> Referrer: http://somepage.com First Visit : 302 Day(s)    Last Visit :   09/23/2011   ISP: Initech   Country:  Some country Country:  Some  country Browser: Chrome Screen Res: Unknow 4 Billion colors (32 bit)   Javascript: Enabled  Page Views: 1  File Downloaded: 0  Daily Visits: 1 Visit Length: 0 minutes 0 seconds Entry Page: Entry page title Exit Page: Exit page title Referring URL: No 

(正如你所看到的,一个很长的和随机的混乱)

而且我想把它变成这样:

Visitor ID: 123456789 
HostName: 127.0.01 
IP: 127.0.01 
First Visit: 302 Day(s) 
First Visit -> Entry Page: First Visit Entry Page Title Example 
First Visit -> Referrer: http://somepage.com 
Last Visit: 09/23/2011 
ISP: Initech 
Country: Some country 
Country: Some country 
Browser: Chrome 
Screen Res: Unknow 4 Billion colors (32 bit) 
Javascript: Enabled 
Page Views: 1 
File Downloaded: 0 
Daily Visits: 1 
Visit Length: 1 minute(s) 26 second 
Entry Page: Entry page title 
Exit Page: Exit page title 
Referring URL: No 

我目前正在使用正则表达式来删除多余的空格,并尝试排序数据。到目前为止,它几乎正在使用这个:

$patterns  = array("/HostName\s*:/", 
         "/IP\s*:/", 
         "/First\s+Visit\s+->\s+Entry\s+Page\s*:/", 
         "/First\s+Visit\s+->\s+Referrer\s*:/", 
         "/First\s+Visit\s*:/", 
         "/\bLast\s+Visit\s*:/", 
         "/\bISP\s*:/", 
         "/\bCountry\s*:/", 
         "/\bBrowser\s*:/", 
         "/\bScreen\s*Res\s*:/", 
         "/\bJavascript\s*:/", 
         "/\bPage\s+Views\s*:/", 
         "/\bFile\s+Downloaded\s*:/", 
         "/\bDaily\s+Visits\s*:/", 
         "/\bVisit\s+Length\s*:/", 
         "/\bEntry\s+Page\s*:/", 
         "/\bExit\s+Page\s*:/", 
         "/\bReferring\s+URL\s*:/", 
         "/\bFrom\s+Campaign\s*:/" ); 

$replacements = array("\nHostName:", 
         "\nIP:", 
         "\nFirst Visit -> Entry Page:", 
         "\nFirst Visit -> Referrer:", 
         "\nFirst Visit:", 
         "\nLast Visit:", 
         "\nISP:", 
         "\nCountry:", 
         "\nBrowser:", 
         "\nScreen Res:", 
         "\nJavascript:", 
         "\nPage Views:", 
         "\nFile Downloaded:", 
         "\nDaily Visits:", 
         "\nVisit Length:", 
         "\nEntry Page:", 
         "\nExit Page:", 
         "\nReferring URL:", 
         "\nFrom Campaign:" ); 
ksort($patterns); 
ksort($replacements); 

$fixed_text  = preg_replace ($patterns, $replacements, $ugly_mess); 

但是,这并不像预期的那样工作。请注意,某些区域相似,并且正则表达式无法工作,导致这样的事情:

Visitor ID: 123456789 
HostName: 127.0.0.1 
IP: 127.0.0.1 
Last Visit: 302 Day(s) 
First Visit: 10 June 2010 
First Visit -> 
Entry Page: First Visit Entry Page Title Example 
First Visit -> Referrer: http://somepage 
.com 
ISP: Initech 
Country: Some Country 
Country: Some Country 
Browser: Chrome 
Screen Res: Unknow 4 Billion colors (32 bit) 
Javascript: Enabled 
Page Views: 1 
File Downloaded: 0 
Daily Visits: 1 
Visit Length: 1 minute(s) 26 second 
Entry Page: Entry page title 
Exit Page: Exit page title 
Referring URL: No 

我可能要对这个错误的方式,所以这就是为什么我问的建议或修复的当前代码。有什么想法吗?

回答

0

而不是做替换模式,如果你使用匹配。我使用的是JavaScript,但您可以轻松地将其更改回PHP。

var pattern = "^(?:"; 
    pattern += "(?:Visitor\\s*ID\\s*:\\s*(\\d+)\\s*)"; 
    pattern += "|(?:HostName\s*:\\s*([^ ]+)\\s*)"; 
    pattern += "|(?:IP\\s*:\\s*([^ ]+)\\s*)"; 
    pattern += "|(?:First\\s*Visit\\s*->\\s*Entry Page\\s*:\\s*(.+?)\\s*(?=First\\s*Visit\\s*->))"; 
    pattern += "|(?:First\\s*Visit\\s*->\\s*Referrer\\s*:\\s*(.+?)\\s*(?=First\\s*Visit\\s*:))"; 
    pattern += "|(?:First\\s*Visit\\s*:\\s*(\\d+)\\s*Day\\(s\\)\\s*)"; 
    pattern += "|(?:Last\\s*Visit\\s*:\\s*(\\d+/\\d+/\\d+)\\s*)"; 
    pattern += "|(?:ISP\\s*:\\s*(.+?)\\s*(?=Country\\s*:))"; 
    pattern += "|(?:Country\\s*:\\s*(.+?)\\s*(?=(?:Country|Browser)\\s*:))"; 
    pattern += "|(?:Browser\\s*:\\s*(.+?)\\s*(?=Screen\\s*Res\\s*:))"; 
    pattern += "|(?:Screen\\s*Res\\s*:\\s*(.+?)\\s*(?=Javascript\\s*:))"; 
    pattern += "|(?:Javascript\\s*:\\s*(.+?)\\s*(?=Page\\s*Views\\s*:))"; 
    pattern += "|(?:Page\\s*Views\\s*:\\s*(\\d+)\\s*)"; 
    pattern += "|(?:File\\s*Downloaded\\s*:\\s*(\\d+)\\s*)"; 
    pattern += "|(?:Daily\\s*Visits\\s*:\\s*(\\d+)\\s*)"; 
    pattern += "|(?:Visit\\s*Length\\s*:\\s*((?:\\d+ (?:hours|minutes|seconds)\\s*)+))"; 
    pattern += ")+"; 
    var regex = new RegExp(pattern); 

    var content = readData().replace(/ /g, ""); 
    var match = content.match(regex); 
    echo("Visitor Id: " + match[1]); 
    echo("Hostname: " + match[2]); 
    echo("IP: " + match[3]); 
    // continue on...