-1
我需要从网页中剥离HTML标记返回的纯文本中提取数据。标签被剥离出来,因为页面由表格数据组成,但是嵌套在表格中的表格嵌套在表格中等等(非常丑陋的HTML代码)。清洁代码(用HTML精简),并剥出标签后,该网站会返回类似这样的信息:使用正则表达式使用正则表达式或其他更高效的方法提取信息
Visitor ID : 123456789 HostName: 127.0.01 IP : 127.0.0.1 First Visit -> Entry Page : First Visit Entry Page Title Example First Visit -> Referrer: http://somepage.com First Visit : 302 Day(s) Last Visit : 09/23/2011 ISP: Initech Country: Some country Country: Some country Browser: Chrome Screen Res: Unknow 4 Billion colors (32 bit) Javascript: Enabled Page Views: 1 File Downloaded: 0 Daily Visits: 1 Visit Length: 0 minutes 0 seconds Entry Page: Entry page title Exit Page: Exit page title Referring URL: No
(正如你所看到的,一个很长的和随机的混乱)
而且我想把它变成这样:
Visitor ID: 123456789
HostName: 127.0.01
IP: 127.0.01
First Visit: 302 Day(s)
First Visit -> Entry Page: First Visit Entry Page Title Example
First Visit -> Referrer: http://somepage.com
Last Visit: 09/23/2011
ISP: Initech
Country: Some country
Country: Some country
Browser: Chrome
Screen Res: Unknow 4 Billion colors (32 bit)
Javascript: Enabled
Page Views: 1
File Downloaded: 0
Daily Visits: 1
Visit Length: 1 minute(s) 26 second
Entry Page: Entry page title
Exit Page: Exit page title
Referring URL: No
我目前正在使用正则表达式来删除多余的空格,并尝试排序数据。到目前为止,它几乎正在使用这个:
$patterns = array("/HostName\s*:/",
"/IP\s*:/",
"/First\s+Visit\s+->\s+Entry\s+Page\s*:/",
"/First\s+Visit\s+->\s+Referrer\s*:/",
"/First\s+Visit\s*:/",
"/\bLast\s+Visit\s*:/",
"/\bISP\s*:/",
"/\bCountry\s*:/",
"/\bBrowser\s*:/",
"/\bScreen\s*Res\s*:/",
"/\bJavascript\s*:/",
"/\bPage\s+Views\s*:/",
"/\bFile\s+Downloaded\s*:/",
"/\bDaily\s+Visits\s*:/",
"/\bVisit\s+Length\s*:/",
"/\bEntry\s+Page\s*:/",
"/\bExit\s+Page\s*:/",
"/\bReferring\s+URL\s*:/",
"/\bFrom\s+Campaign\s*:/" );
$replacements = array("\nHostName:",
"\nIP:",
"\nFirst Visit -> Entry Page:",
"\nFirst Visit -> Referrer:",
"\nFirst Visit:",
"\nLast Visit:",
"\nISP:",
"\nCountry:",
"\nBrowser:",
"\nScreen Res:",
"\nJavascript:",
"\nPage Views:",
"\nFile Downloaded:",
"\nDaily Visits:",
"\nVisit Length:",
"\nEntry Page:",
"\nExit Page:",
"\nReferring URL:",
"\nFrom Campaign:" );
ksort($patterns);
ksort($replacements);
$fixed_text = preg_replace ($patterns, $replacements, $ugly_mess);
但是,这并不像预期的那样工作。请注意,某些区域相似,并且正则表达式无法工作,导致这样的事情:
Visitor ID: 123456789
HostName: 127.0.0.1
IP: 127.0.0.1
Last Visit: 302 Day(s)
First Visit: 10 June 2010
First Visit ->
Entry Page: First Visit Entry Page Title Example
First Visit -> Referrer: http://somepage
.com
ISP: Initech
Country: Some Country
Country: Some Country
Browser: Chrome
Screen Res: Unknow 4 Billion colors (32 bit)
Javascript: Enabled
Page Views: 1
File Downloaded: 0
Daily Visits: 1
Visit Length: 1 minute(s) 26 second
Entry Page: Entry page title
Exit Page: Exit page title
Referring URL: No
我可能要对这个错误的方式,所以这就是为什么我问的建议或修复的当前代码。有什么想法吗?