2012-08-12 64 views
0

我想以编程方式从HTML页面中提取一些文本数据。我正在使用vb.net 2008 webbrowser控件以编程方式捕获这些数据。网页的HTML代码如下如何从html源文件中提取特定数据

<div id="main-div"> <div id="top_header"> <div style="height: 85px;"> <div style="float: left; width: 400px; height: 80px;"><img src="cc/elance.png" width="318" height="80">1</div> <div style="float: left;"> <div class="trebuchet-23"><img src="images/logo.png"></div> <div style="height: 20px; padding-left: 315px;"> <div class="arial-12-left">Sitemap</div> <div class="arial-12-left">-</div> <div class="arial-12-left">Location</div> </div> </div> </div> </div> <div id="menu-div"> <div id="menu-left"></div> <div id="menu-middle"> <div id="menu-inside"> <div id="menu-ov"><a href="/">Home</a></div> <div class="menu-line"></div> <div id="menu-bt"><a href="abut.php">About Us</a></div> <div class="menu-line"></div> <div id="menu-bt"><a href="dflogin.html">Member Login </a></div> <div class="menu-line"></div> <div id="menu-bt"><a href="co.php">Contractors</a></div> <div class="menu-line"></div> <div id="menu-bt"><a href="cont.php">Contact</a></div> </div> <div id="search-bg"> </div> </div> <div id="menu-right"></div> </div> <div style="margin-bottom: 20px;"><img src="images/banner.jpg" width="890" height="336"></div> <div id="inside-div"> <div id="about-div"> <div id="about-left"></div> <div id="about-middle"> <div class="heading-div"> <div class="white-20">Welcom To</div> <div class="red-20"><img src="cc/elance.png" width="398" height="81"></div> </div> <div class="myriad-22"></div> <div style="height: 220px;"> <div style="float: left; margin-right: 10px;"> <font color="#ffffff" size="+2">Assignment Report</font><br><font color="#ffffff">Assignment No.1</font><br><font color="#ffffff">Total Posts 341</font><br><hr> 
***<font color="#ffffff">Pin Code = HF5O6</font><br><font color="#ffffff">TITLE = Xbox 360 with 20GB HDD + 2 wireless controllers + 8 Games + Wireless Headset + Guitar in Eastry</font><br><font color="#ffffff">DATE = 09/08/2012</font><br><font color="#ffffff">Tracking Key = 85265E712050-15152226115354753</font><br>*** 

<form name="form1" method="post" action="/dflogin.php"><input name="txtId" value="E712050-15" type="hidden"><input name="txtassId" value="1" type="hidden"><input name="txtPsw" value="HH29" type="hidden"><input name="txtLog" value="0" type="hidden"><hr><font color="#ffffff">*Please copy tracking key exact as it is. We track your report through this key.</font><br><h6 align="right"><input name="btnSub" value="Next" style="background-color: rgb(0, 153, 0); color: rgb(255, 255, 255);" type="SUBMIT"></h6></form> </div> <div style="float: left;"> <div class="Tre-13-gray" style="width: 280px;"></div> <div class="bt-read2"></div> </div> </div> </div> <div id="about-right"></div> </div> </div> <div id="footer-div"> <div id="footer-left"></div> <div id="footer-middle"> <!--<div style="float:left; padding-top:35px; margin-right:15px;"> <div class="white-20">Contact</div> <div class="red-20">us</div> </div>--> <div style="float: left; padding-top: 15px;"> <div style="height: 30px;"> <!--<div class="Tre-13-red2" style="width:120px;">Mailing address:</div> <div class="Tre-13-gray2" style="width:400px;">[email protected]</div>--> </div> <!-- <div class="Tre-13-gray3">General Pricing and Service information:<br /> </div> <div class="Tre-13-red2" style="width:255px;">General Operations Director/ Sales:</div> <div class="Tre-13-gray2" style="width:300px;">eoprojects.com</div>--> </div> <div style="float: left; padding-left: 312px;"> <div style="height: 69px; width: 90px;"> <div style="float: left; padding-top: 20px; padding-right: 15px;"><img src="images/icon_f.png" width="34" height="35"></div> <div style="float: left; padding-top: 20px;"><img src="images/icon_t.png" width="34" height="35"></div> </div> <div class="arial-11">© 2010 All Copyrights Reserved</div> </div> </div> <div id="footer-right">1</div> </div> </div> 

开始与星行给出的是线,我想从下面的代码中提取。

任何一个plz能告诉我我应该写什么代码来提取?

在此先感谢。

+0

[whathaveyoutried.com](http://whathaveyoutried.com/) – mintobit 2012-08-12 01:11:08

回答

0

您需要解析HTML。 .NET中有一个名为HtmlAgilityPack的免费工具,专门为您设计。像下面这样的东西应该工作(这里假设你有一个变量rawHtml存储你的HTML代码):

Dim parsedHtml As New HtmlDocument() 
parsedHtml.Load(rawHtml) 

Dim fontNode As HtmlNode = parsedHtml.DocumentNode.Descendants("/font") 
+0

我使用这个代码... {{昏暗MSDNpage1作为字符串= WebBrowser1.Document.Body.InnerText My.Computer.FileSystem .WriteAllText(“D:\ Assignment.txt”,MSDNpage1,True)}} ....但它提取整个页面的文本。 – 2012-08-12 13:16:25

+0

我新vb.net所以不知道如何使用敏捷包 – 2012-08-12 13:28:30

+0

是否有任何其他方式从HTML源提取某些文本? – 2012-08-12 13:34:18

0

下面的示例代码从网站上提取IP地址。

Option Explicit 
' Add Microsoft Internet Transfer Control 

Dim pubIPA As String, pos1 As Long, pos2 As Long, str As String 

Private Sub Form_Load() 
    str = Inet1.OpenURL("http://api.externalip.net/ip/", icString) 
    pos1 = InStr(str, "var ip =") 
    pos1 = InStr(pos1 + 1, str, "'", vbTextCompare) + 1 
    pos2 = InStr(pos1 + 1, str, "'", vbTextCompare) 
    pubIPA = Mid$(str, pos1, pos2 - pos1) 
    MsgBox pubIPA, vbInformation 
    Unload Me 
End Sub 
相关问题