2016-07-28 158 views
3

这是我的代码来拆分网址,但该代码有问题。所有链接均以双字出现,例如www.utem.edu.my/portal/portal。词/门户/门户总是出现在任何链接中的两倍。任何建议我提取网页中的链接?如何分割网址?

public String crawlURL(String strUrl) { 
    String results = ""; // For return 
    String protocol = "http://"; 

    // Assigns the input to the inURL variable and checks to add http 
    String inURL = strUrl; 
    if (!inURL.toLowerCase().contains("http://".toLowerCase()) && 
      !inURL.toLowerCase().contains("https://".toLowerCase())) { 
     inURL = protocol + inURL; 
    } 

    // Pulls URL contents from the web 
    String contectURL = pullURL(inURL); 
    if (contectURL == "") { // If it fails, then try with https 
     protocol = "https://"; 
     inURL = protocol + inURL.split("http://")[1]; 
     contectURL = pullURL(inURL); 
    } 

    // Declares some variables to be used inside loop 
    String aTagAttr = ""; 
    String href = ""; 
    String msg = ""; 

    // Finds A tag and stores its href value into output var 
    String bodyTag = contectURL.split("<body")[1]; // Find 1st <body> 
    String[] aTags = bodyTag.split(">"); // Splits on every tag 

    //To show link different from one another 
    int index = 0; 

    for (String s: aTags) { 
    // Process only if it is A tag and contains href 
    if (s.toLowerCase().contains("<a") && s.toLowerCase().contains("href")) { 

     aTagAttr = s.split("href")[1]; // Split on href 

     // Split on space if it contains it 
     if (aTagAttr.toLowerCase().contains("\\s")) 
      aTagAttr = aTagAttr.split("\\s")[2]; 

     // Splits on the link and deals with " or ' quotes 
     href = aTagAttr.split(((aTagAttr.toLowerCase().contains("\""))? "\"" : "\'"))[1]; 

     if (!results.toLowerCase().contains(href)) 
      //results += "~~~ " + href + "\r\n"; 

     /* 
     * Last touches to URl before display 
     *  Adds http(s):// if not exist 
     *  Adds base url if not exist 
     */ 

     if(results.toLowerCase().indexOf("about") != -1) { 
      //Contains 'about' 
     } 
     if (!href.toLowerCase().contains("http://") && !href.toLowerCase().contains("https://")) { 

      // http:// + baseURL + href 
      if (!href.toLowerCase().contains(inURL.split("://")[1])) 
       href = protocol + inURL.split("://")[1] + href; 
      else 
       href = protocol + href; 
     } 

     System.out.println(href);//debug 
+0

你有'if(!results.toLowerCase()。contains(href))// results + =“~~~”+ href +“\ r \ n”;'这会导致错误,因为没有如果应用到代码的不同部分,而不是因为某些东西被评论而没有做任何事情吨。 – martijnn2008

回答

4

考虑使用URL类...

使用它通过文件的建议: )

public static void main(String[] args) throws Exception { 

     URL aURL = new URL("http://example.com:80/docs/books/tutorial" 
          + "/index.html?name=networking#DOWNLOADING"); 

     System.out.println("protocol = " + aURL.getProtocol()); 
     System.out.println("authority = " + aURL.getAuthority()); 
     System.out.println("host = " + aURL.getHost()); 
     System.out.println("port = " + aURL.getPort()); 
     System.out.println("path = " + aURL.getPath()); 
     System.out.println("query = " + aURL.getQuery()); 
     System.out.println("filename = " + aURL.getFile()); 
     System.out.println("ref = " + aURL.getRef()); 
    } 
} 

输出:

协议= HTTP

权威= ex ample.com:80

主机= example.com

端口= 80

在这之后你可以把你需要创建一个新的字符串/ URL的元素: )

+0

谢谢。 :)关于这段代码的任何建议href = protocol + inURL.split(“://”)[1] + href;因为我认为这部分导致链接加倍。请帮帮我 – Jenna