2014-10-04 176 views
0

我想过滤所有的网站链接出谷歌搜索。如果我寻找某些东西,我想要获取网站上的所有网站链接,Google会向我们展示这些链接。如何排除搜索结果(链接)从谷歌搜索在Java

首先我想要阅读完整的html内容。之后我想过滤掉所有重要的网址。例如 - >如果我把“买鞋子”的话放进谷歌 - >我想获得像“www.amazon.in/Shoes”等链接。

如果我开始我的节目,我只得到了几个网址,只有Google为基础的网站,如“google.de/intl/de/options/”

PS:我检查与相同的查询页面的源代码( “购买+鞋子”),并注意Chrome浏览器比firefox浏览器提供更多的内容。我的感觉是,我只能得到少数网站的结果,因为java像Firefox浏览器那样连接,不是吗? 如何获得所有这些链接,哪些谷歌显示?

import java.io.BufferedReader; 
import java.io.BufferedWriter; 
import java.io.File; 
import java.io.FileWriter; 
import java.io.IOException; 
import java.io.InputStreamReader; 
import java.net.MalformedURLException; 
import java.net.URL; 
import java.net.URLConnection; 
import java.nio.charset.Charset; 
import java.util.Scanner; 
import java.util.regex.Matcher; 
import java.util.regex.Pattern; 
public class findEveryUrl { 
public static void main(String[] args) throws IOException 
{ 

    String gInput = "https://www.google.de/#q="; 
    // setKeyWord asks you to enter the keyword into the console 
    String fullUrl = gInput + setKeyWord(); 
    //fullUrl is used for the InputStream and "www." is the string, which is used for splitting 
    findAllSubs(fullUrl, "www."); 
    //System.out.println("given url: " + fullUrl); 
} 



/* 
* @param <T> String type. 
* @param urlString has to be the full Url. 
* @param splitphrase is the String which is used for splitting. 
* @return void 
*/ 
static void findAllSubs(String urlString, String splitphrase) 
{ 
    try 
    { 
     URL  url  = new URL(urlString); 
     URLConnection yc = url.openConnection(); 
     BufferedReader in = new BufferedReader(new InputStreamReader(
       yc.getInputStream())); 
     String inputLine ; 
     String array[]; 

     while ((inputLine = in.readLine()) != null){ 
      inputLine += in.readLine(); 
      array = inputLine.split(splitphrase); 
      arrayToConsol(array); 

     } 
    }catch (IOException e) { 
     e.printStackTrace(); 
    } 

} 



/* 
* urlQuery() asks you for the search keyword for the google query 
* @return returns the keyword, which you wrote into the console 
*/ 
public static String setKeyWord(){ 
    BufferedReader console = new BufferedReader(new InputStreamReader(System.in)); 
    System.out.print("Enter a KeyWord: "); 
    //googles search engine url 

    String keyWord = null; 
    try { 
     keyWord = console.readLine(); 
    } catch (IOException e) { 
     // shouldn't be happen 
     e.printStackTrace(); 
    } 

    return keyWord; 
} 

public static void arrayToConsol(String[] array){ 
    for (String item : array) { 
     System.out.println(item); 
    } 
} 

public static void searchQueryToConsole(String url) throws IOException{ 
    URL googleSearch = new URL(url); 
    URLConnection yc = googleSearch.openConnection(); 
    BufferedReader in = new BufferedReader(new InputStreamReader(
      yc.getInputStream())); 
    String inputLine; 
    while ((inputLine = in.readLine()) != null) 
     System.out.println(inputLine); 
    in.close(); 
}} 

回答

0

这里是简单和容易的解决方案。

http://www.programcreek.com/2012/05/call-google-search-api-in-java-program/

但是如果你想要解析使用CSS选择器来查找元素JSoup其伟大的图书馆的其他页面。

Document doc = Jsoup.connect("http://en.wikipedia.org/").get(); 
Elements newsHeadlines = doc.select("#mp-itn b a"); 
+0

谢谢Daredesm,为你快速回复=) – 2014-10-05 20:10:08