2017-12-02 471 views
1

我是Jsoup解析的新手,我想要获得本页面上所有公司的列表: 现在,一种方法是使用div标签检查页面与我需要的相关。 然而,当我打电话的方法:jsoup获得div元素的类

Document doc = Jsoup.connect("https://angel.co/companies?company_types[]=Startup").get(); 
System.out.println(doc.html()); 

首先,我甚至不能找到我的康索尔HTML输出那些DIV标签,(这是为了给公司的名单中的) 其次,即使我没有找到它,我怎么能找到一定的div元素的类名:

div class=" dc59 frw44 _a _jm" 

赦免的行话,我不知道如何去通过这一点。

回答

1

的数据没有被嵌入在网页,但他们使用的是后续API调用检索:

以上是针对每个页面重复的(因此新的令牌为&每个页面都需要一个id列表)。您可以在网络标签中使用Chrome开发人员控制台查看此过程。

第一个POST请求给出JSON输出,但第二个请求(GET)给出了JSON对象属性中的HTML数据。

下提取公司过滤器:

private static CompanyFilter getCompanyFilter(final String filter, final int page) throws IOException { 

    String response = Jsoup.connect("https://angel.co/company_filters/search_data") 
      .header("Content-Type", "application/x-www-form-urlencoded;charset=UTF-8") 
      .header("X-Requested-With", "XMLHttpRequest") 
      .data("filter_data[company_types][]=", filter) 
      .data("sort", "signal") 
      .data("page", String.valueOf(page)) 
      .userAgent("Mozilla") 
      .ignoreContentType(true) 
      .post().body().text(); 

    GsonBuilder gsonBuilder = new GsonBuilder(); 
    Gson gson = gsonBuilder.create(); 
    return gson.fromJson(response, CompanyFilter.class); 
} 

然后下面的提取物企业:

private static List<Company> getCompanies(final CompanyFilter companyFilter) throws IOException { 

    List<Company> companies = new ArrayList<>(); 

    URLConnection urlConn = new URL("https://angel.co/companies/startups?" + companyFilter.buildRequest()).openConnection(); 
    urlConn.setRequestProperty("User-Agent", "Mozilla"); 
    urlConn.connect(); 
    BufferedReader reader = new BufferedReader(new InputStreamReader(urlConn.getInputStream(), "UTF-8")); 
    HtmlContainer htmlObj = new Gson().fromJson(reader, HtmlContainer.class); 

    Element doc = Jsoup.parse(htmlObj.getHtml()); 
    Elements data = doc.select("div[data-_tn]"); 

    if (data.size() > 0) { 
     for (int i = 2; i < data.size(); i++) { 
      companies.add(new Company(data.get(i).select("a").first().attr("title"), 
        data.get(i).select("a").first().attr("href"), 
        data.get(i).select("div.pitch").first().text())); 
     } 

    } else { 
     System.out.println("no data"); 
    } 
    return companies; 
} 

主要功能:

public static void main(String[] args) throws IOException { 

    int pageCount = 1; 
    List<Company> companies = new ArrayList<>(); 

    for (int i = 0; i < 10; i++) { 

     System.out.println("get page n°" + pageCount); 
     CompanyFilter companyFilter = getCompanyFilter("Startup", pageCount); 
     pageCount++; 
     System.out.println("digest  : " + companyFilter.getDigest()); 
     System.out.println("count  : " + companyFilter.getTotalCount()); 
     System.out.println("array size : " + companyFilter.getIds().size()); 
     System.out.println("page  : " + companyFilter.getpage()); 

     companies.addAll(getCompanies(companyFilter)); 

     if (companies.size() == 0) { 
      break; 
     } else { 
      System.out.println("size  : " + companies.size()); 
     } 
    } 
} 

CompanyCompanyFilter & HtmlContainer是模型类:

class CompanyFilter { 

    @SerializedName("ids") 
    private List<Integer> mIds; 

    @SerializedName("hexdigest") 
    private String mDigest; 

    @SerializedName("total") 
    private String mTotalCount; 

    @SerializedName("page") 
    private int mPage; 

    @SerializedName("sort") 
    private String mSort; 

    @SerializedName("new") 
    private boolean mNew; 

    public List<Integer> getIds() { 
     return mIds; 
    } 

    public String getDigest() { 
     return mDigest; 
    } 

    public String getTotalCount() { 
     return mTotalCount; 
    } 

    public int getpage() { 
     return mPage; 
    } 

    private String buildRequest() { 
     String out = "total=" + mTotalCount + "&"; 
     out += "sort=" + mSort + "&"; 
     out += "page=" + mPage + "&"; 
     out += "new=" + mNew + "&"; 
     for (int i = 0; i < mIds.size(); i++) { 
      out += "ids[]=" + mIds.get(i) + "&"; 
     } 
     out += "hexdigest=" + mDigest + "&"; 
     return out; 
    } 
} 

private static class Company { 

    private String mLink; 
    private String mName; 
    private String mDescription; 

    public Company(String name, String link, String description) { 
     mLink = link; 
     mName = name; 
     mDescription = description; 
    } 

    public String getLink() { 
     return mLink; 
    } 

    public String getName() { 
     return mName; 
    } 

    public String getDescription() { 
     return mDescription; 
    } 
} 

private static class HtmlContainer { 

    @SerializedName("html") 
    private String mHtml; 

    public String getHtml() { 
     return mHtml; 
    } 
} 

完整的代码也可以here

+0

这正是我一直在寻找。 你是救生员。谢谢<3 – vidhi

+1

还有一个问题。 该行: 公司(data.get(i).select(“a”)。first()。attr(“title”)。replace(“\\\”“,”“), Doesn'如果里面有两个单词,比如说'Abd Motors',你只能得到'Abd',@Bertrand – vidhi

+0

@vidhi ok我已经更新了答案,实际上我注意到了GET https://angel.co/companies/startups提供了JSON数据,并且希望用Jsoup解析它,但这有点棘手,因为当指定'ignoreContentType'时,它不能正确解析json中的转义引号,所以我'通过用Gson解析对象并提取避免使用所有那些丑陋的'replac e(“\\\”“,”“)' –