2017-11-25 240 views
3

我有一个问题,似乎之前已经问过,但有点不同。我试图抓取this website的数据,但问题是,它似乎像加载了AJAX。因为我的应用程序无法找到我正在寻找的HTML中的id和类。在Xamarin中等待AJAX​​与HtmlAgilityPack

您可以通过检查元素或查看源来重现此操作。在查看源代码时,我看到的是比检查元素时少很多。

我以为我可以追查包含AJAX按F12,将网络标签,然后选择XHR加载此HTML文件,但我无法找到它。

我的问题是:我如何检索这些数据或找出用于收集数据的文件是 ?

我的代码示例(我找不到Timetable_toolbar_elementSelect_popup0):

private async Task GetHtmlDocument(string url) 
     { 
      HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url); 
      //request.Credentials = new LoginCredentials().Credentials; 

      try 
      { 
       WebResponse myResponse = await request.GetResponseAsync(); 
       HtmlDocument htmlDoc = new HtmlDocument(); 
       htmlDoc.OptionFixNestedTags = true; 
       htmlDoc.Load(myResponse.GetResponseStream()); 
       var test = htmlDoc.GetElementbyId("Timetable_toolbar_elementSelect_popup0"); 
      } 
      catch (Exception e) 
      { 
      } 
     } 
+0

你究竟想要刮到什么?我访问过这个网站并没有看到任何Timetable_toolbar_elementSelect_popup0。 – derloopkat

+0

@derloopkat对不起,如果您在菜单中的“Lesrooster”和“Klassen”上的klik,您将在右页。然而,显然你还需要先点击“Klas”下的下拉菜单,才能看到带有id的容器。 – user3478148

+0

我还没有机会检查评论,Kent ...我会这么做当我继续我的项目。 – user3478148

回答

0

使用webrequest调用ajax方法的解决方案。

所以我觉得无聊,想通了。下面缺少的是如何通过id来识别Klase。下面的例子将获取klase'1GLD'。我们需要cookies的原因是为了让请求知道我们从哪个学校取得Klase。此外,下面的代码只返回JSON - 而不是HTML,因为它是我们所称的ajax方法。

CookieContainer cookies = new CookieContainer(); 
try 
{ 
    string webAddr = "https://roosters.windesheim.nl/"; 
    var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr); 
    httpWebRequest.ContentType = "application/json; charset=utf-8"; 
    httpWebRequest.Method = "POST"; 
    httpWebRequest.CookieContainer = cookies;   
    httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate; 
    httpWebRequest.Headers.Add("X-Requested-With", "XMLHttpRequest"); 

    var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse(); 
    using (var streamReader = new StreamReader(httpResponse.GetResponseStream())) 
    { 
     cookies.Add(httpWebRequest.CookieContainer.GetCookies(httpWebRequest.RequestUri)); 
    } 
} 
catch (WebException ex) 
{ 
    Console.WriteLine(ex.Message); 
} 

//According to my web debugger the cookie will last until the 10th of December. So need to fix a new cookie until then. 
//I noticed the url used unixtimestamps at the end of the url. So we just add the unixtimestamp at the end for each request. 
long unixTimeStamp = new DateTimeOffset(DateTime.Now).ToUnixTimeMilliseconds() - 100; 

//we are now ready to call the ajax method and get the JSON. 
try 
{ 
    string webAddr = "https://roosters.windesheim.nl/WebUntis/Timetable.do?request.preventCache="+unixTimeStamp.ToString(); 
    var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr); 
    httpWebRequest.ContentType = "application/x-www-form-urlencoded; charset=utf-8"; 
    httpWebRequest.Method = "POST"; 
    httpWebRequest.CookieContainer = cookies; 
    httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate; 
    httpWebRequest.Headers.Add("X-Requested-With", "XMLHttpRequest"); 

    using (var streamWriter = new StreamWriter(httpWebRequest.GetRequestStream())) 
    { 
     string json = "ajaxCommand=getWeeklyTimetable&elementType=1&elementId=13090&date=20171126&formatId=7&departmentId=0&filterId=-2"; 

     //The command below will return a JSON datastructure containing all the klases and their relevant ID. 
     //string otherJson = "ajaxCommand=getPageConfig&type=1&filter=-2" 


     streamWriter.Write(json); 
     streamWriter.Flush(); 
    } 


    var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse(); 
    using (var streamReader = new StreamReader(httpResponse.GetResponseStream())) 
    { 
     var responseText = streamReader.ReadToEnd(); 
     //THE RESULTS GETS PRINTED HERE. 
     Console.Write(responseText); 
    } 
} 
catch (WebException ex) 
{ 
    Console.WriteLine(ex.Message); 
} 

其他解决方案与Selenium Firefox驱动程序。

这样做比较容易。但也需要一些时间。并非所有的睡眠都是必要的。这将使HTML与isntead一起工作,就像你所要求的一样。但我发现它在最后的foreach循环中是必需的。

public static void Main(string[] args) 
{ 
    HtmlDocument doc = new HtmlDocument(); 
    //According to my web debugger the cookie will last until the 10th of December. So need to fix a new cookie until then. 
    //I noticed the url used unixtimestamps at the end of the url. So we just add the unixtimestamp at the end for each request. 
    long unixTimeStamp = new DateTimeOffset(DateTime.Now).ToUnixTimeMilliseconds() - 100; 
    string webAddr = "https://roosters.windesheim.nl/WebUntis/Timetable.do?request.preventCache="+unixTimeStamp.ToString(); 
    var ffOptions = new FirefoxOptions(); 
    ffOptions.BrowserExecutableLocation = @"C:\Program Files (x86)\Mozilla Firefox\firefox.exe"; 
    ffOptions.LogLevel = FirefoxDriverLogLevel.Default; 
    ffOptions.Profile = new FirefoxProfile { AcceptUntrustedCertificates = true }; 
    var service = FirefoxDriverService.CreateDefaultService(); 

    var driver = new FirefoxDriver(service, ffOptions, TimeSpan.FromSeconds(120)); 


    driver.Navigate().GoToUrl(webAddr); 


    driver.FindElement(By.XPath("//input[@id='school']")).SendKeys("Windesheim"+Keys.Enter); 
    Thread.Sleep(2000); 
    driver.FindElement(By.XPath("//span[@id='dijit_PopupMenuBarItem_0_text' and text() ='Lesrooster']")).Click(); 

    driver.FindElement(By.XPath("//td[@id='dijit_MenuItem_0_text' and text() ='Klassen']")).Click(); 
    Thread.Sleep(2000); 

    driver.FindElement(By.XPath("//div[@id='widget_Timetable_toolbar_elementSelect']//input[@class='dijitReset dijitInputField dijitArrowButtonInner']")).Click(); 

    //we get all the options for Klase 
    doc.LoadHtml(driver.PageSource); 
    HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div[@id='Timetable_toolbar_elementSelect_popup']/div[@item]"); 
    List<String> options = new List<String>(); 
    foreach (HtmlNode n in nodes) 
    { 
     options.Add(n.InnerText); 
    } 

    foreach(string s in options) 
    { 
     driver.FindElement(By.XPath("//input[@id='Timetable_toolbar_elementSelect']")).Clear(); 
     driver.FindElement(By.XPath("//input[@id='Timetable_toolbar_elementSelect']")).SendKeys(s); 
     Thread.Sleep(2000); 
     driver.FindElement(By.XPath("//body")).SendKeys(Keys.Enter); 
     Thread.Sleep(2000); 
     doc.LoadHtml(driver.PageSource); 
     //Console.WriteLine(driver.Url); //Now we can see the id of the current Klase 
    } 

    Console.WriteLine(doc.DocumentNode.InnerHtml); 

    Console.ReadKey(); 
} 

最后更新

使用Selenium的解决方案,我能得到的ID的所有课程。我已包含文件here,以便您可以将它与您的ajax和Web请求一起使用。

1

我要离开这个注释。但它格式太大,格式太差。所以,我们走了。

首先,该网站使用通过ajaxcommand调用的JavaScript动态更新。

如果你可以打开一个会话,并存储包含SESSIONID和现在的“加密” schoolname,那么你可以调用Ajax的命令,这样的饼干。

https://roosters.windesheim.nl/ajaxCommand=getWeeklyTimetable&elementType=1&elementId=13090&date=20171126&formatId=7&departmentId=0&filterId=-2 

但是,这确实需要你知道elementType是什么和elementId是什么。

在这种情况下,elementId在Klas等于1GLD时表示Klas。 formatID(7)在等于“Beknopt”时表示Roosterformaat。你必须弄清楚剩余变量的作用。更重要的是,如果您成功地向服务器发出了有效的ajax命令,那么您将不会返回HTML作为响应,您将以JSON接收数据。

做你想做的最简单的方法是在一个单独的file所有类。并将其用作参考点。其他选项也一样。

,然后使用一个无头的浏览器,如phantomjs.orgSelenium。通过这种方式,您可以找到并点击您想要抓取的课程。将HTML加载到HtmlAgilityPack.HtmlDocument中,然后执行您需要执行的操作。 Selenium/PhantomJS直到跟踪你的cookies。 这种方法比较慢 - 但要容易得多。

编辑从webrequest存储cookie - 简单的方法。

我并不热衷于这个问题。但OP问。如果有人有更好的方法,请编辑。

CookieContainer cookies = new CookieContainer(); 
try 
{ 
    string webAddr = "https://roosters.windesheim.nl/WebUntis/"; 

    var httpWebRequest = (HttpWebRequest)WebRequest.Create(webAddr); 
    httpWebRequest.ContentType = "application/json; charset=utf-8"; 
    httpWebRequest.Method = "POST"; 
    httpWebRequest.CookieContainer = cookies; 

    httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate; 
    httpWebRequest.Headers.Add("X-Requested-With", "XMLHttpRequest"); 
    using (var streamWriter = new StreamWriter(httpWebRequest.GetRequestStream())) 
    { 
     string json = "ajaxCommand=getWeeklyTimetable&elementType=1&elementId=13092&date=20171126&formatId=7&departmentId=0&filterId=-2"; 

     streamWriter.Write(json); 
     streamWriter.Flush(); 
    } 


    var httpResponse = (HttpWebResponse)httpWebRequest.GetResponse(); 
    using (var streamReader = new StreamReader(httpResponse.GetResponseStream())) 
    { 
     cookies.Add(httpWebRequest.CookieContainer.GetCookies(httpWebRequest.RequestUri)); 
     //cookies.Add(httpResponse.Cookies); 
     var responseText = streamReader.ReadToEnd(); 
     doc.LoadHtml(responseText); 
     foreach(Cookie c in httpResponse.Cookies) 
     { 
      Console.WriteLine(c.ToString()); 
     } 
    } 
} 
catch (WebException ex) 
{ 
    Console.WriteLine(ex.Message); 
} 
    Console.WriteLine(doc.DocumentNode.InnerHtml); 

    Console.ReadKey(); 
+0

关于您评论的最后一段,如果您使用Selenium,使用HtmlAgilityPack加载文档没有意义。 Selenium支持xpath,css和id选择器。 HtmlAgilityPack只是一个用于解析Html的库,并且还支持xpath,但是没有浏览器在后台运行。 – derloopkat

+0

谢谢。这似乎比我希望的要复杂得多。一个问题:“如果你可以打开会话并存储包含SESSIONID和现在”加密“学校名称的cookie,我不知道如何做到这一点,你能指点我的方向吗?我会研究Selenium/PhantomJS – user3478148