C＃使用C＃WebClient或HttpWebRequest将网站下载到字符串中

我试图下载网站的内容。但是，对于某个网页，返回的字符串包含混乱的数据，其中包含许多字符。C＃使用C＃WebClient或HttpWebRequest将网站下载到字符串中

这是我最初使用的代码。

HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(url); 
req.Method = "GET"; 
req.UserAgent = "Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US))"; 
string source; 
using (StreamReader reader = new StreamReader(req.GetResponse().GetResponseStream())) 
{ 
    source = reader.ReadToEnd(); 
} 
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); 
doc.LoadHtml(source);

我也试过可选的实施方式与Web客户端，但还是同样的结果：

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); 
using (WebClient client = new WebClient()) 
using (var read = client.OpenRead(url)) 
{ 
    doc.Load(read, true); 
}

从搜索我想这可能是与编码的问题，所以我想这两个解决方案，但仍保持贴无法让这个工作。

违规的网站，我似乎无法下载是维基百科的英文版本的美国的文章（恩。维基百科。组织/维基/美国）。虽然我已经尝试了一些其他wikipedia文章，并没有看到这个问题。

来源

2011-09-22 EnISeeK

响应是gzip编码的。请尝试以下方法解码流：

UPDATE

基于由BrokenGlass设置以下属性应该能解决你的问题的意见（工作对我来说）：

req.Headers[HttpRequestHeader.AcceptEncoding] = "gzip, deflate"; 
req.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;

旧/手动解决方案：

string source; 
var response = req.GetResponse(); 

var stream = response.GetResponseStream(); 
try 
{ 
    if (response.Headers.AllKeys.Contains("Content-Encoding") 
     && response.Headers["Content-Encoding"].Contains("gzip")) 
    { 
     stream = new System.IO.Compression.GZipStream(stream, System.IO.Compression.CompressionMode.Decompress); 
    } 

    using (StreamReader reader = new StreamReader(stream)) 
    { 
     source = reader.ReadToEnd(); 
    } 
} 
finally 
{ 
    if (stream != null) 
     stream.Dispose(); 
}

来源

2011-09-22 16:38:49 Peter

你不应该手动做这个，这是建立在已经，即看到这个答案：http://stackoverflow.com/questions/2973208/automatically-decompress-gzip-response-via-webclient-downloaddata – BrokenGlass

@BrokenGlass感谢您的提示。我已经想知道为什么我以前从未遇到过使用gzip编码的问题。 – Peter

谢谢，这对我有用！ – EnISeeK

使用内置在HtmlAgilityPack装载机工作对我来说：

HtmlWeb web = new HtmlWeb(); 
HtmlDocument doc = web.Load("http://en.wikipedia.org/wiki/United_States"); 
string html = doc.DocumentNode.OuterHtml; // I don't see no jumbled data here

编辑：

使用标准WebClient与您的用户代理将导致HTTP 403 - 禁止 - 使用这不是为我工作：

using (WebClient wc = new WebClient()) 
{ 
    wc.Headers.Add("user-agent", "Mozilla/5.0 (Windows; Windows NT 5.1; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4"); 
    string html = wc.DownloadString("http://en.wikipedia.org/wiki/United_States"); 
    HtmlDocument doc = new HtmlDocument(); 
    doc.LoadHtml(html); 
}

另请参阅该SO主题：WebClient forbids opening wikipedia page?

来源

2011-09-22 16:24:44 BrokenGlass

我试了第一次见面你建议并得到以下错误： “gzip”不是支持的编码名称。参数名称：名称位于System.Globalization.EncodingTable.internalGetCodePageFromName（字符串名称）位于System.Globalization.EncodingTable。GetCodePageFromName（String name） – EnISeeK

@Nick：对我来说工作很好 - 确保你有最新版本的HtmlAgilityPack - 我从NuGet获得我的 – BrokenGlass

从NuGet获得HtmlAgilityPack后，仍然出现相同的错误。 NuGet安装的版本是1.4.0.0。 – EnISeeK

这是怎么了，我通常抓住一个页面为一个字符串（其VB，但应该很容易翻译）：

req = Net.WebRequest.Create("http://www.cnn.com") 
Dim resp As Net.HttpWebResponse = req.GetResponse() 
sr = New IO.StreamReader(resp.GetResponseStream()) 
lcResults = sr.ReadToEnd.ToString

，并没有你的问题。

来源

2011-09-22 16:25:14

C＃使用C＃WebClient或HttpWebRequest将网站下载到字符串中

回答

相关问题