摘自HTML所有图片网址除了那些注释掉

我使用这个正则表达式来获取所有图像的URL在HTML文件中：摘自HTML所有图片网址除了那些注释掉

(?<=img\s*\S*src\=[\x27\x22])(?<Url>[^\x27\x22]*)(?=[\x27\x22])

有什么办法来修改这个正则表达式来排除任何IMG标记，用html评论“”注释掉？

来源

2012-02-24 Andrey

为什么不使用适当的HTML解析器呢？ – 2012-02-24 18:01:19

[小马他来...]（http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454） – 2012-02-24 18:02:34

@Pekka：因为我无法保证html是100％“正确”的 - 应用程序从非IT人员那里获得，所以很可能会出现[糟糕] html格式错误。 – Andrey 2012-02-24 18:06:21

如果您正则表达式已经适用于提取图像（这本身就是一个奇迹），考虑一个正则表达式来剥离HTML注释，像这样：

<!--.*?-->

替换为空字符串，以及任何图片评论内部将不再显示在您的其他正则表达式中。或者，如果您使用PHP（您没有标记编程语言），则可以使用strip_tags function和"<img>"作为“允许标记”参数。这将删除HTML注释以及可能干扰您的正则表达式的其他标签。

来源

2012-02-24 18:05:31

这可能实际上工作，谢谢！让我试试... – Andrey 2012-02-24 18:08:35

是的，正则表达式已经可以提取图像URL了。 – Andrey 2012-02-24 18:11:13

当使用HTML敏捷包时，它实际上也很简单，那里有一堆设置可帮助修复坏HTML，如果需要的话。像：

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); 
doc.OptionAutoCloseOnEnd = true; 
doc.OptionCheckSyntax = false; 
doc.OptionFixNestedTags = true; 
// etc, just set them before calling Load or LoadHtml

http://htmlagilitypack.codeplex.com/

string textToExtractSrcFrom = "... your text here ..."; 

doc.LoadHtml(textToExtractSrcFrom); 

var nodes = doc.DocumentNode.SelectNodes("//img[@src]") ?? new HtmlNodeCollection(); 
foreach (var node in nodes) 
{ 
    string src = node.Attributes["src"].Value; 
} 

//or 
var links = nodes.Select(node => node.Attributes["src"].Value);

来源

2012-02-24 22:10:10 jessehouwing

摘自HTML所有图片网址除了那些注释掉

回答

相关问题