2015-04-01 46 views
0

我创建文件(见this)包含任何数目的字符(人字/声音),像这样:如何查询模式的不同实例的文本文件?

<span class="sam" title="This is Sam speaking"> 
<span class="higbie" title="This is Calvin Higbie speaking"> 
<span class="ballou" title="This is Mr. Ballou speaking"> 

对于某些方面,这里是一个文件的一个片段:

<p><span class="others" title="This is 'an elderly pilgrim' speaking">"Jack, do you see that range of mountains over yonder that bounds the Jordan valley? The mountains of Moab, Jack! Think of it, my 
    boy--the actual mountains of Moab--renowned in Scripture history! 
    We are actually standing face to face with those illustrious crags 
    and peaks--and for all we know" [dropping his voice impressively], 
    "our eyes may be resting at this very moment upon the spot WHERE 
    LIES THE MYSTERIOUS GRAVE OF MOSES! Think of it, Jack!"</span></p> 

当文档完成时,我想生成这种标记模式的清晰列表。 IOW,我想检查遵循该模式的每一段HTML,但只返回每个不同人物/说话人的一个实例。我不想要其中的400个:

<span class="sam" title="This is Sam speaking"> 

...(只有一个)。

在伪SQL terminoloy,我想是这样的:

SELECT DISTINCT SOMETHING FROM FILE WHERE SLIDING_WINDOW_OF_TEXT STARTSWITH("<span class=\"") AND SLIDING_WINDOW_OF_TEXT ENDSWITH(" speaking\">") 

我不知道这是什么使用正则表达式最好的攻击,或者如果有什么东西像“LinqToText”,或什么否则...

回答

1

这并不难。您可以使用LINQ获取Distinct()值。添加参考文献和using System.Linq;/using System.Xml.Linq;。这里是(在VS2012测试)工作示例:

var MyRegex = new Regex(@"(?i)<span class=([""']).+?\1 title=([""']).+?\2>", RegexOptions.CultureInvariant | RegexOptions.Compiled); 
var str = @"<p><span class=""others"" title=""This is 'an elderly pilgrim' speaking""><span class=""others"" title=""This is 'an elderly pilgrim' speaking""><span class=""others"" title=""This is 'an elderly pilgrim' speaking""><span class=""others"" title=""This is 'an elderly pilgrim' speaking""><span class=""higbie"" title=""This is Calvin Higbie speaking""><span class=""ballou"" title=""This is Mr. Ballou speaking""><span class=""ballou"" title=""This is Mr. Ballou speaking""><span class=""higbie"" title=""This is Calvin Higbie speaking""></p>"; 
var distinct_values = MyRegex.Matches(str).Cast<Match>().Select(p => p.Value).Distinct().ToList(); 

返回3(不是8)匹配:

enter image description here

NO-LINQ SOLUTION

如果你不能使用LINQ(例如,在单),您可以使用以下代码,利用System.Collections.Generic中的List<string>

using System.IO; 
using System; 
using System.Collections.Generic; 
using System.Text.RegularExpressions; 

class Program 
{ 
    static void Main() 
    { 
     var MyRegex = new Regex(@"(?i)<span class=([""']).+?\1 title=([""']).+?\2>", RegexOptions.CultureInvariant | RegexOptions.Compiled); 
     var str = @"<p><span class=""others"" title=""This is 'an elderly pilgrim' speaking""><span class=""others"" title=""This is 'an elderly pilgrim' speaking""><span class=""others"" title=""This is 'an elderly pilgrim' speaking""><span class=""others"" title=""This is 'an elderly pilgrim' speaking""><span class=""higbie"" title=""This is Calvin Higbie speaking""><span class=""ballou"" title=""This is Mr. Ballou speaking""><span class=""ballou"" title=""This is Mr. Ballou speaking""><span class=""higbie"" title=""This is Calvin Higbie speaking""></p>"; 
    //  var distinct_values = MyRegex.Matches(str). 
//     Cast<Match>().Select(p => p.Value).Distinct().ToList(); 
     var new_arr = new List<string>(); 
     var matches = MyRegex.Matches(str); 
     for (int i=0; i<matches.Count; i++) 
      if (!new_arr.Contains(matches[i].Value)) 
       new_arr.Add(matches[i].Value); 

     Console.WriteLine(string.Join("\n", new_arr)); 
    } 
} 

输出:

<span class="others" title="This is 'an elderly pilgrim' speaking">                         
<span class="higbie" title="This is Calvin Higbie speaking">                           
<span class="ballou" title="This is Mr. Ballou speaking"> 
+1

“并不难”,当你有正则表达式和LINQ下来像一个冠军,并知道如何使用它们,但一般人的脑袋就会爆炸成微小的碎片和蓬松的东西看着那个代码。 – 2015-04-03 21:43:06

+0

我得到了,''System.Text.RegularExpressions.MatchCollection'没有包含'Cast'的定义,并且没有找到'System.Text.RegularExpressions.MatchCollection'类型的第一个参数的扩展方法'Cast' )“ 右键单击”Cast“不会提供”Resolve“上下文菜单项... – 2015-04-06 14:45:23

+1

请添加System.Linq,System.Xml.Linq和System。 Text.RegularExpressions'语句指向'using'指令列表。另外,您可能需要添加对项目的引用(右键单击项目中的“References”节点,单击**添加引用**转到'Framwork'选项卡并检查是否System.Xml.Linq '被选中)。 – 2015-04-06 14:48:22

相关问题