2014-09-10 55 views
0

我有一个包含超过200个记录以下格式的文本文件:正则表达式匹配多行文本的块?

@INPROCEEDINGS{Rajan-Sullivan03, 
    author = {Hridesh Rajan and Kevin J. Sullivan}, 
    title = {{{Eos}: Instance-Level Aspects for Integrated System Design}}, 
    booktitle = {ESEC/FSE 2003}, 
    year = {2003}, 
    pages = {297--306}, 
    month = sep, 
    isbn = {1-58113-743-5}, 
    location = {Helsinki, FN}, 
    owner = {Administrator}, 
    timestamp = {2009.03.08} 
} 

@INPROCEEDINGS{ras-mor-models-06, 
    author = {Awais Rashid and Ana Moreira}, 
    title = {Domain Models Are {NOT} Aspect Free}, 
    booktitle = {MoDELS}, 
    year = {2006}, 
    editor = {Oscar Nierstrasz and Jon Whittle and David Harel and Gianna Reggio}, 
    volume = {4199}, 
    series = {Lecture Notes in Computer Science}, 
    pages = {155--169}, 
    publisher = {Springer}, 
    bibdate = {2006-12-07}, 
    bibsource = {DBLP, http://dblp.uni-trier.de/db/conf/models/models2006.html#RashidM06}, 
    isbn = {3-540-45772-0}, 
    owner = {aljasser}, 
    timestamp = {2008.09.16}, 
    url = {http://dx.doi.org/10.1007/11880240_12} 
} 

基本上是一个记录与@开始,以结束},所以我试图做的是先从@和结束} \ n}但没有工作,它只会匹配第一条记录和另一条记录,因为它后面没有新行。

  string pattern = @"(^@)([\s\S]*)(}$\n}(\n))"; 

,当我试图通过使修复它,它匹配的一切,一个匹配

string pattern = @"(^@)([\s\S]*)(}$\n}(\n*))"; 

我都试过,直到我得出了以下模式,但它不工作,请你能不能修复它或者可以给出一个更有效率的一个加上对它做的一些小解释。

这里是我的代码:

  string pattern = @"(^@)([\s\S]*)(}$\n}(\n))"; 
     Regex regex = new Regex(pattern,RegexOptions.Multiline); 
     var matches = regex.Matches(bibFileContent).Cast<Match>().Select(m => m.Value).ToList(); 
+2

具体谈谈什么是 “不工作” 的意思。给出你想要的输出的例子。 – tnw 2014-09-10 15:12:44

+0

它只匹配第一条记录 – ykh 2014-09-10 15:15:01

+0

这不是更简单吗? string pattern = @“@([^;] *)}”;这是和Regex一起玩的好地方http://www.regexr.com/ – 2014-09-10 15:15:47

回答

1

这看起来像均衡组的候选者。

# @"(?m)^[^\S\r\n]*@[^{}]+(?:\{(?>[^{}]+|\{(?<Depth>)|\}(?<-Depth>))*(?(Depth)(?!))\})" 

(?m) 
^ [^\S\r\n]* 
@ [^{}]+ 
(?: 
     \{       # Match opening { 
     (?>       # Then either match (possessively): 
      [^{}]+      # Anything (but only if we're not at the start of { or }) 
     |        # or 
      \{       # { (and increase the braces counter) 
      (?<Depth>) 
     |        # or 
      \}       # } (and decrease the braces counter). 
      (?<-Depth>) 
    )*       # Repeat as needed. 
     (?(Depth)      # Assert that the braces counter is at zero. 
      (?!)       # Fail if it isn't 
    ) 
     \}       # Then match a closing }. 
) 

代码示例

Regex FghRx = new Regex(@"(?m)^[^\S\r\n]*@[^{}]+(?:\{(?>[^{}]+|\{(?<Depth>)|\}(?<-Depth>))*(?(Depth)(?!))\})"); 
string FghData = 
@" 
@INPROCEEDINGS{Rajan-Sullivan03, 
author = {Hridesh Rajan and Kevin J. Sullivan}, 
    title = {{{Eos}: Instance-Level Aspects for Integrated System Design}}, 
    booktitle = {ESEC/FSE 2003}, 
    year = {2003}, 
    pages = {297--306}, 
    month = sep, 
    isbn = {1-58113-743-5}, 
    location = {Helsinki, FN}, 
    owner = {Administrator}, 
    timestamp = {2009.03.08} 
} 

@INPROCEEDINGS{ras-mor-models-06, 
    author = {Awais Rashid and Ana Moreira}, 
    title = {Domain Models Are {NOT} Aspect Free}, 
    booktitle = {MoDELS}, 
    year = {2006}, 
    editor = {Oscar Nierstrasz and Jon Whittle and David Harel and Gianna Reggio}, 
    volume = {4199}, 
    series = {Lecture Notes in Computer Science}, 
    pages = {155--169}, 
    publisher = {Springer}, 
    bibdate = {2006-12-07}, 
    bibsource = {DBLP, http://dblp.uni-trier.de/db/conf/models/models2006.html#RashidM06}, 
    isbn = {3-540-45772-0}, 
    owner = {aljasser}, 
    timestamp = {2008.09.16}, 
    url = {http://dx.doi.org/10.1007/11880240_12} 
} 
"; 

Match FghMatch = FghRx.Match(FghData); 
while (FghMatch.Success) 
{ 
    Console.WriteLine("New Record\n------------------------"); 
    Console.WriteLine("{0}", FghMatch.Groups[0].Value); 
    FghMatch = FghMatch.NextMatch(); 
    Console.WriteLine(""); 
} 

输出

New Record 
------------------------ 
@INPROCEEDINGS{Rajan-Sullivan03, 
author = {Hridesh Rajan and Kevin J. Sullivan}, 
    title = {{{Eos}: Instance-Level Aspects for Integrated System Design}}, 
    booktitle = {ESEC/FSE 2003}, 
    year = {2003}, 
    pages = {297--306}, 
    month = sep, 
    isbn = {1-58113-743-5}, 
    location = {Helsinki, FN}, 
    owner = {Administrator}, 
    timestamp = {2009.03.08} 
} 

New Record 
------------------------ 
@INPROCEEDINGS{ras-mor-models-06, 
    author = {Awais Rashid and Ana Moreira}, 
    title = {Domain Models Are {NOT} Aspect Free}, 
    booktitle = {MoDELS}, 
    year = {2006}, 
    editor = {Oscar Nierstrasz and Jon Whittle and David Harel and Gianna Reggio}, 
    volume = {4199}, 
    series = {Lecture Notes in Computer Science}, 
    pages = {155--169}, 
    publisher = {Springer}, 
    bibdate = {2006-12-07}, 
    bibsource = {DBLP, http://dblp.uni-trier.de/db/conf/models/models2006.html#RashidM06}, 
    isbn = {3-540-45772-0}, 
    owner = {aljasser}, 
    timestamp = {2008.09.16}, 
    url = {http://dx.doi.org/10.1007/11880240_12} 
} 
+0

它完美的作品。 thxxxx – ykh 2014-09-10 16:29:03

+0

不客气。 – sln 2014-09-10 16:33:22

1

我认为问题是,你的输入不会被\ n所以你的第二个记录不匹配完成。你应该把交替使用$

这将在第1组得到的记录:

@(.*?)^}(?:[\r\n]+|$) 

DEMO

通知您必须使用ms修饰符

使用此代码:

Regex regex = new Regex(pattern, RegexOptions.Multiline | RegexOptions.Singleline); 
MatchCollection mc = regex.Matches(bibFileContent); 
List<String> results = new List<String>(); 
foreach (Group m in mc[0].Groups) 
{ 
results.Add(m.Value); 
} 
+0

你的正则表达式正在工作,但它不能在代码上工作,当我尝试它时,我已经使用MultilineOption – ykh 2014-09-10 15:34:15

+0

你必须使用Multiline和SingleLine选项 – 2014-09-10 15:36:14

+0

我已经试过这个RegexOptions选项= RegexOptions.Multiline | RegexOptions.Singleline;但仍然给我零匹配 – ykh 2014-09-10 15:43:03

2

如果您使用的匹配方法,你需要这样的模式,即处理平衡大括号:

string pattern = @"@[A-Z]+{(?>[^{}]+|(?<open>{)|(?<-open>}))*(?(open)(?!))}"; 
Regex regex = new Regex(pattern); 

或以确保所有的结果都能很好地形成(在视图中的括号点)

string pattern = @"\G[^{}]*(@[A-Z]+{(?>[^{}]+|(?<open>{)|(?<-open>}))*(?(open)(?!))})"; 

这两个模式使用命名的捕获作为计数器。当满足开括号时,计数器递增,当满足闭括号时,计数器递减。 (?(open)(?!))是一种条件测试,如果计数器不为空,则会使模式失败。

online demo

如果chuncks不包含@字符,它会更得心应手地使用Regex.Split(input, pattern)方法:

string[] result = Regex.Split(input, @"[^}]*([email protected])"); 

如果chuncks可以包含@字符,可以使其更加坚固用更具描述性的前瞻:

string[] result = Regex.Split(input, @"[^}]*([email protected][A-Z]+{)"); 

string[] result = Regex.Split(input, @"\s*([email protected][A-Z]+{)"); 
+0

它在上面工作正常,但是当我尝试时我得到了更多的结果,因为它似乎有一些记录中有“@”,我得到了283/243 – ykh 2014-09-10 15:37:04

+0

@ user733659:你试过最后一个模式? – 2014-09-10 15:41:21

1

你可以使用一个简单的正则表达式是这样的:

(@[^@]+) 

Working demo

enter image description here

的想法是,匹配以@开头,并且不能有内容另一个@。顺便说一句,如果你只是想匹配的模式,而不是捕捉它只是删除capturin组:

@[^@]+