2011-12-29 136 views
3

我想用空格,除非字符串中的文本是双引号(“文本”)或单引号(“文本”),为了将一个字符串。分割字符串用空格在C#

我与这个功能做:

public static string[] ParseKeywordExpression(string keywordExpressionValue, bool isUniqueKeywordReq) 
{ 
    keywordExpressionValue = keywordExpressionValue.Trim(); 
    if (keywordExpressionValue == null || !(keywordExpressionValue.Length > 0)) 
     return new string[0]; 
    int idx = keywordExpressionValue.Trim().IndexOf(" "); 
    if (idx == -1) 
     return new string[] { keywordExpressionValue }; 
    //idx = idx + 1; 
    int count = keywordExpressionValue.Length; 
    ArrayList extractedList = new ArrayList(); 
    while (count > 0) 
    { 
     if (keywordExpressionValue[0] == '"') 
     { 
      int temp = keywordExpressionValue.IndexOf(BACKSLASH, 1, keywordExpressionValue.Length - 1); 
      while (keywordExpressionValue[temp - 1] == '\\') 
      { 
       temp = keywordExpressionValue.IndexOf(BACKSLASH, temp + 1, keywordExpressionValue.Length - temp - 1); 
      } 
      idx = temp + 1; 
     } 
     if (keywordExpressionValue[0] == '\'') 
     { 
      int temp = keywordExpressionValue.IndexOf(BACKSHASH_QUOTE, 1, keywordExpressionValue.Length - 1); 
      while (keywordExpressionValue[temp - 1] == '\\') 
      { 
       temp = keywordExpressionValue.IndexOf(BACKSHASH_QUOTE, temp + 1, keywordExpressionValue.Length - temp - 1); 
      } 
      idx = temp + 1; 
     } 
     string s = keywordExpressionValue.Substring(0, idx); 
     int left = count - idx; 
     keywordExpressionValue = keywordExpressionValue.Substring(idx, left).Trim(); 
     if (isUniqueKeywordReq)      
     { 
      if (!extractedList.Contains(s.Trim('"'))) 
      { 
       extractedList.Add(s.Trim('"')); 
      } 
     } 
     else 
     { 
      extractedList.Add(s.Trim('"')); 
     } 
     count = keywordExpressionValue.Length; 
     idx = keywordExpressionValue.IndexOf(SPACE); 
     if (idx == -1) 
     { 
      string add = keywordExpressionValue.Trim('"', ' '); 
      if (add.Length > 0) 
      { 
       if (isUniqueKeywordReq) 
       { 
        if (!extractedList.Contains(add)) 
        { 
         extractedList.Add(add); 
        } 
       } 
       else 
       { 
        extractedList.Add(add); 
       } 
      }     
      break; 
     } 
    } 
    return (string[])extractedList.ToArray(typeof(string)); 
} 

是否有任何其他的方式来做到这一点,也可以此功能可以优化?

例如,我想拆分字符串

%ABC%%aasdf%aalasdjjfas “C:\文件和设置\ Program Files文件\ abc.exe”

%ABC%
%aasdf%
aalasdjjfas
“C:\文献和设置\ Program Files文件\ abc.exe”

+0

所以找到一个CSV正则表达式,并适应它使用'\ s'而不是逗号? – 2011-12-29 13:37:25

+0

@BradChristie我已经编辑了我对我多么希望输出quiestion。我不thinl CSV正则表达式将有助于 – Ankesh 2011-12-29 13:47:19

回答

6

造成这种情况的最简单的很正则表达式,处理单引号和双引号:

("((\\")|([^"]))*")|('((\\')|([^']))*')|(\S+)

var regex = new Regex(@"(""((\\"")|([^""]))*"")|('((\\')|([^']))*')|(\S+)"); 
var matches = regex.Matches(inputstring); 
foreach (Match match in matches) { 
    extractedList.Add(match.Value); 
} 

所以基本上代码四到五线是足够。

表达,解释说:

Main structure: 
("((\\")|([^"]))*") Double-quoted token 
|      , or 
('((\\')|([^']))*') single-quoted token 
|      , or 
(\S+)     any group of non-space characters 

Double-quoted token: 
(      Group starts 
    "     Initial double-quote 
    (     Inner group starts 
     (\\")   Either a backslash followed by a double-quote 
     |    , or 
     ([^"])   any non-double-quote character 
    )*     The inner group repeats any number of times (or zero) 
    "     Ending double-quote 
) 

Single-quoted token: 
(      Group starts 
    '     Initial single-quote 
    (     Inner group starts 
     (\\')   Either a backslash followed by a single-quote 
     |    , or 
     ([^'])   any non-single-quote character 
    )*     The inner group repeats any number of times (or zero) 
    '     Ending single-quote 
) 

Non-space characters: 
(      Group starts 
    \S     Non-white-space character 
    +     , repeated at least once 
)      Group ends 
+0

是其对双引号,但不能在单引号工作EX-%ABC%%aasdf%aalasdjjfas “C:\ Doctment和设置\ Program Files文件\ abc.exe” C:\ Doctment和设置\ Program Files \ abc.exe' – Ankesh 2011-12-29 14:01:26

+0

更新我的答案还包括单引号。 – 2011-12-29 14:31:13

+0

你的正则表达式工作很好...... :)。谢谢:) – Ankesh 2011-12-30 06:18:29

2

如果你不喜欢正则表达式,这种方法应该能够分裂引用的字符串,而忽略连续的空格:

public IEnumerable<string> SplitString(string input) 
{ 
    var isInDoubleQuote = false; 
    var isInSingleQuote = false; 
    var sb = new StringBuilder(); 
    foreach (var c in input) 
    { 
     if (!isInDoubleQuote && c == '"') 
     { 
      isInDoubleQuote = true; 
      sb.Append(c); 
     } 
     else if (isInDoubleQuote) 
     { 
      sb.Append(c); 
      if (c != '"') 
       continue; 
      if (sb.Length > 2) 
       yield return sb.ToString(); 
      sb = sb.Clear(); 
      isInDoubleQuote = false; 
     } 
     else if (!isInSingleQuote && c == '\'') 
     { 
      isInSingleQuote = true; 
      sb.Append(c); 
     } 
     else if (isInSingleQuote) 
     { 
      sb.Append(c); 
      if (c != '\'') 
       continue; 
      if (sb.Length > 2) 
       yield return sb.ToString(); 
      sb = sb.Clear(); 
      isInSingleQuote = false; 
     } 
     else if (c == ' ') 
     { 
      if (sb.Length == 0) 
       continue; 
      yield return sb.ToString(); 
      sb.Clear(); 
     } 
     else 
      sb.Append(c); 
    } 
    if (sb.Length > 0) 
     yield return sb.ToString(); 
} 

编辑:改变返回类型IEnumerable的,使用产率和StringBuilder的

+0

这会产生很多GC'able临时字符串,不是吗? – 2011-12-29 17:02:43

+1

如果你不打算打的结果不止一次,而只是'通过他们foreach',然后更改返回类型为'IEumerable '和更换'output.Add'用'产量回报curentString通话;'是个好主意。这也是使用'StringBuilder'而不是大量连接的情况。 – 2011-12-29 17:06:19

+0

我完全同意@JonHanna。 'yield return'是C#未被充分利用的特性。 'StringBuilder'参数是有效的,但由于它可能仅用于解析命令行参数序列,所以性能下降并不是很大。但是,尽管如此,对于草率代码没有任何理由。 – 2011-12-29 18:17:44

2

我通过使用0的十六进制值逃脱单和双引号字符串中的和\x22。它使模式的C#文本文本更易于阅读和操作。

而且使用IgnorePatternWhitespace正在为它做允许一个OT评论可读性更好的模式;不影响正则表达式处理。

string data = @"'single' %ABC% %aasdf% aalasdjjfas ""c:\Document and Setting\Program Files\abc.exe"""; 

string pattern = @"(?xm)  # Tell the regex compiler we are commenting (x = IgnorePatternWhitespace) 
          # and tell the compiler this is multiline (m), 
          # In Multiline the^matches each start line and $ is each EOL 
          # -Pattern Start- 
^(       # Start at the beginning of the line always 
(?![\r\n]|$)    # Stop the match if EOL or EOF found. 
(?([\x27\x22])    # Regex If to check for single/double quotes 
     (?:[\x27\x22])   # \\x27\\x22 are single/double quotes 
     (?<Token>[^\x27\x22]+) # Match this in the quotes and place in Named match Token 
     (?:[\x27\x22]) 

    |       # or (else) part of If when Not within quotes 

    (?<Token>[^\s\r\n]+) # Not within quotes, but put it in the Token match group 
)       # End of Pattern OR 

(?:\s?)      # Either a space or EOL/EOF 
)+       # 1 or more tokens of data. 
"; 

Console.WriteLine(string.Join(" | ", 

Regex.Match(data, pattern) 
     .Groups["Token"] 
     .Captures 
     .OfType<Capture>() 
     .Select(cp => cp.Value) 
       ) 
       ); 
/* Output 
single | %ABC% | %aasdf% | aalasdjjfas | c:\Document and Setting\Program Files\abc.exe 
*/ 

以上是基于我写了下面的两个博客文章:

+1

我很高兴你找到你的答案。我非常信任正则表达式,如果人们花时间学习它,它是一个强大的工具,不管语言如何(C#/ Java/php),都可以在整个过程中使用它。 :-) – OmegaMan 2011-12-30 08:34:50