2012-07-18 95 views
1

我必须编写(或使用现有的)csv解析库。解析带有未知分隔符号的csv文件

的问题是,文件在不同格式的例如不同的分隔符号上传:

File1: 
field1; field2; field3; field4 
field1; field2; field3; field4 

File2: 
feld1, field2, field3, field4 
feld1, field2, field3, field4 

File3: 
"field1", "field2", "field3", "field4" 
"field1", "field2", "field3", "field4" 

什么是programmaticaly了解哪些符号是实际的列分隔符的最佳方式?

我在考虑用符号统计分析编写我自己的方法,但也许有现有的解决方案?

回答

1

我会使用正则表达式(希望不会得到与上次一样多的降薪);)。我利用了backreferences这基本上允许使用以前捕获的组。只要每行使用相同的分隔符,您也可以在同一个文件中有不同的分隔符(不知道它是否有用)。

所以,我这是怎么建立的正则表达式:

string csvItem = @"[""']?\w+[""']?"; 
string separator = @"\s*[,\.;-]\s*"; 
string pattern = string.Format(@"^({0}(?<sep>{1}){0})+(\k<sep>{0})*$", 
    csvItem, separator); 

csvItem是在CSV项目(列)。它可以包含小写或大写字母,数字和下划线,并可以选择性地用“或”包围。

分隔符分隔项目。它由这些字符中的一个组成。。 - - 零个或多个间隔字符。

的图案表示,有效线由通过分离器分离的至少两个csvItems注意反向引用 - > \ķ

这这是测试文件的内容:

field1; field2; field3; field4 
field1; field2; field3; field4 

feld1, field2, field3, field4 
feld1, field2, field3, field4 

"field1", "field2", "field3", "field4" 
"field1", "field2", "field3", "field4" 

并且采样器乐控制台项目:

using System; 
using System.Collections.Generic; 
using System.Linq; 
using System.Text; 
using System.IO; 
using System.Text.RegularExpressions; 

namespace csvParser { 
    class Program { 
     static void Main(string[ ] args) { 
      var lines = File.ReadAllLines(@"e:\prova.csv"); 

      for (int i = 0; i < lines.Length; i++) { 
       string csvItem = @"[""']?\w+[""']?"; 
       string separator = @"\s*[,\.;-]\s*"; 
       string pattern = string.Format(@"^({0}(?<sep>{1}){0})+(\k<sep>{0})*$", csvItem, separator); 

       var rex = new Regex(pattern, RegexOptions.Singleline); 
       var match = rex.Match(lines[ i ]); 

       if (match == null) { 
        Console.WriteLine("No match on line {0}", i); 
        continue; 
       } 
       else { 
        string sep = match.Groups[ "sep" ].Value; 

        Console.WriteLine("--- Line #{0} ---------------", i); 
        Console.WriteLine("Line is '{0}'", lines[ i ]); 
        Console.WriteLine("Separator is '{0}'", sep); 

        Console.WriteLine("Items are:"); 
        foreach (string item in lines[ i ].Split(sep)) 
         Console.WriteLine("\t'{0}'", item); 

        Console.WriteLine(); 
       } 
      } 

      Console.ReadKey(); 
     } 
    } 

    public static partial class Extension { 
     public static string[ ] Split(this string str, string sep) { 
      return str.Split(new string[ ] { sep }, StringSplitOptions.RemoveEmptyEntries); 
     } 
    } 
} 

最后输出:

--- Line #0 --------------- 
Line is 'field1; field2; field3; field4' 
Separator is '; ' 
Items are: 
     'field1' 
     'field2' 
     'field3' 
     'field4' 

--- Line #1 --------------- 
Line is 'field1; field2; field3; field4' 
Separator is '; ' 
Items are: 
     'field1' 
     'field2' 
     'field3' 
     'field4' 

--- Line #2 --------------- 
Line is '' 
Separator is '' 
Items are: 

--- Line #3 --------------- 
Line is 'feld1, field2, field3, field4' 
Separator is ', ' 
Items are: 
     'feld1' 
     'field2' 
     'field3' 
     'field4' 

--- Line #4 --------------- 
Line is 'feld1, field2, field3, field4' 
Separator is ', ' 
Items are: 
     'feld1' 
     'field2' 
     'field3' 
     'field4' 

--- Line #5 --------------- 
Line is '' 
Separator is '' 
Items are: 

--- Line #6 --------------- 
Line is '"field1", "field2", "field3", "field4"' 
Separator is ', ' 
Items are: 
     '"field1"' 
     '"field2"' 
     '"field3"' 
     '"field4"' 

--- Line #7 --------------- 
Line is '"field1", "field2", "field3", "field4"' 
Separator is ', ' 
Items are: 
     '"field1"' 
     '"field2"' 
     '"field3"' 
     '"field4"' 

不幸的是,正则表达式捕捉空行了。试图修复它:)

+0

谢谢,这是f * cking真棒方法! – Ruslan 2012-07-18 15:59:14

+0

然而,你的方法需要预定义的可能分隔符列表..我想有一个方法,将调用给定文件的大多数可能的分隔符。 – Ruslan 2012-07-18 16:07:23

+1

@Ruslan:恩,我觉得这很难做到。你至少应该知道你正在寻找什么样的分隔符或者它们包含什么字符。当csv用双重空间和空间格式化时, – BlackBear 2012-07-18 16:35:07