我会使用正则表达式(希望不会得到与上次一样多的降薪);)。我利用了backreferences这基本上允许使用以前捕获的组。只要每行使用相同的分隔符,您也可以在同一个文件中有不同的分隔符(不知道它是否有用)。
所以,我这是怎么建立的正则表达式:
string csvItem = @"[""']?\w+[""']?";
string separator = @"\s*[,\.;-]\s*";
string pattern = string.Format(@"^({0}(?<sep>{1}){0})+(\k<sep>{0})*$",
csvItem, separator);
csvItem是在CSV项目(列)。它可以包含小写或大写字母,数字和下划线,并可以选择性地用“或”包围。
分隔符分隔项目。它由这些字符中的一个组成。。 - - 零个或多个间隔字符。
的图案表示,有效线由通过分离器分离的至少两个csvItems注意反向引用 - > \ķ
这这是测试文件的内容:
field1; field2; field3; field4
field1; field2; field3; field4
feld1, field2, field3, field4
feld1, field2, field3, field4
"field1", "field2", "field3", "field4"
"field1", "field2", "field3", "field4"
。
并且采样器乐控制台项目:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Text.RegularExpressions;
namespace csvParser {
class Program {
static void Main(string[ ] args) {
var lines = File.ReadAllLines(@"e:\prova.csv");
for (int i = 0; i < lines.Length; i++) {
string csvItem = @"[""']?\w+[""']?";
string separator = @"\s*[,\.;-]\s*";
string pattern = string.Format(@"^({0}(?<sep>{1}){0})+(\k<sep>{0})*$", csvItem, separator);
var rex = new Regex(pattern, RegexOptions.Singleline);
var match = rex.Match(lines[ i ]);
if (match == null) {
Console.WriteLine("No match on line {0}", i);
continue;
}
else {
string sep = match.Groups[ "sep" ].Value;
Console.WriteLine("--- Line #{0} ---------------", i);
Console.WriteLine("Line is '{0}'", lines[ i ]);
Console.WriteLine("Separator is '{0}'", sep);
Console.WriteLine("Items are:");
foreach (string item in lines[ i ].Split(sep))
Console.WriteLine("\t'{0}'", item);
Console.WriteLine();
}
}
Console.ReadKey();
}
}
public static partial class Extension {
public static string[ ] Split(this string str, string sep) {
return str.Split(new string[ ] { sep }, StringSplitOptions.RemoveEmptyEntries);
}
}
}
最后输出:
--- Line #0 ---------------
Line is 'field1; field2; field3; field4'
Separator is '; '
Items are:
'field1'
'field2'
'field3'
'field4'
--- Line #1 ---------------
Line is 'field1; field2; field3; field4'
Separator is '; '
Items are:
'field1'
'field2'
'field3'
'field4'
--- Line #2 ---------------
Line is ''
Separator is ''
Items are:
--- Line #3 ---------------
Line is 'feld1, field2, field3, field4'
Separator is ', '
Items are:
'feld1'
'field2'
'field3'
'field4'
--- Line #4 ---------------
Line is 'feld1, field2, field3, field4'
Separator is ', '
Items are:
'feld1'
'field2'
'field3'
'field4'
--- Line #5 ---------------
Line is ''
Separator is ''
Items are:
--- Line #6 ---------------
Line is '"field1", "field2", "field3", "field4"'
Separator is ', '
Items are:
'"field1"'
'"field2"'
'"field3"'
'"field4"'
--- Line #7 ---------------
Line is '"field1", "field2", "field3", "field4"'
Separator is ', '
Items are:
'"field1"'
'"field2"'
'"field3"'
'"field4"'
不幸的是,正则表达式捕捉空行了。试图修复它:)
谢谢,这是f * cking真棒方法! – Ruslan 2012-07-18 15:59:14
然而,你的方法需要预定义的可能分隔符列表..我想有一个方法,将调用给定文件的大多数可能的分隔符。 – Ruslan 2012-07-18 16:07:23
@Ruslan:恩,我觉得这很难做到。你至少应该知道你正在寻找什么样的分隔符或者它们包含什么字符。当csv用双重空间和空间格式化时, – BlackBear 2012-07-18 16:35:07