2013-05-08 85 views
2

正则表达式模式我发现这个正则表达式模式在http://gskinner.com/RegExr/校正跨语言

,(?=(?:[^"]*"[^"]*")*(?![^"]*")) 

哪个模式匹配CSV分隔的值(更具体而言,分离逗号,可以在被分割),其在该网站上的作品与我的测试数据非常好。您可以在测试时看到我认为是站点底部面板中的JavaScript实现。

但是,当我尝试在C#/ .net中实现这一点时,匹配不能正常工作。 我的实现:

Regex r = new Regex(",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))", RegexOptions.ECMAScript); 
//get data... 
foreach (string match in r.Split(sr.ReadLine())) 
{ 
    //lblDev.Text = lblDev.Text + match + "<br><br><br><p>column:</p><br>"; 
    dtF.Columns.Add(match); 
} 

//more of the same to get rows 

在某些数据行的结果上面的网站上所产生的结果完全一致,但在其他的头6点左右的行失败分裂或根本不存在分割阵列英寸

任何人都可以告诉我,为什么模式似乎不以相同的方式表现?

我的测试数据:

CategoryName,SubCategoryName,SupplierName,SupplierCode,ProductTitle,Product Company ,ProductCode,Product_Index,ProductDescription,Product BestSeller,ProductDimensions,ProductExpressDays,ProductBrandName,ProductAdditionalText ,ProductPrintArea,ProductPictureRef,ProductThumnailRef,ProductQuantityBreak1 (QB1),ProductQuantityBreak2 (QB2),ProductQuantityBreak3 (QB3),ProductQuantityBreak4 (QB4),ProductPlainPrice1,ProductPlainPrice2,ProductPlainPrice3,ProductPlainPrice4,ProductColourPrice1,ProductColourPrice2,ProductColourPrice3,ProductColourPrice4,ProductExtraColour1,ProductExtraColour2,ProductExtraColour3,ProductExtraColour4,SellingPrice1,SellingPrice2,SellingPrice3,SellingPrice4,ProductCarriageCost1,ProductCarriageCost2,ProductCarriageCost3,ProductCarriageCost4,BLACK,BLUE,WHITE,SILVER,GOLD,RED,YELLOW,GREEN,ProductOtherColors,ProductOrigination,ProductOrganizationCost,ProductCatalogEntry,ProductPageNumber,ProductPersonalisationType1 (PM1),ProductPrintPosition,ProductCartonQuantity,ProductCartonWeight,ProductPricingExpering,NewProduct,ProductSpecialOffer,ProductSpecialOfferEnd,ProductIsActive,ProductRepeatOrigination,ProductCartonDimession,ProductSpecialOffer1,ProductIsExpress,ProductIsEco,ProductIsBiodegradable,ProductIsRecycled,ProductIsSustainable,ProductIsNatural 
Audio,Speakers and Headphones,The Prime Time Company,CM5064:In-ear headphones,Silly Buds,,10058,372,"Small, trendy ear buds with excellent sound quality and printing area actually on each ear- piece. Plastic storage box, with room for cables be wrapped around can also be printed.",FALSE,70 x 70 x 20mm,,,,10mm dia,10058.jpg,10058.jpg,100,250,500,1000,2.19,2.13,2.06,1.99,0.1,0.1,0.05,0.05,0.1,0.1,0.05,0.05,3.81,3.71,3.42,3.17,0,0,0,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,,30,,TRUE,24,Screen Printed,Earpiece,200,11,,TRUE,,,TRUE,15,,,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE 
Audio,Speakers and Headphones,The Prime Time Company,CM5058:Headstart,Head Start,,10060,372,"Lightweight, slimline, foldable and patented headphones ideal for the gym or exercise. These 
headphones uniquely hang from the ears giving security, comfort and an excellent sound quality. There is also a secret cable winding facility.",FALSE,130 x 85 x 45mm,,,,30mm dia,10060.jpg,10060.jpg,100,250,500,1000,5.6,5.43,5.26,5.09,0.1,0.1,0.05,0.05,0.1,0.1,0.05,0.05,9.47,8.96,8.24,7.97,0,0,0,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,,30,,TRUE,24,Screen Printed,print plate on ear (s),100,11,,TRUE,,,TRUE,15,,,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE 
+2

你的意思是前六行还是前六列?如果行;那么你需要看看'sr.ReadLine()'和围绕它的循环,以确保你正在正确读取数据。也;我注意到,您的测试数据在第二个数据行的产品描述列的中间包含一个换行符。换行符会影响你的结果。 – 2013-05-08 10:36:06

回答

3

使用了合适的工具。正则表达式不适合解析可以包含无限数量的嵌套引号的CSV。

使用这个代替:

快速CSV阅读

http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader

我们用它在生产代码。它的效果很好,让你明白复杂的解析过程。有关复杂性的更多信息,请查看解决方案中包含的800多个单元测试。

0

您的C#正则表达式在LinqPad中对我很好,但是您的数据在最后一行数据中包含换行符。所以你不能简单地使用sr.ReadLine()来读取数据。