2012-08-17 57 views
-2

比方说,我们有一个逗号分隔的文件(CSV)这样的:去除场报价在csv文件

"name of movie","starring","director","release year" 
"dark knight rises","christian bale, anna hathaway","christopher nolan","2012" 
"the dark knight","christian bale, heath ledger","christopher nolan","2008" 
"The "day" when earth stood still","Michael Rennie,the 'strong' man","robert wise","1951" 
"the 'gladiator'","russel "the awesome" crowe","ridley scott","2000" 

正如你可以从上面看到,行4 & 5有引号内的引号。 输出应该是这个样子:

"name of movie","starring","director","release year" 
"dark knight rises","christian bale, anna hathaway","christopher nolan","2012" 
"the dark knight","christian bale, heath ledger","christopher nolan","2008" 
"The day when earth stood still","Michael Rennie,the strong man","robert wise","1951" 
"the gladiator","russel the awesome crowe","ridley scott","2000" 

如何摆脱这样的行情中出现一个CSV文件,这样的报价(单,双)的。请注意,单个字段中的逗号是可以的,因为解析器确定它在引号内并将其作为一个字段。这只是安排csv文件的预处理步骤,以便它可以反馈到多个解析器中以转换为我们所需的任何格式。 Bash,awk,python都可以工作。请不要perl,我厌倦了这种语言:D 在此先感谢!

+0

我不清楚以及如何删除第一个和最后一个报价将有所帮助。要求是在csv文件中的每个字段周围都有双引号。如果我们在每个字段之间没有引号,那么在它们中包含逗号的字段值不能被分析。 – crazyim5 2012-08-17 17:53:17

+0

我的想法是,CSV阅读器将无法解析该文件,因为有非双引号。我想你必须自己解析它,因此我的建议。虽然因为它们会被删除,但删除第一个和最后一个引号也是不必要的。我以为你已经在使用csv模块了...我猜不是。 – 2012-08-17 18:18:03

+0

我不明白为什么我的问题得到-1:/ – crazyim5 2012-08-17 18:18:47

回答

3

如何

import csv 

def remove_quotes(s): 
    return ''.join(c for c in s if c not in ('"', "'")) 

with open("fixquote.csv","rb") as infile, open("fixed.csv","wb") as outfile: 
    reader = csv.reader(infile) 
    writer = csv.writer(outfile, quoting=csv.QUOTE_ALL) 
    for line in reader: 
     writer.writerow([remove_quotes(elem) for elem in line]) 

产生

~/coding$ cat fixed.csv 
"name of movie","starring","director","release year" 
"dark knight rises","christian bale, anna hathaway","christopher nolan","2012" 
"the dark knight","christian bale, heath ledger","christopher nolan","2008" 
"The day when earth stood still","Michael Rennie,the strong man","robert wise","1951" 
"the gladiator","russel the awesome crowe","ridley scott","2000" 

顺便说一句,你可能要检查一些这些名称的拼写..

+0

'如果c不在(''',''“)'也可以写成'如果不在'”“”''“”“':) :) – 2012-08-17 18:05:10

+1

啊,我本来希望读者能够扼杀那些未转义的双引号。 – 2012-08-17 18:09:37

+0

@TimPietzcker:沉重!那么,如果是这样的话,''''''“'也可以工作,这要归功于字符串连接。^^ – DSM 2012-08-17 18:16:25

0

将值拆分为数组。遍历数组除第一个和最后一个字符之外的所有引号。希望能帮助到你。

0

使用awk,你可以这样做:

awk -v Q='"' '{ gsub("[\"']","") ; gsub(",",Q "," Q) ; print Q $0 Q }' 
+0

感谢您的解决方案。我认为帝斯曼发布的python解决方案非常棒。 – crazyim5 2012-08-17 18:28:08