2011-02-13 108 views
1

我想在Python中拆分逗号分隔的字符串。我这里棘手的部分是数据本身的一些字段中有一个逗号,它们用引号引起来("')。生成的拆分字符串也应该在删除的字段周围加引号。另外,一些字段可以是空的。如何在Python中分隔逗号分隔的字符串,除了引号内的逗号之外

例子:

hey,hello,,"hello,world",'hey,world' 

需要被分成5个部分,如下面

['hey', 'hello', '', 'hello,world', 'hey,world'] 

任何想法/想法/建议/着如何去解决在Python上述问题将有助于非常感激。

谢谢你, Vish

+0

如果你指定你想在某些情况下发生的,你简单的例子,不包括什么这将是非常有益的:(1)`'abcd'efgh (2)`'abcd'“efgh”`(3)`abcd“efgh”` - 你想让它从每一个(WITH QUOTES UNSTRIPPED)产生一个字段还是一个异常或其他东西? – 2011-02-14 20:40:28

+0

另外,假设你的输入文件是通过查询客户数据库并用一个不合情理的地址行产生的,如'O'Drien'Road的'Dunromin',那么在输入文件中如何引用/转义? – 2011-02-14 23:57:08

回答

4

(编辑:原来的答案有麻烦的空字段t他边由于道路re.findall的作品,所以我重构了一点,并添加测试。)

import re 

def parse_fields(text): 
    r""" 
    >>> list(parse_fields('hey,hello,,"hello,world",\'hey,world\'')) 
    ['hey', 'hello', '', 'hello,world', 'hey,world'] 
    >>> list(parse_fields('hey,hello,,"hello,world",\'hey,world\',')) 
    ['hey', 'hello', '', 'hello,world', 'hey,world', ''] 
    >>> list(parse_fields(',hey,hello,,"hello,world",\'hey,world\',')) 
    ['', 'hey', 'hello', '', 'hello,world', 'hey,world', ''] 
    >>> list(parse_fields('')) 
    [''] 
    >>> list(parse_fields(',')) 
    ['', ''] 
    >>> list(parse_fields('testing,quotes not at "the" beginning \'of\' the,string')) 
    ['testing', 'quotes not at "the" beginning \'of\' the', 'string'] 
    >>> list(parse_fields('testing,"unterminated quotes')) 
    ['testing', '"unterminated quotes'] 
    """ 
    pos = 0 
    exp = re.compile(r"""(['"]?)(.*?)\1(,|$)""") 
    while True: 
     m = exp.search(text, pos) 
     result = m.group(2) 
     separator = m.group(3) 

     yield result 

     if not separator: 
      break 

     pos = m.end(0) 

if __name__ == "__main__": 
    import doctest 
    doctest.testmod() 

(['"]?)匹配一个可选的单或双引号。

(.*?)匹配字符串本身。这是一个非贪婪的比赛,可以根据需要进行匹配而不用吃整串。这被分配到result,这就是我们实际得到的结果。

\1是反向引用,以匹配我们之前匹配的相同单引号或双引号(如果有的话)。

(,|$)匹配逗号分隔每个条目或行尾。这被分配到separator

如果分隔符是假的(例如空),那意味着没有分隔符,所以我们在字符串的末尾 - 我们完成了。否则,我们根据正则表达式的完成位置(m.end(0))更新新的开始位置,然后继续循环。

9

听起来像是你想要的CSV模块。

+1

-1听起来像10个人(写作时),他们没有阅读细则:两个引号字符,例如``你好,世界','嘿,世界' - csv模块不会那样做。 – 2011-02-13 07:01:13

+1

@John:我们可能会对很多事情持不同意见,但是我有一种感觉,我们同意这里的投票系统有时候有它,呃,弱点...... – 2011-02-13 09:35:45

2

csv模块不会同时处理“and”引号的情况,如果没有提供这种方言的模块,就必须进入解析业务。为了避免依赖第三方模块中,我们可以使用re模块进行词法分析,使用re.MatchObject.lastindex噱头将令牌类型与匹配的模式相关联

以下代码作为脚本运行时会通过所有显示的测试,与Python 2.7和2.2

import re 

# lexical token symbols 
DQUOTED, SQUOTED, UNQUOTED, COMMA, NEWLINE = xrange(5) 

_pattern_tuples = (
    (r'"[^"]*"', DQUOTED), 
    (r"'[^']*'", SQUOTED), 
    (r",", COMMA), 
    (r"$", NEWLINE), # matches end of string OR \n just before end of string 
    (r"[^,\n]+", UNQUOTED), # order in the above list is important 
    ) 
_matcher = re.compile(
    '(' + ')|('.join([i[0] for i in _pattern_tuples]) + ')', 
    ).match 
_toktype = [None] + [i[1] for i in _pattern_tuples] 
# need dummy at start because re.MatchObject.lastindex counts from 1 

def csv_split(text): 
    """Split a csv string into a list of fields. 
    Fields may be quoted with " or ' or be unquoted. 
    An unquoted string can contain both a " and a ', provided neither is at 
    the start of the string. 
    A trailing \n will be ignored if present. 
    """ 
    fields = [] 
    pos = 0 
    want_field = True 
    while 1: 
     m = _matcher(text, pos) 
     if not m: 
      raise ValueError("Problem at offset %d in %r" % (pos, text)) 
     ttype = _toktype[m.lastindex] 
     if want_field: 
      if ttype in (DQUOTED, SQUOTED): 
       fields.append(m.group(0)[1:-1]) 
       want_field = False 
      elif ttype == UNQUOTED: 
       fields.append(m.group(0)) 
       want_field = False 
      elif ttype == COMMA: 
       fields.append("") 
      else: 
       assert ttype == NEWLINE 
       fields.append("") 
       break 
     else: 
      if ttype == COMMA: 
       want_field = True 
      elif ttype == NEWLINE: 
       break 
      else: 
       print "*** Error dump ***", ttype, repr(m.group(0)), fields 
       raise ValueError("Missing comma at offset %d in %r" % (pos, text)) 
     pos = m.end(0) 
    return fields 

if __name__ == "__main__": 
    tests = (
     ("""hey,hello,,"hello,world",'hey,world'\n""", ['hey', 'hello', '', 'hello,world', 'hey,world']), 
     ("""\n""", ['']), 
     ("""""", ['']), 
     ("""a,b\n""", ['a', 'b']), 
     ("""a,b""", ['a', 'b']), 
     (""",,,\n""", ['', '', '', '']), 
     ("""a,contains both " and ',c""", ['a', 'contains both " and \'', 'c']), 
     ("""a,'"starts with "...',c""", ['a', '"starts with "...', 'c']), 
     ) 
    for text, expected in tests: 
     result = csv_split(text) 
     print 
     print repr(text) 
     print repr(result) 
     print repr(expected) 
     print result == expected 
2

我捏造了这样的东西。我猜想,这非常多余,但它为我做了这份工作。你必须有点适应它规范:

def csv_splitter(line): 
    splitthese = [0] 
    splitted = [] 
    splitpos = True 
    for nr, i in enumerate(line): 
     if i == "\"" and splitpos == True: 
      splitpos = False 
     elif i == "\"" and splitpos == False: 
      splitpos = True 
     if i == "," and splitpos == True: 
      splitthese.append(nr) 
    splitthese.append(len(line)+1) 
    for i in range(len(splitthese)-1): 
     splitted.append(re.sub("^,|\"","",line[splitthese[i]:splitthese[i+1]])) 
    return splitted