在python中解析制表符分隔文件时出现的奇怪现象

我解析了一个制表符分隔的文件，其中第一个元素是Twitter标签，第二个元素是tweet内容。在python中解析制表符分隔文件时出现的奇怪现象

我的输入文件看起来像：

#trumpisanabuser of young black men . calling for the execution of the innocent !url " 
#centralparkfiv of young black men . calling for the execution of the innocent !url " 
#trumppence16 " 
#trumppence16 " 
#america2that @user "

和我的代码的作用是滤除重复内容如通过检查第二制表sepearted元件是重复的转推。

import sys 
import csv 

tweetfile = sys.argv[1] 
tweetset = set() 
with open(tweetfile, "rt") as f: 
    reader = csv.reader(f, delimiter = '\t') 
    for row in reader: 
     print("hashtag: " + str(row[0]) + "\t" + "tweet: " + str(row[1])) 
     row[1] = row[1].replace("\\ n", "").rstrip() 
     if row[1] in tweetset: 
      continue 
     temp = row[1].replace("!url","") 
     temp = temp.replace("@user","") 
     temp = "".join([c if c.isalnum() else "" for c in temp]) 
     if temp: 
      taglines.append(row[0] + "\t" + row[1]) 
     tweetset.add(row[1])

但是，解析很奇怪。当我打印每个解析的项目时，输出如下所示。任何人都可以解释为什么解析中断并导致此行被打印（hashtag: #trumppence16 tweet:，换行符，然后#trumppence16）？

hashtag: #centralparkfive tweet: of young black men . calling for the execution of the innocent !url " 
hashtag: #trumppence16 tweet: 
#trumppence16 
hashtag: #america2that tweet: @user "

来源

2017-01-03 pandagrammer

你必须在文件中未结束的引号 – e4c5

对于推文，您有"行。 CSV可以通过报价列通过引用"左右的值，包括换行符。从开头"到下一个结束"的所有内容都是单列值。

reader = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)

来源

2017-01-03 07:59:31

哦，我的天哪，这解决了：

您可以通过设置quoting option到csv.QUOTE_NONE禁用报价处理。谢谢！！！！！ – pandagrammer

在python中解析制表符分隔文件时出现的奇怪现象

回答

相关问题