2017-05-31 35 views
1

我有一个包含两列的CSV文件包含句子。例如 Test.csv:如何在csv文件中干掉每一行?

Col[1] 
---------------------- 
This trip was amazing. 

Col[2] 
-------------------- 
The cats are playing. 

所以我做了一些NLP过程:

with codecs.open('test.csv','r', encoding='utf-8', errors='ignore') as myfile: 
    data = csv.reader(myfile, delimiter=',') 
    next(data) 
    stops = set(stopwords.words("english")) 
    stemmer = PorterStemmer() 
    for row in data: 
     word_tokens1 = word_tokenize(row[1].lower()) 
     word_tokens2 = word_tokenize(row[2].lower()) 
     remo1 = [w for w in word_tokens1 if w in re.sub("[^a-zA-Z]"," ",w)] 
     remo2 = [w for w in word_tokens2 if w in re.sub("[^a-zA-Z]"," ",w)] 
     list1 = [w for w in remo1 if not w in stops] 
     list2 = [w for w in remo2 if not w in stops] 
     for w in list1: 
      l = stemmer.stem(w) 
      print(l) 
     for w in list2: 
      l2 = stemmer.stem(w) 
      print(l2) 

我的问题是,当我不制止,当我打印:

trip 
amazi 
cat 
play 

它连续打印每个单词。我怎样才能制止 等之后返回来了一句:

Col[1]: 
------------------- 
trip amazi 

Col[2]: 
------------------- 
cat play 
+0

您可以显示文件的示例吗?我想知道你为什么使用csv软件包。据我所知,你关心的是行。在csv中,列之间用逗号分隔。行由换行符分隔。 – MAZDAK

+0

它是在不同的颜色对不起,我写它作为代码.. –

+0

因此,每条线看起来像“这次旅行是惊人的,猫在玩”? – MAZDAK

回答

0

这里是你的代码的修改版本,产生所需的输出。你所要做的最重要的事情正在发生变化

for w in list1: 
      l = stemmer.stem(w) 
      print(l) 
     for w in list2: 
      l2 = stemmer.stem(w) 
      print(l2) 

stemmed_first = "" 
      c = 0 
      for w in list1: 
       if c < len(list1)-1: 
        stemmed_first += stemmer.stem(w) + " " 
       else: 
        stemmed_first += stemmer.stem(w) 
       c += 1 

与同为list2。但是,我在您的代码中做了其他小的更改:

stemmer = PorterStemmer() 
stops = set(stopwords.words("english")) 

with open('test.csv', 'rb') as csvfile: 
    spamreader = csv.reader(csvfile, delimiter=',') 

    for row in spamreader: 
     if len(row) >= 2: 
      word_tokens1 = nltk.tokenize.word_tokenize(row[0]) 
      word_tokens2 = nltk.tokenize.word_tokenize(row[1]) 
      remo1 = [w for w in word_tokens1 if w in re.sub("[^a-zA-Z]", " ", w)] 
      remo2 = [w for w in word_tokens2 if w in re.sub("[^a-zA-Z]", " ", w)] 
      list1 = [w for w in remo1 if not w in stops] 
      list2 = [w for w in remo2 if not w in stops] 

      stemmed_first = "" 
      c = 0 

      for w in list1: 
       if c < len(list1)-1: 
        stemmed_first += stemmer.stem(w) + " " 
       else: 
        stemmed_first += stemmer.stem(w) 
       c += 1 

      stemmed_second = "" 
      c = 0 

      for w in list2: 
       if c < len(list2)-1: 
        stemmed_second += stemmer.stem(w) + " " 
       else: 
        stemmed_second += stemmer.stem(w) 
       c += 1 

      print stemmed_first 
      print stemmed_second 
相关问题