使用NLTK编码问题

我试图抓取一个非常'右侧'的网站来进行关于仇恨和种族主义检测的研究，所以我的测试内容可能会受到攻击。使用NLTK编码问题

我试图删除一些停用词和标点符号在Python中，我使用NLTK，但我遇到了一个编码问题...我使用Python 2.7和数据来自一个文件，我填写文章从网站我爬到：

stop_words = set(nltk.corpus.stopwords.words("english")) 
for key, value in data.iteritems(): 
    print type(value), value 
    tokenized_article = nltk.word_tokenize(value.lower()) 
    print tokenized_article 
    break

和输出看喜欢：（我加...缩短样品）

<type 'str'> A Negress Bernie ... they’re not going to take it anymore. 

['a', 'negress', 'bernie', ... , 'they\u2019re', 'not', 'going', 'to', 'take', 'it', 'anymore', '.']

我不明白为什么有这个“\ u2019”那不应该在那里。如果有人可以告诉我如何驾驶它。我试图用UTF-8编码，但我仍然遇到同样的问题。

来源

2016-11-30 mel

'\ u2019'是unicode符号[右单引号]（http://unicode.org/cldr/utility/character.jsp?a=2019）。如果你没有太多不同的问题字符，你可以简单地[修复你的字符串]（http://stackoverflow.com/questions/24358361/removing-u2018-and-u2019-character） – alexis

stop_words = set(nltk.corpus.stopwords.words("english")) 
for key, value in data.iteritems(): 
    print type(value), value 
    #replace value with ignored handler 
    value = value.encode('ascii', 'ignore') 
    tokenized_article = nltk.word_tokenize(value.lower()) 
    print tokenized_article 
    break

来源

2016-11-30 17:03:13

谢谢:)我切换'忽略'与'替换'其他方式我会'他们'。然后我可以删除'？'与string.punctuation – mel

我喜欢你的任务主题，继续 –

这不是一个好建议。即使在处理文本之前，您应该已经明确地抓取了网站的编码并且知道这一点，然后将抓取工具设置为适当的编码。如果它们都是UTF8，那么比较Python3中的字符串会更有意义，并且会给您带来更少的痛苦。 – alvas

使用NLTK编码问题

回答

相关问题