如何通过twitter api使用python格式化推文？

我通过twitter api收集了一些推文。然后我在Python中使用split(' ')来计算单词。然而，有些字似乎是这样的：如何通过twitter api使用python格式化推文？

correct! 
correct. 
,correct 
blah" 
...

那么如何格式化没有标点符号的tweets？或者，也许我应该尝试另一种方式来split推文？谢谢。

来源

2013-05-12 zfz

任何简单的解决方案之前，进口重。标点符号“无疑会混淆表情符号和其他特殊字符序列。如果你关心这一点，你应该考虑使用tokenizer构建推文。 – Jared 2013-05-12 09:15:51

你可以使用re.split多个字符分割...

from string import punctuation 
import re 

puncrx = re.compile(r'[{}\s]'.format(re.escape(punctuation))) 
print filter(None, puncrx.split(your_tweet))

或者，只要找到包含某些连续的字符的话：

print re.findall(re.findall('[\w#@]+', s), your_tweet)

如：

print re.findall(r'[\[email protected]#]+', 'talking about #python with @someone is so much fun! Is there a  140 char limit? So not cool!') 
# ['talking', 'about', '#python', 'with', '@someone', 'is', 'so', 'much', 'fun', 'Is', 'there', 'a', '140', 'char', 'limit', 'So', 'not', 'cool']

我原本在这个例子中有一个笑脸，但当然这些最终得到过滤o用这种方法，所以这是值得警惕的。

来源

2013-05-12 09:10:45

尝试在拆分之前从字符串中删除标点符号。

import string 
s = "Some nice sentence. This has punctuation!" 
out = s.translate(string.maketrans("",""), string.punctuation)

然后做out的split。

来源

2013-05-12 09:09:59 Steve

我会建议使用此代码分裂之前从特殊符号清理文本：

tweet_object["text"] = re.sub(u'[[email protected]#$.,#:\u2026]', '', tweet_object["text"])

您需要使用`字符串中使用的功能子

import re

来源

2013-05-12 09:42:59 rvnikita

如何通过twitter api使用python格式化推文？

回答

相关问题