python：计算句子中的单词标记

我正在计算一个字符串中单词的数量。然而，我首先必须去掉一些标点符号，例如python：计算句子中的单词标记

line = "i want you , to know , my name . "

运行

en = line.translate(string.maketrans('', ''), '!,.?')

产生

en = "i want you to know my name "

在此之后

，我要算在该行的单词数。但是当我做len（en）时，我得到30而不是7.

在en上使用split来标记并找到长度在所有情况下都不起作用。例如

我试过它并不总是工作。例如考虑这个字符串。

"i ccc bcc the a of the abc ccc dd on aaa , 28 abc 19 ."

连接就变成了：

"i ccc bcc the a of the abc ccc dd on aaa 28 abc 19 "

但LEN（EN）返回17，而不是15

可以请你帮忙吗？谢谢

来源

2011-11-07 Duke

en.split(' ')的问题是您的字符串中有多余的空格，这会给出空匹配。您可以通过拨打en.split()来解决这个问题。

但是，也许你可以使用使用正则表达式此不同的方法（现在没有必要先删除标点）：

import re 
print len(re.findall(r'\w+', line))

看到它联机工作：ideone

来源

2011-11-07 00:58:07

完美。谢谢！ – Duke

@Adinoyi请务必使用绿色复选标记接受最佳答案.... – Dougal

感谢Dougal。完成。 – Duke

len函数计算变量的长度，在这种情况下，它是字符串的长度，它是30个字符。要计算单词，您需要将字符串拆分为空白，然后计算返回的项目数量。

来源

2011-11-07 00:51:53 slugonamission

看看在文档collections.Counter的介绍性示例中。这表明如何在一个句子中查找单个单词。

来源

2011-11-07 01:17:15

除了使用正则表达式\w+的是更快的使用\b计数的话，像这样：

import re 
_re_word_boundaries = re.compile(r'\b') 

def num_words(line): 
    return len(_re_word_boundaries.findall(line)) >> 1

请注意，我们必须减少一半的数量，因为在开始和结束都\b比赛一个字。不幸的是，与egrep不同，Python不支持仅在开始或结束时进行匹配。

如果你有很长的行和关心的内存，使用迭代器可能是一个更好的解决方案：

def num_words(line): 
    return sum(1 for word in _re_word_boundaries.finditer(line)) >> 1

来源

2011-11-07 10:23:15 Cito

def main(): 

# get the user msg 
    print "this program tells you how many words are in your sentence." 
    message = raw_input("Enter message: ") 

    wrdcount = 0 
    for i in message.split(): 
     eawrdlen = len(i)/len(i) 
     wrdcount = wrdcount + eawrdlen 
    print wrdcount 


main()

来源

2012-12-18 19:24:42

您可以使用NLTK：

import nltk 
en = "i ccc bcc the a of the abc ccc dd on aaa 28 abc 19 " 
print(len(nltk.word_tokenize(en)))

输出：

来源

2015-08-11 23:56:32

python：计算句子中的单词标记

回答

相关问题