2015-10-06 59 views
0

我试图计算我收集的一些演讲中出现口头收缩的次数。一个特殊的演讲是这样的:从列表中计算字符串中元素的出现次数?

speech = "I've changed the path of the economy, and I've increased jobs in our own 
home state. We're headed in the right direction - you've all been a great help." 

所以,在这种情况下,我想计算四(4)个收缩。我有宫缩的列表,这里有一些最初的几个术语:

contractions = {"ain't": "am not; are not; is not; has not; have not", 
"aren't": "are not; am not", 
"can't": "cannot",...} 

我的代码看起来是这样的,首先:

count = 0 
for word in speech: 
    if word in contractions: 
     count = count + 1 
print count 

我不是这个Anywhere入门但是,因为代码遍历每一个字母,而不是整个单词。

+5

for word in speech.split(''): – Monkpit

+0

我没有得到你的字典中的值在做什么,你有一个字典顺便说一句btw没有列表 –

+0

我在我的答案中添加了很多东西应该给你一些额外的。 – colidyre

回答

5

使用str.split()拆就空白的字符串:

for word in speech.split(): 

这将各执任意空白;这意味着空格,制表符,换行符和一些更具异国情调的空白字符,以及任意数量的连续字符。

您可能需要使用str.lower()小写你的话(否则Ain't不会被发现,例如),并去掉标点符号:

from string import punctuation 

count = 0 
for word in speech.lower().split(): 
    word = word.strip(punctuation) 
    if word in contractions: 
     count += 1 

我使用str.strip() method这里;它会从单词的开头和结尾中删除在string.punctuation string中找到的所有内容。

1

你正在遍历一个字符串。所以这些项目是字符。为了从字符串中获得单词,你可以使用一些天真的方法,例如str.split(),它可以为你创建(现在你可以迭代一个字符串列表(在str.split()的参数上分割的单词,默认:在空格上分割)。甚至有re.split(),这是更强大。但我不认为你需要用拆分正则表达式中的文本。

,你所要做的,至少是str.lower()为小写的字符串或把所有可能出现次数(也是大写字母),我强烈推荐第一个替代方案,后者并不是真正可行的,去除标点符号也是一个责任,但这仍然是天真的,如果你需要更复杂的方法,你必须通过词分词器分割文本。NLTK是一个很好的起点,请参阅nltk tokenizer。但我强烈地认为这个问题不是你的主要问题,或者真的影响你解决你的问题。 :)

speech = """I've changed the path of the economy, and I've increased jobs in our own home state. We're headed in the right direction - you've all been a great help.""" 
# Maybe this dict makes more sense (list items as values). But for your question it doesn't matter. 
contractions = {"ain't": ["am not", "are not", "is not", "has not", "have not"], "aren't": ["are not", "am not"], "i've": ["i have", ]} # ... 

# with re you can define advanced regexes, but maybe 
# from string import punctuation (suggestion from Martijn Pieters answer 
# is still enough for you) 
import re 

def abbreviation_counter(input_text, abbreviation_dict): 
    count = 0 
    # what you want is a list of words. str.split() does this job for you. 
    # " " is default and you can also omit this. But if you really need better 
    # methods (see answer text abover), you have to take a word tokenizer tool 
    # or have to write your own. 
    for word in input_text.split(" "): 
     # and also clean word (remove ',', ';', ...) afterwards. The advantage of 
     # using re over `from string import punctuation` is that you have more 
     # control in what you want to remove. That means that you can add or 
     # remove easily any punctuation mark. It could be very handy. It could be 
     # also overpowered. If the latter is the case, just stick to Martijn Pieters 
     # solution. 
     if re.sub(',|;', '', word).lower() in abbreviation_dict: 
      count += 1 

    return count 

print abbrev_counter(speech, contractions) 
2 # yeah, it worked - I've included I've in your list :) 

这是一个豆蔻有点沮丧给在作为的Martijn Pieters的做同样的时间回答),但我希望我仍然产生了一些价值你。这就是为什么我编辑了我的问题,以便为未来的工作提供一些提示。

+0

感谢您的输入,但我已经从这个问题转向了。但是,您的解决方案确实奏效!我只是不想回去重新格式化我的整个'contractions'字典:) – blacksite

+0

是的,这只是一个建议。如果能够以任何方式提供帮助,我将很乐意为我的工作得到赞扬。 :) – colidyre

+0

我已经得到你:) – blacksite

0

A for Python中的循环遍历迭代中的所有元素。在字符串的情况下,元素是字符。

您需要将字符串拆分为包含单词的字符串的列表(或元组)。您可以使用.split(delimiter)

你的问题是相当普遍的,所以Python有一个快捷方式:speech.split()拆分任何数量的空格/制表符/换行符,所以你只能在列表中获得你的单词。

所以,你的代码应该是这样的:

count = 0 
for word in speech.split(): 
    if word in contractions: 
     count = count + 1 
print(count) 

speech.split(" ")工作过,但只在拆分空格而不是制表符,换行符,如果有双空格,你会得到你的结果列表空元素。

相关问题