如何计算2个预定义单词之间的单词数量？

<replace-add>，我不知道你知道导致</replace-add>我可以帮你<replace-del>说哦</replace-del><replace-add>我们</replace-add>感谢，所以我刚刚从</replace-add>我的女儿<replace-del> tenah代尔</replace-del><replace-add>明确可怕</replace-add>如何计算2个预定义单词之间的单词数量？

建立一个骑 <replace-del>为 </replace-del> <replace-add>

如何计算文本中<replace-add>和</replace-add>之间的确切字数。

来源

2017-10-13 Tim

那你是指所有在这些标签之间出现的以空格分隔的字符串？为了清楚起见，你能否包括预期的样本输出？此外，尝试用四个空格缩进来格式化代码。我们可以假设标签会像这样发生，还是可以有属性？ –

我不知道你知道原因输出将是7，也请注意，我将在文本中有其他标签，如,~~等。但上的示例就足够了。 – Tim~~

不使用任何库：

def get_tag_indexes(text, tag, start_tag): 
    tag_indexes = [] 
    start_index = -1 

    while True: 
     start_index = text.find(tag, start_index + 1) 

     if start_index != -1: 
      if start_tag: 
       tag_indexes.append(start_index + len(tag)) 
      else: 
       tag_indexes.append(start_index) 
     else: 
      return tag_indexes 

text = """<replace-add>that i dont know you know cause</replace-add> i could help you with <replace-del>that oh</replace-del> <replace-add>us</replace-add> thanks so i just set up a ride <replace-del>for</replace-del> <replace-add>from</replace-add> my daughter <replace-del>tenah dyer</replace-del> <replace-add>clear dire</replace-add>""" 

tag_starts = get_tag_indexes(text, "<replace-add>", True) 
tag_ends = get_tag_indexes(text, "</replace-add>", False) 

for start, end in zip(tag_starts, tag_ends): 
    words = text[start:end].split() 
    print "{} words - {}".format(len(words), words)

给你：

7 words - ['that', 'i', 'dont', 'know', 'you', 'know', 'cause'] 
1 words - ['us'] 
1 words - ['from'] 
2 words - ['clear', 'dire']

这将使用函数返回给定文本的位置的列表。这可以用来提取两个标签之间的文本。

作为一个替代方法，这可能实际上还可以使用beautifulsoup完成：

from bs4 import BeautifulSoup 

text = """<replace-add>that i dont know you know cause</replace-add> i could help you with <replace-del>that oh</replace-del> <replace-add>us</replace-add> thanks so i just set up a ride <replace-del>for</replace-del> <replace-add>from</replace-add> my daughter <replace-del>tenah dyer</replace-del> <replace-add>clear dire</replace-add>""" 
soup = BeautifulSoup(text, "lxml") 

for block in soup.find_all('replace-add'): 
    words = block.text.split() 
    print "{} words - {}".format(len(words), words)

来源

2017-10-13 09:34:40

嘿马丁，我不应该导入任何图书馆。 – Tim

@Tim一点都没有？！你允许标准库的东西？这是一项任务或某事的要求吗？ –

我的意思是它们可以像导入操作系统，difflib等一样使用，但最好远离它，除非它们是必不可少的，并且不属于任务。 – Tim

根据如何值得信赖的来源是，你可以做两件事情。鉴于

source = """<replace-add>that i dont know you know cause</replace-add> i could help you with <replace-del>that oh</replace-del> <replace-add>us</replace-add> thanks so i just set up a ride <replace-del>for</replace-del> <replace-add>from</replace-add> my daughter <replace-del>tenah dyer</replace-del> <replace-add>clear dire</replace-add>"""

你可以使用正则表达式，像这样：

import re 

from itertools import chain 

word_pattern = re.compile(r"(?<=<replace-add>).*?(?=</replace-add>)") 
re_words = list(chain.from_iterable(map(str.split, word_pattern.findall(source))))

这如果源这些标签完全匹配只会工作，没有任何属性等

的另一种选择标准库是HTML解析：

from html.parser import HTMLParser 

class MyParser(HTMLParser): 
    def get_words(self, html): 
     self.read_words = False 
     self.words = [] 
     self.feed(html) 
     return self.words 

    def handle_starttag(self, tag, attrs): 
     if tag == "replace-add": 
      self.read_words = True 

    def handle_data(self, data): 
     if self.read_words: 
      self.words.extend(data.split()) 

    def handle_endtag(self, tag): 
     if tag == "replace-add": 
      self.read_words = False 


parser = MyParser() 
html_words = parser.get_words(source)

这种方法会更可靠，一个d可能会更有效一些，因为它使用完全集中于此任务的工具。

现在，做

print(re_words) 
print(html_words)

我们得到

['that', 'i', 'dont', 'know', 'you', 'know', 'cause', 'us', 'from', 'clear', 'dire'] 
['that', 'i', 'dont', 'know', 'you', 'know', 'cause', 'us', 'from', 'clear', 'dire']

（当然，这个名单的len是单词的数量。）

如果严格只是需要数的话，你可以只保留一个运行总数，并将data.split的长度添加到每个遇到的数据中。

如果你真的不能进行任何导入，你要么做出一些牺牲，要么必须实现你自己的正则表达式引擎/ html解析器。如果这是家庭作业的一项要求，那么你真的应该表现出一些事先的努力来发布这个问题。

来源

2017-10-13 09:47:07

如何计算2个预定义单词之间的单词数量？

回答

相关问题