需要将#tags分割为文本

-1

我需要以自动方式将#tags分割为有意义的单词。需要将#tags分割为文本

样品输入：

iloveusa
mycrushlike
mydadhero

样本输出

我爱美国
我暗恋像
我爸英雄

任何实用程序或开放的API，我可以用它来实现这一目标？

来源

2016-07-27 scientist.rahul

的[上边界分割词]可能的复制（http://stackoverflow.com/questions/39781936/split-words-on-boundary） – tripleee

检查 - Word Segmentation Task从Norvig的工作。

from __future__ import division 
from collections import Counter 
import re, nltk 

WORDS = nltk.corpus.brown.words() 
COUNTS = Counter(WORDS) 

def pdist(counter): 
    "Make a probability distribution, given evidence from a Counter." 
    N = sum(counter.values()) 
    return lambda x: counter[x]/N 

P = pdist(COUNTS) 

def Pwords(words): 
    "Probability of words, assuming each word is independent of others." 
    return product(P(w) for w in words) 

def product(nums): 
    "Multiply the numbers together. (Like `sum`, but with multiplication.)" 
    result = 1 
    for x in nums: 
     result *= x 
    return result 

def splits(text, start=0, L=20): 
    "Return a list of all (first, rest) pairs; start <= len(first) <= L." 
    return [(text[:i], text[i:]) 
      for i in range(start, min(len(text), L)+1)] 

def segment(text): 
    "Return a list of words that is the most probable segmentation of text." 
    if not text: 
     return [] 
    else: 
     candidates = ([first] + segment(rest) 
         for (first, rest) in splits(text, 1)) 
     return max(candidates, key=Pwords) 

print segment('iloveusa')  # ['i', 'love', 'us', 'a'] 
print segment('mycrushlike') # ['my', 'crush', 'like'] 
print segment('mydadhero') # ['my', 'dad', 'hero']

要获得比此更好的解决方案，您可以使用bigram/trigram。

更多的例子在：Word Segmentation Task

来源

2016-07-27 22:50:04 RAVI

需要将#tags分割为文本

回答

相关问题