支持向量机 - 构建特征出字数和上下文

此python代码创建反映特定推文中是否存在给定关键字的功能。支持向量机 - 构建特征出字数和上下文

#get feature list stored in a file (for reuse) 
featureList = getFeatureList('data/sampleTweetFeatureList.txt') 

#start extract_features 
def extract_features(tweet): 
    tweet_words = set(tweet) 
    features = {} 
    for word in featureList: 
     features['contains(%s)' % word] = (word in tweet_words) 
    return features 
#end

和输出应该是这样的：

{ 
    'contains(arm)': True,    #notice this 
    'contains(articles)': False, 
    'contains(attended)': False, 
    'contains(australian)': False, 
    'contains(awfully)': False, 
    'contains(bloodwork)': True,  #notice this 
    'contains(bombs)': False, 
    'contains(cici)': False, 
    ..... 
    'contains(head)': False, 
    'contains(heard)': False, 
    'contains(hey)': False, 
    'contains(hurts)': True,   #notice this 
    ..... 
    'contains(irish)': False, 
    'contains(jokes)': False, 
    ..... 
    'contains(women)': False 
}

现在，我该如何着手建立的特征向量，如果设置也是功能包括（除了关键字出现如上图所示）：

字在给定的鸣叫计数
语境，如“地震”的特殊关键字的。对于例如在'日本地震现在'这个例子中，围绕'地震'的左右字是'日本'和'现在'。

编辑：我想弄清楚的是，如何捕获这些信息（字数和上下文），以便获得SVM算法工作所需的矢量？直到现在我所拥有的是| featureList |中的矢量维度空间。我如何扩展它以包含字数和上下文？

来源

2015-04-12 dharm0us

使用分裂（）来获取单词的列表和LEN（）寻找项目的数量在列表中find the number of words in a sentence：

word_count = len(tweet.split())

当你需要存储多个值，如您的背景下，您可以使用元组，有点像这样：

features['contains(%s)' % word] = (word in tweet_words, previous_word, next_word)

使地图看起来是这样的：

{ 
    'contains(arm)': (True, 'broken', 'was'), 
    'contains(articles)': (False, '', ''), 
    ... 
}

并且可以这样枚举：

for feature in features: 
    for word, previous, next in features[feature]: 
     if word: 
      print previous 
      print next

您的原始解决方案有一个问题：您没有考虑到重复的单词。 set（）的使用意味着独特的元素。如果您想保留重复项或地图，请使用列表[]，以便快速查找。

使用类可以让您更轻松地操纵数据。我们可以通过使用单词映射到其上下文列表来获得更大的灵活性，而不是单词列表。为什么不把这个词的位置放在推文里呢？

class Tweet(object): 
    def __init__(self, tweet): 
     self.text= tweet 
     self.words = tweet.split() 
     self.word_count = len(words) 

     # dictionary comprehension - preliminary map of 'word' to empty list 
     self.contexts = {word: [] for word in words} 

     for idx, word in enumerate(self.words): 
      self.contexts[word].append(
       idx,             # idx 
       self.words[idx] if idx > 0 else '',     # previous word 
       self.words[idx+1] if idx < self.word_count else '') # next word

然后，您可以重写你的函数这种方式，虽然副本仍然没有处理：

def extract_features(tweet_str): 
    tweet = Tweet(tweet_str) 
    features = {} 
    for word in featureList: 
     features['contains(%s)' % word] = (word in tweet.words) 
    return features

有那么多的东西，你现在可以用它做：

# enumerate each word, their location and context: 
for word in tweet.words: 
    location, previous, next = tweet.contexts[word] 

# get the word count: 
print tweet.word_count 

# how many apples? 
print len(word.lower() for word in tweet.words if word.startswith('apple')) 

# how many apples, again? 
print len(tweet.contexts['apples']) # of course you will miss 'apple' and 'Apples', etc 

# did he used the word 'where'? 
print 'where' in tweet.words # note: 'Where' will not match because of capital W

来源

2015-04-13 00:08:00 Joe

我想知道的是，如何捕获这些信息（字数和上下文），以便获得SVM算法工作所需的矢量？直到现在我所拥有的是| featureList |中的矢量维度空间。我如何扩展它以包含字数和上下文？ – dharm0us

对不起，我根本不知道SVM。 – Joe

支持向量机 - 构建特征出字数和上下文

回答

相关问题