2016-02-04 99 views
0

嘿,大家我知道这已经问过几次了,但我很难用python查找文档频率。我试图找到TF-IDF,然后找到他们和查询之间的余弦分数,但我坚持查找文档频率。这是我到目前为止有:使用Python查找文档频率

#includes 
import re 
import os 
import operator 
import glob 
import sys 
import math 
from collections import Counter 

#number of command line argument checker 
if len(sys.argv) != 3: 
    print 'usage: ./part3_soln2.py "path to folder in quotation marks" query.txt' 
    sys.exit(1) 

#Read in the directory to the files 
    path = sys.argv[1] 

#Read in the query 
y = sys.argv[2] 
querystart = re.findall(r'\w+', open(y).read().lower()) 
query = [Z for Z in querystart] 
Query_vec = Counter(query) 
print Query_vec 

#counts total number of documents in the directory 
doccounter = len(glob.glob1(path,"*.txt")) 

if os.path.exists(path) and os.path.isfile(y): 
    word_TF = [] 
    word_IDF = {} 
    TFvec = [] 
    IDFvec = [] 

    #this is my attempt at finding IDF 
    for filename in glob.glob(os.path.join(path, '*.txt')): 

     words_IDF = re.findall(r'\w+', open(filename).read().lower()) 

     doc_IDF = [A for A in words_IDF if len(A) >= 3 and A.isalpha()] 

     word_IDF = doc_IDF 

     #psudocode!! 
     """ 
     for key in word_idf: 
      if key in word_idf: 
       word_idf[key] =+1 
      else: 
       word_idf[key] = 1 

    print word_IDF 
    """ 

    #goes to that directory and reads in the files there 
    for filename in glob.glob(os.path.join(path, '*.txt')): 

     words_TF = re.findall(r'\w+', open(filename).read().lower()) 

     #scans each document for words greater or equal to 3 in length 
     doc_TF = [A for A in words_TF if len(A) >= 3 and A.isalpha()] 

     #this assigns values to each term this is my TF for each vector 
     TFvec = Counter(doc_TF) 

     #weighing the Tf with a log function 
     for key in TFvec: 
      TFvec[key] = 1 + math.log10(TFvec[key]) 


    #placed here so I dont get a command line full of text 
    print TFvec 

#Error checker 
else: 
    print "That path does not exist" 

我使用Python 2和到目前为止,我真的没有任何想法如何算一个术语多少文件出现在我能找到的文档总数但我真的很难找到一个术语出现的文档数量。我只是要创建一个大型字典,它包含所有文档中的所有术语,这些术语稍后可能在查询需要这些术语时提取。感谢您给我的任何帮助。

+1

是否有一个原因,你试图自己实现这个而不是使用库:http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html –

+0

我读了一个但是我必须记录tf和idf值,并认为如果我自己实现它,会更容易。另外,我将在一个包含大约100个文本文件的目录中阅读,所以我再次认为它比使用scikit更容易 – Sean

+0

此外,我将不得不在晚些时候为tfidf做cosin。 scikit也有这个功能吗? – Sean

回答

2

DF的术语x是出现x的文档的数量。为了找到这个问题,你需要先遍历所有的文件。只有这样你才能从DF中计算IDF。

您可以使用字典,用于计算DF:

  1. 遍历所有文件
  2. 对于每一个文档,检索设定它的话(不重复)
  3. 增加DF计算每个字从第2阶段开始。因此,无论该单词在文档中有多少次,您都会将计数增加1。

Python代码看起来是这样的:

from collections import defaultdict 
import math 

DF = defaultdict(int) 
for filename in glob.glob(os.path.join(path, '*.txt')): 
    words = re.findall(r'\w+', open(filename).read().lower()) 
    for word in set(words): 
     if len(word) >= 3 and word.isalpha(): 
      DF[word] += 1 # defaultdict simplifies your "if key in word_idf: ..." part. 

# Now you can compute IDF. 
IDF = dict() 
for word in DF: 
    IDF[word] = math.log(doccounter/float(DF[word])) # Don't forget that python2 uses integer division. 

PS这是很好的学习手工实现的事情,但如果您遇到困难,我建议你看看NLTK包。它提供了用于处理语料库(文本集合)的有用功能。

+0

非常感谢,昨天有人向我推荐了默认字典,但我不知道如何使用它。 – Sean