嘿,大家我知道这已经问过几次了,但我很难用python查找文档频率。我试图找到TF-IDF,然后找到他们和查询之间的余弦分数,但我坚持查找文档频率。这是我到目前为止有:使用Python查找文档频率
#includes
import re
import os
import operator
import glob
import sys
import math
from collections import Counter
#number of command line argument checker
if len(sys.argv) != 3:
print 'usage: ./part3_soln2.py "path to folder in quotation marks" query.txt'
sys.exit(1)
#Read in the directory to the files
path = sys.argv[1]
#Read in the query
y = sys.argv[2]
querystart = re.findall(r'\w+', open(y).read().lower())
query = [Z for Z in querystart]
Query_vec = Counter(query)
print Query_vec
#counts total number of documents in the directory
doccounter = len(glob.glob1(path,"*.txt"))
if os.path.exists(path) and os.path.isfile(y):
word_TF = []
word_IDF = {}
TFvec = []
IDFvec = []
#this is my attempt at finding IDF
for filename in glob.glob(os.path.join(path, '*.txt')):
words_IDF = re.findall(r'\w+', open(filename).read().lower())
doc_IDF = [A for A in words_IDF if len(A) >= 3 and A.isalpha()]
word_IDF = doc_IDF
#psudocode!!
"""
for key in word_idf:
if key in word_idf:
word_idf[key] =+1
else:
word_idf[key] = 1
print word_IDF
"""
#goes to that directory and reads in the files there
for filename in glob.glob(os.path.join(path, '*.txt')):
words_TF = re.findall(r'\w+', open(filename).read().lower())
#scans each document for words greater or equal to 3 in length
doc_TF = [A for A in words_TF if len(A) >= 3 and A.isalpha()]
#this assigns values to each term this is my TF for each vector
TFvec = Counter(doc_TF)
#weighing the Tf with a log function
for key in TFvec:
TFvec[key] = 1 + math.log10(TFvec[key])
#placed here so I dont get a command line full of text
print TFvec
#Error checker
else:
print "That path does not exist"
我使用Python 2和到目前为止,我真的没有任何想法如何算一个术语多少文件出现在我能找到的文档总数但我真的很难找到一个术语出现的文档数量。我只是要创建一个大型字典,它包含所有文档中的所有术语,这些术语稍后可能在查询需要这些术语时提取。感谢您给我的任何帮助。
是否有一个原因,你试图自己实现这个而不是使用库:http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html –
我读了一个但是我必须记录tf和idf值,并认为如果我自己实现它,会更容易。另外,我将在一个包含大约100个文本文件的目录中阅读,所以我再次认为它比使用scikit更容易 – Sean
此外,我将不得不在晚些时候为tfidf做cosin。 scikit也有这个功能吗? – Sean