2016-08-24 40 views
0

我的代码运行但我的函数输出总是0.0。我的代码调用.txt文件并创建一个矩阵,其中每个.txt文件表示矩阵中的一行,并且.txt文件中的每个单词在矩阵的相应行中都有自己的列。用“一包字”的方法计算距离

我将两条线进行比较。我想要统计两行联合的每个词出现的频率。然而,尽管代码运行,我得到了错误的结果(0.0)。

我想我可能会在我用于该功能的矩阵中出现错误,但矩阵看起来不错。

奇怪的是,如果我手动创建到列表:

a = ["a", "b", "c", "d"], 
b = ["b", "c", "d", "e"] 

它的工作原理,但是当我更改为:

a = ["word 1", "word 2", "word 3", "word 4"], 
b = ["word 2","word 3","word 4","word 5",] 

结果再次0.0。我很困惑!

我的代码:

def bow_distance(a, b): 

    p = 0 

    if len(a) > len(b): 
     max_words = len(a) 
    else: 
     max_words = len(b) 

    list_words_ab = list(set(a) | set(b)) 

    len_bow_matrix = len(list_words_ab) 
    bow_matrix = numpy.zeros(shape = (3, len_bow_matrix), dtype = str) 

    while p < len_bow_matrix: 
     bow_matrix[0, p] = str(list_words_ab[p]) 
     p = p+1 

    p = 0 

    while p < len_bow_matrix: 
     bow_matrix[1, p] = a.count(bow_matrix[0, p]) 
     bow_matrix[2, p] = b.count(bow_matrix[0, p]) 
     p = p+1 

    p = 0 
    overlap = 0 

    while p < len_bow_matrix: 
     abs_difference = abs(float(bow_matrix[1, p]) - float(bow_matrix[2, p])) 
     overlap = overlap + abs_difference 
     p = p+1 

    return (overlap/2)/max_num_parts 


    # Calculate the distances 

i = 1 
j = 1 

while i < num_of_txt + 1: 

    print(i) 
    newfile = open("TXT_distance_" + str(i)+".txt", "w") 

    while j < num_of_txt + 1: 
     newfile.write(str(bow_distance(text_word_matrix[i-1], text_word_matrix[j-1])) + " ") 
     j = j+1 

    newfile.close() 
    j = 1 
    i = i+1 

回答

0

对于第一次看到我在这里看到两次失败:

a = ["a", "b", "c", "d"], <----- comma here 
b = ["b", "c", "d", "e"] 
it works, but when I change to: 

a = ["word 1", "word 2", "word 3", "word 4"], <----- and here 
b = ["word 2","word 3","word 4","word 5",] <----- and here inside the list 
+0

还有后'“字5”'需要被去除的多余的逗号。 – Harrison

+0

诚然,谢谢你。 – turkus

+0

单词5之后的逗号并不重要,因为它可以在列表中以逗号结尾。然而,列表定义之后的逗号*(其中定义了'a')会使'a'成为具有单个值(即数组本身)的元组,并且可能会抛弃您的逻辑。 – Riaz