2017-07-07 60 views
2

我想在熊猫数据框的列上运行一个函数。 语料库是pd.Dataframe在pandas Dataframe的列上运行函数的有效方法?

import pandas as pd 
import numpy as np 
from scipy.spatial.distance import cosine 

corpus = pd.DataFrame([[3,1,1,1,1,60],[2,2,0,2,0,20], [0,2,1,1,0,0], [0,0,2,1,0,1],[0,0,0,0,1,0]],index=["stark","groß","schwach","klein", "dick"],columns=["d1", "d2", "d3","d4","d5","d6"]) 

而且我有查询。查询是一个熊猫系列。

query = pd.Series([1,1,0,0,0], index=["stark","groß","schwach","klein", "dick"]) 

现在我想在语料库和查询中的每一列上运行余弦函数。

for column in corpus: 
print("Similarity of Documents", column," and query: \n" ,1-cosine(query, corpus[column])) 

有没有更好的方法来运行列上的余弦函数?也许某种方法可以获取列并在每列上运行该函数。我想避免for循环。

+0

余弦函数只是从scipy.spatial.distance scipy.spatial.distance.cosine进口的(U,V) 你和v是数组。 (余弦计算两个一维数组之间的距离。) – BenVes

+0

谢谢你,你是对的。我编辑了我的问题。 :) – BenVes

回答

2

你可以使用scipy.spatial.distance.cdist's'cosine'功能的矢量soliution,像这样 -

from scipy.spatial.distance import cdist 

out = 1-cdist(query.values[None], corpus.values.T, 'cosine') 

采样运行 -

In [192]: corpus 
Out[192]: 
     d1 d2 d3 d4 d5 d6 
stark  3 1 1 1 1 60 
groß  2 2 0 2 0 20 
schwach 0 2 1 1 0 0 
klein  0 0 2 1 0 1 
dick  0 0 0 0 1 0 

In [193]: query 
Out[193]: 
stark  1 
groß  1 
schwach 0 
klein  0 
dick  0 
dtype: int64 

In [194]: from scipy.spatial.distance import cosine 

In [195]: for column in corpus: 
    ...:  print(1-cosine(query, corpus[column])) 
    ...:  
0.980580675691 
0.707106781187 
0.288675134595 
0.801783725737 
0.5 
0.89431540856 

In [196]: 1-cdist(query.values[None], corpus.values.T, 'cosine') 
Out[196]: array([[ 0.98058, 0.70711, 0.28868, 0.80178, 0.5 , 0.89432]]) 

运行测试 -

In [225]: corpus = pd.DataFrame(np.random.rand(100,10000)) 

In [226]: query = pd.Series(np.random.rand(100)) 

# @C.Square's apply based soln 
In [227]: %timeit corpus.apply(lambda x:1-cosine(query, x), axis=0) 
1 loop, best of 3: 352 ms per loop 

# Proposed in this post using cdist() 
In [228]: %timeit 1-cdist(query.values[None], corpus.values.T, 'cosine') 
100 loops, best of 3: 3.2 ms per loop 
0

apply -ing功能是一个整洁,可读和快速的方式这样的工作:

import pandas as pd 
from scipy.spatial.distance import cosine 

corpus = pd.DataFrame([[3,1,1,1,1,60],[2,2,0,2,0,20], [0,2,1,1,0,0], [0,0,2,1,0,1],[0,0,0,0,1,0]], index=["stark","groß","schwach","klein", "dick"], columns=["d1", "d2", "d3","d4","d5","d6"]) 
query = pd.Series([1,1,0,0,0], index=["stark","groß","schwach","klein", "dick"]) 

corpus.apply(lambda x:1-cosine(query, x), # Apply your function 
      axis=0)      # For each column 

# d1 0.980581 
# d2 0.707107 
# d3 0.288675 
# d4 0.801784 
# d5 0.500000 
# d6 0.894315 
# dtype: float64 
1

您还可以使用的cosine的定义和实现自己

pandas

corpus.T.dot(query)/(corpus ** 2).sum() ** .5/(query ** 2).sum() ** .5 

d1 0.980581 
d2 0.707107 
d3 0.288675 
d4 0.801784 
d5 0.500000 
d6 0.894315 
dtype: float64 

numpy

c = corpus.values 
q = query.values 

r = c.T.dot(q)/(c ** 2).sum(0) ** .5/(q ** 2).sum() ** .5 

pd.Series(r, corpus.columns) 

d1 0.980581 
d2 0.707107 
d3 0.288675 
d4 0.801784 
d5 0.500000 
d6 0.894315 
dtype: float64 

与@ Divakar的建议
np.einsum

c = corpus.values 
q = query.values 

r = (
     np.einsum('ji,j->i', c, q)/
     np.einsum('ij,ij->j', c, c) ** .5/
     np.einsum('i,i', q, q) ** .5 
    ) 

pd.Series(r, corpus.columns) 

d1 0.980581 
d2 0.707107 
d3 0.288675 
d4 0.801784 
d5 0.500000 
d6 0.894315 
dtype: float64 
+1

我看到'einsum'有'(c ** 2).sum(0)',另一个! – Divakar

相关问题