阅读矩阵，并获取在Python

行和列的名字，我想读一个矩阵文件的东西，它看起来像：阅读矩阵，并获取在Python

sample sample1 sample2 sample3 
sample1 1 0.7 0.8 
sample2 0.7 1 0.8 
sample3 0.8 0.8 1

我想获取所有具有> 0.8的值对。例如：大文件中的sample1,sample3 0.8sample2,sample3 0.8等。

当我使用csv.reader时，每一行都变成了一个列表，并且跟踪行和列名称会使程序变得不可靠。我想知道一个像使用numpy或pandas这样的优雅方式。

希望的输出：

sample1,sample3 0.8 
sample2,sample3 0.8

1可以因为相同的样品之间被忽略，这将是1总是。

来源

2015-11-04 gthm

熊猫和numpy都有csv阅读器。有很多关于这些问题的SO问题。 – hpaulj

pandas'read_table可以处理sep参数中的正则表达式。

In [19]: !head file.txt 
sample sample1 sample2 sample3 
sample1 1 0.7 0.8 
sample2 0.7 1 0.8 
sample3 0.8 0.8 1 

In [20]: df = pd.read_table('file.txt', sep='\s+') 

In [21]: df 
Out[21]: 
    sample sample1 sample2 sample3 
0 sample1  1.0  0.7  0.8 
1 sample2  0.7  1.0  0.8 
2 sample3  0.8  0.8  1.0

从那里，你可以筛选所有值> = 0.8。

In [23]: df[df >= 0.8] 
Out[23]: 
    sample sample1 sample2 sample3 
0 sample1  1.0  NaN  0.8 
1 sample2  NaN  1.0  0.8 
2 sample3  0.8  0.8  1.0

来源

2015-11-04 21:47:04

我的问题是如何取出这些对？所需的输出是行和列名称。 – gthm

如果你想使用熊猫，下面的答案将有所帮助。我假设你会弄清楚如何自己将你的矩阵文件读入熊猫。我还假设你的列和行都被正确标记。读完数据后最终会出现的是一个DataFrame，它看起来很像您在问题顶部的矩阵。我假设所有的行名都是DataFrame索引。我正在考虑你已经将数据读入一个名为df的变量作为我的出发点。

熊猫比列式更有效率。所以，我们按行进行事情，循环遍历列。

pairs = {} 
for col in df.columns: 
    pairs[col] = df[(df[col] >= 0.8) & (df[col] < 1)].index.tolist() 
    # If row names are not an index, but a different column named 'names' run the following line, instead of the line above 
    # pairs[col] = df[(df[col] >= 0.8) & (df[col] < 1)]['names'].tolist()

或者，你可以使用apply()要做到这一点，因为那也是会遍历所有列。（也许在0.17它会释放GIL以获得更快的结果，我不知道，因为我没有尝试过。）

pairs现在将包含列名称作为键和行名称列表作为值相关性大于0.8，但小于1.

如果您还想从DataFrame中提取相关值，请用.to_dict()替换.tolist()。 .to_dict()将生成一个字典，以便索引是关键字，值是值：{index -> value}。所以，最终你的pairs看起来像{column -> {index -> value}}。它也将保证免费nan。请注意0只有在您的索引包含您想要的行名称时才会起作用，否则它将返回默认索引，这只是数字。

Ps。如果你的文件很大，我会建议阅读它的大块。在这种情况下，上面的代码将为每个块重复。所以它应该在你的循环中迭代块。但是，您必须小心地将来自下一个块的新数据附加到pairs。以下链接供你参考：

您可能还需要其他类型的I/O通过大熊猫支持阅读参考1。

来源

2015-11-05 02:40:06 Kartik

可以屏蔽关上三角值与np.triu：

In [11]: df 
Out[11]: 
     sample1 sample2 sample3 
sample 
sample1  1.0  0.7  0.8 
sample2  0.7  1.0  0.8 
sample3  0.8  0.8  1.0 

In [12]: np.triu(df, 1) 
Out[12]: 
array([[ 0. , 0.7, 0.8], 
     [ 0. , 0. , 0.8], 
     [ 0. , 0. , 0. ]]) 

In [13]: np.triu(df, 1) >= 0.8 
Out[13]: 
array([[False, False, True], 
     [False, False, True], 
     [False, False, False]], dtype=bool)

然后提取索引/列其中，这是真的，我认为你必须使用np.where *：

In [14]: np.where(np.triu(df, 1) >= 0.8) 
Out[14]: (array([0, 1]), array([2, 2]))

这为您提供了一系列第一个索引索引，然后是列索引（这是这个numpy版本中效率最低的部分）：

In [16]: index, cols = np.where(np.triu(df, 1) >= 0.8) 

In [17]: [(df.index[i], df.columns[j], df.iloc[i, j]) for i, j in zip(index, cols)] 
Out[17]: 
[('sample1', 'sample3', 0.80000000000000004), 
('sample2', 'sample3', 0.80000000000000004)]

根据需要。

*我可能会忘了一个更简单的方式来获得这最后一块：

您可以使用相同（编辑下面大熊猫代码做的，但我觉得有可能是太另一种方式。）绝招大熊猫但栈获得索引/列本身：

In [21]: (np.triu(df, 1) >= 0.8) * df 
Out[21]: 
     sample1 sample2 sample3 
sample 
sample1  0  0  0.8 
sample2  0  0  0.8 
sample3  0  0  0.0 

In [22]: res = ((np.triu(df, 1) >= 0.8) * df).stack() 

In [23]: res 
Out[23]: 
sample 
sample1 sample1 0.0 
     sample2 0.0 
     sample3 0.8 
sample2 sample1 0.0 
     sample2 0.0 
     sample3 0.8 
sample3 sample1 0.0 
     sample2 0.0 
     sample3 0.0 
dtype: float64 

In [24]: res[res!=0] 
Out[24]: 
sample 
sample1 sample3 0.8 
sample2 sample3 0.8 
dtype: float64

来源

2015-11-09 15:27:57

要阅读它在您需要的skipinitialspace和index_col参数：

a=pd.read_csv('yourfile.txt',sep=' ',skipinitialspace=True,index_col=0)

为了获取值成对：

[[x,y,round(a[x][y],3)] for x in a.index for y in a.columns if x!=y and a[x][y]>=0.8][:2]

给出：

[['sample1', 'sample3', 0.8], 
['sample2', 'sample3', 0.8]]

来源

2015-11-10 14:22:01 atomh33ls

使用scipy.sparse.coo_matrix，因为它与 “（行，列）数据” 格式的作品。

from scipy.sparse import coo_matrix 
import numpy as np 

M = np.matrix([[1.0, 0.7, 0.8], [0.7, 1.0, 0.8], [0.8, 0.8, 1.0]]) 
S = coo_matrix(M)

这里，S.row和S.col是行和列索引的数组，S.data是这些索引处的值的数组。所以，你可以通过

idx = S.data >= 0.8

过滤器和用于实例创建只有那些元素的新矩阵：

S2 = coo_matrix((S.data[idx], (S.row[idx], S.col[idx]))) 
print S2

输出是

(0, 0) 1.0 
(0, 2) 0.8 
(1, 1) 1.0 
(1, 2) 0.8 
(2, 0) 0.8 
(2, 1) 0.8 
(2, 2) 1.0

注（0,1）不会出现值为0.7。

来源

2015-11-16 07:25:49 JARS

阅读矩阵，并获取在Python

回答

相关问题