从python中的大型数据框中快速采样大量的数据

我有一个非常大的数据框（大约1.1M行），我试图对它进行采样。从python中的大型数据框中快速采样大量的数据

我有一个索引列表（约70,000个索引），我想从整个数据框中选择。

这是我用尽为止，但所有这些方法都服用了太多的时间：

方法1 - 使用大熊猫：

sample = pandas.read_csv("data.csv", index_col = 0).reset_index() 
sample = sample[sample['Id'].isin(sample_index_array)]

方法2：

我试着写所有采样的行到另一个csv。

f = open("data.csv",'r') 

out = open("sampled_date.csv", 'w') 
out.write(f.readline()) 

while 1: 
    total += 1 
    line = f.readline().strip() 

    if line =='': 
     break 
    arr = line.split(",") 

    if (int(arr[0]) in sample_index_array): 
     out.write(",".join(e for e in (line)))

任何人都可以请建议一个更好的方法吗？或者我可以如何修改它以使其更快？

感谢

来源

2016-09-24 user324

如果我理解你是对的，你可以将你的标记转化为一个熊猫索引对象。然后将对象馈入DataFrame中直接切片。 – pylang

你似乎可以从简单的selection methods受益。我们没有您的数据，因此以下是使用pandas Index对象和.iloc选择方法选择子集的示例。

import pandas as pd 
import numpy as np 

# Large Sample DataFrame 
df = pd.DataFrame(np.random.randint(0,100,size=(1000000, 4)), columns=list('ABCD')) 
df.info() 

# Results 
<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 1000000 entries, 0 to 999999 
Data columns (total 4 columns): 
A 1000000 non-null int32 
B 1000000 non-null int32 
C 1000000 non-null int32 
D 1000000 non-null int32 
dtypes: int32(4) 
memory usage: 15.3 MB 


# Convert a sample list of indices to an `Index` object 
indices = [1, 2, 3, 10, 20, 30, 67, 78, 900, 2176, 78776] 
idxs = pd.Index(indices) 
subset = df.iloc[idxs, :] 
subset 

# Output 
A B C D 
1  9 33 62 17 
2  44 73 85 11 
3  56 83 85 79 
10  5 72 3 82 
20  72 22 61 2 
30  75 15 51 11 
67  82 12 18 5 
78  95 9 86 81 
900 23 51 3 5 
2176 30 89 67 26 
78776 54 88 56 17

在你的情况，试试这个：

df = pd.read_csv("data.csv", index_col = 0).reset_index() 
idx = pd.Index(sample_index_array)    # assuming a list 
sample = df.iloc[idx, :]

的.iat and .at methods甚至更快，但需要标量指标。

来源

2016-09-24 15:06:56 pylang

谢谢！这应该工作！出于好奇，有没有办法在读取数据时对这些行进行分片？ – user324

如果您要求读取已过滤的子集，则可以在[read_csv]中['skiprows']（http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html ），但我不认为他们有'use_rows.'的选项。我会发布一个问题给github来请求这个功能。 – pylang

好的。我试试skiprows。谢谢！ – user324

从python中的大型数据框中快速采样大量的数据

回答

相关问题