pandas range_date减慢函数极值

我给出了一个样本数据集，并且希望从原始样本数据集中选择多个样本，例如1000个样本数据块，每个数据样本由来自原始样本数据的500个数据点组成。我已经写在python这个小功能：pandas range_date减慢函数极值

import timeit 
import pandas as pd 
import numpy as np 
sample_data = np.random.randn(10000, 15) 
index = pd.date_range("20000101", periods=10000, freq='B') 
sample_data_df = pd.DataFrame(sample_data, index=index) 
def f(n, sample_data_df, f): 
    s = (1+sample_data_df).resample(f, axis=0) 
    r = s.prod()-1 
    out = r.sample(n, replace=True) 
    # out_index = pd.date_range(start=sample_data_df.index[0], 
    #        periods=len(out.index), 
    #        freq=f) 
    # out.index = output_index 
    return out 


start_time = timeit.default_timer() 
N = 1000 
a = [f(500, sample_data_df, 'BM') for i in range(N)] 
elapsed = timeit.default_timer() - start_time 
print(elapsed)

如果我运行此代码需要35.8964748383秒。然而，我想有附连到每个I将取消对线路中的功能块的索引，即

def f(n, sample_data_df, f): 
     s = (1+sample_data_df).resample(f, axis=0) 
     r = s.prod()-1 
     out = r.sample(n, replace=True) 
     out_index = pd.date_range(start=sample_data_df.index[0], 
            periods=len(out.index), 
            freq=f) 
     out.index = output_index 
     return out

现在函数采用72.2418179512。疯了吧。如果需要在每个输出中都有这样的索引，我怎么能加快这一点？我知道一旦索引生成并附加到每个输出。但是，我想在其他情况下使用该函数，以便在函数内完成索引的分配时将非常感激。

此外，除了索引还有其他来源可以提高速度吗？因为即使没有索引35.8964748383也是很长时间的。

来源

2017-10-13 math

你需要在函数中重新取样？ – DJK

@ djk47463是的，该函数实际上是一个类的方法，其目的是重新采样。我在想的是编写一个装饰器来添加索引。那是pythonic？你知道为什么在熊猫中编制索引太慢吗？对于像我这样的初学者来说，索引听起来像是一件相当便宜的事情。日期类型是否不能在熊猫中有效处理？ – math

频率大于一天的重采样/日期范围是熊猫中已知的性能问题，请参阅相关问题，欢迎提供帮助！ https://github.com/pandas-dev/pandas/issues/16463 – chrisb

编辑：

增加时序为创造新的日期索引
增加了缓存功能来创建新的索引

的问题是不是重采样或索引的这么多的速度，如果我们看看时机：

%timeit (1+sample_data_df).resample('BM', axis=0).prod()-1 
21.7 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 
%timeit pd.date_range(start="20000101", periods=500, freq='BM') 
21.4 ms ± 272 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

22毫秒似乎并不坏我c考虑到我们正在对150,000个元素进行重新采样和缩减。

你的问题来自于它1000，在你的情况是没有必要的（因为你正在做同样的事情）。如果要保留函数中的重采样，可以执行的操作是缓存重采样的结果。不幸的是，缓存函数结果（lru_cache）的标准方法不能处理可变对象（如dfs，lists ...）。所以我对这个解决方案是包装重采样中创建的散列函数，调用实际功能与哈希值作为参数：

from functools import lru_cache 
class Sampler(): 
    def __init__(self, df): 
    self.df = df 

    def get_resampled_sample(self, n, freq): 
    resampled = self._wraper_resample_prod(freq) 
    return resampled.sample(n, replace=True) 

    def _wraper_resample_prod(self, freq): 
    hash_df = hash(self.df.values.tobytes()) 
    return self._resample_prod(hash_df, freq) 

    @lru_cache(maxsize=1) 
    def _resample_prod(self, hash_df, freq): 
    return (self.df+1).resample(freq, axis=0).prod()-1

现在重采样的结果，只要被缓存为的散列df的值不会改变。这意味着我们可以更快地采样。

%timeit [sampler.get_resampled_sample(500, 'BM') for i in range(1000)] 
881 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

你可以做同样的事情与索引，但在这里你不需要创建一个自定义哈希因为pd.date_range所有参数都是不可变对象。

class Sampler(): 
    def __init__(self, df): 
    self.df = df 

    def update_df(self, df): 
    self.df = df 

    def get_resampled_sample(self, n, freq): 
    resampled = self._wraper_resample_prod(freq) 
    df = resampled.sample(n, replace=True) 
    df.index = self._create_date_range(self.df.index[0], n, freq) 
    return df 

    def _wraper_resample_prod(self, freq): 
    hash_df = hash(self.df.values.tobytes()) 
    return self._resample_prod(hash_df, freq) 

    @lru_cache(maxsize=1) 
    def _resample_prod(self, hash_df, freq): 
    return (self.df+1).resample(freq, axis=0).prod()-1 

    @lru_cache(maxsize=1) 
    def _create_date_range(self, start, periods, freq): 
    return pd.date_range(start=start, periods=periods, freq=freq)

时序：

%timeit [sampler.get_resampled_sample(500, 'BM') for i in range(1000)] 
1.11 s ± 43.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

来源

2017-10-17 17:15:39

非常感谢您的回答。我同意这将有效地加速事情。我从来没有使用缓存之前，这么多的感谢提示正确的方向。虽然您提供了一个整体优化，但我仍然在努力理解为什么索引与不索引造成如此大的差异（按时间）。我会保持开放的赏金，看看别人是否可以提供另一个答案。 – math

@math我添加了一个缓存函数来创建新的日期索引。同样的论据在这里也是有效的。创建索引并不是很慢（我的电脑需要大约20毫秒），但是在列表理解中这样做只需要1000次。 –

非常感谢您的补充。有一个装饰器用于索引（没有缓存）会更干净吗？普通函数不会编制索引，而装饰的函数会执行索引？ – math