我使用熊猫来读取非常大的csv文件,这也是gzip。 我解压缩到大约30-50GB的csv文件。 我分块文件并处理/操作它们。 最后,相关数据添加到我压缩购买内存,以避免30-50Gb加文件分块
它工作正常,但速度很慢,因为我要处理,每天一个文件,有几个年的数据(600TB未压缩的CSV)
能买HDF5文件更多内存是避免分块和加速64GB/128GB的过程的好方法? 但这会使熊猫变得缓慢而笨拙吗? 我是否正确地说切换到C++可以加速这个过程,但我仍然忍受着读取过程,不得不以块为单位处理数据。 最后有没有人有任何想法来处理这个最好的方法。
顺便说一下,一旦工作完成,我不必回过头去处理数据,所以想让它在合理的时间内工作,所以写了一些东西,并行过程可能不错,但经验有限那个领域需要我花些时间才能构建出来,所以宁愿不去除非那是唯一的选择。
更新。我认为这会更容易看到代码。无论如何,我不相信代码特别慢。我认为技术/方法可能是。
def txttohdf(path, contract):
#create dataframes for trade and quote
dftrade = pd.DataFrame(columns = ["datetime", "Price", "Volume"])
dfquote = pd.DataFrame(columns = ["datetime", "BidPrice", "BidSize","AskPrice", "AskSize"])
#create an hdf5 file with high compression and table so we can append
hdf = pd.HDFStore(path + contract + '.h5', complevel=9, complib='blosc')
hdf.put('trade', dftrade, format='table', data_columns=True)
hdf.put('quote', dfquote, format='table', data_columns=True)
#date1 = date(start).strftime('%Y%m%d')
#date2 = date(end).strftime('%Y%m%d')
#dd = [date1 + timedelta(days=x) for x in range((date2-date1).days + 1)]
#walkthrough directories
for subdir, dir, files in os.walk(path):
for file in files:
#check if contract has name
#print(file)
#create filename from directory and file
filename = os.path.join(subdir, file)
#read in csv
if filename.endswith('.gz'):
df = pd.read_csv(gzip.open(filename),header=0,iterator=True,chunksize = 10000, low_memory =False, names = ['RIC','Date','Time','GMTOffset','Type','ExCntrbID','LOC','Price','Volume','MarketVWAP','BuyerID','BidPrice','BidSize','NoBuyers','SellerID','AskPrice','AskSize','NoSellers','Qualifiers','SeqNo','ExchTime','BlockTrd','FloorTrd','PERatio','Yield','NewPrice','NewVol','NewSeqNo','BidYld','AskYld','ISMABidYld','ISMAAskYld','Duration','ModDurtn','BPV','AccInt','Convexity','BenchSpd','SwpSpd','AsstSwpSpd','SwapPoint','BasePrice','UpLimPrice','LoLimPrice','TheoPrice','StockPrice','ConvParity','Premium','BidImpVol','AskImpVol','ImpVol','PrimAct','SecAct','GenVal1','GenVal2','GenVal3','GenVal4','GenVal5','Crack','Top','FreightPr','1MnPft','3MnPft','PrYrPft','1YrPft','3YrPft','5YrPft','10YrPft','Repurch','Offer','Kest','CapGain','Actual','Prior','Revised','Forecast','FrcstHigh','FrcstLow','NoFrcts','TrdQteDate','QuoteTime','BidTic','TickDir','DivCode','AdjClose','PrcTTEFlag','IrgTTEFlag','PrcSubMktId','IrgSubMktId','FinStatus','DivExDate','DivPayDate','DivAmt','Open','High','Low','Last','OpenYld','HighYld','LowYld','ShortPrice','ShortVol','ShortTrdVol','ShortTurnnover','ShortWeighting','ShortLimit','AccVolume','Turnover','ImputedCls','ChangeType','OldValue','NewValue','Volatility','Strike','Premium','AucPrice','Auc Vol','MidPrice','FinEvalPrice','ProvEvalPrice','AdvancingIssues','DecliningIssues','UnchangedIssues','TotalIssues','AdvancingVolume','DecliningVolume','UnchangedVolume','TotalVolume','NewHighs','NewLows','TotalMoves','PercentageChange','AdvancingMoves','DecliningMoves','UnchangedMoves','StrongMarket','WeakMarket','ChangedMarket','MarketVolatility','OriginalDate','LoanAskVolume','LoanAskAmountTradingPrice','PercentageShortVolumeTradedVolume','PercentageShortPriceTradedPrice','ForecastNAV','PreviousDaysNAV','FinalNAV','30DayATMIVCall','60DayATMIVCall','90DayATMIVCall','30DayATMIVPut','60DayATMIVPut','90DayATMIVPut','BackgroundReference','DataSource','BidSpread','AskSpread','ContractPhysicalUnits','Miniumumquantity','NumberPhysicals','ClosingReferencePrice','ImbalanceQuantity','FarClearingPrice','NearClearingPrice','OptionAdjustedSpread','ZSpread','ConvexityPremium','ConvexityRatio','PercentageDailyReturn','InterpolatedCDSBasis','InterpolatedCDSSpread','ClosesttoMaturityCDSBasis','SettlementDate','EquityPrice','Parity','CreditSpread','Delta','InputVolatility','ImpliedVolatility','FairPrice','BondFloor','Edge','YTW','YTB','SimpleMargin','DiscountMargin','12MonthsEPS','UpperTradingLimit','LowerTradingLimit','AmountOutstanding','IssuePrice','GSpread','MiscValue','MiscValueDescription'])
#parse date time this is quicker than doing it while we read it in
for chunk in df:
chunk['datetime'] = chunk.apply(lambda row: datetime.datetime.strptime(row['Date']+ ':' + row['Time'],'%d-%b-%Y:%H:%M:%S.%f'), axis=1)
#df = df[~df.comment.str.contains('ALIAS')]
#drop uneeded columns inc date and time
chunk = chunk.drop(['Date','Time','GMTOffset','ExCntrbID','LOC','MarketVWAP','BuyerID','NoBuyers','SellerID','NoSellers','Qualifiers','SeqNo','ExchTime','BlockTrd','FloorTrd','PERatio','Yield','NewPrice','NewVol','NewSeqNo','BidYld','AskYld','ISMABidYld','ISMAAskYld','Duration','ModDurtn','BPV','AccInt','Convexity','BenchSpd','SwpSpd','AsstSwpSpd','SwapPoint','BasePrice','UpLimPrice','LoLimPrice','TheoPrice','StockPrice','ConvParity','Premium','BidImpVol','AskImpVol','ImpVol','PrimAct','SecAct','GenVal1','GenVal2','GenVal3','GenVal4','GenVal5','Crack','Top','FreightPr','1MnPft','3MnPft','PrYrPft','1YrPft','3YrPft','5YrPft','10YrPft','Repurch','Offer','Kest','CapGain','Actual','Prior','Revised','Forecast','FrcstHigh','FrcstLow','NoFrcts','TrdQteDate','QuoteTime','BidTic','TickDir','DivCode','AdjClose','PrcTTEFlag','IrgTTEFlag','PrcSubMktId','IrgSubMktId','FinStatus','DivExDate','DivPayDate','DivAmt','Open','High','Low','Last','OpenYld','HighYld','LowYld','ShortPrice','ShortVol','ShortTrdVol','ShortTurnnover','ShortWeighting','ShortLimit','AccVolume','Turnover','ImputedCls','ChangeType','OldValue','NewValue','Volatility','Strike','Premium','AucPrice','Auc Vol','MidPrice','FinEvalPrice','ProvEvalPrice','AdvancingIssues','DecliningIssues','UnchangedIssues','TotalIssues','AdvancingVolume','DecliningVolume','UnchangedVolume','TotalVolume','NewHighs','NewLows','TotalMoves','PercentageChange','AdvancingMoves','DecliningMoves','UnchangedMoves','StrongMarket','WeakMarket','ChangedMarket','MarketVolatility','OriginalDate','LoanAskVolume','LoanAskAmountTradingPrice','PercentageShortVolumeTradedVolume','PercentageShortPriceTradedPrice','ForecastNAV','PreviousDaysNAV','FinalNAV','30DayATMIVCall','60DayATMIVCall','90DayATMIVCall','30DayATMIVPut','60DayATMIVPut','90DayATMIVPut','BackgroundReference','DataSource','BidSpread','AskSpread','ContractPhysicalUnits','Miniumumquantity','NumberPhysicals','ClosingReferencePrice','ImbalanceQuantity','FarClearingPrice','NearClearingPrice','OptionAdjustedSpread','ZSpread','ConvexityPremium','ConvexityRatio','PercentageDailyReturn','InterpolatedCDSBasis','InterpolatedCDSSpread','ClosesttoMaturityCDSBasis','SettlementDate','EquityPrice','Parity','CreditSpread','Delta','InputVolatility','ImpliedVolatility','FairPrice','BondFloor','Edge','YTW','YTB','SimpleMargin','DiscountMargin','12MonthsEPS','UpperTradingLimit','LowerTradingLimit','AmountOutstanding','IssuePrice','GSpread','MiscValue','MiscValueDescription'], axis=1)
# convert to datetime explicitly and add nanoseconds to same time stamps
chunk['datetime'] = pd.to_datetime(chunk.datetime)
#nanoseconds = df.groupby(['datetime']).cumcount()
#df['datetime'] += np.array(nanoseconds, dtype='m8[ns]')
# drop empty prints and make sure all prices are valid
dfRic = chunk[(chunk["RIC"] == contract)]
if len(dfRic)>0:
print(dfRic)
if ~chunk.empty:
dft = dfRic[(dfRic["Type"] == "Trade")]
dft.dropna(subset = ["Volume"], inplace =True)
dft = dft.drop(["RIC","Type","BidPrice", "BidSize", "AskPrice", "AskSize"], axis=1)
dft = dft[(dft["Price"] > 0)]
# clean up bid and ask
dfq = dfRic[(dfRic["Type"] == "Quote")]
dfq.dropna(how = 'all', subset = ["BidSize","AskSize"], inplace =True)
dfq = dfq.drop(["RIC","Type","Price", "Volume"], axis=1)
dfq = dfq[(dfq["BidSize"] > 0) | (dfq["AskSize"] > 0)]
dfq = dfq.ffill()
else:
print("Empty")
#add to hdf and close if loop finished
hdf.append('trade', dft, format='table', data_columns=True)
hdf.append('quote', dfq, format='table', data_columns=True)
hdf.close()
你能解释什么是缓慢的,为什么它慢?没有更多的细节,很难猜测什么有助于加快这一进程。 –
您应该尝试分析和测量程序的性能,以确定哪些点最慢以及内存或CPU功耗是否是限制因素。这将有助于缩小特定更改对您的帮助。然后,您还可以将最慢的源代码部分上传到http://codereview.stackexchange.com/上的问题,并征求关于提高其性能的建议。 – gfv
我会尝试读取块压缩的CSV格式,而不是首先解压缩它们 - 这样,您应该拥有更少的IO(通常是最慢的部分之一)。除此之外,拥有更多内存应该允许你拥有更大的块,或者甚至可以在没有块的情况下完成,如果你的内存将近似于。比所产生的DF大两倍。在同一台服务器/计算机上进行并行处理(如果你的意思是DASK)会使开销变得更糟。如果你需要一个真正的权力看看Apache PySpark SQL,但这将意味着更高的投资到Hadoop集群 - 只是我的2美分... – MaxU