2017-03-02 83 views
0

我想读270万行成熊猫数据框但运行到内存问题(我猜)。奇怪的部分是当我监视服务器上的RAM使用情况时,python使用最大1.5 GB的免费8 GB(服务器上的总RAM为16 GB)。在相同的设置中,它可以轻松读取多达100万行。熊猫MemoryError与read_sql_query

这里有什么问题?既然它没有使用所有的空闲内存,并且对于更少的行数可以正常运行,那么它是否会以某种方式限制内存?

下面是关于设置的代码和一些信息;
使用Python 2.7(32位)的Anaconda 1.4.3
带一个Xeon处理器和16 GB RAM的Windows Server
同一台计算机上的SQL Server限于4 GB RAM。

验证码:

def ingest_sql(connection, nrows, alldata,refresh=False): 
"""Ingests the SQL query related to the data_flag. 
:param connection: 
:param nrows: number of rows 
:param Refresh: 
:return: data frame to read data into. 
""" 

df = [] 
print 'alldata:',alldata 

if alldata == 'True': 
    print "Reading All Data" 

    print 'Select te.evtdescr, te.Ref_Badge_ID, te.Ref_Reader_ID, tr.SITE_ID AS SiteID, tb.id AS badgeid, te.event_time_utc, te.empid, te.cardnum, te.eventid, tp.ID AS personid, tp.NAME, tb.BADGENO '+ \ 
      'FROM TBL_EVENTS_HISTORY te INNER JOIN TBL_Badges tb ON te.Ref_Badge_ID = tb.ID ' + \ 
      'INNER JOIN TBL_PERSONS tp ON tb.PERSONID = tp.ID ' + \ 
      'INNER JOIN TBL_READERS tr ON te.Ref_Reader_ID = tr.ID ' + \ 
      'WHERE empid>0 AND eventid<2 AND ' + \ 
      'Ref_Badge_ID IS NOT NULL and ' + \ 
      'Ref_Reader_ID IS NOT NULL ORDER BY event_time_utc', 

    df = pd.read_sql_query('Select te.evtdescr, te.Ref_Badge_ID, te.Ref_Reader_ID, tr.SITE_ID AS SiteID, tb.id AS badgeid, te.event_time_utc, te.empid, te.cardnum, te.eventid, tp.ID AS personid, tp.NAME, tb.BADGENO '+ 
          'FROM TBL_EVENTS_HISTORY te INNER JOIN TBL_Badges tb ON te.Ref_Badge_ID = tb.ID ' + 
          'INNER JOIN TBL_PERSONS tp ON tb.PERSONID = tp.ID '+ 
          'INNER JOIN TBL_READERS tr ON te.Ref_Reader_ID = tr.ID '+ 
          'WHERE empid>0 AND eventid<2 AND '+ 
          'Ref_Badge_ID IS NOT NULL and '+ 
          'Ref_Reader_ID IS NOT NULL ORDER BY event_time_utc', 
          connection) 
else: 
    print 'Alldata is False' 
    print "Reading only "+ nrows + " rows" 
    print 'Select top ' + str(nrows) + ' te.evtdescr, te.Ref_Badge_ID, te.Ref_Reader_ID, tr.SITE_ID AS SiteID, tb.id AS badgeid, te.event_time_utc, te.empid, te.cardnum, te.eventid, tp.ID AS personid, tp.NAME, tb.BADGENO '+ \ 
      'FROM TBL_EVENTS_HISTORY te INNER JOIN TBL_Badges tb ON te.Ref_Badge_ID = tb.ID ' +\ 
      'INNER JOIN TBL_PERSONS tp ON tb.PERSONID = tp.ID ' +\ 
      'INNER JOIN TBL_READERS tr ON te.Ref_Reader_ID = tr.ID ' +\ 
      'WHERE empid>0 AND eventid<2 AND ' +\ 
      'Ref_Badge_ID IS NOT NULL and ' +\ 
      'Ref_Reader_ID IS NOT NULL ORDER BY event_time_utc', 

    df = pd.read_sql_query('Select top '+ str(nrows) +\ 
          ' te.evtdescr, te.Ref_Badge_ID, te.Ref_Reader_ID, tr.SITE_ID AS SiteID, tb.id AS badgeid, te.event_time_utc, te.empid, te.cardnum, te.eventid, tp.ID AS personid, tp.NAME, tb.BADGENO '+ 
          'FROM TBL_EVENTS_HISTORY te INNER JOIN TBL_Badges tb ON te.Ref_Badge_ID = tb.ID ' + 
          'INNER JOIN TBL_PERSONS tp ON tb.PERSONID = tp.ID ' + 
          'INNER JOIN TBL_READERS tr ON te.Ref_Reader_ID = tr.ID ' + 
          'WHERE empid>0 AND eventid<2 AND ' + 
          'Ref_Badge_ID IS NOT NULL and ' + 
          'Ref_Reader_ID IS NOT NULL ORDER BY event_time_utc', 
          connection) 

return df 

以下是错误:

global start_time 
" the MASTER GLUE FUNCTION 
pandas imported 
all external packages imporated 
WIC: future imported 
banana phone 
DRIVER={SQL Server};SERVER=10.180.10.67;DATABASE=SAFEANALYTICS;UID=safeapp;PWD=safeapp 
winter is coming imported 
TBL_READERS 
ID 
Starting: 
full_run: True 
date_flag is False 
alldata: True 
Reading All Data 
Select te.evtdescr, te.Ref_Badge_ID, te.Ref_Reader_ID, tr.SITE_ID AS SiteID, tb.id AS badgeid, te.event_time_utc, te.empid, te.cardnum, te.eventid, tp.ID AS personid, tp.NAME, tb.BADGENO FROM TBL_EVENTS_HISTORY te INNER JOIN TBL_Badges tb ON te.Ref_Badge_ID = tb.ID INNER JOIN TBL_PERSONS tp ON tb.PERSONID = tp.ID INNER JOIN TBL_READERS tr ON te.Ref_Reader_ID = tr.ID WHERE empid>0 AND eventid<2 AND Ref_Badge_ID IS NOT NULL and Ref_Reader_ID IS NOT NULL ORDER BY event_time_utc 
Traceback (most recent call last): 
    File "C:\Transfer\Project\VARYS_DRS_02232017\Calculate_Risk.py", line 141, in <module> 
    make_risk_tables(dev=args.dev,nrows_0=args.nrows_0,nrows=args.nrows,dataflag=args.data_flag,all_data=True) 
    File "C:\Transfer\Project\VARYS_DRS_02232017\Calculate_Risk.py", line 35, in make_risk_tables 
    WINterIsComing_with_devid.WinVarys(nrows=nrows_0,data_flag=dataflag,refresh=dev,alldata=all_data) 
    File "C:\Transfer\Project\VARYS_DRS_02232017\WINterIsComing_with_devid.py", line 151, in WinVarys 
    df = read_columns_into_df(data_flag, df) 
    File "C:\Transfer\Project\VARYS_DRS_02232017\WINterIsComing_with_devid.py", line 112, in read_columns_into_df 
    df=df.drop_duplicates() 
    File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\util\decorators.py", line 91, in wrapper 
    return func(*args, **kwargs) 
    File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\frame.py", line 3138, in drop_duplicates 
    duplicated = self.duplicated(subset, keep=keep) 
    File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\util\decorators.py", line 91, in wrapper 
    return func(*args, **kwargs) 
    File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\frame.py", line 3188, in duplicated 
    labels, shape = map(list, zip(*map(f, vals))) 
    File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\frame.py", line 3177, in f 
    _SIZE_HINT_LIMIT)) 
    File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\algorithms.py", line 313, in factorize 
    labels = table.get_labels(vals, uniques, 0, na_sentinel, True) 
    File "pandas\src\hashtable_class_helper.pxi", line 839, in pandas.hashtable.PyObjectHashTable.get_labels (pandas\hashtable.c:15395) 
MemoryError 
[Finished in 58.3s] 
+1

尝试64位蟒蛇 –

+0

我在项目的其他部分使用用Cython,需要32位:( – PyRaider

+0

用Cython不需要32位蟒蛇 –

回答

0

就像保罗建议升级Python 2.7版的32位到64位的工作。我不完全确定它为什么起作用,但用Microsoft Visual C++编译器for Python编译Cython代码与64位python是很困难的。所以不得不删除Cython代码。