0
我想读270万行成熊猫数据框但运行到内存问题(我猜)。奇怪的部分是当我监视服务器上的RAM使用情况时,python使用最大1.5 GB的免费8 GB(服务器上的总RAM为16 GB)。在相同的设置中,它可以轻松读取多达100万行。熊猫MemoryError与read_sql_query
这里有什么问题?既然它没有使用所有的空闲内存,并且对于更少的行数可以正常运行,那么它是否会以某种方式限制内存?
下面是关于设置的代码和一些信息;
使用Python 2.7(32位)的Anaconda 1.4.3
带一个Xeon处理器和16 GB RAM的Windows Server
同一台计算机上的SQL Server限于4 GB RAM。
验证码:
def ingest_sql(connection, nrows, alldata,refresh=False):
"""Ingests the SQL query related to the data_flag.
:param connection:
:param nrows: number of rows
:param Refresh:
:return: data frame to read data into.
"""
df = []
print 'alldata:',alldata
if alldata == 'True':
print "Reading All Data"
print 'Select te.evtdescr, te.Ref_Badge_ID, te.Ref_Reader_ID, tr.SITE_ID AS SiteID, tb.id AS badgeid, te.event_time_utc, te.empid, te.cardnum, te.eventid, tp.ID AS personid, tp.NAME, tb.BADGENO '+ \
'FROM TBL_EVENTS_HISTORY te INNER JOIN TBL_Badges tb ON te.Ref_Badge_ID = tb.ID ' + \
'INNER JOIN TBL_PERSONS tp ON tb.PERSONID = tp.ID ' + \
'INNER JOIN TBL_READERS tr ON te.Ref_Reader_ID = tr.ID ' + \
'WHERE empid>0 AND eventid<2 AND ' + \
'Ref_Badge_ID IS NOT NULL and ' + \
'Ref_Reader_ID IS NOT NULL ORDER BY event_time_utc',
df = pd.read_sql_query('Select te.evtdescr, te.Ref_Badge_ID, te.Ref_Reader_ID, tr.SITE_ID AS SiteID, tb.id AS badgeid, te.event_time_utc, te.empid, te.cardnum, te.eventid, tp.ID AS personid, tp.NAME, tb.BADGENO '+
'FROM TBL_EVENTS_HISTORY te INNER JOIN TBL_Badges tb ON te.Ref_Badge_ID = tb.ID ' +
'INNER JOIN TBL_PERSONS tp ON tb.PERSONID = tp.ID '+
'INNER JOIN TBL_READERS tr ON te.Ref_Reader_ID = tr.ID '+
'WHERE empid>0 AND eventid<2 AND '+
'Ref_Badge_ID IS NOT NULL and '+
'Ref_Reader_ID IS NOT NULL ORDER BY event_time_utc',
connection)
else:
print 'Alldata is False'
print "Reading only "+ nrows + " rows"
print 'Select top ' + str(nrows) + ' te.evtdescr, te.Ref_Badge_ID, te.Ref_Reader_ID, tr.SITE_ID AS SiteID, tb.id AS badgeid, te.event_time_utc, te.empid, te.cardnum, te.eventid, tp.ID AS personid, tp.NAME, tb.BADGENO '+ \
'FROM TBL_EVENTS_HISTORY te INNER JOIN TBL_Badges tb ON te.Ref_Badge_ID = tb.ID ' +\
'INNER JOIN TBL_PERSONS tp ON tb.PERSONID = tp.ID ' +\
'INNER JOIN TBL_READERS tr ON te.Ref_Reader_ID = tr.ID ' +\
'WHERE empid>0 AND eventid<2 AND ' +\
'Ref_Badge_ID IS NOT NULL and ' +\
'Ref_Reader_ID IS NOT NULL ORDER BY event_time_utc',
df = pd.read_sql_query('Select top '+ str(nrows) +\
' te.evtdescr, te.Ref_Badge_ID, te.Ref_Reader_ID, tr.SITE_ID AS SiteID, tb.id AS badgeid, te.event_time_utc, te.empid, te.cardnum, te.eventid, tp.ID AS personid, tp.NAME, tb.BADGENO '+
'FROM TBL_EVENTS_HISTORY te INNER JOIN TBL_Badges tb ON te.Ref_Badge_ID = tb.ID ' +
'INNER JOIN TBL_PERSONS tp ON tb.PERSONID = tp.ID ' +
'INNER JOIN TBL_READERS tr ON te.Ref_Reader_ID = tr.ID ' +
'WHERE empid>0 AND eventid<2 AND ' +
'Ref_Badge_ID IS NOT NULL and ' +
'Ref_Reader_ID IS NOT NULL ORDER BY event_time_utc',
connection)
return df
以下是错误:
global start_time
" the MASTER GLUE FUNCTION
pandas imported
all external packages imporated
WIC: future imported
banana phone
DRIVER={SQL Server};SERVER=10.180.10.67;DATABASE=SAFEANALYTICS;UID=safeapp;PWD=safeapp
winter is coming imported
TBL_READERS
ID
Starting:
full_run: True
date_flag is False
alldata: True
Reading All Data
Select te.evtdescr, te.Ref_Badge_ID, te.Ref_Reader_ID, tr.SITE_ID AS SiteID, tb.id AS badgeid, te.event_time_utc, te.empid, te.cardnum, te.eventid, tp.ID AS personid, tp.NAME, tb.BADGENO FROM TBL_EVENTS_HISTORY te INNER JOIN TBL_Badges tb ON te.Ref_Badge_ID = tb.ID INNER JOIN TBL_PERSONS tp ON tb.PERSONID = tp.ID INNER JOIN TBL_READERS tr ON te.Ref_Reader_ID = tr.ID WHERE empid>0 AND eventid<2 AND Ref_Badge_ID IS NOT NULL and Ref_Reader_ID IS NOT NULL ORDER BY event_time_utc
Traceback (most recent call last):
File "C:\Transfer\Project\VARYS_DRS_02232017\Calculate_Risk.py", line 141, in <module>
make_risk_tables(dev=args.dev,nrows_0=args.nrows_0,nrows=args.nrows,dataflag=args.data_flag,all_data=True)
File "C:\Transfer\Project\VARYS_DRS_02232017\Calculate_Risk.py", line 35, in make_risk_tables
WINterIsComing_with_devid.WinVarys(nrows=nrows_0,data_flag=dataflag,refresh=dev,alldata=all_data)
File "C:\Transfer\Project\VARYS_DRS_02232017\WINterIsComing_with_devid.py", line 151, in WinVarys
df = read_columns_into_df(data_flag, df)
File "C:\Transfer\Project\VARYS_DRS_02232017\WINterIsComing_with_devid.py", line 112, in read_columns_into_df
df=df.drop_duplicates()
File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\util\decorators.py", line 91, in wrapper
return func(*args, **kwargs)
File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\frame.py", line 3138, in drop_duplicates
duplicated = self.duplicated(subset, keep=keep)
File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\util\decorators.py", line 91, in wrapper
return func(*args, **kwargs)
File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\frame.py", line 3188, in duplicated
labels, shape = map(list, zip(*map(f, vals)))
File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\frame.py", line 3177, in f
_SIZE_HINT_LIMIT))
File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\algorithms.py", line 313, in factorize
labels = table.get_labels(vals, uniques, 0, na_sentinel, True)
File "pandas\src\hashtable_class_helper.pxi", line 839, in pandas.hashtable.PyObjectHashTable.get_labels (pandas\hashtable.c:15395)
MemoryError
[Finished in 58.3s]
尝试64位蟒蛇 –
我在项目的其他部分使用用Cython,需要32位:( – PyRaider
用Cython不需要32位蟒蛇 –