2015-10-16 105 views
0

我试图通过从较大的DataFrame中采样100,000行数据来进行机器学习培训/测试。我已经用30,000-60,000随机样本尝试了预期输出,但是当增加到100,000+时,它会给我记忆错误。在Python中进行机器学习时出现内存错误大熊猫

# coding=utf-8 
import pandas as pd 
from pandas import DataFrame, Series 
import numpy as np 
import nltk 
import re 
import random 
from random import randint 
import csv 
import dask.dataframe as dd 
import sys 
reload(sys) 
sys.setdefaultencoding('utf-8') 

from sklearn.linear_model import LogisticRegression 
from sklearn.feature_extraction import DictVectorizer 
from sklearn.preprocessing import Imputer 

lr = LogisticRegression() 
dv = DictVectorizer() 
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0) 

# Get csv file into data frame 
data = pd.read_csv("file.csv", header=0, encoding="utf-8") 
df = DataFrame(data) 

# Random sampling a smaller dataframe for debugging 
rows = random.sample(df.index, 100000) 
df = df.ix[rows] # Warning!!!! overwriting original df 

# Assign X and y variables 
X = df.raw_name.values 
y = df.ethnicity2.values 

# Feature extraction functions 
def feature_full_last_name(nameString): 
    try: 
     last_name = nameString.rsplit(None, 1)[-1] 
     if len(last_name) > 1: # not accept name with only 1 character 
      return last_name 
     else: return '?' 
    except: return '?' 

# Transform format of X variables, and spit out a numpy array for all features 
my_dict = [{'last-name': feature_full_last_name(i)} for i in X] 

all_dict = my_dict 

newX = dv.fit_transform(all_dict).toarray() 

# Separate the training and testing data sets 
half_cut = int(len(df)/2.0)*-1 
X_train = newX[:half_cut] 
X_test = newX[half_cut:] 
y_train = y[:half_cut] 
y_test = y[half_cut:] 

# Fitting X and y into model, using training data 
lr.fit(X_train, y_train) 

# Making predictions using trained data 
y_train_predictions = lr.predict(X_train) 
y_test_predictions = lr.predict(X_test) 

print (y_train_predictions == y_train).sum().astype(float)/(y_train.shape[0]) 
print (y_test_predictions == y_test).sum().astype(float)/(y_test.shape[0]) 

错误声明:

Traceback (most recent call last): 
    File "C:\Users\Dropbox\Python_Exercises\_Scraping\BeautifulSoup\FamilySearch.org\FamSearch_Analysis\MachineLearning\FamSearch_LogReg_GOOD8.py", line 93, in <module> 
    newX = dv.fit_transform(all_dict).toarray() 
    File "E:\Program Files Extra\Python27\lib\site-packages\scipy\sparse\compressed.py", line 942, in toarray 
    return self.tocoo(copy=False).toarray(order=order, out=out) 
    File "E:\Program Files Extra\Python27\lib\site-packages\scipy\sparse\coo.py", line 274, in toarray 
    B = self._process_toarray_args(order, out) 
    File "E:\Program Files Extra\Python27\lib\site-packages\scipy\sparse\base.py", line 793, in _process_toarray_args 
    return np.zeros(self.shape, dtype=self.dtype, order=order) 
MemoryError 
+1

你有多少内存? –

+0

我有16.0 GB的内存。我的python在Win32上是2.7.6 [MSC v.1500 64位(AMD64)] – KubiK888

+0

数据中有多少列?什么是数据类型?原始数据的大小是从csv读取的,然后是100k行的样本?该数据帧仍然存在于内存中,因此您可能需要在进行分析之前将其删除。实际上,数据本身可能仍然存在。删除。 – Alexander

回答

2

这看起来错:

newX = dv.fit_transform(all_dict).toarray() 

因为在几乎所有的估计scikit学习支持稀疏数据集,但你正试图使密从稀疏数据集。当然,它会消耗大量的内存。您需要避免在代码中使用todense()和toarray()方法。

+0

我不知道如何准备ML分类器要训练的数据格式,否则请告知。 – KubiK888

+0

@ KubiK888,只需从该行中删除'.toarray()',我认为其他所有东西仍然可以工作。 –

+0

这样做,谢谢 – KubiK888