11

TLDR

我一直在努力,以适应MNIST一个简单的神经网络,它适用于小的调试安装,但是当我把它交给MNIST的一个子集,它培养超级快并且梯度非常快地接近0,但是然后它对于任何给定输入输出相同的值,并且最终成本相当高。我一直试图有目的地过度装备,以确保它实际上正在工作,但它不会在MNIST上这样做,暗示设置中存在深层问题。我已经使用渐变检查检查了我的反向传播实现,它似乎匹配了,所以不知道错误在哪里,或者现在要做什么!调试神经网络

非常感谢您提供的任何帮助,我一直在努力解决这个问题!

说明

我一直试图在numpy的神经网络,基于这样的解释: http://ufldl.stanford.edu/wiki/index.php/Neural_Networks http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm

反向传播似乎符合梯度检查:

Backpropagation: [ 0.01168585, 0.06629858, -0.00112408, -0.00642625, -0.01339408, 
    -0.07580145, 0.00285868, 0.01628148, 0.00365659, 0.0208475 , 
    0.11194151, 0.16696139, 0.10999967, 0.13873069, 0.13049299, 
    -0.09012582, -0.1344335 , -0.08857648, -0.11168955, -0.10506167] 
Gradient Checking: [-0.01168585 -0.06629858 0.00112408 0.00642625 0.01339408 
    0.07580145 -0.00285868 -0.01628148 -0.00365659 -0.0208475 
    -0.11194151 -0.16696139 -0.10999967 -0.13873069 -0.13049299 
    0.09012582 0.1344335 0.08857648 0.11168955 0.10506167] 

当我在这个简单的调试设置上训练:

a is a neural net w/ 2 inputs -> 5 hidden -> 2 outputs, and learning rate 0.5 
a.gradDesc(np.array([[0.1,0.9],[0.2,0.8]]),np.array([[0,1],[0,1]])) 
ie. x1 = [0.1, 0.9] and y1 = [0,1] 

我得到这些可爱的培训曲线 error on iterations gradient on iterations

诚然,这显然是一个简单化了,很方便的功能,以适应。 但是只要我把它交给MNIST,与此设置:

# Number of input, hidden and ouput nodes 
    # Input = 28 x 28 pixels 
    input_nodes=784 
    # Arbitrary number of hidden nodes, experiment to improve 
    hidden_nodes=200 
    # Output = one of the digits [0,1,2,3,4,5,6,7,8,9] 
    output_nodes=10 

    # Learning rate 
    learning_rate=0.4 

    # Regularisation parameter 
    lambd=0.0 

随着下面的代码此设置来看,为100次迭代,它似乎在第一则只是“平线”,培养相当快速犯规达到了很好的模式:

Initial ===== Cost (unregularised): 2.09203670985 /// Cost (regularised):  2.09203670985 Mean Gradient: 0.0321241229793 
Iteration 100 Cost (unregularised): 0.980999805477 /// Cost (regularised): 0.980999805477 Mean Gradient: -5.29639499854e-09 
TRAINED IN 26.45932364463806 

这然后给出了测试精度真是可怜,并预测相同的输出,即使与所有输入为0.1或全0测试。9我刚刚得到的结果相同(虽然这取决于初始随机权正是它输出其数量而变化):

Test accuracy: 8.92 
Targets 2 2 1 7 2 2 0 2 3 
Hypothesis 5 5 5 5 5 5 5 5 5 

而对于MNIST培训曲线: enter image description here enter image description here

代码转储:

# Import dependencies 
import numpy as np 
import time 
import csv 
import matplotlib.pyplot 
import random 
import math 

# Read in training data 
with open('MNIST/mnist_train_100.csv') as file: 
    train_data=np.array([list(map(int,line.strip().split(','))) for line in file.readlines()]) 


# In[197]: 

# Plot a sample of training data to visualise 
displayData(train_data[:,1:], 25) 


# In[198]: 

# Read in test data 
with open('MNIST/mnist_test.csv') as file: 
    test_data=np.array([list(map(int,line.strip().split(','))) for line in file.readlines()]) 

# Main neural network class 
class neuralNetwork: 
    # Define the architecture 
    def __init__(self, i, h, o, lr, lda): 
     # Number of nodes in each layer 
     self.i=i 
     self.h=h 
     self.o=o 
     # Learning rate 
     self.lr=lr 
     # Lambda for regularisation 
     self.lda=lda 

     # Randomly initialise the parameters, input-> hidden and hidden-> output 
     self.ih=np.random.normal(0.0,pow(self.h,-0.5),(self.h,self.i)) 
     self.ho=np.random.normal(0.0,pow(self.o,-0.5),(self.o,self.h)) 

    def predict(self, X): 
     # GET HYPOTHESIS ESTIMATES/ OUTPUTS 
     # Add bias node x(0)=1 for all training examples, X is now m x n+1 
     # Then compute activation to hidden node 
     z2=np.dot(X,self.ih.T) + 1 
     #print(a1.shape) 
     a2=sigmoid(z2) 
     #print(ha) 
     # Add bias node h(0)=1 for all training examples, H is now m x h+1 
     # Then compute activation to output node 
     z3=np.dot(a2,self.ho.T) + 1 
     h=sigmoid(z3) 
     outputs=np.argmax(h.T,axis=0) 

     return outputs 

    def backprop (self, X, y): 
     try: 
      m = X.shape[0] 
     except: 
      m=1 

     # GET HYPOTHESIS ESTIMATES/ OUTPUTS 
     # Add bias node x(0)=1 for all training examples, X is now m x n+1 
     # Then compute activation to hidden node 
     z2=np.dot(X,self.ih.T) 
     #print(a1.shape) 
     a2=sigmoid(z2) 
     #print(ha) 
     # Add bias node h(0)=1 for all training examples, H is now m x h+1 
     # Then compute activation to output node 
     z3=np.dot(a2,self.ho.T) 
     h=sigmoid(z3) 

     # Compute error/ cost for this setup (unregularised and regularise) 
     costReg=self.costFunc(h,y) 
     costUn=self.costFuncReg(h,y) 

     # Output error term 
     d3=-(y-h)*sigmoidGradient(z3) 

     # Hidden error term 
     d2=np.dot(d3,self.ho)*sigmoidGradient(z2) 

     # Partial derivatives for weights 
     D2=np.dot(d3.T,a2) 
     D1=np.dot(d2.T,X) 

     # Partial derivatives of theta with regularisation 
     T2Grad=(D2/m)+(self.lda/m)*(self.ho) 
     T1Grad=(D1/m)+(self.lda/m)*(self.ih) 

     # Update weights 
     # Hidden layer (weights 1) 
     self.ih-=self.lr*(((D1)/m) + (self.lda/m)*self.ih) 
     # Output layer (weights 2) 
     self.ho-=self.lr*(((D2)/m) + (self.lda/m)*self.ho) 

     # Unroll gradients to one long vector 
     grad=np.concatenate(((T1Grad).ravel(),(T2Grad).ravel())) 

     return costReg, costUn, grad 

    def backpropIter (self, X, y): 
     try: 
      m = X.shape[0] 
     except: 
      m=1 

     # GET HYPOTHESIS ESTIMATES/ OUTPUTS 
     # Add bias node x(0)=1 for all training examples, X is now m x n+1 
     # Then compute activation to hidden node 
     z2=np.dot(X,self.ih.T) 
     #print(a1.shape) 
     a2=sigmoid(z2) 
     #print(ha) 
     # Add bias node h(0)=1 for all training examples, H is now m x h+1 
     # Then compute activation to output node 
     z3=np.dot(a2,self.ho.T) 
     h=sigmoid(z3) 

     # Compute error/ cost for this setup (unregularised and regularise) 
     costUn=self.costFunc(h,y) 
     costReg=self.costFuncReg(h,y) 

     gradW1=np.zeros(self.ih.shape) 
     gradW2=np.zeros(self.ho.shape) 
     for i in range(m): 
      delta3 = -(y[i,:]-h[i,:])*sigmoidGradient(z3[i,:]) 
      delta2 = np.dot(self.ho.T,delta3)*sigmoidGradient(z2[i,:]) 

      gradW2= gradW2 + np.outer(delta3,a2[i,:]) 
      gradW1 = gradW1 + np.outer(delta2,X[i,:]) 

     # Update weights 
     # Hidden layer (weights 1) 
     #self.ih-=self.lr*(((gradW1)/m) + (self.lda/m)*self.ih) 
     # Output layer (weights 2) 
     #self.ho-=self.lr*(((gradW2)/m) + (self.lda/m)*self.ho) 

     # Unroll gradients to one long vector 
     grad=np.concatenate(((gradW1).ravel(),(gradW2).ravel())) 

     return costUn, costReg, grad 

    def gradDesc(self, X, y): 
     # Backpropagate to get updates 
     cost,costreg,grad=self.backpropIter(X,y) 

     # Unroll parameters 
     deltaW1=np.reshape(grad[0:self.h*self.i],(self.h,self.i)) 
     deltaW2=np.reshape(grad[self.h*self.i:],(self.o,self.h)) 

     # m = no. training examples 
     m=X.shape[0] 
     #print (self.ih) 
     self.ih -= self.lr * ((deltaW1))#/m) + (self.lda * self.ih)) 
     self.ho -= self.lr * ((deltaW2))#/m) + (self.lda * self.ho)) 
     #print(deltaW1) 
     #print(self.ih) 
     return cost,costreg,grad 


    # Gradient checking to compute the gradient numerically to debug backpropagation 
    def gradCheck(self, X, y): 
     # Unroll theta 
     theta=np.concatenate(((self.ih).ravel(),(self.ho).ravel())) 
     # perturb will add and subtract epsilon, numgrad will store answers 
     perturb=np.zeros(len(theta)) 
     numgrad=np.zeros(len(theta)) 
     # epsilon, e is a small number 
     e = 0.00001 
     # Loop over all theta 
     for i in range(len(theta)): 
      # Perturb is zeros with one index being e 
      perturb[i]=e 
      loss1=self.costFuncGradientCheck(theta-perturb, X, y) 
      loss2=self.costFuncGradientCheck(theta+perturb, X, y) 
      # Compute numerical gradient and update vectors 
      numgrad[i]=(loss1-loss2)/(2*e) 
      perturb[i]=0 
     return numgrad 

    def costFuncGradientCheck(self,theta,X,y): 
     T1=np.reshape(theta[0:self.h*self.i],(self.h,self.i)) 
     T2=np.reshape(theta[self.h*self.i:],(self.o,self.h)) 
     m=X.shape[0] 
     # GET HYPOTHESIS ESTIMATES/ OUTPUTS 
     # Compute activation to hidden node 
     z2=np.dot(X,T1.T) 
     a2=sigmoid(z2) 
     # Compute activation to output node 
     z3=np.dot(a2,T2.T) 
     h=sigmoid(z3) 

     cost=self.costFunc(h, y) 
     return cost #+ ((self.lda/2)*(np.sum(pow(T1,2)) + np.sum(pow(T2,2)))) 

    def costFunc(self, h, y): 
     m=h.shape[0] 
     return np.sum(pow((h-y),2))/m 

    def costFuncReg(self, h, y): 
     cost=self.costFunc(h, y) 
     return cost #+ ((self.lda/2)*(np.sum(pow(self.ih,2)) + np.sum(pow(self.ho,2)))) 

# Helper functions to compute sigmoid and gradient for an input number or matrix 
def sigmoid(Z): 
    return np.divide(1,np.add(1,np.exp(-Z))) 
def sigmoidGradient(Z): 
    return sigmoid(Z)*(1-sigmoid(Z)) 

# Pre=processing helper functions 
# Normalise data to 0.1-1 as 0 inputs kills the weights and changes 
def scaleDataVec(data): 
    return (np.asfarray(data[1:])/255.0 * 0.99) + 0.1 

def scaleData(data): 
    return (np.asfarray(data[:,1:])/255.0 * 0.99) + 0.1 

# DISPLAY DATA 
# plot_data will be what to plot, num_ex must be a square number of how many examples to plot, random examples will then be plotted 
def displayData(plot_data, num_ex, rand=1): 
    if rand==0: 
     data=plot_data 
    else: 
     rand_indexes=random.sample(range(plot_data.shape[0]),num_ex) 
     data=plot_data[rand_indexes,:] 
    # Useful variables, m= no. train ex, n= no. features 
    m=data.shape[0] 
    n=data.shape[1] 
    # Shape for one example 
    example_width=math.ceil(math.sqrt(n)) 
    example_height=math.ceil(n/example_width) 
    # No. of items to display 
    display_rows=math.floor(math.sqrt(m)) 
    display_cols=math.ceil(m/display_rows) 
    # Padding between images 
    pad=1 
    # Setup blank display 
    display_array = -np.ones((pad + display_rows * (example_height + pad), (pad + display_cols * (example_width + pad)))) 
    curr_ex=0 
    for i in range(1,display_rows+1): 
     for j in range(1,display_cols+1): 
      if curr_ex>m: 
       break 
      # Max value of this patch 
      max_val=max(abs(data[curr_ex, :])) 
      display_array[pad + (j-1) * (example_height + pad) : j*(example_height+1), pad + (i-1) * (example_width + pad) :       i*(example_width+1)] = data[curr_ex, :].reshape(example_height, example_width)/max_val 
      curr_ex+=1 

    matplotlib.pyplot.imshow(display_array, cmap='Greys', interpolation='None') 


# In[312]: 

a=neuralNetwork(2,5,2,0.5,0.0) 
print(a.backpropIter(np.array([[0.1,0.9],[0.2,0.8]]),np.array([[0,1],[0,1]]))) 
print(a.gradCheck(np.array([[0.1,0.9],[0.2,0.8]]),np.array([[0,1],[0,1]]))) 
D=[] 
C=[] 
for i in range(100): 
    c,b,d=a.gradDesc(np.array([[0.1,0.9],[0.2,0.8]]),np.array([[0,1],[0,1]])) 
    C.append(c) 
    D.append(np.mean(d)) 
    #print(c) 

print(a.predict(np.array([[0.1,0.9]]))) 
# Debugging plot 
matplotlib.pyplot.figure() 
matplotlib.pyplot.plot(C) 
matplotlib.pyplot.ylabel("Error") 
matplotlib.pyplot.xlabel("Iterations") 
matplotlib.pyplot.figure() 
matplotlib.pyplot.plot(D) 
matplotlib.pyplot.ylabel("Gradient") 
matplotlib.pyplot.xlabel("Iterations") 
#print(J) 


# In[313]: 

# Class instance 

# Number of input, hidden and ouput nodes 
# Input = 28 x 28 pixels 
input_nodes=784 
# Arbitrary number of hidden nodes, experiment to improve 
hidden_nodes=200 
# Output = one of the digits [0,1,2,3,4,5,6,7,8,9] 
output_nodes=10 

# Learning rate 
learning_rate=0.4 

# Regularisation parameter 
lambd=0.0 

# Create instance of Nnet class 
nn=neuralNetwork(input_nodes,hidden_nodes,output_nodes,learning_rate,lambd) 


# In[314]: 

time1=time.time() 
# Scale inputs 
inputs=scaleData(train_data) 
# 0.01-0.99 range as the sigmoid function can't reach 0 or 1, 0.01 for all except 0.99 for target 
targets=(np.identity(output_nodes)*0.98)[train_data[:,0],:]+0.01 
J=[] 
JR=[] 
Grad=[] 
iterations=100 
for i in range(iterations): 
    j,jr,grad=nn.gradDesc(inputs, targets) 
    grad=np.mean(grad) 
    if i == 0: 
     print("Initial ===== Cost (unregularised): ", j, "\t///", "Cost (regularised): ",jr," Mean Gradient: ",grad) 
    print("\r", end="") 
    print("Iteration ", i+1, "\tCost (unregularised): ", j, "\t///", "Cost (regularised): ", jr," Mean Gradient: ",grad,end="") 
    J.append(j) 
    JR.append(jr) 
    Grad.append(grad) 
time2 = time.time() 
print ("\nTRAINED IN ",time2-time1) 


# In[315]: 

# Debugging plot 
matplotlib.pyplot.figure() 
matplotlib.pyplot.plot(J) 
matplotlib.pyplot.plot(JR) 
matplotlib.pyplot.ylabel("Error") 
matplotlib.pyplot.xlabel("Iterations") 
matplotlib.pyplot.figure() 
matplotlib.pyplot.plot(Grad) 
matplotlib.pyplot.ylabel("Gradient") 
matplotlib.pyplot.xlabel("Iterations") 
#print(J) 


# In[316]: 

# Scale inputs 
inputs=scaleData(test_data) 
# 0.01-0.99 range as the sigmoid function can't reach 0 or 1, 0.01 for all except 0.99 for target 
targets=test_data[:,0] 
h=nn.predict(inputs) 
score=[] 
targ=[] 
hyp=[] 
for i,line in enumerate(targets): 
    if line == h[i]: 
     score.append(1) 
    else: 
     score.append(0) 
    hyp.append(h[i]) 
    targ.append(line) 
print("Test accuracy: ", sum(score)/len(score)*100) 
indexes=random.sample(range(len(hyp)),9) 
print("Targets ",end="") 
for j in indexes: 
    print (targ[j]," ",end="") 
print("\nHypothesis ",end="") 
for j in indexes: 
    print (hyp[j]," ",end="") 
displayData(test_data[indexes, 1:], 9, rand=0) 


# In[277]: 

nn.predict(0.9*np.ones((784,))) 

编辑1

建议使用不同的学外贸纳克率,但不幸的是,他们都推出了类似的结果,这里的地块为30次迭代,使用MNIST 100集:

enter image description here enter image description here

具体而言,这里的数字,他们开始和结束搭配:

Initial ===== Cost (unregularised): 4.07208963507 /// Cost (regularised): 4.07208963507 Mean Gradient: 0.0540251381858 
    Iteration 50 Cost (unregularised): 0.613310215166 /// Cost (regularised): 0.613310215166 Mean Gradient: -0.000133981500849Initial ===== Cost (unregularised): 5.67535252616  /// Cost (regularised): 5.67535252616 Mean Gradient: 0.0644797515914 
    Iteration 50 Cost (unregularised): 0.381080434935 /// Cost (regularised): 0.381080434935 Mean Gradient: 0.000427866902699Initial ===== Cost (unregularised): 3.54658422176 /// Cost (regularised): 3.54658422176 Mean Gradient: 0.0672211732868 
    Iteration 50 Cost (unregularised): 0.981 /// Cost (regularised): 0.981 Mean Gradient: 2.34515341943e-20Initial ===== Cost (unregularised): 4.05269658215 /// Cost (regularised): 4.05269658215 Mean Gradient: 0.0469666696193 
    Iteration 50 Cost (unregularised): 0.980999999999 /// Cost (regularised): 0.980999999999 Mean Gradient: -1.0582706063e-14Initial ===== Cost (unregularised): 2.40881492228 /// Cost (regularised): 2.40881492228 Mean Gradient: 0.0516056901574 
    Iteration 50 Cost (unregularised): 1.74539997258 /// Cost (regularised): 1.74539997258 Mean Gradient: 1.01955789614e-09Initial ===== Cost (unregularised): 2.58498876008 /// Cost (regularised): 2.58498876008 Mean Gradient: 0.0388768685257 
    Iteration 3 Cost (unregularised): 1.72520399313 /// Cost (regularised): 1.72520399313 Mean Gradient: 0.0134040908157 
    Iteration 50 Cost (unregularised): 0.981 /// Cost (regularised): 0.981 Mean Gradient: -4.49319474346e-43Initial ===== Cost (unregularised): 4.40141352357 /// Cost (regularised): 4.40141352357 Mean Gradient: 0.0689167742968 
    Iteration 50 Cost (unregularised): 0.981 /// Cost (regularised): 0.981 Mean Gradient: -1.01563966458e-22 

0.01的学习率,相当低的,已经是最好的结果,但在探索本地区学习率,我才出来,用30-40%的准确率,在8%有了很大的改进或甚至是我以前见过的0%,但实际上并不是它应该实现的目标!

编辑2

我现在完成了,并增加了反向传播函数为矩阵而不是迭代公式优化,所以现在我可以在大的历元运行/迭代不十分缓慢。所以这个类的“backprop”函数与梯度检查匹配(实际上它是1/2的大小,但我认为这是梯度检查的问题,所以我们将离开该BC它不应该成比例关系,我已经尝试过加入部门来解决这个问题)。有了大量的时代,我获得了更好的准确性,但仍然存在一个问题,因为当我先前在同一个数据集csvs上编写了略微不同的简单3层神经网络样式作为书的一部分时,我得到更好的训练结果。以下是大时代的一些情节和数据。

enter image description here enter image description here

看起来不错,但我们仍然有一个非常贫穷的测试集的精度,这是2500条纵贯数据集,应该得到一个好的结果要少得多!

Test accuracy: 61.150000000000006 
    Targets 6 9 8 2 2 2 4 3 8 
    Hypothesis 6 9 8 4 7 1 4 3 8 

编辑3,什么数据集?

http://makeyourownneuralnetwork.blogspot.co.uk/2015/03/the-mnist-dataset-of-handwitten-digits.html?m=1

使用train.csv和test.csv尝试与更多的数据,并没有更好的只是时间长所以一直使用的子集train_100和test_10而我调试。

编辑4

似乎是一个非常大的数字信号出现时间(例如14000)的后学习的东西,因为整个数据集在backprop函数中使用(未backpropiter)每个循环有效地是一个划时代,并用在100列火车和10个测试样本的子集上可观的数量的时代,测试的准确性相当好。然而,对于这个小样本,这可能很容易归因于偶然性,即使这样,即使在小数据集上,它也只有70%不符合您的要求。但它确实表明它似乎在学习,我正在非常广泛地尝试参数来排除这一点。

+0

使用较小的学习率或较高的正则化参数 – BlackBear

+0

感谢您的建议,更新了问题,以显示不同学习率的情节!不幸的是,这并没有太大的帮助。 – olliejday

+0

没有试图通过不是我自己的整个代码,这看起来像你试图将值映射到自己(或类似的东西),并降低了学习速度只是放慢了它的必然性相当擅长预测'x == x'。是否有任何地方可能会意外地将输出作为输入特征? – roganjosh

回答

1

已解决

我解决了我的神经网络。简要说明如下,以帮助其他人。感谢所有那些有助于建议的人。基本上,我用一种完全矩阵方法实现了它,即。反向传播每次都使用所有示例。我后来尝试将它作为一种矢量方法实现,即。反向传播每个例子。这是当我意识到矩阵方法不会更新每个示例的参数时,所以通过这种方式运行的方式与依次运行每个例子的方法不同,实际上,整个训练集作为一个示例反向传播。因此,我的矩阵实现确实有效,但经过多次迭代后,最终花费比矢量方法更长的时间!已经提出了一个新的问题来了解更多关于这个特定部分的内容,但是现在我们走了,它只是需要用矩阵方法进行很多迭代,或者需要通过示例方法进行更渐进的示例。