I have used the frame work provided by Daniel Nouri on his eponymous website. here is the code I used.It looks fine the only change I made is to change output_nonlinearity=lasagne.nonlinearities.softmax and regression to False.Otherwise it looks pretty straight forward
from lasagne import layers
import theano
from lasagne.updates import sgd,nesterov_momentum
from nolearn.lasagne import NeuralNet
from sklearn.metrics import classification_report
import lasagne
import cv2
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.datasets import fetch_mldata
import sys
mnist = fetch_mldata('MNIST original')
X = np.asarray(mnist.data, dtype='float32')
y = np.asarray(mnist.target, dtype='int32')
(trainX, testX, trainY, testY) = train_test_split(X,y,test_size =0.3,random_state=42)
trainX = trainX.reshape(-1, 1, 28, 28)
testX = testX.reshape(-1, 1, 28, 28)
clf = NeuralNet(
    layers=[
    ('input', layers.InputLayer),
    ('conv1', layers.Conv2DLayer),
    ('pool1', layers.MaxPool2DLayer),
    ('dropout1', layers.DropoutLayer),  # !
    ('conv2', layers.Conv2DLayer),
    ('pool2', layers.MaxPool2DLayer),
    ('dropout2', layers.DropoutLayer),  # !
    ('hidden4', layers.DenseLayer),
    ('dropout4', layers.DropoutLayer),  # !
    ('hidden5', layers.DenseLayer),
    ('output', layers.DenseLayer),
    ],
 input_shape=(None,1, 28, 28),
 conv1_num_filters=20, conv1_filter_size=(3, 3), pool1_pool_size=(2, 2),
 dropout1_p=0.1,  # !
 conv2_num_filters=50, conv2_filter_size=(3, 3), pool2_pool_size=(2, 2),
 dropout2_p=0.2,  # !
 hidden4_num_units=500,
 dropout4_p=0.5,  # !
 hidden5_num_units=500,
 output_num_units=10,
 output_nonlinearity=lasagne.nonlinearities.softmax,
 update=nesterov_momentum,
 update_learning_rate=theano.shared(float32(0.03)),
 update_momentum=theano.shared(float32(0.9)),
 regression=False,
 max_epochs=3000,
 verbose=1,
 )
clf.fit(trainX,trainY)
However on running it I get this NaN
input               (None, 1, 28, 28)       produces     784 outputs
conv1               (None, 20, 26, 26)      produces   13520 outputs
pool1               (None, 20, 13, 13)      produces    3380 outputs
dropout1            (None, 20, 13, 13)      produces    3380 outputs
conv2               (None, 50, 11, 11)      produces    6050 outputs
pool2               (None, 50, 6, 6)        produces    1800 outputs
dropout2            (None, 50, 6, 6)        produces    1800 outputs
hidden4             (None, 500)             produces     500 outputs
dropout4            (None, 500)             produces     500 outputs
hidden5             (None, 500)             produces     500 outputs
output              (None, 10)              produces      10 outputs
epoch    train loss    valid loss    train/val    valid acc  dur
-------  ------------  ------------  -----------  -----------  ------
  1           nan           nan          nan      0.09923  16.18s
  2           nan           nan          nan      0.09923  16.45s
Thanks in advance.
                        
I'm very late to the game, but hopefully someone finds this answer useful!
In my experience, there could be a number of things going wrong here. I'll write out my steps for debugging this kind of problem in nolearn/lasagne:
Using Theano's
fast_compileoptimizer can lead to underflow issues, which result in thenanoutput (this was the ultimate problem in my case)When the output starts with
nanvalues, or ifnanvalues start appearing soon after training starts, the learning rate may be too high. If it is0.01, try and make it0.001.The input or output values may be too close to one another, and you may want to try scaling them. A standard approach is to scale the input by subtracting the mean and dividing by the standard deviation.
Make sure you are using
regression=Truewhen using nolearn with a regression problemTry using a linear output instead of softmax. Other nonlinearities sometimes also help, but in my experience not often.
If all this fails, try and isolate whether the issue is with your network or with your data. If you feed in random values within the expected range and still get
nanoutput, it's probably not specific to the dataset you are training on.Hope that helps!