Training networks

Strada facilitates implementing your own training mechanisms for neural networks in Julia. This makes it very convenient to use your own optimization procedures, which is very convenient if you are for example working on reinforcement learning. You can tap into the training mechanism on two levels: By manually loading data into the network and calling its forward and backward method and then getting the gradient blobs out, or using a slightly higher level interface which should be familiar if you have used a package for numerical optimization before. We describe the latter approach here.

As an example, let us consider how to train a convolutional network that can recognize MNIST digits. First, let us define the model:

batchsize = 64

layers = [
    MemoryLayer("data"; shape=(batchsize, 1, 28, 28)),
    MemoryLayer("label"; shape=(batchsize, 1)),
    ConvLayer("conv1", ["data"]; kernel=(5,5), n_filter=20),
    PoolLayer("pool1", ["conv1"]; kernel=(2,2), stride=(2,2)),
    ConvLayer("conv2", ["pool1"]; kernel=(5,5), n_filter=50),
    PoolLayer("pool2", ["conv2"]; kernel=(2,2), stride=(2,2)),
    LinearLayer("ip1", ["pool2"]; n_filter=500),
    ActivationLayer("relu1", ["ip1"]; activation=ReLU),
    LinearLayer("ip2", ["relu1"]; n_filter=10),
    SoftmaxWithLoss("loss", ["ip2", "label"])
]

net = Net("LeNet", layers; log_level=3);

Creating objective functions and predictors

You can now create an objective function that will be optimized by calling

(objective, theta) = make_objective(net, Float32)

Here, Float32 is the floating point type used by the network, theta is a flat vector containing the initial parameters and objective is a Julia function with signature

function objective(data::Data{F,N}, theta::Vector{F}; grad::Vector{F}=zeros(F, 0))
    # If length(grad) != 0, store the gradient of the loss function in grad.
    # The caller needs to guarantee that length(grad) = length(theta)
    # In any case, return the loss of the network computed on the minibatch data
end

Data{F,N} is the datatype representing a minibatch (see its documentation in the API). Here, N is the number of data layers in the network. Data{F,N} is an N tuple where each component is an array that will be fed into the corresponding data layer of the network. In the case of MNIST, N = 2 which means Data{F,N} is of type NTuple{Array{Float32, 4}, Array{Float32, 2}}. The first array in the tuple corresponds to images and the second one to labels.

Now we create a function that can compute a prediction on a new digit (once the network has been trained):

predictor = make_predictor(net, Float32, "ip2")

Here "ip2" is the name of the last layer before the softmax. The predictor has signature

function predictor(data::Data{F,N}, theta::Vector{F}; result::Matrix{Int}=zeros(Int, 0, 0))
    # Store the predicted label of the n-th example from minibatch data in result[n, 1]
end

The result is a matrix here, because we also support predicting sequences.

Loading the data

We can now load the dataset. Let us assume we have a function load_mnist that outputs arrays with shape (1, 28, 28, 50000) and (1, 50000) for the training set. Using the minibatch_stream constructor, this data can then be loaded into a MinibatchStream, which is a collection of Data{F,N} tuples of minibatch size that can be iterated over.

(Xtrain, ytrain) = load_mnist(directory; data_set=:train)
(Xtest, ytest) = load_mnist(directory; data_set=:test)

data = minibatch_stream(Xtrain, ytrain; batchsize=batchsize)
testset = minibatch_stream(Xtest, ytest, batchsize=batchsize)

Training the network

In this case, we train the model using SGD:

sgd(objective, data, theta; predictor=predictor, testset=testset,
    lr_schedule=InvLR(0.01, 0.0001, 0.75, 0.9), epochs=5, verbose=true)