A diagram of a FFNN is shown in Figure 2. Neural networks are often ``hyped'' in the popular press as ``artificial brains'' of sorts; in this treatment, we reserve ourselves to the more modest function-approximation perspective.
Figure 2: General schematic of a feed-forward neural network
Each of the circles in the figure is called a unit(or neuron) containing an input-output transfer function. Neural network researchers have chosen from a variety of transfer functions; among the more popular ones are the logistic-sigmoid and the tangent-sigmoid functions. In this project we use the latter, which, mathematically, is:
where both the input, n, and the output, f(n) are scalar-valued. The three column-arrangements of the units are called layers. From left to right in Figure 2, they are labeled the input, hidden, and output layers. Only the hidden and output units perform the nonlinear tansig mapping; the input units have identity transfer function. In our exposition we focus on 3-layer networks because our system was one, but any number of layers can be used.
Layers are connected to each other by a system of weights, which multiplicatively scale the values traversing the links. In the diagram, we observe that there are 2 sets of weights: one connecting the input to the hidden layer, and the other from the hidden to the output layer. The values from weights converging on a given unit are added to form n, the unit inputs.
If we think of the inputs as a vector p of dimension R, then the set of weighted sums that form the input to the hidden layer containing S units is:
where
is an RxS matrix, the
row of which forms the set of
weights that the
hidden unit scales each element of p
by. The elements of the Rx1 vector
are called the bias
values of each unit; they add a degree of freedom to the inputs by
allowing them to be ``offset.'' (Please note that the superscripts do
not denote exponents!)
The output of the hidden layer will be a, an Sx1 vector determined by:
The outputs of the hidden layer are then fed-forward (the process
which gives the networks their name) to the output layer, whose
outputs are computed as were those of the hidden layer. Specifically,
let
be an SxO matrix, the
row of which forms the set of
weights that the
output unit scales each element of a
by, and
be the Ox1 vector of output unit biases. The output
vector o (of dimension O) is thus found by:
The overall system essentially performs a nonlinear mapping from an input space to an output space with dimensions R and O respectively. The weights and biases are the adaptive parameters of the system; modifying them appropriately constitute the network's learning to approximate a desired function with the same domain and range. We now discuss this procedure.
Suppose we have a set of inputs and desired outputs:
We want to chose the network parameters so as to minimize the mean-squared error between the actual network output o and the desired output t:
Note that since the output o is a function of the weights, the MSE is a nonlinear surface with the weights as parameters. Training the network strives to find the minimum value of this MSE.
The MSE F is approximated by iteratively presenting the inputs, finding o, and at each iteration, computing:
where t(k), o(k) are the desired and actual network outputs at the
iteration.
The learning rule is the method used to update the weights after each training iteration. The method here is the so-called gradient-descent algorithm, which chooses its ``steps'' on the error surface to be the direction in weight space that most rapidly decreases the error. In equation form:
where
are the elements of the
layer
weight matrix and bias vector respectively.
is a parameter
known as the learning rate; this controls the size of the
weight adaptation increments. The challenge of computing the partial
derivatives in the above equations is overcome with an algorithm known
as backpropagation. Here, we simply restate the well-known
procedure; for a full treatment, please see [5], which
was the work that reinspired interest in neural networks in the
1980s. The notation for representing this method was obtained from
[3]. Training the network, or calculating the updates,
proceeds as follows for each training input p with corresponding
desired output t:
where the the sensitivity matrix for each layer m is:
i.e. for each layer, it is a diagonal matrix of the partial derivatives of each unit with respect to its summed input n. Note that the calculation must start with the last layer; successive calculations use the immediately previous results. For this reason the procedure is called backpropagation.
The training continues until a criterion has been satisfied (for example a sufficiently low training MSE.) Presentation of the whole training set is known as an epoch.
A few general comments before describing the specific architecture of our system: we note that a larger number of neurons allows the system to approximate functions of greater complexity. The caveat, however, is that while more units (and, consequently, more parameters) can fit the training data better, the system will likely behave more ``whimsical'' in between the training points as a result of its greater degrees of freedom. The upshot is that unseen (test) data will be poorly fit to the function desired. In neural net jargon, this is called poor generalization. On the other hand, too few parameters relative to the training data distribution and even the training set won't be adequately matched.
The initial conditions of the network also have a strong influence on the training. The steepest descent algorithm is relatively slow and has a tendency to get trapped in local minima of the error surface (where the gradient is zero.) In our experiments, we implemented both the steepest-descent backprop described here and a more sophisticated method called conjugate-gradient backpropagation. For simplicity, we omit the description here; interested readers please refer to [3].