next up previous
Next: Simulation Details and Results Up: Feed-Forward Neural Network Method Previous: Feed-Forward Neural Network Method

Neural Network Principles

A diagram of a FFNN is shown in Figure 2. Neural networks are often ``hyped'' in the popular press as ``artificial brains'' of sorts; in this treatment, we reserve ourselves to the more modest function-approximation perspective.

   figure22
Figure 2: General schematic of a feed-forward neural network

Each of the circles in the figure is called a unit(or neuron) containing an input-output transfer function. Neural network researchers have chosen from a variety of transfer functions; among the more popular ones are the logistic-sigmoid and the tangent-sigmoid functions. In this project we use the latter, which, mathematically, is:

equation34

where both the input, n, and the output, f(n) are scalar-valued. The three column-arrangements of the units are called layers. From left to right in Figure 2, they are labeled the input, hidden, and output layers. Only the hidden and output units perform the nonlinear tansig mapping; the input units have identity transfer function. In our exposition we focus on 3-layer networks because our system was one, but any number of layers can be used.

Layers are connected to each other by a system of weights, which multiplicatively scale the values traversing the links. In the diagram, we observe that there are 2 sets of weights: one connecting the input to the hidden layer, and the other from the hidden to the output layer. The values from weights converging on a given unit are added to form n, the unit inputs.

If we think of the inputs as a vector p of dimension R, then the set of weighted sums that form the input to the hidden layer containing S units is:

equation48

where tex2html_wrap_inline245 is an RxS matrix, the tex2html_wrap_inline575 row of which forms the set of weights that the tex2html_wrap_inline575 hidden unit scales each element of p by. The elements of the Rx1 vector tex2html_wrap_inline253 are called the bias values of each unit; they add a degree of freedom to the inputs by allowing them to be ``offset.'' (Please note that the superscripts do not denote exponents!)

The output of the hidden layer will be a, an Sx1 vector determined by:

equation57

The outputs of the hidden layer are then fed-forward (the process which gives the networks their name) to the output layer, whose outputs are computed as were those of the hidden layer. Specifically, let tex2html_wrap_inline257 be an SxO matrix, the tex2html_wrap_inline575 row of which forms the set of weights that the tex2html_wrap_inline575 output unit scales each element of a by, and tex2html_wrap_inline265 be the Ox1 vector of output unit biases. The output vector o (of dimension O) is thus found by:

eqnarray67

The overall system essentially performs a nonlinear mapping from an input space to an output space with dimensions R and O respectively. The weights and biases are the adaptive parameters of the system; modifying them appropriately constitute the network's learning to approximate a desired function with the same domain and range. We now discuss this procedure.

Suppose we have a set of inputs and desired outputs:

equation75

We want to chose the network parameters so as to minimize the mean-squared error between the actual network output o and the desired output t:

equation83

Note that since the output o is a function of the weights, the MSE is a nonlinear surface with the weights as parameters. Training the network strives to find the minimum value of this MSE.

The MSE F is approximated by iteratively presenting the inputs, finding o, and at each iteration, computing:

equation88

where t(k), o(k) are the desired and actual network outputs at the tex2html_wrap_inline609 iteration.

The learning rule is the method used to update the weights after each training iteration. The method here is the so-called gradient-descent algorithm, which chooses its ``steps'' on the error surface to be the direction in weight space that most rapidly decreases the error. In equation form:

eqnarray95

where tex2html_wrap_inline611 are the elements of the tex2html_wrap_inline613 layer weight matrix and bias vector respectively. tex2html_wrap_inline283 is a parameter known as the learning rate; this controls the size of the weight adaptation increments. The challenge of computing the partial derivatives in the above equations is overcome with an algorithm known as backpropagation. Here, we simply restate the well-known procedure; for a full treatment, please see [5], which was the work that reinspired interest in neural networks in the 1980s. The notation for representing this method was obtained from [3]. Training the network, or calculating the updates, proceeds as follows for each training input p with corresponding desired output t:

  1. Calculate the output of each layer using the known weights and transfer functions of the units. The network output is, obviously, the output of the last layer.
  2. For each layer, the unit sensitivities to changes in the summed inputs n are calculated as follows:

    eqnarray123

    where the the sensitivity matrix for each layer m is:

    tex2html_wrap233

    i.e. for each layer, it is a diagonal matrix of the partial derivatives of each unit with respect to its summed input n. Note that the calculation must start with the last layer; successive calculations use the immediately previous results. For this reason the procedure is called backpropagation.

  3. Calculate the weight updates using an equivalent version of the previous rule expressed as:

    eqnarray157

The training continues until a criterion has been satisfied (for example a sufficiently low training MSE.) Presentation of the whole training set is known as an epoch.

A few general comments before describing the specific architecture of our system: we note that a larger number of neurons allows the system to approximate functions of greater complexity. The caveat, however, is that while more units (and, consequently, more parameters) can fit the training data better, the system will likely behave more ``whimsical'' in between the training points as a result of its greater degrees of freedom. The upshot is that unseen (test) data will be poorly fit to the function desired. In neural net jargon, this is called poor generalization. On the other hand, too few parameters relative to the training data distribution and even the training set won't be adequately matched.

The initial conditions of the network also have a strong influence on the training. The steepest descent algorithm is relatively slow and has a tendency to get trapped in local minima of the error surface (where the gradient is zero.) In our experiments, we implemented both the steepest-descent backprop described here and a more sophisticated method called conjugate-gradient backpropagation. For simplicity, we omit the description here; interested readers please refer to [3].


next up previous
Next: Simulation Details and Results Up: Feed-Forward Neural Network Method Previous: Feed-Forward Neural Network Method

Firas Hamze
Thu Jun 1 01:31:26 PDT 2000