Jonathan Recker
Abstract
Many tedious tasks can be made more efficient by automating the process of reading hand-written numerals. In such a system an optical scanner converts each hand-written numeral to a digital image, and computer software classifies the image as one of the digits zero through nine. This paper describes two neural network-based algorithms for performing this task. The first algorithm identifies numerals based on spatial features, while the second uses statistical properties of the scanned images. Simulations indicate that both algorithms perform with greater than 90% accuracy on an unfamiliar test set and that the statistical approach reduces computational complexity without sacrificing performance.
I. Introduction
By reducing the need for human interaction, numeral-recognition systems can speed up jobs such as reading income tax returns, sorting inventory, and routing mail. Several steps are necessary to achieve this. A recognition system must first capture digital snapshots of hand-written numerals. Before attempting to classify the numerals, some preprocessing of the snapshots might be necessary. If a string of adjacent numerals is captured at once, for example, the system has to decide on boundaries between digits. An algorithm must then classify each hand-written numeral as one of the ten decimal digits. This information can then be stored in a database, passed on to mail-sorting robots, or utilized in some other fashion.
Although a qualitative description of this process is straightforward, it cannot be easily reduced to a few simple mathematical rules. The difficulty results from the natural variations in human handwriting. A useful recognition system must be robust to alterations in size, shape, orientation, thickness, etc. Closed-form mathematical models tend to be inadequate for such a task because of the many possible representations of the same image. To be sufficiently robust such models require a very large number of rules.
Humans, on the other hand, can accurately classify all but the most severely distorted numerals. One natural idea, then, is to emulate in software the process that people use to recognize numerals. As a child is growing up he learns to read by seeing images (letters, numbers) and having a teacher (someone who already can read) tell him what the images represent. After sufficient training the child becomes able to read without a teacher. His education also generalizes so that he can even read unfamiliar handwriting. As an example consider figure 1. Although the numerals in each pair look significantly different, few readers will have difficulty identifying them.
Humans learn to classify fuzzy patterns like these, and neural networks are well-suited for mapping this human style of recognition into a procedure implementable in software. The rest of this paper describes this process in more detail. First I offer a brief overview of neural networks and explain why they are useful here. Then I describe the database of hand-written numerals that I used. Next I discuss the specific algorithms I implemented and how I tested them. Finally I summarize my simulation results and comment on their significance.
II. Neural Network Overview
A. Structure
Neural networks are a popular research subject, and technical papers
describing them abound. Rather than rederive the mathematical properties
of neural networks here, I offer a more conceptual overview of how these
structures work and focus primarily on why they are useful in numeral
recognition. Additional background is available from several sources. The
article by Widrow and Lehr [1] is highly recommended as an intuitive
introduction to neural networks and training methods. A more rigorous
mathematical derivation of backpropagation can also be found in chapter 4
of [2]. For now, let us consider a typical fully-connected neural network,
as illustrated in figure 2.
Each open circle represents a neuron, and the arrows indicate connections between neurons. The neurons in one column form a “layer.” The solid circles at the inputs are merely connection points and do not count as a layer of neurons. Each neuron accepts several inputs and produces one output which is sent to all of the neurons in the next layer. The inputs to a neuron are each weighted (multiplied by a scalar) and then summed. The sum is scaled by a nonlinear "sigmoid" function, and this output becomes an input for the neurons in the next layer.
The sigmoid function, denoted by sgm(x), must be chosen carefully to ensure that the neural network can be trained properly. The sigmoid should be at least piecewise differentiable, anti-symmetric about the origin, and should saturate at +1 for large positive x and –1 for large negative x. For this project I used the sigmoid function
|   | sgm(x) = tanh(x) | (1) |
which has derivative
|   | sgm'(x) = sech2(x) | (2) |
Using Euler's formula this can be reduced to
|   | sgm'(x) = 1 - sgm2(x) | (3) |
This is a smooth function satisfying the above criteria. The fact that sgm’(x) is a simple function of sgm(x) has computational advantages as well. The backpropagation algorithm used for training the network requires that both sgm(x) and sgm’(x) be calculated at each neuron. [1] Once we have sgm(x), we can use (3) to determine sgm’(x) with one extra addition and multiply. This is faster than calculating and squaring a second transcendental function as in (2).
B. Training
Neural networks are highly nonlinear, so the only feasible method of
setting the neuron weights is to train the network. Initially the weights
are set to small random values. Then an input pattern is fed into the
network, and the outputs from the neurons in the final layer are observed.
An error signal is calculated at each output neuron by subtracting the
observed output from the desired output. These errors are fed back into
the network and used to update the neuron weights in each layer. The
standard technique used to do this is the Rumelhart-McClelland
backpropagation algorithm. [3]
This process can be repeated with many input-output pairs. Eventually the neural network will learn to associate certain inputs with certain outputs. Additionally, the network behavior will generalize to some extent. That is, given an input that it has not seen before, the network will match it with a similar input that it has seen before and will produce a corresponding output. This is the characteristic of neural networks that makes them suitable for classifying patterns such as those in figure 1. A network is first trained by feeding in many different images of hand-written numerals and telling the network what the output should be (i.e. what digit the input represents.) After sufficient training the network will be able to classify new images based on their similarity to images used for training.
III. Training and Testing Data
A reliable numeral recognition system must be trained on a large number of samples. With a large training set the system will get to see many different versions of each digit, and the law of large numbers will work to our advantage. The effects of minor variations in the hand-written numerals will average out, while the key features of each digit will be reinforced. As an example, consider the two "eights" in figure 1. Notice that one of them leans a little to the left while the other leans a little to the right. If the network is trained with lots of eights, some leaning one way and some leaning the other, it will conclude that this is not an important feature in determining "eight-ness". All eights consist of two touching ovals, however, so the network will learn that this is a critical feature for all eights. If the training set is too small, the network will not be able to figure this out and therefore will not generalize well.
The database used in this project was originally assembled by Alpaydin and Kaynak at Bogazici University in Turkey. The database and some background information are available online. [4] It consists of one set of about 3800 samples from 30 people, and a second set of about 1800 samples from 13 different people. Each sample is a 32x32-pixel binary image scanned from preprinted forms filled out by these 43 people. Each digit zero through nine is represented an equal number of times.
Alpaydin and Kaynak also scaled and normalized the images using standard routines from the National Institute of Standards and Technology (NIST). This preprocessing is numeral-independent and serves only to center the numerals in the 32x32 grid and stretch them to have consistent width and height. This is illustrated in figure 3, where the one (originally a thin line) was widened to be about as wide as the eight.
IV. Algorithms
A. Spatial
The first of two algorithms tested for this project is what I call the
"spatial" approach, illustrated in figure 4.
The input to the neural network consists of all 1024 pixel amplitudes, and there is one output corresponding to each digit 0-9. The desired response is +1 at the output corresponding to the identified digit, and –1 at all the other outputs. The desired response for the input shown in figure 4, then, would be +1 at output 3 and –1 at all other outputs. Typically the outputs will not be exactly +/- 1, but we hope that they are close, say +/- 0.95 or better. Some experiments were run to determine what type of neural network topology worked best for this algorithm. Ultimately a 1024-100-10 network (i.e. 1024 inputs, 100 neurons in layer 1, 10 neurons in layer 2) was found to work best.
B. Statistical
The second algorithm I tried was inspired by a similar project undertaken
by Benjamin Defnet at Georgia Tech, [5] which was in turn based off of
work by Wang, Nagendraprasad, and Gupta. [6] The idea is to reduce the
dimensionality of the recognition problem by considering statistics of the
images rather than just spatial features. One method of doing this is to
compute the row, column, and diagonal histograms (i.e. count black pixels)
of each image. These statistics are then used as inputs to the neural
network rather than the individual pixel amplitudes. This is illustrated
in figure 5.
The structure of the neural network is similar to than in figure 4, except now the input vector is the (32 + 32 + 63 + 63) = 190 histogram values. The topology of the network is now 190-20-10, which is a significant reduction in complexity.
In Defnet's experiments he was only able to obtain about 60% accuracy recognizing unfamiliar inputs, and he attributed this to an undersized training set. The training set I used was much larger, and my images were preprocessed differently. This resulted in a significant improvement in accuracy.
V. Results
A. Metrics of interest
Three outcomes are possible when testing these algorithms. Given any input
the network can either make a correct identification, an incorrect
identification, or no identification. The first case is, of course,
desirable. A three is correctly identified as a three. The second outcome
occurs when the network makes a mistake - a three is identified as some
other digit. The third case occurs when the network is not sufficiently
confident in the output to make a guess. My algorithm makes no
identification if either all outputs are negative or more than one output
is positive. It will only venture a guess when exactly one output is
greater than zero. This approach is often better than guessing with a high
probability of error. In sorting mail, for example, it would be better to
send unreadable addresses to a person for clarification rather than to
route them to the wrong destination.
B. Classes of data
The data used for testing can be divided into three categories. The first
set consists of samples actually used in training. The network should
identify these digits with near 100% accuracy since it has seen them
before. If it does not, the amount of training was probably inadequate.
The second class of data is composed of additional samples from people who
contributed to the training set. These samples are likely to resemble
training samples very closely, but they were not themselves used for
training. Finally the system should be tested with samples from people not
used in training. Real systems must handle this type of data robustly, so
accuracy on unfamiliar data is the most important metric.
C. Simulation results
The graph in figure 6 summarizes the results with 750 training and 750
test images. As the number of training samples was increased from 250 to
500, the accuracy seemed to approach an asymptotic limit. Increasing this
number from 500 to 750 made little difference, so the results in figure 6
appear to represent the best that can be done with these algorithms.
The first two bars show that both algorithms performed with 100% accuracy on samples in the training set. This was expected, as mentioned above. The middle pair of bars indicates that both algorithms were correct greater than 90% of the time on other samples from people in the training set. The last two bars show similar behavior for samples from individuals not in the training set. Note also that misidentification occurred less that 5% of the time outside of the training set. The system was more likely to make no guess rather than to guess incorrectly. These results are quite a bit better than those in [5], most likely because of the large size of my training set and the different preprocessing of my data.
D. Implementation issues
The algorithms were first implemented and tested for correctness using
Matlab. The actual simulations, however, were done using C implementations
of each system. I rewrote my original Matlab code in C for several
reasons. Most importantly the compiled C code runs at least an order of
magnitude faster than corresponding Matlab code because of the iterative
nature of the backpropagation algorithm. A completely general
implementation of backpropagation, for example, involves three nested
loops (looping over all layers, all neurons in each layer, and all inputs
to each neuron) with up to three multiply-accumulates per weight. Because
the Matlab scripting language is not compiled, the overhead of these
nested loops is immense. The simulations were performed on a Pentium III –
500 MHz system which features good pipelining and branch prediction. It
was therefore possible to run many more simulations (with much less
aggravation) with a fast C implementation.
Also, I wanted to produce clean, reasonably fast C code which others can use as a springboard for future research in this area or neural networks in general. My source code is available from this web site. Many public-domain implementations of backpropagation routines exist, but I found the majority of them to be clunky, slow, and unreadable due to excessive abstraction. My backpropagation code supports any fully-connected network topology, and the core routine requires about 40 lines of code. This could be reduced further, but I retained some of the awkward indexing conventions I originally used in Matlab.
E. Complexity comparison
Perhaps the most important fact revealed by these results is that the
statistical algorithm provided comparable accuracy to the spatial
algorithm but with much lower complexity. A rough analysis is useful: A
forward pass through the neural network requires approximately one
multiply-accumulate per weight. The 1024-100-10 network used by the
spatial algorithm has about 1.1 million weights, while the 190-20-10
network for the statistical algorithm has about 40,000 weights. So for
nearly identical accuracy, the statistical recognition system runs about
25 times faster.
VI. Conclusion
Automatic recognition of hand-written numerals has many applications, but designing reliable systems is challenging because of the natural variations in human handwriting. One way to solve this problem is to use a neural network that learns to identify numerals much like a person learns to read. If the training set for such a network is sufficiently large and diverse, the network will generalize to recognize hand-written numerals from unfamiliar sources. Two algorithms were presented here, and both were shown to have accuracies of greater than 90%. These experiments also demonstrated that system complexity can be reduced significantly without degrading performance by considering image statistics rather than just spatial features.
References
| [1] | B. Widrow and M. Lehr, “30 years of adaptive neural networks: Perceptron, madaline, and backpropagation,” Proc. IEEE, vol.78, pp.1415-1442, Sept. 1990. |
| [2] | S. Haykin, Neural Networks: A Comprehensive Foundation. Upper Saddle River, NJ: Prentice-Hall, 1999. |
| [3] | D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning internal representations by error propagation," in Parallel Distributed Processing, vol.1, ch. 8, D. E. Rumelhart and J. L. McClelland, Eds., Cambridge, MA: M.I.T. Press, 1986. |
| [4] | ftp://ftp.ics.uci.edu/pub/machine-learning-databases/optdigits/ |
| [5] | http://www.cc.gatech.edu/classes/AY2000/cs7495_fall/participants/jammin/fp/ |
| [6] | Wang, Nagendraprasad, and Gupta, "A neural net based "hybrid" approach to handwritten numeral recognition," Northeastern University and MIT. |