KLT and DCT Gender Classification of Face Images

Chad Netzer
Pavithra Srinivasan
EE 368A, Spring Quarter, 2001

Abstract: We compare the method of "eigenfaces" using the KLT, with methods that use the DCT for dimension reduction, for performing gender classification. We use several classification methods, namely neural net, linear discriminant, and closest match, and test their accuracy on both KLT and DCT coefficients.

We also use cross validation to determine expected accuracy, and variance of the classification accuracy, and we use an independent data set to test the trained classifiers.


Introduction

Automated gender classification is possible with the use of statistical classification algorithms that are trained with a set of known images. With a large number of known images, it may be possible to directly train the classification algorithms, with each grayscale image representing a point in the image space. The dimension of the space is equal to the number of pixels. However, it is often much better to reduce the number of dimensions, in an efficient way, allowing accurate training with less observations.

In applications using face images, a well established dimension reduction technique is to use the Karhunen-Loeve Transform on an ensemble of images, to generate an eigenspace composed of "eigenfaces" (See Kirby and Sirovich [1]) The idea is to project images into this eigenspace, and use the first few projected components for classification.

However, determining the KLT eigenspace for a large ensemble of images, or for images of high dimensionality, can be computationally expensive (Kirby and Sirovich [2]). An alternative for face recognition tasks is to use Discrete Cosine Transform techniques. The idea is to use a few of the most "informative" coefficients to represent the image space, and classify each image based oon these coefficents. The idea has been applied successfully to image recognition[3], and we will test its abilities for gender classification.

eigenfaces

Using eigenfaces is an effective means of dimension reduction because of the energy compaction properties of the Karhunen Loeve Transform (KLT). Given an ensemble of training images, the KLT will determine the most informative eigenvectors (the "principal components"), allowing the least informative components to be ignored.

Throughout this report, we make use of a database of 400 medical students (200 male, 200 female) provided by A. Diaco, J. DiCarlo, and J. Santos, in their Spring 2000 report on Gender classification. Here are 8 images from this database, four male and four female:

Eight representative images from medical student database

From this database of 400 faces, we computed the first five eigenfaces, shown below as images:

First five eigenfaces from medical student database
1
2
3
4
5

For comparison, here are some representative images collected from the FBI and other U.S. Government websites, representing the mug shots of wanted fugitives (101 male, 31 female):

Eight representative images from FBI fugitive database

And here are the first five eigenfaces computed from this database of FBI images:

First five eigenfaces from FBI fugitives database
1
2
3
4
5

This illustrates how different data sets can produce quite different eigenspaces. In the case of the FBI images, there are less female images to balance out the males in the eigenface representation. In addition, the quality of the images is generally poorer and more varied in size, requiring some additional preprocessing to prepare them for eigenspace representation.

Most informative DCT coefficients

In contrast to the KLT, which produces a basis that is dependent on the ensemble images, the Discrete Cosine Transform (DCT) basis is image independent. However, given an ensemble of images, we can choose the most informative DCT coefficients for that ensemble by determining which coefficients have the greatest variance.

Using standard 2-dimensional DCTs, here is a plot of the variances of each coefficient (using the entire ensemble of medical student images), on a log scale (minus the DC component, because the images all have subtracted means):

Here, we see the general general trend of the coefficient variances is to drop off for the higher frequency components, which is to be expected in natural images.

If we zoom in on the lowest 30 x 30 region of DCT coefficients, and plot on a linear scale, we see how quickly this drop off in the variances is (here plotted as standard deviations):

This compaction of variance in the lower frequency components of a DCT is analogous to the energy compaction properties of the DCT that are so ften exploited for use in data compression.

In order to reduce the dimensionality of the DCT coefficients, for doing the actual gender recognition, we choose a small subset of all the DCT coefficients, and using them to train the classifying algorithms.

Here is a series of plots showing which DCT coefficients are "most informative", for varying numbers of retained coefficients (from 1 to 100):

15x15 section showing "most informative" DCT coefficients
10 coeffs
20 coeffs
30 coeffs
40 coeffs
50 coeffs

Here we see that the components of greatest variance are concentrated in the upper left corner, which corresponds to the lowest frequencies. The pattern is not strictly triangular, but instead has a somewhat more complicated pattern. Our gender classification methods use images that are 128 pixels on each side, which means there are 128*128=16384 coefficients per image. We discard all but the most important one hundred coefficients or so, which represents a huge reduction in the dimensionality of the problem. Thus, the DCTs ability to locate the most informative components in a small region of the image, is very useful.

When using the DCT coefficients for classification, one could calculate the "most informative" coefficents, which is an expensive operatino (more expensive than the KLT, since the DCT of every image must be computed). However, one could also rely on the compaction of variance in to the low frequqncies, and use a precomputed template of coefficients (not unlike the zig-zag marching pattern used in JPEG), or compute the variances on a random subset of images, to approximate the "most informative' coefficients. Either of these technique can results in considerable computational savings, of both memory and processor cycles. This makes DCT based methods a better choice for time or memory critical applications, at the possible expense of some accuracy.

Projection and coefficient rescaling

Once the eigenspace has been computed with the KLT, or the most informative coefficients have been selected with the DCT, a basis is formed for representing new face images. The KLT and DCT methods use different approaches for calculating coefficent values. However, in both cases, we insure that the coefficient weights that are produced fall in the range from -1 to 1.

In the case of the KLT, an eigenspace is computed, and images are "projected" into that space. This basically means that the dot product is formed between the image vector, and all the eigenface vectors in the eigenspace. The resulting dot products are the coefficients that are use to train and test the classification algorithms. In order to insure that the coefficients fall in the range from -1 to 1, the image vector is normalized to length one. Thus, the coefficient produced is the cosine of the angle between the eigenface vector and the image vector.

For the DCT coefficients, after the most informative components are found for a training set, it is a simple matter to compute the DCT of any test images, and discard all but the most informative components. However, these coefficents can have a fairly wide range. Z. Pan, R. Adams and H. Bolouri propose a method of normalization for these coefficients[3],[4], in order to improve the learning speed of neural net algorithms. The method is simply to estimate the upper and lower bounds of the coefficients from the training data, and to scale each coefficient separately to be within the range of -1 and 1. We employed this method in all our DCT classifications.


Classification

Automated gender classification is accomplished by using training a classification algorithm with face images of known gender, then using the trained classifier to determine the gender of unknown images. The face images are reduced in dimensionality, using the KLT or DCT transforms and taking a subset of the coefficients.

In the classification project of Diaco, DiCarlo, and Santos (EE368A Spring 2000), they compared several simple classification algorithms, using the database of Stanford medical students. We will also use some of these methods to compare the results for KLT and DCT dimension reduction. Furthermore, we will use a neural net classification scheme, which have been used successfully in image recognition tasks.

The linear discriminant classifier is a very simple idea, in which the training phase is used to determine a vector of weights, w, that solve the linear equation:

Xw=y

where X is an (n by p) matrix of basis coefficients, and y is the correct value for each image in the basis. Each row of X is an "observation" with dimension p. For gender classification, each row entry of y is either a 0 or 1 (representing a male or female).

Another classifier we used was a hidden layer feed-forward neural network with error backprogation. The network was constructed with Matlab's Neural Network toolbox. A number of different configurations were tested, wherein we varied the number of internal nodes, and the training functions, to see which performed the best. In particular, we experimented with a fixed number of nodes, and variable numbers corresponding to the number of input coefficients. For the training algorithms, we found that Levenberg-Marquardt ['trainlm'] amd the Broyden, Fletcher, Goldfarb and Shanno ['trainbfg'] algorithm performed quite poorly for this task, while conjugate gradiant and Resilient Backpropagation ['trainrp'] performed reasonably well.

Testing all the configurations was a time and compute intensive task, and in the end we settled for a conjugate gradient learning function (Matlab's 'trainscg' function), with a single hidden layer, and 70 internal hidden nodes.


Cross validation with Med student database

The concept of cross validation is straightforward. The idea is to use a face database, with known genders, to test the accuracy and variability of classification methods. In a classic 10-way cross validation, one sets aside 10% of the data as validation data, and uses the remaining 90% of the data for training. Accuracy results are collected using the validation data, as though it were actual test data (ie. previously unknown to the classifier). This procedure is repeated 10 times, each time using a different 10% of the data for validation. A good introduction to the use of cross validation is provided by Mosteller and Tukey[5] Here is another good description of cross validation.

An important question when using any classifier, is how many KLT or DCT coefficients are sufficient for achieving good classification accuracy? This is where cross validation can help. By doing a cross validation, for many different numbers of coefficients, we can get an idea of the how many we should use.

Linear discrimant results from cross validation

In the classification project of Diaco, DiCarlo, and Santos, several classification methods were used to test the KLT method on the Medical student database. One simple, and effective method, is a linear discrimant, which we also adopted and used to compare between KLT and DCT dimension reduction procedures.

Here is a plot of the accuracy and variability of the linear discriminant on the Med student database, using a 10-way cross validation procedure with both the KLT and DCT dimension reduction methods:

Linear discriminant accuracy/variability as a function of retained coefficients
KLT accuracy
DCT accuracy

Both graphs show the means, and estimated standard deviations around the mean, of the 10-way cross-validated training and test data. The training data is represented with blue, and the test data with red. Also, each plot is plotted on a logorithmic X axis, to emphasize the lower coefficients.

In both cases, it can be clearly seen that the classification accuracy of the training data increases with the number of coefficients used, until it reaches 100 percent. This is to be expected, as the number of parameters increases to match the number of training observations, the model can classify each training image exactly.

It is also evident that the accuracy of classification on the test/validation data, matches the training data fairly closely at lower numbers of coefficients. However, as more coefficients are used, the test data classification accuracy reaches at peak near the 90% mark. In the case of KLT coefficients, the classification rate remains fairly stable with 10 or more coefficients. However, with DCT coefficients, after about 100 coefficients, the accuracy rate drops sharply. This is likely due to a phenomonon known as "overfitting", where the classifying algorithm fits the training data very well, but doesn't generalize well to test data, because of bias introduced by the training.

It is also important to note that these graphs (and the cross validation procedure) produce information about the variation of the classification accuracy. Both graphs plot an estimate of standard deviation around the means, and while that the deviation around the training data classification is quite small, the standard deviation for the test data is much larger. Also, the KLT deviation is generally smaller than the DCT deviation. Without going into a more rigorous examination of the variation of the mean estimates, involving confidence intervals, we can say that accuracy estimates of this classification scheme will have a not insignificant margin of error.

There is one other important difference shown by these graphs. The linear discriminant requires far fewer KLT coefficients to achieve accuracy rates near 90%. Roughly 10 KLT coefficients are needed versus roughly 60 or 70 with the DCT method. This is a natural consequence of the fact that the KLT method produces data dependent basis, whereas the DCT uses a data independent basis.

Closest Match results from cross validation

Another method used by Diaco, DiCarlo, and Santos is a "closest match" procedure. This is simply the closest Euclidean distance match of a test vector to the vectors in the training set. We also chose to use it as a classifier for both KLT and DCT coefficients, due to it's simplicity and effectiveness.

Closest match accuracy/variability as a function of retained coefficients
KLT accuracy
DCT accuracy

Of note here is that the training data is always classified correctly by this method, since the model essentially remembers each training vector and finds a perfect match.

For the KLT transform coefficients, the classifier performs similarly to the linear discriminant, and is about 90% accurate with only 10 coefficients, and remains roughyl this accurate even with many more coefficients. Using DCT transform coordinates, the "closest match" method is much more accurate than the linear discriminant when using only a few coefficients (10-30), but does not reach the higher accuracy rating of the linear discriminant.

Neural net results from cross validation

Neural net accuracy/variability as a function of retained coefficients
KLT accuracy
DCT accuracy

The neural net performs about equal to the "closest match", or perhaps slightly better. It seems to show some of the same characteristics of the "closest match". Notably that it's expected accuracy has a fairly large amount of variability, and that it does reasonably well with a small number of DCT coefficients. It was hoped that the neural net method would achieve the highest accuracy overall, based on its prevalence in face recognition applications. That it didn't achieve that goal could be indicative of the data set, or that we don't have sufficient experience in properly tuning neural nets. In general, there is no theory about how many internal nodes to use, or how many layers, other than brute force checking.


Generalizing trained classifiers

One important question is, how will these trained classification methods work on another, completely independent database? In order to answer that question, we employed a database of images downloaded from the FBI and other government law enforcement agencies, consisting mainly of fugitives or other wanted persons. The idea is that these images will have been taken from a much broader section of the population, and in general, the faces will have many differences from those of the Stanford medical students.

Here we used the full set of Stanford medical students to train the linear discriminant, "closest match", and neural net classifiers. Then, all the FBI test images were classified by the algorithms, and the percentage classified correctly is plotted. This is done, as before, for increasing numbers of retained coefficients, for both the KLT and DCT methods.

Classifiers trained with med student faces, and tested on FBI faces
KLT accuracy
DCT accuracy

The result of these tests are striking. Where all the classifying methods, using KLT coefficients, were able to average close to 90% accuracy when trained and tested on the medical student database, here they are barely able to manage much better than 60% accuracy. The "closest match" classifier does the worst, while the linear discriminant and neural net seem to do about equally well. However, the neural net fluctuates wildly in performance, as a function of the number of coefficients used. This is consistent with the higher variance exhibited in the cross-validation procedure.

The situation when using DCT coefficients, is somewhat more bizarre. Here, the neural net and "closest match" classifiers are far less accurate than the linear discriminant, when the number of coefficients used is low (roughly under 300). However, they are both improving as the number coefficients is increased. The linear discriminant is most accurate, in this situation, with a low number of DCT coefficients. In fact, the more that are used, the worse it does. And overall, the linear discriminant with a few DCT coefficients is is the most accurate classifier, on this data set, easily beating out the KLT classifications.

In general, however, we see that the classification results are much less accurate when testing against the FBI database. It appears that the training produced by the medical student database does not necessarily generalize to other face data sets. This may be due to technical issues, such noise in the images, or changes in shape and scale between the image sets. However, it may also be due to fundamental differences in the underlying facial statistics of each set of images.

Here are some images that the Stanford medical students trained classifiers could not correctly classify, using varying numbers of KLT and DCT coefficients.

Difficult to classify faces, using Stanford med students as training data

The linear discriminant, "closest match", and neural net classifiers were not able to correctly classify these faces even a single time, when trained on the medical student database. It is interesting to note that these images are not particularly noisy or poorly photographed (of which there are many in the fbi database). Rather than being an effect of noise, the poor classification seems to be a distinct result of the face properties themselves. In addition, these don't seem like particularly difficult classifications for a person to perform, which is perhaps indicative of the bias that can result when training the algorithms with data that is too homogeneous.

For comparison, here are some of faces for which the classifications were most accurate:

Easy to classify faces, using Stanford med students as training data

In general, these result indicate that the training data can have an adverse affect on the ability to do proper classification, in real world situations. The proper selection of face database may greatly improve the classification results.


Conclusions

Using the Stanford medical student database, we showed that using cross validation that the KLT transform can achieve roughly 90% accuracy on images that are similar to those in the Stanford medical database. These results can be achieved using only the first few KLT coefficients (as few as about 10), with either a linear discriminant, "closest match", or neural net classifier. The cross-validation also showed that there is some variation in the expected accuracy value, which makes choosing a particular classifying algorithm more difficult. The linear discriminant seems to work quite well, and with only a few coefficients it is quick to compute.

We also showed that DCT coefficients can be used successfully to classify gender, given appropriate training and test data. It generally requires more coefficients (between 60 and 70) to approach the accuracy of KLT, but its expected accuracy rate is more variable than the KLT methods, and it operates well over a narrower number of coefficients.

The linear discriminant classifier seems to work as well a simple feedforward neural network with backpropagation and conjugate gradient learning algorithm. The neural networks have been used with many image recognitions tasks, and it seems reasonable to assume that better configurations than ours exist. However, the linear discriminant is much simpler and faster, and works as well or better than this commonly used neural net configuration.

When choosing a training set, care must be taken to understand what data will be tested. Assumptions of accuracy, and statistical simulations to determine that accuracy will not be valid for certain types of test data. In the case of face images and gender classification, one can easily find test images that are poorly classified by a training set that is too homogeneous. However, just what constitutes a good training set of face images can be difficult to assertain, given the huge variability in human faces.

However, when performing gender classification (and perhaps other classification task), using that can be can be reasonably expected to be unrepresentative of the test data, it may be best to discard KLT methods for a simple DCT method. The linear discriminant with five to eight of the most informative coefficients, may perform as well as, or better than the KLT methods.

Or perhaps this technique can be used as a check of the test data. If the DCT linear discriminant is giving markedly different answers than the KLT linear discriminant, when only five to eight coefficients are being used, it may indicate a situation where the overall classification accuracy is suspect, due to differences in the training and test images.



Appendix (with code and work breakdown)
References
Last modified: Sat Jun 2 19:39:16 PDT 2001