Facing the Truth: Using Color to Improve Facial Feature Extraction

Maria Jabon mjabon@stanford.edu

Christopher Tsai tsaic@stanford.edu

Damien Cerbelaud damienc@stanford.edu

March 20, 2008

EE 362 - Applied Image and Vision Systems

Introduction -> Prior Work -> Methods -> Results -> Conclusions -> References -> Acknowledgements -> Source Code


Face detection is an integral component of many real-world systems.  As artificial intelligence pervades computers across cyberspace, programmers strive endlessly to find ways in which computers can view, interpret, and respond sensibly to human beings; since images constitute a primary means of communication between humans and machines, extracting faces, facial features, and emotions from images can serve numerous functions in computer vision.  Furthermore, with electronic databases growing to gargantuan proportions, use of face recognition has also spread rampantly across a multitude of fields, from medicinal records and employee databases to crime-fighting profiles.  Finally, face tracking in particular has piqued the interest of engineers and computer scientists interested in automating surveillance, motion sensing, advanced video capture and conferencing, and dynamic zoom.  In this particular problem, not only the human face but also the robustness of the features are paramount to success; facial feature extraction must prevail despite ambient noise, motion blur, spatial as well as temporal irregularities, and neighboring features across a wide variety of subjects.

OKAO Vision – face vision in Japanese – is a face recognition software package that Omron Corporation developed to both recognize and characterize faces in AVI video sequences.  By localizing a number of facial regions, extracting keypoints such as eyes and mouth, comparing the face against a labeled database, and evaluating the shape and size of the extracted facial features, OKAO estimates age, gender, and other characteristics. 


However, based on the Japanese demographic and the trends of test subjects used to measure OKAO’s performance, the segmentation algorithms used to detect edges and limn facial features do not perform equally well for all humans.  In particular, because Omron has optimized OKAO to successfully identify Asian and Caucasian subjects, the face detection software is not optimally trained for darker complexions[1].  Because the shade of lip color and the nearby skin tone can be deceptively similar for dark-skinned people, OKAO often fails to separate the lip cleanly from the face.  Meanwhile, black lines such as eyelashes or mascara are also less conspicuous on darker skin, so segmentation of grayscaled images becomes even more difficult, even for noteworthy points such as the eyes.  When evaluating the feature identification output from OKAO, for example, we notice several mislabeled frames:


For example, the young woman in the left frame has lip color that is very similar to her cheek color, especially under the overhead lighting of the test room, while the adjacent male also possesses dark skin tones with minimal pinkness in his lips.  In both cases, OKAO mistakes other parts of the subjects’ faces for the mouth; the feature detection algorithm segments the woman’s chin instead of her lip and the man’s left nostril instead of his lip.  Meanwhile, OKAO also mislabels both eyes, likely due to both the failed registration of the mouth and the lack of color contrast between black pupils and surrounding black skin.  The low lighting also does not help.  At worst, OKAO sometimes fails to capture any facial feature, resulting in an unlabeled frame.

Testing our methods through the black box that is OKAO facial feature detection, we strive to enhance the edges in input Red-Green-Blue (RGB) three-channel color images in order to facilitate more robust feature detection at OKAO’s output.  Since OKAO does not accept binary or grayscale images – only color channels – we must somehow enhance the color frames themselves before submitting them to OKAO for grayscaling and segmentation:


Unfortunately, OKAO performs grayscaling and pattern matching within its own software code [1], so we can only model its operation and tailor our color edge enhancement to make the internal grayscale image more amenable to the subsequent edge detection and pattern matching.  Because Caucasian faces are detected and segmented nearly perfectly, we experiment on enhancing primarily dark-skinned subjects, the people with whom OKAO struggles most.

Prior Work

Even before they incorporated all three color channels into edge detection, engineers designed face trackers with grayscale or luminance images.  Monochrome edge detection has plagued the minds of scientists for decades, and numerous ingenious techniques in color originally made their debut in binary or grayscale segmentation; a single colorless image can capture most of the edges visible in a color image.

Even when the original image is color, many methods simply convert to grayscale before beginning processing.  The luminance channel in the YIQ (Luminance-Hue-Saturation) and YCbCr (Luminance-Blue Chrominance-Red Chrominance) color metrics provides enough high spatial frequency detail to locate the lines between a face and background.  Luminance, by virtue of its high spatial frequency content, contains most of the edge detail in a multispectral image.

Feature-based methods dominate the literature for face detection and recognition.  Algorithms like the Hough transform, the Radon transform, and the hit-or-miss transform isolate common alignments such as horizontal or vertical edges.  Thresholding and watershed segmentation reduce a complicated image into localized regions around object points by eliminating – flooding – surrounding regions in an intensity image with empirically determined threshold values.  Adaptive thresholding is also popular in face detection, since we can often use one threshold to separate the face from its background and then another, more stringent threshold on the segmented face candidates to ascertain the locations of eyes and mouth.  Gradient-based edge detection methods are also rampant for their simplicity and speed: the Sobel, Prewitt, Roberts, and Canny edge detectors all approximate the directional derivatives in images by convoluting the image with small 3x 3 filters (or 2x2 filters for the Roberts detector); the points where a directional derivative is high indicate edge-like behavior[2]: 

and represent the horizontal and vertical Sobel gradient filters.

and represent the horizontal and vertical Prewitt gradient filters.

and represent the horizontal and vertical Roberts gradient filters.

Finally, features such as eyes and lips possess strong corner points, so Harris and Haralick developed robust corner detectors that are specialized to find places where the edges change direction abruptly[3].


The Harris and Haralick corner detectors represent one case of a popular family of template matching or pattern matching algorithms.span style="mso-spacerun:yes">  Even for gray images, pattern matching can be quite effective, because the large number of pixels in an image ensures that distinct features have well-defined arrangements.  The two endpoints of the eye and mouth, for example, almost always bear pixel arrangements similar to < or >[4].  One can also specify desired features in detail to prevent false features from falling into detection bins[5].  Eyes for example, might exact their own special template[6]:


Template-matching algorithms need not be so specific.  A myriad of computer scientists have generalized the definition of a feature or template to the looseness of basis functions, which, roughly speaking, approximate features to a fundamental order of magnitude.  For example, Bay, Tuytelaars, and Van Gool devised the Speeded-Up Robust Features (SURF) algorithm to detect and match features using Haar-like box filters:


Their method, which approximates the Hessian and its determinant with a more easily computable version, essentially describes a distribution of Haar wavelet responses in the neighborhood of interest points[7], using the Haar coefficients to distinguish one uniquely shaped feature from another:


The SURF method derives from another feature detection algorithm, Scale-Invariant Feature Transform (SIFT), which blurs the image with a succession of varying-variance Gaussians to selectively accentuate features of different scales[8].  A survey of the number of techniques that SIFT and SURF have spawned would exceed the scope of our project[9].

For face recognition in particular, adaptive techniques such as those developed for neural networks are especially powerful.  With access to a large image database, algorithms need to rely less on feature precision since a small subset general statistical descriptors such as eigencoefficients can suffice.  Thus, image processors interested in applying face segmentation to recognition often train their detectors with a series of similarly structured face images (with eyes and mouth positioned in similar locations) to expedite detection and identification.

The use of all three color channels, however, has been less prevalent, likely because computation time and processing complexity increase roughly threefold, depending on the method.  Strong directional derivatives still generally indicate the presence of edges, but compiling a gradient image from three channels is often more cumbersome.  Nevertheless, image processors perform the Sobel gradient (or Canny gradient) on each channel separately and use the multispectral norm as an indicator of edginess:

            The use of all three color channels, however, has been less prevalent, likely because computation time and processing complexity increase roughly threefold, depending on the method.  Strong directional derivatives still generally indicate the presence of edges, but compiling a gradient image from three channels is often more cumbersome.  Nevertheless, image processors perform the Sobel gradient (or Canny gradient) on each channel separately and use the multispectral norm as an indicator of edginess:

We can also intertwine the three channels differently, as Gonzalez and Woods propose in their text [10]:

Computer scientists and psychologists have also standardized the use of skin color to identify candidate face regions [11], followed by template matching within the skin regions to eliminate false positives (such as hands, feet, and other body parts)[12][13]. 

For systems that convert RGB images to YCbCr, the luminance channel often fails to properly identify color edges between objects that share comparable luminance values:


The blue sheet of paper disappears because its luminance is the same as the table’s luminance despite their contrasting colors.  In light of this failure, Dikbas, Arici, and Altunbasak proposed principal component analysis (PCA) for quality edge detection robust to luminance similarities; they transformed the RGB axes into eigenspace axes, with the first eigenimage representing the maximal variance components in the image [14][15].  For instance, the following differently colored squares share similar luminance values, thwarting normal color gradient segmentation.  Edge detection on the first principal component, however, succeeds:


Notice that the color Laplacian – a variant of the color gradient method – performs particularly poorly on the skin tones in the lower-left corner of the grid.  A number of methods display this difficulty, as we illustrate with a test image.  In the following grid of skin tones, the luminance changes discontinuously after every 16 columns, while the chrominance jumps after every 16 rows:


Both the grayscaled image and the luminance channel omit edges in one direction because of their relative insensitivity to color changes.  The color gradient finds all of the discontinuities but returns some undesirable double edges where two color channels differ.  The first eigenimage is cleanest.

Edge detection functions noticeably better on the first principal component, but, unfortunately for us, many prepackaged face detectors like OKAO successively perform grayscaling and edge detection; thus, we can influence the edge detection only through a color image, as the grayscaling occurs within the black box, out of our reach.  However, we can still use the principal component (as well as various color gradient metrics) to boost edges in all color channels, or even in the luminance channel.  This edge boosting enhances color contrast and makes detection more robust despite continued use of OKAO’s internal grayscaling function.

For all of the edge detection algorithms mentioned in the literature, we had to determine how best to incorporate edge-boosting features into color images.  Do we replace the luminance channel with the principal component?  Do we duplicate or triplicate the principal component images into the three color channels?  Can we boost only the locations in the color images where color edges appear?  Or do we selectively enhance all regions in the luminance channel by edges detected in the principal component?  We attempted a variety of these alternatives and observed the results. 


I. Histogram Equalization

Histogram equalization is a logical first step to compensate for imbalanced illumination.  For example, segmentation of dark-toned subjects in particular suffers under low lighting, so compensating for this lighting is important and generally unequivocally helpful.  Equalizing a windowed region – the face segmented from its background – also improves detection because the color tones displayed on the human face generally occupy a limited portion of the color and luminance histograms, making equalization to the full dynamic range especially helpful for distinguishing two otherwise close shades of brown or tan:


The African-American face, for example, contains a plethora of reds but mainly low-intensity blue and green content.  The luminance, too, is limited to a range between 20 and 200, inviting dynamic range stretching.  First, we equalize the color (RGB) channels separately, producing the results:


Whereas the color histograms are now maximally expanded to fill the dynamic range, the luminance mass function still seems lopsided and limited, so we separately equalize the luminance channel in our next experiment:


When we equalize the luminance channel, OKAO performs significantly better in detecting the eyes and mouth on dark-skinned faces.  Because the luminance histogram of darker skin tones is initially quite compressed in the low intensity values, rarely eclipsing 200, equalization unsurprisingly yields significant improvement.

II. Eigenimage and Color Gradient Edge Boosting

We can still incorporate images with strong color edges into a full-color image by selectively boosting those edges in either each channel or the luminance channel.  Simple channel replacement significantly lowers the average brightness without more processing (see below), so we focus on enhancement rather than replacement:


The principal component image often contains the most sensible color edges, as shown below:


We boost the luminance channel – the most logical choice for high spatial frequency edge detail – by this principal component and renormalize the scale.  Instead of simple image addition, we weigh the importance of the edges by a variable factor that we perfect through repeated trial and error.  No matter our choice, the enhanced image will have values that exceed the original dynamic range, so our renormalization returns the intensity values to the [0, 255] range of our original color image.  For example, if we boost the luminance channel through straight addition (with a coefficient of unity), then our renormalization will essentially halve the impact of the original luminance channel and accentuate its edges by the edges in the principal component.


The resulting image preserves the visual quality of the original RGB image while slightly boosting the presence of points that would be highlighted as edges if OKAO performed edge detection on the principal component itself.  This allows us to influence the type of edges that will appear from OKAO’s pattern-matching algorithm despite our relative lack of knowledge concerning the black box’s grayscaling method.  We can also increase the probability of color edge detection if we boost by the edge image rather than the component image:


The color gradient, on the other hand, does not offer us this luxury, because the color gradient image itself is inherently an edge map:


Thus, unable to boost every pixel value, we boost only the edges.  Because over-boosting results in an image with less and less pointwise detail (in the regions not deemed color edges), we opt for lighter boosting in the luminance channel to preserve the most important pointwise qualities.



We see that boosting the luminance channel strengthens the lines between lip and cheek while also outlining the eyes.  However, because we did not inundate the luminance channel completely, other edges that the color gradient did not accentuate also remain, as we desire.

III. Specialized Grayscaling

We also tried several methods using grayscale input images, to address the problem of localization. We decided to compare three different methods : simple grayscaling, scaled grayscaling, and first principal component image. the motivation for using such methods was that simple grayscaling is likely what any feature detection software does with color images before applying the detection algorithm; scaled grayscale because it is a simple, computationally efficient way to improve the results of simple grayscale; and finally, the first principal component (also scaled between 0 and 1), because as we saw before, it can lead to much better results. 

We tried these functions on a test image composed of a light red square over a darker red square. Colors were chosen such that the contrast is very low. The results we obtain for each method are displayed below.


As expected, the results given by the first principal component are the best, and we can also see that the simple grayscale performs very poorly in terms of highlighting the color differences.

We then used these three methods to generate videos, and we ran OKAO on them. We noticed nearly no visual improvement for the scaled grayscale and for the principal component method compared to simple greyscale. The reason for this might be that color information in the sample videos is very important, relatively the color information of the face is quite low. For instance, large areas of white or blue in one of the sample videos we used predominated in the calculus of the variance and the brown contrast where the face edge information is might not be enhanced with such a technique.

 In order to improve the performance of our previous methods, we implemented an adaptive windowing algorithm, that would define local processing windows in the image. We decided to design our algorithm with two processing windows, one located on the eyes, and one on the mouth.


The algorithm we designed to find the coordinates of each window is adaptive. For each frame, the window coordinates are calculated using the previous estimated position of the eyes and the mouth; if detection does not arise in one frame, the previous window is stored until new detected position update the coordinates. Below is a block diagram showing the adaptive windowing : 


However, we could not modify OKAO so that it would adapt itself to the eye/mouth position. In order to simulate the adaptive windowing, we generated a stream of positions for the eyes and mouth using the original video, and each frame would have windows generated from the corresponding previous frame in the original video. It induces some irrelevancy in the process, in particular when the original detection was absent or really poor; however, for most of the frames, our simulation proved to do very well.

The dimensions of each window are 2h*(h+e), where h is the distance between, respectively, the two eyes and the mouth corners; e is an offset corresponding to vertical differences between the positions of the two eyes/mouth corners.

Each of the windows would now define a portion of each frame, and considering these windows as separate pictures, we applied to them each one of the different techniques we used previously. We also added one method, which consists in multiplying the principal component image by a scaling factor and truncate every pixel value below 1, so that differences between dark pixels are emphasized. As we considered mainly dark skinned characters, the face edge informations were mainly contained in these low luminance pixels. 

The results we got by using adaptive windowing were not really improved for the scaled grayscale. However, the principal component method turned out to lead to noticeable improvement on the quality of the detection of the eyes and mouth. The best results were obtained for the scaled version of the principal component, with a much better accuracy in the mouth detection than the original video.

Evaluation and Results

In order to evaluate the performance of our algorithms we used videos of three dark skinned African American subjects and ran them through OKAO vision, recording the confidence level for each frame.  Then, for each algorithm we selected, we processed each frame of each of the three videos and ran the resulting videos through OKAO, recording the new confidence levels for each frame.   Due to limited processing power each video was kept to 6 seconds (300 frames).   

We found, however, that the percent improvement in confidence averaged over all frames was not a good indicator of performance as it was greatly influenced by outliers.  Given that our videos were kept to 6 seconds in length, only one or two frames with a spurious zero for confidence threw off the entire reading.  The percentage of frames with improved confidence proved a robust evaluator however, often matching our visual evaluation of the movie.  A summary of the percentage of frames with higher confidence for each method is shown below:



Our third evaluation technique was to visually examine each movie while OKAO was performing face tracking in order to note any improvements.  This step was necessary because at times OKAO gives high confidence ratings while tracking the wrong feature, for example when tracking the nose instead of the mouth.  We found, however, that our visual evaluations matched the percentage of frames with higher confidence ratings very well.

Conclusions and Future Work

Overall we found that the simple grayscaling techniques were the most effective across all subjects.  We noted some improvement with principal component, most specifically with subject 50.  We also saw significant results when luminance of the first maximal varience projection was scaled and truncated in subject 40.  Overall, however, it seemed that no one method worked exceptionally well across all subjects. 

In future work we would like to extend the current project to include much more comprehensive test data, including more face-tracking systems, real time acquisition and processing of movies, more subjects, and longer video sequences.  We would also like to try our algorithms on subjects of other races, not just African Americans.  The current study was limited in scope due to lack of appropriate test data and computing power. 

We are also interested in applying more techniques based upon what we have learned in this project.  For example, since the system of selectively scaling certain areas of the image seemed promising, we would like to try selective color boosting for edges only in a certain range, centered around skin tones.  We would also like to apply pre-grayscaling techniques to other feature detection and extraction algorithms, and see how they improve. For example, it could be interesting to see how an algorithm like SURF that detects contrast differences between adjacent pixels would react to color contrast enhancing before grayscaling. 

The extension we are most interested in, however, is creating a system that could could dynamically choose which algorithm to apply based upon the first few frames of the video.  Various techniques could be experimented with for choosing the algorithm; a RGB histogram could be calculated from the first few frames of the video,  for example and the algorithm chosen based upon the color spread, or the average luminance could be calculated and the algorithm chosen based upon that.  In fact, many statistics could be computed about the frames and we could use machine learning to determine the best algorithm given all the computed statistics.

The above system could even be extended to include the video capture device itself.  We could interface back to the device and change settings, such as exposure rate in order to obtain the optimal image for face tracking.  In that way we could truly create an Applied Vision and Image System, the fundamental concept of this course!


[1] Tsukiji, Shu., Discussion about OKAO Face Tracking System., Stanford Virtual Human Interaction Lab, March 3, 2008.

[2] Gonzalez, Rafael C., Richard E. Woods, and Steven L. Eddins.  Digital Image Processing Using MATLAB.  Upper Saddle  River, NJ: Pearson Prentice Hall, 2004.

[3] C.G. Harris, M. Stephens,A combined corner and edge detector.In 4th Alvey Vision Conference, pp 147-151, 1988.

[4] Girod lecture, Feature Detection, slide no. 28

[5] Boehnen, Chris, Trina Russ.  “A Fast Multi-Modal Approach to Facial Feature Detection.”  Proceedings of the Seventh IEEE Workshop on Applications of Computer Vision, 2005.

[6] Feris, Rogerio F., Teofilo Emidio de Campos, Roberto Marcondes Cesar Junior.  “Detection and Tracking of Facial Features in Video Sequences.”  Lecture Notes in Artificial Intelligence, vol. 1793, pp. 197-206, April 2000.

[7] Bay, Tuytelaars, Tinne Tuytelaars, Luc J. Van Gool: SURF: Speeded Up Robust Features. ECCV (1) 2006: 404-417

[8] Lowe, David G. "Distinctive image features from scale-invariant keypoints," International Journal of Computer Vision, 60, 2 (2004), pp. 91-110.

[9] Huang, Szu-Hao, Shang-Hong Lai.  “Detecting Faces from Color Video by Using Paired Wavelet Features.”  National Tsing Hua University.  Proceedings of the IEEE Computer Society on Computer Vision and Pattern Recognition, 2004.

[10] Gonzalez, Woods Digital Image Processing .  Upper Saddle  River, NJ: Pearson Prentice Hall, 2004.

[11] Campadelli, Paola, Rafaella Lanzarotti, Guiseppe Lipori.  “Face Detection in Color Images of Generic Scenes.”  IEEE Conference on Computational Intelligence for Homeland Security and Personal Safety.  Venice, Italy, 21-22 July 2004.

[12] Feris, Rogerio F., Teofilo Emidio de Campos, Roberto Marcondes Cesar Junior.  “Detection and Tracking of Facial Features in Video Sequences.”  Lecture Notes in Artificial Intelligence, vol. 1793, pp. 197-206, April 2000.

[13] Campadelli, Paola, Rafaella Lanzarotti, Guiseppe Lipori.  “Face Detection in Color Images of Generic Scenes.”  IEEE Conference on Computational Intelligence for Homeland Security and Personal Safety.  Venice, Italy, 21-22 July 2004

[14] Dikbas, Salih, Tarik Arici, and Yucel Altunbasak.  “Chrominance Edge Preserving Grayscale Transformation with Approximate First Principal Component for Color Edge Detection.”  Georgia Institute of Technology.  ICIP 2007.

[15] L. Sirovich and M. Kirby, "Low-dimensional procedure for the characterization of human faces," Journal of the Optical Society of America A, 4(3), pp. 519-524, 1987.


We would like to thank Professor Wandell, Professor Farrell, Dr. Manu Parmar, Dr. Peter Catrysse, and Christopher Anderson for all their help throughout the course and in the completion of our project.  We would also like to thank Omeron Corporation for the use of OKAO vision, and Shuichiro Tsukiji in particular for the interface programs he provided.  We would also like to thank the Stanford Virtual Human Interaction Lab for the use of their videos.

Source Code

SourceCode.zip - all MATLAB files

SupplementaryData.zip - supplementary data including OKAO output for each video

Original18.avi - Subject 18 video

Original42.avi - Subject 42 video

Original50.avi - Subject 50 video

Division of Work

Overall we feel we all contributed equally in this project.  A general breakdown of the work is as follows, although we all helped out when needed.

Chris Tsai - responsible for the histogram equalization methods, eigenimage methods, and color gradient methods, much of the prior work research, writing, and finalizing slides

Damien Cerbelaud - responsible for the specialized greyscaling techniques, helping to decompress videos using Linux, and writing

Maria Jabon - responsible for working with OKAO, finding data, testing all results, processing and recreating movies, some prior work research, writing, and making the webpage