Motion compensated intermediate viewpoint image interpolation

Mai Vu and Michal Smulski

 

 

Problem Statement:

An array of cameras acquires a set of frames. The acquired scene could be static or dynamic. Each camera produces an independent sequence of frames. In order to combine all of the frames into a single sequence, as if acquired by a “virtual” camera, intermediate viewpoint frames must be generated. This is required since cameras are spaced relatively widely so that there is a jump between view from camera i and camera i+1.

A simple solution is to assume a fixed depth for all objects in the scene for all the cameras. Intermediate viewpoints then can be generated by texture mapping the image onto a flat surface and reprojecting at different viewpoint. This solution has been attempted and results are good for objects far away from the camera array or for the objects that are at the same distance from the cameras. However, in a scene with strong depth variations, aliasing occurs. That is, objects closer to camera translate at different rate then the ones farther away.  The image below has been obtained using the method described above. Note the ghosting around the folder’s corner.

 

Figure 1 Interpolated Image assuming fixed depth

 

In order to improve the quality of the interpolated images, uniform depth assumption must be removed. Knowing the actual depth, or equivalently disparity among images taken from different cameras at the same time t, would allow for better image interpolation. This task was the main objective of this project. Before proceeding to motion estimation algorithms, description of the camera array is given to help better appreciate the problem of motion estimation in this case.

 

Image Acquisition Hardware:

As part of Immersive Television Project, an array of camera was designed. The array consists of “bricks”. Each brick is a strip of four cameras that are set equidistantly from each other.

Figure 2 Single "Brick" from Camera Array

 

The sensors are CMOS imagers from Photobit (PB159DX). The imager is an APS system with 512 by 384 array of 7.9um square pixels. It outputs raw Bayer mosaic image. The images are demosaiced and unwarpped using custom software. Lenses used in the project are fixed-focus wide-angle miniature-lenses with 8.0 mm nominal focal length. The imagers are spaced as closely as possible, the signal traces on the PCB board being the limiting factor. The sensors are placed along a single line and spaced nominally at 40 mm.

 

Block Matching Algorithms:

Two motion estimation methods were investigated in this project. These are: Phase Correlation Method and Multiple-baseline Stereo Algorithm. Both methods belong to the block matching type of motion estimation algorithms. Phase Correlation is chosen since it estimates motion with little sensitivity to intensity offsets between images. The offset is caused by the intensity variation across the camera array. The other method is chosen because it was successfully implemented for a real-time depth extractor using multiple cameras (see [6]).

 

Phase Correlation Method

 

Phase Correlation is a motion estimation method based on assumption that the motion from one image to the other is pure translation and the intensity in an image is a scaled version of intensity in the other image by a constant factor. The algorithm is based on the time shifting property of the Fourier transform. According to this property, a shift in image plane corresponds to an exponential factor in Fourier domain.  Mathematically, let image f_1 and image f_2 be related by pure translation motion as:

Equation 1

Then in Fourier domain, the normalized cross-power spectrum at each spatial frequency component (u,v) retains only the information on the phase difference:

Equation 2

The phase correlation surface is obtained by applying inverse DFT to the normalized cross-power spectrum. In order to extract motion vector from the phase difference, inverse DTFT is computed to obtain:

Equation 3

 

The phase correlation surface is zero everywhere except at the location (-d_x,-d_y), which corresponds to the translation motion between the two images.

Phase Correlation method has a number of implementation issues. In real implementation, Fast Fourier Transform instead of DTFT is used. Since the FFT assumes periodic signal which is not true in real images (i.e. objects that moves out of boundary of an image does not reappear on the other side of the image). Because of this, a windowing function is applied before carrying out the forward Fourier transform to reduce boundary discontinuity effects [2] and suppress spurious peaks in case of non-integer displacement [1]. The windowing function also help to reduce the noise level in the phase correlation surface due to areas mismatch close to the edges of the image blocks. We choose to implement the following separable windowing function [2]

 

Equation 4

where s is the distance from the pixel to the closest block edge and N_s is the block size.

  Another issue is that multiple objects in the scene can move with different motion vectors. While phase correlation picks up all of these motions, inverse DFT of F is not a delta function but is the sum of a number of deltas. One of the challenges in implementing the algorithm is to decide which peak (delta) is the correct one for each object.

We apply Phase correlation in two steps. First we correlate the entire images to find the global shift vector. Consider the two images taken by the camera 0 and camera 1:

 

Figure 3 Camera 1 (left) and Camera 0 (right) Images

 

The inverse of their global phase difference is plotted below.

 

Figure 4 Inverse DFT of Phase Difference

                                                                                                                                                      

The largest peak corresponds to the shift of the largest set of pixels that is a projection of objects with the same depth. We call this the global motion vector. In this case, those are background pixels that show calibration target and the wall behind it.

 

               To verify correctness of the algorithm, a new image is generated using global motion vector using the following linear interpolation scheme that roughly places the “virtual camera” in the middle of camera 0 and camera 1:

Equation 5

 

 

Figure 5 Interpolated Image with Global Motion

 

               As expected the background from the two images has been superimposed correctly but objects closers to the cameras (note the corner of the folder) exhibit ghosting. In order to remove the ghosting, second step of phase correlation is carried out. Using the global motion vector, we obtain two intermediate images by shifting left and right image by half of the global motion vector toward each other. Phase correlation is then applied to blocks of these two intermediate images, followed by block matching on smaller sub-blocks to assign to each sub-block a motion vector among the candidates identified by phase correlation. We tested different block sizes and found that for these two images, the best block size for Phase Correlation was 32x32 pixels and the block matching was performed on 8x8 sub-blocks.

The second step of Phase Correlation produces local motion vectors, which are plotted below:

Figure 6 Local Motion Field

Again, applying equation 5 to each sub-block with its local motion vector, we interpolate the middle image. The result is shown below

 

Figure 7 Interpolated Image with Local Motion

 

 

Visibly, there are a number of mismatches in this second step of phase correlation. There are several reasons for these mismatches. Firstly, as aforementioned, pixels at different depth have different disparity, thus not every pixels in each sub-block used in block matching have the same motion vector. Secondly, there are occluded regions, i.e. regions that are covered by some objects from a particular angle in an image but are visible in the other image. Pixels in these regions have no real match, but block matching will assign a motion vector to it. Next, block matching is done using the minimum sum of square difference criteria based on image intensity. Due to intensity variation from camera to camera, this can generate false match even on candidate motion vectors generated by phase correlation. The mismatch effect is most obvious for the blocks at edges of the toy figure in the image, which is mismatched with other blocks in the background. Lastly, some errors can be caused by noise in phase correlation due to the boundary discontinuity windowing approximation. To improve this, phase correlation was performed on overlapping blocks, but the result did not show significant improvement, which implied that this is not the major reason for the mismatch.

 

Concluding, two-step phase correlation followed by block matching did not give satisfactory interpolated images. This led to the investigation of another motion estimation algorithm.

 

 

Multiple-baseline Stereo Method

A block matching technique uses correlation methods to find motion vectors. Input to the algorithm are two frames, search window size W, minimum d_min and maximum d_max search vectors. For each pixel f_0(i,j) in frame 0, a window of size W is defined around that pixel. Then, for each search vector d that is contained between d_min and d_max a window is defined in frame f_1(i+d(i),j+d(j)). For each vector d, a correlation function (sum of squared distances or SSD) is computed. The SSD mathematically describes how the match window in frame 0 differs from the match window in frame 1. The optimal motion vector d is found by minimizing SSD.

 

Equation 6

              

The following diagram graphically describes computation of SSD for pixel at location (i,j):

 

              

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Multiple-baseline Stereo [5] is the second method considered in this project. It is also a block matching technique since it exploits multiple images to construct correlation function and uses a priori information about cameras’ location and orientation to define search window and correlate motion vectors among different camera pairs. Kanade [5] described the method using simple camera geometry described below.

If a sequence of image frames has been taken from a number of cameras aligned on a straight line and pointing along the same direction, then motion vector d between each two cameras is related by:

Equation 7

where B (baseline) is distance between optical centers of the cameras, F is their focal length and Z is distance between the cameras and the projected point in space. In other words, given depth Z of a point P, one can compute apparent motion vector that results from taking a snapshot of point P with two cameras separated by distance B.

               Since, motion vector d is related to depth Z, one can transform the common variable from 1/Z to d, which is the disparity between two farthest cameras. Then disparity between camera 0 and camera k will be:

Equation 8

 Where R(k) is the ratio of baseline B (distance between farthest cameras)  and B(k) (distance between camera 0 and camera k) .

               The new correlation function is simply the sum of SSD of each two frames. In mathematical terms

 

Equation 9

Kanade [5] showed that using SSSD false matches could be avoided. Hence patterns could be correctly identified. Since search window for the camera array is large due to large baseline, false mismatches are a real concern. The method seemed to be a perfect choice for this project. With the assumption of the simple cameras’ geometry, the method is easily implemented by modifying standard block matching algorithm. However, the actual hardware had fixed relative camera position and orientation. Calibrating the cameras to obtain their extrinsic parameters had determined that the array has no such simple geometry. The results implied that the equation 7 could not be used.

               In order to apply Multiple-baseline Stereo Method to the images acquired through the camera array, a more general model of the array has to be developed. Using projective camera model new equation that relates image coordinates could be written as follows:

Equation 10

Z is the depth of point P from camera 0 and 1. (i, j, z) and (i’, j’, z) are the coordinates of the point P projected onto camera 0 and camera 1 respectively.  T is 3 by 1 translation vector and R is 3 by 3 rotation matrix.

R and T are sufficient to describe relationship between two cameras’ coordinate systems and therefore, with depth knowledge, relationship among motion vectors. However, equation 10 is still a simplification. This model assumes that R is close to identity matrix and the focal lengths of the cameras are identical. The assumption was made based on calibration data. It significantly simplifies matching motion vectors among camera pairs.

The algorithm was hardcoded for 3 cameras located on the same “brick”. It was implemented in two steps. Firstly, minimum and maximum disparity vectors were computed for cameras 0 and 1 and cameras 0 and 2. Then correspondence was computed between disparity vectors of disparity matrix of camera 0 and 1 to disparity of camera 0 and 2. Motion vectors were estimated at pixel accuracy for camera 0 to 2. Since the cameras were spaced evenly, motion vectors between camera 0 and 1 were at sub pixel accuracy. Second step was to actually implement and run block matching algorithm. Note that since disparity estimated were fractions, matching had to be done at sub pixel accuracy. Bilinear interpolation was used to obtain intensity values for non-integer locations. For speed efficiency matching was done in C while the first part was implemented in Matlab.

               The following three images were used to test the algorithm:

Figure 8 Images from Camera 0(right), 1(middle) and 2(left)

               Only pixels common to all three images were considered when running the match algorithm. To test the correctness of the system, image 2 was generated from image 0 using motion vectors:

Figure 9 image 2 generated from image 0

 

               The next plot shows motion vectors between image 2 and image 0.

MATLAB Handle Graphics

Figure 10 Motion vectors between camera 2 and 0

               The large peaks in the back represent movement of the folder while the smaller peaks represent background and the Tick motion. The motion data corresponds to expected motion vectors. As it can be seen from the back-projected image the motion vectors were “correct” in a sense that image 0 could be recovered from image 1. However, the motion vectors are “correct” to generate intermediate image with similar image quality.  This is because the motion field does not created a one to one correspondence between image intensities. Back-projecting image 0 to image 2 using the same motion field does not produce image 2.  This, we believe, is the first step in producing intermediate frames.

 

Conclusion

Two motion estimation algorithms have been investigated: two step phase correlation followed by block matching method and multi-baseline stereo matching algorithm. Intermediate viewpoint image is interpolated from the motion vector using linear interpolation. The second method using multiple cameras outperforms the first method in terms of aesthetic vision of the interpolated image. However, there is still a significant research required in order to produce an intermediate image from the motion field. One option that is considered is to actually build a partial 3-D model of the scene then project that model at the desired angle.

 

 

Project Milestones:

1. Random Dot Stereograms (Mai Vu),

2. Camera array calibration and image acquisition (Michal Smulski),

3. Phase Correlation Algorithm (Mai Vu),

4. Multiple-baseline Stereo Algorithm (Michal Smulski),

 

References:       

[1] B. Girod, “Motion-Compensating Prediction with Fractional-Pel Accuracy”, IEEE Tran. On Communicaiton, Vol 41, No 4, Apr 1993.

[2] D.V. Papadimitriou, T.J. Dennis, “Stereo disparity analysis using phase correlation”, Electronic Letters, Vol. 30, No. 18, Sep 1994.

[3] M.H. Ahmad Fadzil, T.J. Dennis, “A Hierarchical Motion Estimation for Interframe Coding”, IEE Colloquium on “Application of Motion Compensation”, Oct 1990.

[4] C. Cafforio, F. Rocca, S. Tubaro, “Motion Compensated Image Interpolation”, IEEE Tran. On Communication, Vol. 38, No. 2, Feb 1990.

[5] T. Kanade, M. Okutomi, T. Nakahara, “A Multiple-baseline Stereo Method”, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.15, no.4, 1993.

[6] T. Kanade, “A Stereo Machine for Video-Rate Dense Depth Mapping and Its New Applications”, ARPA Image Understanding Workshop, Feb, 1996

[7] E. Trucco, A. Verri, “Introductory Techniques for 3-D Computer Vision”, Prentice Hall, 1998.

[8] A.M. Tekalp, “Digital Video Processing”, Prentice Hall, 1995