|
Environment matting, introduced by Zongker et al. [11] is a process which captures not only foreground objects and their traditional opacity, but also a description of how the object interacts with the light presented to it by the environment. The technically precise way of capturing such information would be to photograph the object separately as it would be lighted by each possible incoming ray. Since such a task would take large amount of time and memory, even while being highly accurate, is infeasible. Various tradeoffs of accuracy against speed of motion capture have been presented by Zongker et al [11], and then by Chuang et al [5]. In this project we explore another point on this tradoff curve.
The approach taken by environment matte capture techniques, generally uses artificially generated background images of various colors to estimate the some environment matte structure. Thus, for example, in Chuang et al's real-time capture algorithm, a wash of varying color is used as the background, and from the acquired image, the color of a pixel indicates which pixel of the background it refracts. To use this technique, the foreground object itself has to be colorless. In this project, instead of depending on the background color for the matting information, we depend on the spatial characteristics of the background pattern. Thus, for a particular spatial pattern observed in the image, we estimate where in the background it came from (using a separately captured image of the background only). This, we do using elastic image registration. Also, we estimate the contribution of the foreground object and of the background, to each pixel of the target image (the image from which we are extracting the environment matte). The information provided by these two stages is a warping function of the background image into the target image, a transparency map, and a foreground color map. This, then, is our environment matte.
From the above description, the particular tradeoffs achieved by this algorithm can be discussed. Since a single image of the object is used, this technique allows for real time capture. Also, since color is not being used explicitly to extract the environment matte, we can allow colored and partially transparent objects. On the disadvantages side, since we are using image registration, this technique is only useful on objects which are not "too harsh" with the background, i.e. which warp the background continuously, smoothly and invertibly. This greatly limits the applicatbility of this project. Still, we can think of some useful instances, like flowing water, air convection currents, atmospheric seeing in astronomical images, cars, etc.
In the remainder of the project report, we will discuss the two main techniques to be used for the above task, viz. elastic registration and estimating the foreground objects.
For environment matting purposes, we wish to find out, for each pixel in the target image F the pixel that contributes to it from the background image G. Such a warp of F onto G can be calculated using elastic registration techniques. Since we are registering on a pixel-by-pixel basis (local, as against global, registration), we need a local similarity measure, to tell us how close the warped version of F is to G. For our specific problem, we know that, due to the arbitrary transparency of the foreground objects allowed, the same feature in F and G need not have the same amplitude. Thus, similarity measures which depend upon the amplitude, like the L1-norm or the L2-norm [6], cannot be used. Instead, for measuring similarity, we use the cross-correlation between F and G in a neighborhood around the point.
where summation is taken waighted over a Gaussian neighborhood of the point x under consideration. EF and EG are the expected values of F and G over that neighborhood. We wish to find an estimate of h that would maximize the above cross correlation. We can do that by taking a partial derivative with respect to h and equating it to zero. If we try to estimate G (x + h) using the first Taylor approximation, namely
we see that the derivative will be independant of h and hence not very useful. We can use the second order Taylor approximation [2], but the results we got using such a method were unsatisfactory, possibly because of the difficulty of estimating the Hessian of G. Instead what we do is, add a term which increases with distance of the warp to the objective function, giving
The derivative of this function with respect to h, calculated using the first order Taylor approximation, equated to zero, gives the estimate
In the above estimation formula, we can choose B such that h will never go beyond some certain value k, by making sure that at a distance of k, the negative term in the definition of S dominated over the maximum possible correlation between F and G. This maximum possible correlation, though not known, can be estimated as the autocorrelation of F multiplied by the alpha at that point. (If there are no intervening objects, the alpha is always one).
The above gives us an estimate of the warping displacement h at each point in the image. We should note here, that ultimately, the cross-correlation between F and G is not going to be very accurate, because even in the neighborhood under consideration, the images could have local distortions with respect to one another. To counter this effect, we proceed iteratively, and at each iteration, after we have found as estimate of the warp for each pixel, we actually warp G by that estimate, so that the next iteration only has to consider the "remnant" warp.
To ensure that the algorithm accounts for coarses distortions before settling onto fine variations, we use a multiresolution approach. Thus, for each iteration of the above algorithm, we pre-convolve the images with a Gaussian kernel. The width of this kernel is large at first, and is exponentially decreased in further iterations. The width of the neighborhood window being used is also decreased by the same factor every iteration. Also, for proper registration, we have to allow for local cooperation in the estimated warp [2]. This is by convolving the warp with a Gaussian kernel after every iteration [4], the width of which also decreases exponentially. Together with the distance-based term in the objective function S, we can get a smooth warping function.
Each iteration of the above algorithm consists of performing some tasks at each pixel. The whole iteration (other than the warp), is linear and space invariant, and hence, can be implemented with convolutions and linear operations. This makes the implementation very fast and easy to achieve. This is how I implemented the algorithm.
For extra speed, the lower-resolution (larger Gaussian) steps of the algorithm can be done using a coarser sampling grid [2]. I have not implemented this.
Following images show an original image ('lena'), and a distorted version of the image. The registration algorithm then extracts the warp, and this warp can hence be applied to any new background image. We have applied it to the original lena image itself, to show the similarity between the given warp and the recovered warp. For a 512x512 image, this calculation takes about 20 minutes. All the gaussian sizes are reduced by 0.75 every time. The recovered warp looks highly satisfactory.
| Original 'lena image. (G) |
| Distorted version of 'lena' image, F. This effect
is supposed to have been caused by the phenomenon being
captured for environment matting. In this case, though, the
warp is artificial. |
| After the warp function has been estimated, it can be
applied to any background image we want. Here, we have applied
it to the original background itself. Observe the similarity
between this image generated using the estimated warp, and the
image F above, from which the warp has been
measured. The glitches near the borders are caused because
since the algorithm has been implemented by convolution, the
clipping of the kernel on the sides causes some failure.
|
The second problem to be considered is the separation of the presented target scene, into foreground and background and associated transparency map. Suppose, at a pixel, the color of the original background is cb, and that of the captured scene is c. This composite color c comes from more is the effect of the background color times a transparency value a, added into the foreground color cf. Namely,
From the images captured , we know the background color cb and the composed color c. Unfortunately, these are not enough to solve the above equation unambiguously. The above equation represents n constraints, where n is the number of color components in the image. But, there are n + 1 unknown quantities, the foreground color cf, and the transparency value a.
To get over this problem, we have to somehow estimate the transparency value at a pixel. To do so, we assume coherence between nearby transparency values, i.e. spatially nearby transparency values are nearby in value. Further, we assume that the foreground objects are uncorrelated with the background objects. Under these assumptions, we can estimate fraction of G the background present in F as the cross-correlation of G and F normalized by the auto-correlation of G. To get the transparency at a point, then, we perform the cross-correlation and auto-correlation over a Gaussian neighborhood around it. Now, if this neighborhood is too large, the resulting transparency values will be a smudge of nearby transparency values. If this neighborhood is small, the measurement will become increasingly noisy, as the amount of data being used reduces, and the assumption of uncorrelatedness of the foreground objects with the background becomes increasingly untenable.
We look at the value of cross-correlation starting from a large neighborhood downwards, and try to estimate the value at a nieghborhood of size one pixel from the trend that we get. Various techniques to evaluate such a series can be used. After trying techniques from very simple (just take a single value) to fairly complicated (histogram based), it is seen that an IIR filtering approach gives good results. Everytime a new estimate of a is generated using a particular size neighborhood, we generate a new value of aest by taking a convex combination of the old value and the newly generated value. The convex combination is weighted highly towards the old value.
This algorithm, just like the one earlier, can be almost entirely implemented using convolutions and linear operations. Thus, an efficient implementation can be coded.
The following series of pictures shows results of the above algorithm, along with some intermediate steps. The algorithm produced the output in about 10 minutes on the 512x512 images. The results look satisfactory.
| Original 'lena' image. |
| The image with few objects composited. There is a
soft-edged yellow ball, and a green stroke. The green stroke
is semi-transparent in some parts. |
| The estimate of transparency obtained by using a Gaussian
neighborhood of size 35. |
| The estimate of transparency obtained by using a Gaussian
neighborhood of size 4.67. |
| The estimate of transparency obtained by using a Gaussian
neighborhood of size 0.62. |
| The series of estimated transparency values using
successively smaller neighborhoods, for a point completely in
a foreground object. |
| ...for a point completely outside, but near a foreground
object. |
| ...for a semi-transparent point. The actual transparency
value was 0.26. |
| The estimated transparency matte. |
| The estimated foreground. |
Now, how can we put the above two algorithms together, to extract the environment matte from the aforementioned type of scenes? This is problematic because, unless the registration is done, the correlation required for estimating the transparency cannot be computed. On the other hand, unless the background and foreground are separated, we cannot register the background part with the original background. The solution, I believe, is to run both the algorithms together.
Since both algorithms are running multiresolution, from coarse towards fine, we can perform alternate iterations of both the algorithms. Notice that a particular matching iteration matches only upto a particular resolution, hence for the transparency estimation, we should pre-smudge the images to that resolution (this is already implemented). As estimates of transparency become available, the registration algorithm can, with increasingly greater accuracy, register only the background part, and also normalize the energies of both images to same values, so that the correlation maxima are more accurate. Also, parts where opaque foreground objects are present, should be registered only using the "elasticity" part of the algorithm, and not the "correlation maximizing part".