Motion estimation

Motion estimation is specific to the encoder. It's always the most complicated part of the system, and can absorb huge system resources, so methods have to be found to produce short-cuts. Dirac adopts a 3-stage approach. In the first stage, motion vectors are found for every block and each reference to pixel accuracy using hierarchical motion estimation. In the second stage, these vectors are refined to sub-pixel accuracy. In the third stage, we do mode decision, which chooses which predictor to use, and how to aggregate motion vectors by grouping blocks with similar motion together.

Motion estimation is most accurate when all three components are involved, but this is more expensive in terms of computation as well as more complicated. Dirac only uses the luma (Y) component.

Hierarchical motion estimation

Hierarchical ME speeds things up by repeatedly downconverting both the current and the reference frame by a factor of two in both dimensions, and doing motion estimation on smaller pictures. At each stage of the hierarchy, vectors from lower levels (smaller versions of the picture) are used as a guide for searching at higher levels. This dramatically reduces the size of searches for large motions. Dirac has four levels of downconversion. The block size remains constant (and the blocks will still overlap at all resolutions) so that at each level there are only a quarter as many blocks and each block corresponds to 4 blocks at the next higher resolution; and so each block provides a guide motion vector to 4 blocks at the next higher resolution layer. At each resolution, block matching proceeds by searching in a small range around the guide vector for the best match using the RDO metric (which is described below).

Search strategies in hierarchical ME

The hierarchical approach dramatically reduces the computational effort involved in motion estimation for an equivalent search range. However it risks missing small motions and it might not make good decisions when there are a variety of motions near to each other.

To mitigate this, the codec also always uses the zero vector (0,0) as another guide vector - this allows it to track slow- as well as fast-moving objects. Finally, the motion vectors already found in neighbouring blocks can also be used as guide vectors, it they have not already been tried.

Since each layer has twice the horizontal and vertical resolution of the one below it, it would appear to make sense to just search in an area +/-1 pixel of the guide vectors. In fact,the search ranges are always larger than this because this could cause the motion estimator to get trapped in a local minimum.

Sub-pixel refinement and upconversion

Sub-pixel refinement also operates hierarchically. Once pixel-accurate motion vectors have been determined, each block will have an associated motion vector (V0,W0) where V0 and W0 are multiples of 8. 1/2-pel accurate vectors are found by finding the best match out of (V0,W0) and its 8 neighbours: (V0+4,W0+4), (V0,W0+4), (V0-4,W0+4), (V0+4,W0), (V0-4,W0), (V0+4,W0-4), (V0,W0-4), (V0-4,W0-4). This in turn produces a new best vector (V1,W1), which provides a guide for 1/4-pel refinement, and so on. The process is illustrated in the figure below.

Figure: sub-pixel motion-vector refinement

The sub-pixel matching process is complicated slightly since the reference is only upconverted by a factor of 2 in each dimension, not 8, and so 1/4 and 1/8 pel matching requires frame component values to be calculated on the fly by linear interpolation.

Previous: Overlapped-block motion compensation Next: RDO motion estimation metric

Table of Contents Back to Motion Estimation and Compensation