Video Coding
Video Coding
Iain E. G. Richardson
Copyright q 2002 John Wiley & Sons, Ltd
ISBNs: 0-471-48553-5 (Hardback); 0-470-84783-2 (Electronic)
3.1 INTRODUCTION
Representing video materialin a digital form requires a large numberof bits. The volume of
data generated by digitising a video signal is too large for most storage and transmission
systems (despite the continual increase in storage capacity and transmission ‘bandwidth’).
This means that compression is essential for most digital video applications.
The ITU-R 601 standard (described in Chapter 2) describes a digital format for video that
is roughly equivalent to analogue television, in terms of spatial resolution and frame rate.
One channel of ITU-R 601 television, broadcast in uncompressed digital form, requires a
transmission bit rate of 216Mbps. At this bit rate,a4.7Gbyte DVD couldstorejust
87 seconds of uncompressed video.
Table 3.1 shows the uncompressed bit rates of several popular video formats. From this
table it can be seenthat even QCIF at 15 frames per second (i.e. relatively low-qualityvideo,
suitable for video telephony) requires 4.6Mbps for transmission or storage. Table 3.2 lists
typical capacities of popular storage media and transmission networks.
There is a cleargap between the high bit rates demanded by uncompressed video and the
available capacityof current networksand storage media. The purpose of video compression
(video coding) is to fill this gap. A video compression system aimsto reduce the amount of
data required to store or transmit video whilst maintaining an ‘acceptable’ level of video
quality. Most of the practical systems and standards for video compression are ‘lossy’, i.e.
the volume of data is reduced (compressed) at the expense of a loss of visual quality. The
quality loss depends on many factors, but in general, higher compression resultsin a greater
loss of quality.
The following statement (or something similar) has been made many times over the 20-year
history of image and video compression: ‘Video compression will become redundant very
soon, once transmission and storage capacities have increased to a sufficient level to cope
with uncompressed video.’ It is truethat both storage and transmission capacities continue
to
increase. However, an efficient and well-designedvideocompression system gives very
significant performanceadvantagesfor visual communicationsat both low and high
transmission bandwidths. At low bandwidths, compression enables applications that would
not otherwise be possible, such as basic-quality video telephony over a standard telephone
28 IMAGE
AND VIDEO COMPRESSION
FUNDAMENTALS
connection. At high bandwidths, compression can support amuch higher visual quality.For
example, a4.7 Gbyte DVD can store approximately 2 hours of uncompressed QCIF video (at
15 frames per second)or 2 hours of compressed ITU-R 601 video (at 30 frames per second).
Most users would prefer to see ‘television-quality’ video with smooth motion rather than
‘postage-stamp’ video with jerky motion.
Video compression and video CODECs will therefore remain a vital part of the emerg-
ing multimedia industry for the foreseeable future, allowing designers to make the most
efficient use of available transmission or storage capacity. In this chapter we introduce the
basiccomponents of an imageorvideocompressionsystem. We begin by defining the
concept of an image or video encoder (compressor) and decoder (decompressor). We then
describe the main functional blocks of an image encodeddecoder (CODEC) and a video
CODEC.
3.2 IMAGEANDVIDEOCOMPRESSION
Information-carrying signals may be compressed, i.e. converted to a representation or form
that requires fewer bits than the original (uncompressed) signal. A device or program that
compresses a signal is an encoder and a device or program that decompresses a signal is a
decoder. An enCOderDECoder pair is a CODEC.
Figure 3.1 shows a typical exampleof a CODEC as partof a communication system. The
original (uncompressed) information is encoded (compressed): this is source coding. The
source coded signal isthen encoded further to add error protection (channel coding)prior to
transmission over a channel. At thereceiver,a channeldecoder detects andor corrects
transmission errors and asource decoder decompresses the signal. The decompressedsignal
maybe identicaltotheoriginal signal (lossless compression) or it maybe distorted or
degraded in some way (lossy compression).
IMAGE AND VIDEO COMPRESSION 29
General-purpose compression CODECs are available that are designed to encode and
compress data containing statistical redundancy. An information-carrying signal usually
contains redundancy, which means that it may (in theory) be represented ina more compact
way. For example, characters within a text file occur with varying frequencies: in English,
the letters E, T and A occur more oftenthan the letters Q, Z and X. This makes it possible to
compress a text file by representing frequently occurring characters with short codes and
infrequently occurring characters with longer codes (thisprinciple is used in Hufinan coding,
described in Chapter 8). Compression is achieved by reducing the statistical redundancy in
the text file. This type of general-purpose CODEC isknown as an entropy CODEC.
Photographic images and sequences of video frames are not amenable to compression
using general-purpose CODECs. Their contents (pixel values) tend to be highly correlated,
i.e. neighbouring pixels have similar values, whereas an entropy encoderperforms best with
data values that have a certaindegree of independence(decorrelateddata).Figure 3.2
illustrates the poor performance of a general-purpose entropy encoder with image data.
The original image (a) is compressed and decompressed using a ZIP program to produce
(c) decoded;
and(c) P E G encoded
decoded
and
30 IMAGE AND VIDEO COMPRESSION FUNDAMENTALS
I I I I
Original -----)
Source model Entropy encoder -
image l video
4
Store l transmit
-
Decoded Source model t Entropydecoder 4
image / video
image (b). This is identical to the original (lossless compression), but the compressed file is
only 92% of the size of the original, i.e. there is very little compression. Image(c)is
obtained by compressing and decompressing the original using the JPEG compression
method. The compressed version is less than a quarter of the size of the original (over
4 x compression) and the decompressed image looks almost identical to the original. (It is in
fact slightly ‘degraded’ due to the lossy compression process.)
In this example, the JPEG method achieved good compression performance by applying a
source model to the image before compression. The source model attempts to exploit the
properties of video or image data andto represent it in a form that can readily be compressed
by an entropy encoder. Figure 3.3 shows the basic design of an image or video CODEC
consisting of a source model and an entropy encoderldecoder.
Images and video signals have a number of properties that may be exploited by source
models. Neighbouring samples (pixels) within an image or a video frame tend to be highly
correlated and so there is significant spatial redundancy. Neighbouring regions within
successive video frames also tend to be highly correlated (temporal redundancy). As well
as these statistical properties (statistical redundancy), a source model may take advantage of
subjectiveredundancy, exploiting the sensitivity of the human visual system to various
characteristics of images and video. For example, the HVS is much more sensitive to low
frequencies than to high ones and so it is possible to compress an image by eliminating
certain high-frequency components. Image (c) in Figure 3.2 was compressed by discarding
certain subjectively redundant components of the information: the decoded image is not
identical to the original but the information loss is not obvious to the human viewer.
Examples of image and video source models include the following:
Each sample or pixel is predicted from one or more previously transmitted samples. The
simplest prediction is formedfrom the previous pixel (pixel A in Figure 3.4). A more
accurate prediction can be obtained using a weighted average of neighbouring pixels (for
example, A, B and C in Figure 3.4).The actual pixel value X is subtracted from the
prediction and the difference (the prediction error) is transmitted to thereceiver. The
prediction error will typically be small due to spatial correlation, and compression can
be achieved by representing common, small prediction errors with short binary codes and
larger, less common errors with longer codes. Further compression may be achieved by
quantising the prediction error and reducing its precision: this is lossy compression as it
IMAGE AND VIDEO COMPRESSION 31
i j Previous
line of transmitted pixels
-..................
I
\
Pixel to be predicted
becomes impossible to exactly reproduce the original values at the decoder. DPCM may be
applied spatially (using adjacent pixels in the same frame)and/or temporally (using adjacent
pixels in a previous frame to form the prediction) and gives modest compression with low
complexity.
3.2.2TransformCoding
3.2.3Motion-compensatedPrediction
Using a similarprinciple to DPCM, the encoder forms a model of the current framebased on
the samples of a previously transmitted frame. The encoder attempts to ‘compensate’ for
motion in a video sequence by translating (moving) or warping the samplesof the previously
transmitted ‘reference’ frame.The resulting motion-compensated predicted frame (the
model of the currentframe)is subtractedfrom the currentframe to producearesidual
‘error’ frame (Figure 3.6). Further coding usually follows motion-compensated prediction,
e.g. transform coding of the residual frame.
Motion
compensation
Current Residual
3.2.4 Model-basedCoding
original I I reconstructed
The transform coding stage converts (transforms) the image from the spatial domain into
another domain in order to make it more amenable to compression. The transform may be
applied to discrete blocks of the image (block transform) or to the entire image.
Block transforms
-
Encoder
Source Entropy
image
Transform 4 Quanlise 4 Reorder + encode
Store / transmit
Entropy
Reorder Rescale
decode
Decoder
(b)
Figure 3.9 (a) 16 x 16 block of pixels; (b) DCT coefficients
IMAGE CODEC 35
image samples (a) and the corresponding block of coefficients produced by the DCT (b). In
the original block, the energy is distributed across the 256 samples and the latter are clearly
closely interrelated (correlated). In the coefficient block, the energy is concentrated into a few
significant coefficients (at the top left). The coefficients are decorrelated: this means that the
smaller-valued coefficients may be discarded (for example by quantisation) without
significantly affecting the quality of the reconstructed image block at the decoder.
The 16 x 16 array of coefficients shown in Figure 3.9 represent spatial frequencies in the
original block. At the top left of the array are the low-frequency components, representing
the gradual changes of brightness (luminance) in the original block. At the bottom right of
the array are high-frequency components and these represent rapid changes in brightness.
These frequency components are analogous to the components produced by Fourier analysis
of a time-varying signal (and in fact the DCT is closely related to the discrete Fourier
transform) except that here the components are 2-D. The example shown in Figure 3.9 is
typical for a photographic image: most ofthecoefficients produced by the DCT are
insignificantand can be discarded. This makes the DCT a powerfultool for image and
video compression.
Image transforms
The DCT is usually applied to small, discrete blocks of an image, for reasons of practicality.
In contrast, an image tran.$orm may be applied to a complete video image (or to a large ‘tile’
within the image). The most popular transform of this type is the discrete wavelet transform.
A 2-D wavelet transform is applied to the original image in order to decompose it into a
series of filtered ‘sub-band’ images (Figure 3.10). Image (a) is processed in a series of stages
to produce the ‘wavelet decomposition’ image (b). This is made upof a series of components,
each containing a subset of the spatial frequencies in the image. At the top left is a low-pass
filtered version of the original and moving to the bottom right, each component contains
progressively higher-frequency information that adds the ‘detail’ of the image. It is clear that
the higher-frequency components are relatively ‘sparse’, i.e. manyof the values (or
‘coefficients’) in these components are zero or insignificant. The wavelet transform is thus
an efficientwayof decorrelating or concentrating the important information into a few
significant coefficients.
The wavelet transform is particularly effective for still image compression and has been
adopted as part of the PEG-2000 standard and for still image ‘texture’ coding in the MPEG-4
standard. Wavelet-based compression is discussed further in Chapter 7.
Another image transform that has received much attention is the so-called fractal
transform. A fractal transform coder attempts to represent an image as a set of scaled and
translated arbitrary ‘basispatterns’. Fractal-based coding has not, however,shownsuffi-
ciently good performance to be included in any of the international standards for video and
image coding and so we will not discuss it in detail.
3.3.2 Quantisation
The block and image transforms described above do not themselves achieve any compres-
sion. Instead, they represent the image in a different domain in which the image data is
36 IMAGE AND VIDEO COMPRESSION F U N D A M E N T A L S
(b)
Figure 3.10 Waveletdecomposition of image
IMAGE CODEC 37
separated into components of varying ‘importance’ to the appearance of the image. The
purpose of quantisationisto remove thecomponents of thetransformeddatathatare
unimportantto the visualappearance of theimage and to retainthevisuallyimportant
components.Onceremoved, the lessimportantcomponentscannot be replaced and so
quantisation is a lossy process.
Example
It is possible to vary the ‘coarseness’ of the quantisation process (using a quantiser ‘scale
factor’ or ‘step size’). ‘Coarse’ quantisation will tend to discard most of the coefficients,
leaving only the most significant,whereas ‘fine’ quantisationwilltend to leave more
coefficients in the quantised block. Coarse quantisationusually gives higher compression at
the expense of a greater loss in image quality. The quantiser scale factor or step size is often
the main parameter used to control image quality and compression in an image or video
CODEC. Figure 3.13 shows a small original image (left) and the effect of compression and
decompression with fine quantisation (middle) and coarse quantisation (right).
A typical image blockwill contain afew significant non-zero coefficientsand a large number
of zero coefficients after block transform coding and quantisation. The remaining non-zero
datacan be efficiently compressed using astatisticalcompression method (‘entropy
coding’):
1. Reorder the quantised coefficients. The non-zero quantised coefficientsof a typical image
block tend to be clustered around the ‘top-left comer’, i.e. around the low frequencies
(e.g.Figure3.9).Thesenon-zero values can be groupedtogether in sequence by
reordering the 64 coefficients, for example in a zigzag scanning order (Figure 3.14).
Scanning through in azigzagsequence from thetop-left (lowest frequency) to the
bottom-right(highestfrequency)coefficientsgroupstogetherthesignificant low-
frequency coefficients.
38 IMAGE AND VIDEO COMPRESSION FUNDAMENTALS
5ooj:.
1000
0
:
i 1
fo
5
level) pair represents the numberof preceding zeros and the second number represents a
non-zero value (level). For example, (5, 12) represents five zeros followed by 12.
3. Entropy coding. A statistical coding algorithm is applied to the (run, level) data. The
purpose of the entropy coding algorithm is to represent frequently occurring (run, level)
pairs with a short code and infrequently occurring (run, level)
pairs with a longer code.In
this way, the run-level data may be compressed into a small number of bits.
Huffman coding and arithmetic coding are widely used for entropy coding of image and
video data.
Huffman coding replaces each ‘symbol’ (e.g. a [run, level] pair) with a codeword containing
a variable number of bits. The codewords are allocatedbased on the statistical distribution of
Figure 3.13 (a) Original image; (b) fine quantisation; (c) coarse quantisation
40 IMAGE AND VIDEO COMPRESSION FUNDAMENTALS
Non-zero coefficients
Low frequency I
...,
High frequency
the symbols. Short codewords are allocated to common symbols and longer codewords are
allocated to infrequent symbols. Each codeword is chosen to be ‘uniquely decodeable’, so
that a decodercanextracttheseries of variable-lengthcodewordswithoutambiguity.
Huffman coding is well suited to practical implementation and is widely used in practice.
Arithmetic coding maps a series of symbols to a fractional number (see Chapter 8) that is
then converted into a binary number and transmitted. Arithmetic codinghas the potential for
higher compression than Huffman coding. Eachsymbol may be represented with a fractional
number of bits (rather than just an integral number of bits) and this means that the bits
allocated per symbol may be more accurately matched to the statistical distribution of the
coded data.
3.3.4 Decoding
The output of the entropy encoder is a sequence of binary codes representing the original image
in compressed form. In order to recreate the image it is necessary to decode this sequence
and the decoding process (shown in Figure 3.8) is almost the reverseof the encoding process.
An entropy decoder extractsrun-level symbols from thebit sequence. These are converted
to a sequence of coefficients that are reordered into a block of quantised coefficients. The
decoding operations up to this point are the inverse of the equivalent encoding operations.
Each coefficient is multipliedby the integer scale factor (‘rescaled’). This is often described
as ‘inverse quantisation’, but in fact the loss of precision due to quantisation cannot be
reversed and so the rescaled coefficients are not identical to the original transform
coefficients.
The rescaled coefficients are transformed with an inverse transform to reconstruct a
decoded image. Becauseof the data loss during quantisation, this image will not be identical
tothe original image:theamount of difference depends partly on the ‘coarseness’ of
quantisation.
1. Prediction: create a prediction of the current frame based on one or more previously
transmitted frames.
2. Compensation: subtract the prediction from the current frame to produce a ‘residual
frame’.
The residual frame is then processed using an ‘image CODEC’. The key to this approach is
the prediction function: if the prediction is accurate, the residual frame will contain little data
and will hence be compressed toa very small size by the image CODEC. In order to decode
the frame, the decoder must ‘reverse’ the compensation process, adding the prediction to the
decoded residual frame (reconstruction) (Figure 3.15). This is inter-frame coding: frames
are coded based onsome relationship with other video frames, i.e. coding exploits the
interdependencies of video frames.
ENCODER DECODER
I prediction
Create I Create
prediction
Prevlous Prevlous
frame@) frame@)
3.4.1 FrameDifferencing
The simplest predictor is justthe previous transmitted frame. Figure3.16 shows theresidual
frameproduced by subtracting the previous framefrom the currentframe in a video
sequence. Mid-grey areas of theresidualframecontain zero data:light and dark areas
indicate positive and negative residual data respectively. It is clearthat much of the residual
data is zero: hence, compression efficiency can be improved by compressing the residual
frame rather than the current frame.
Subtract
Current Encoded
F Image encoder
frame frame
Create
L,!
prediction
Previous
frame(s)
Image decoder
The decoder faces a potential problem that can be illustrated as follows. Table 3.4 shows
the sequence of operations required to encode and decode a series of video frames using
frame differencing. Forthefirst frame the encoder and decoder use no prediction. The
problem starts with frame 2: the encoder uses the original frame 1 as a prediction and
encodes the resulting residual. However, the decoder only has the decoded frame 1 available
to form the prediction. Because the coding process is lossy, there is a difference between the
decoded and original frame 1 which leads to a small error in the prediction of frame 2 at
the decoder. This error will build up with each successive frame and the encoder and decoder
predictors will rapidly ‘drift’ apart, leading to a significant drop in decoded quality.
The solution to this problem isfor the encoder to use a decoded frame to formthe
prediction. Hence the encoder in the above example decodes (or reconstructs) frame 1 to
form a prediction for frame 2. The encoder and decoder use the same prediction and drift
should be reduced or removed. Figure 3.17 shows the complete encoder which now includes
a decoding ‘loop’in order to reconstruct its prediction reference. The reconstructed (or
‘reference’) frame is stored in the encoder and in the decoder to form the prediction for the
next coded frame.
3.4.2 Motion-compensatedPrediction
Frame differencing gives better compression performance than intra-frame coding when
successive frames are very similar, but does not perform well when there is a significant
change between the previous and current frames. Such changes are usually due to movement
in the video scene and a significantly better prediction can be achieved by estimating this
movement and compensating for it.
Figure 3.18 shows a video CODEC that uses motion-compensated prediction. Two new
steps are required in the encoder:
1. Motion estimation: a region of the current frame (often a rectangular block of luminance
samples) is compared with neighbouring regions of the previous reconstructed frame.
44 IMAGE AND VIDEO COMPRESSION FUNDAMENTALS
ENCODER DECODE
Motion-compensated
D
........ prediction
.....................
~
Current
f I
Previous Image Previous
decoder frame@)
The motion estimator attempts to find the ‘best match’, i.e. the neighbouring block in the
reference frame that gives the smallest residual block.
2. Motioncompensation: the ‘matching’ region or block from the referenceframe
(identified by the motion estimator) is subtracted from the current region or block.
compressiondoes not come without aprice: motion estimation can bevery computa-
tionally intensive.The design of a motion estimation algorithm canhave a dramatic effecton
the compression performance and computational complexity of a video CODEC.
A block or image transformis applied to the residual frame and the coefficients are quantised
and reordered. Run-level pairsareentropycoded as before(althoughthestatistical
distribution and hencethecodingtablesaregenerallydifferentforinter-codeddata). If
motion-compensated prediction is used, motion vector information must be sent in addition
to the run-level data. The motion vectors are typically entropy encoded in a similarway to
run-level pairs, i.e. commonly occurring motion vectors are coded with shorter codes and
uncommon vectors are coded with longer codes.
3.4.4
Decoding
3.5 SUMMARY
Efficient coding of images and video sequences involves creating a model of the source data
that convertsitintoaform that can be compressed. Most image and videoCODECs
developed over the last two decades have been based around a common set of ‘building
blocks’. For motion video compression, the first step is to create a motion-compensated
prediction of the frame to be compressed, based on one or more previously transmitted
frames. The difference betweenthis model and the actual input frame isthen coded using an
imageCODEC.Thedataistransformedintoanotherdomain(e.g.the DCT or wavelet
domain), quantised, reordered and compressed using an entropy encoder. A decoder must
reverse these steps to reconstruct the frame:however, quantisation cannotbe reversed and so
the decoded frame is an imperfect copy of the original.
An encoder and decoder must clearly use a compatible set of algorithms in order to
successfully exchange compressed image or video data.Of prime importance is the syntaxor
structure of the compressed data.In the past 15 years there has been a significant worldwide
effort to develop standards for video and image compression. These standards generally
describe a syntax (and a decoding process) tosupport video or image communications for a
wide range of applications. Chapters 4 and 5 provide an overview of the main standards
bodies and JPEG, MPEG and H . 2 6 ~video and image coding standards.