0% found this document useful (0 votes)
67 views

Video To MPEG Coding

The document discusses MPEG-2 video compression. MPEG-2 is an extension of MPEG-1 that supports higher bit rates for broadcast video. It provides tools for efficiently coding interlaced video and supports surround sound. The document outlines the principles of MPEG-2 video compression including discrete cosine transform coding, motion compensation, and defining profiles and levels to specify subsets of the standard.

Uploaded by

ubsingh1999
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

Video To MPEG Coding

The document discusses MPEG-2 video compression. MPEG-2 is an extension of MPEG-1 that supports higher bit rates for broadcast video. It provides tools for efficiently coding interlaced video and supports surround sound. The document outlines the principles of MPEG-2 video compression including discrete cosine transform coding, motion compensation, and defining profiles and levels to specify subsets of the standard.

Uploaded by

ubsingh1999
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

MPEG-2 VIDEO COMPRESSION

by P.N. Tudor
MPEG-2 is an extension of the MPEG-1 international standard for digital
compression of audio and video signals. MPEG-1 was designed to code
progressively scanned video at bit rates up to about 1.5 Mbit/s for applications
such as CD-i (compact disc interactive). MPEG-2 is directed at broadcast formats
at higher data rates; it provides extra algorithmic 'tools' for efficiently coding
interlaced video, supports a wide range of bit rates and provides for multichannel
surround sound coding. This tutorial paper introduces the principles used for
compressing video according to the MPEG-2 standard, outlines the general
structure of a video coder and decoder, and describes the subsets ('profiles') of the
toolkit and the sets of constraints on parameter values ('levels') defined to date.

1. INTRODUCTION

Recent progress in digital technology has made the widespread use of compressed digital
video signals practical. Standardisation has been very important in the development of
common compression methods to be used in the new services and products that are now
possible. This allows the new services to interoperate with each other and encourages the
investment needed in integrated circuits to make the technology cheap.

MPEG (Moving Picture Experts Group) was started in 1988 as a working group within
ISO/IEC with the aim of defining standards for digital compression of audio-visual signals.
MPEG's first project, MPEG-1, was published in 1993 as ISO/IEC 11172 [1]. It is a three-
part standard defining audio and video compression coding methods and a multiplexing
system for interleaving audio and video data so that they can be played back together.
MPEG-1 principally supports video coding up to about 1.5 Mbit/s giving quality similar to
VHS and stereo audio at 192 bit/s. It is used in the CD-i and Video-CD systems for storing
video and audio on CD-ROM.

During 1990, MPEG recognised the need for a second, related standard for coding video for
broadcast formats at higher data rates. The MPEG-2 standard [2] is capable of coding
standard-definition television at bit rates from about 3-15 Mbit/s and high-definition
television at 15-30 Mbit/s. MPEG-2 extends the stereo audio capabilities of MPEG-1 to
multi-channel surround sound coding. MPEG-2 decoders will also decode MPEG-1
bitstreams.

Drafts of the audio, video and systems specifications were completed in November 1993 and
the ISO/IEC approval process was completed in November 1994. The final text was
published in 1995.

MPEG-2 aims to be a generic video coding system supporting a diverse range of


applications. Different algorithmic 'tools', developed for many applications, have been
integrated into the full standard. To implement all the features of the standard in all decoders
is unnecessarily complex and a waste of bandwidth, so a small number of subsets of the full
standard, known as profiles and levels, have been defined. A profile is a subset of
algorithmic tools and a level identifies a set of constraints on parameter values (such as
picture size and bit rate). A decoder which supports a particular profile and level is only
required to support the corresponding subset of the full standard and set of parameter
constraints.

This paper introduces the principles used in MPEG-2 video compression systems, outlines
the general structure of a coder and decoder, and describes the profiles and levels defined to
date.

2. VIDEO FUNDAMENTALS

Television services in Europe currently broadcast video at a frame rate of 25 Hz. Each frame
consists of two interlaced fields, giving a field rate of 50 Hz. The first field of each frame
contains only the odd numbered lines of the frame (numbering the top frame line as line 1).
The second field contains only the even numbered lines of the frame and is sampled in the
video camera 20 ms after the first field. It is important to note that one interlaced frame
contains fields from two instants in time. American television is similarly interlaced but with
a frame rate of just under 30 Hz.

In video systems other than television, non-interlaced video is commonplace (for example,
most computers output non-interlaced video). In non-interlaced video, all the lines of a frame
are sampled at the same instant in time. Non-interlaced video is also termed 'progressively
scanned' or 'sequentially scanned' video.

The red, green and blue (RGB) signals coming from a colour television camera can be
equivalently expressed as luminance (Y) and chrominance (UV) components. The
chrominance bandwidth may be reduced relative to the luminance without significantly
affecting the picture quality. For standard definition video, CCIR recommendation 601 [3]
defines how the component (YUV) video signals can be sampled and digitised to form
discrete pixels. The terms 4:2:2 and 4:2:0 are often used to describe the sampling structure
of the digital picture. 4:2:2 means the chrominance is horizontally subsampled by a factor of
two relative to the luminance; 4:2:0 means the chrominance is horizontally and vertically
subsampled by a factor of two relative to the luminance.

The active region of a digital television frame, sampled according to CCIR recommendation
601, is 720 pixels by 576 lines for a frame rate of 25 Hz. Using 8 bits for each Y, U or V
pixel, the uncompressed bit rates for 4:2:2 and 4:2:0 signals are therefore:

4:2:2: 720 x 576 x 25 x 8 + 360 x 576 x 25 x ( 8 + 8 ) = 166


Mbit/s
4:2:0: 720 x 576 x 25 x 8 + 360 x 288 x 25 x ( 8 + 8 ) = 124
Mbit/s

MPEG-2 is capable of compressing the bit rate of standard-definition 4:2:0 video down to
about 3-15 Mbit/s. At the lower bit rates in this range, the impairments introduced by the
MPEG-2 coding and decoding process become increasingly objectionable. For digital
terrestrial television broadcasting of standard-definition video, a bit rate of around 6 Mbit/s
is thought to be a good compromise between picture quality and transmission bandwidth
efficiency.

3. BIT RATE REDUCTION PRINCIPLES

A bit rate reduction system operates by removing redundant information from the signal at
the coder prior to transmission and re-inserting it at the decoder. A coder and decoder pair
are referred to as a 'codec'. In video signals, two distinct kinds of redundancy can be
identified.

Spatial and temporal redundancy: Pixel values are not independent, but are correlated
with their neighbours both within the same frame and across frames. So, to some extent, the
value of a pixel is predictable given the values of neighbouring pixels.

Psychovisual redundancy: The human eye has a limited response to fine spatial detail [4],
and is less sensitive to detail near object edges or around shot-changes. Consequently,
controlled impairments introduced into the decoded picture by the bit rate reduction process
should not be visible to a human observer.

Two key techniques employed in an MPEG codec are intra-frame Discrete Cosine
Transform (DCT) coding and motion-compensated inter-frame prediction. These techniques
have been successfully applied to video bit rate reduction prior to MPEG, notably for 625-
line video contribution standards at 34 Mbit/s [5] and video conference systems at bit rates
below 2 Mbit/s [6].

Intra-frame DCT coding

DCT [7]: A two-dimensional DCT is performed on small blocks (8 pixels by 8 lines) of each
component of the picture to produce blocks of DCT coefficients (Fig. 1). The magnitude of
each DCT coefficient indicates the contribution of a particular combination of horizontal and
vertical spatial frequencies to the original picture block. The coefficient corresponding to
zero horizontal and vertical frequency is called the DC coefficient.

Fig. 1 - The discrete cosine transform (DCT).


Pixel value and DCT coefficient magnitude are represented by dot size.

The DCT doesn't directly reduce the number of bits required to represent the block. In fact
for an 8x8 block of 8 bit pixels, the DCT produces an 8x8 block of 11 bit coefficients (the
range of coefficient values is larger than the range of pixel values.) The reduction in the
number of bits follows from the observation that, for typical blocks from natural images, the
distribution of coefficients is non-uniform. The transform tends to concentrate the energy
into the low-frequency coefficients and many of the other coefficients are near-zero. The bit
rate reduction is achieved by not transmitting the near-zero coefficients and by quantising
and coding the remaining coefficients as described below. The non-uniform coefficient
distribution is a result of the spatial redundancy present in the original image block.

Quantisation: The function of the coder is to transmit the DCT block to the decoder, in a bit
rate efficient manner, so that it can perform the inverse transform to reconstruct the image. It
has been observed that the numerical precision of the DCT coefficients may be reduced
while still maintaining good image quality at the decoder. Quantisation is used to reduce the
number of possible values to be transmitted, reducing the required number of bits.

The degree of quantisation applied to each coefficient is weighted according to the visibility
of the resulting quantisation noise to a human observer. In practice, this results in the high-
frequency coefficients being more coarsely quantised than the low-frequency coefficients.
Note that the quantisation noise introduced by the coder is not reversible in the decoder,
making the coding and decoding process 'lossy'.

Coding: The serialisation and coding of the quantised DCT coefficients exploits the likely
clustering of energy into the low-frequency coefficients and the frequent occurrence of zero-
value coefficients. The block is scanned in a diagonal zigzag pattern starting at the DC
coefficient to produce a list of quantised coefficient values, ordered according to the scan
pattern.

The list of values produced by scanning is entropy coded using a variable-length code
(VLC). Each VLC code word denotes a run of zeros followed by a non-zero coefficient of a
particular level. VLC coding recognises that short runs of zeros are more likely than long
ones and small coefficients are more likely than large ones. The VLC allocates code words
which have different lengths depending upon the probability with which they are expected to
occur. To enable the decoder to distinguish where one code ends and the next begins, the
VLC has the property that no complete code is a prefix of any other.

Fig. 1 shows the zigzag scanning process, using the scan pattern common to both MPEG-1
and MPEG-2. MPEG-2 has an additional 'alternate' scan pattern intended for scanning the
quantised coefficients resulting from interlaced source pictures.

To illustrate the variable-length coding process, consider the following example list of values
produced by scanning the quantised coefficients from a transformed block:

12, 6, 6, 0, 4, 3, 0, 0, 0...0

The first step is to group the values into runs of (zero or more) zeros followed by a non-zero
value. Additionally, the final run of zeros is replaced with an end of block (EOB) marker.
Using parentheses to show the groups, this gives:

(12), (6), (6), (0, 4), (3) EOB


The second step is to generate the variable length code words corresponding to each group (a
run of zeros followed by a non-zero value) and the EOB marker. Table 1 shows an extract of
the DCT coefficient VLC table common to both MPEG-1 and MPEG-2. MPEG-2 has an
additional 'intra' VLC optimised for coding intra blocks (see Section 4). Using the variable
length code from Table 1 and adding spaces and commas for readability, the final coded
representation of the example block is:
0000 0000 1101 00, 0010 0001 0, 0010 0001 0, 0000 0011 000, 0010
10, 10
Table 1: Extract from the MPEG-2 DCT coefficient VLC table.
Length of Value of non-zero Variable-length
run of zeros coefficient codeword
0 12 0000 0000 1101 00
0 6 0010 0001 0
1 4 0000 0011 000
0 3 0010 10
EOB - 10
Motion-compensated inter-frame prediction

This technique exploits temporal redundancy by attempting to predict the frame to be coded
from a previous 'reference' frame. The prediction cannot be based on a source picture
because the prediction has to be repeatable in the decoder, where the source pictures are not
available (the decoded pictures are not identical to the source pictures because the bit rate
reduction process introduces small distortions into the decoded picture.) Consequently, the
coder contains a local decoder which reconstructs pictures exactly as they would be in the
decoder, from which predictions can be formed.

The simplest inter-frame prediction of the block being coded is that which takes the co-sited
(i.e. the same spatial position) block from the reference picture. Naturally this makes a good
prediction for stationary regions of the image, but is poor in moving areas. A more
sophisticated method, known as motion-compensated inter-frame prediction, is to offset any
translational motion which has occurred between the block being coded and the reference
frame and to use a shifted block from the reference frame as the prediction.

One method of determining the motion that has occurred between the block being coded and
the reference frame is a 'block-matching' search in which a large number of trial offsets are
tested by the coder using the luminance component of the picture. The 'best' offset is selected
on the basis of minimum error between the block being coded and the prediction.

The bit rate overhead of using motion-compensated prediction is the need to convey the
motion vectors required to predict each block to the decoder. For example, using MPEG-2 to
compress standard-definition video to 6 Mbit/s, the motion vector overhead could account
for about 2 Mbit/s during a picture making heavy use of motion-compensated prediction.

4. MPEG-2 DETAILS

Codec structure
In an MPEG-2 system, the DCT and motion-compensated interframe prediction are
combined, as shown in Fig. 2. The coder subtracts the motion-compensated prediction from
the source picture to form a 'prediction error' picture. The prediction error is transformed
with the DCT, the coefficients are quantised and these quantised values coded using a VLC.
The coded luminance and chrominance prediction error is combined with 'side information'
required by the decoder, such as motion vectors and synchronising information, and formed
into a bitstream for transmission. Fig. 3 shows an outline of the MPEG-2 video bitstream
structure.

Fig. 2 - (a) Motion-compensated DCT coder; (b) motion compensated DCT decoder.
Fig. 3 - Outline of MPEG-2 video bitstream structure (shown bottom up).

In the decoder, the quantised DCT coefficients are reconstructed and inverse transformed to
produce the prediction error. This is added to the motion-compensated prediction generated
from previously decoded pictures to produce the decoded output.

In an MPEG-2 codec, the motion-compensated predictor shown in Fig. 2 supports many


methods for generating a prediction. For example, the block may be 'forward predicted' from
a previous picture, 'backward predicted' from a future picture, or 'bidirectionally predicted'
by averaging a forward and backward prediction. The method used to predict the block may
change from one block to the next. Additionally, the two fields within a block may be
predicted separately with their own motion vector, or together using a common motion
vector. Another option is to make a zero-value prediction, such that the source image block
rather than the prediction error block is DCT coded. For each block to be coded, the coder
chooses between these prediction modes, trying to maximise the decoded picture quality
within the constraints of the bit rate. The choice of prediction mode is transmitted to the
decoder, with the prediction error, so that it may regenerate the correct prediction.

Picture types

In MPEG-2, three 'picture types' are defined. The picture type defines which prediction
modes may be used to code each block.

'Intra' pictures (I-pictures) are coded without reference to other pictures. Moderate
compression is achieved by reducing spatial redundancy, but not temporal redundancy. They
can be used periodically to provide access points in the bitstream where decoding can begin.

'Predictive' pictures (P-pictures) can use the previous I- or P-picture for motion
compensation and may be used as a reference for further prediction. Each block in a P-
picture can either be predicted or intra-coded. By reducing spatial and temporal redundancy,
P-pictures offer increased compression compared to I-pictures.

'Bidirectionally-predictive' pictures (B-pictures) can use the previous and next I- or P-


pictures for motion-compensation, and offer the highest degree of compression. Each block
in a B-picture can be forward, backward or bidirectionally predicted or intra-coded. To
enable backward prediction from a future frame, the coder reorders the pictures from natural
'display' order to 'bitstream' order so that the B-picture is transmitted after the previous and
next pictures it references. This introduces a reordering delay dependent on the number of
consecutive B-pictures.

The different picture types typically occur in a repeating sequence, termed a 'Group of
Pictures' or GOP. A typical GOP in display order is:

B1 B2 I3 B4 B5 P6 B7 B8 P9 B10 B11 P12

The corresponding bitstream order is:

I3 B1 B2 P6 B4 B5 P9 B7 B8 P12 B10 B11

A regular GOP structure can be described with two parameters: N, which is the number of
pictures in the GOP, and M, which is the spacing of P-pictures. The GOP given here is
described as N=12 and M=3. MPEG-2 does not insist on a regular GOP structure. For
example, a P-picture following a shot-change may be badly predicted since the reference
picture for prediction is completely different from the picture being predicted. Thus, it may
be beneficial to code it as an I-picture instead.

For a given decoded picture quality, coding using each picture type produces a different
number of bits. In a typical example sequence, a coded I-picture was three times larger than a
coded P-picture, which was itself 50% larger than a coded B-picture.

Buffer control

By removing much of the redundancy from the source images, the coder outputs a variable
bit rate. The bit rate depends on the complexity and predictability of the source picture and
the effectiveness of the motion-compensated prediction.

For many applications, the bitstream must be carried in a fixed bit rate channel. In these
cases, a buffer store is placed between the coder and the channel. The buffer is filled at a
variable rate by the coder, and emptied at a constant rate by the channel. To prevent the
buffer from under- or overflowing, a feedback mechanism acts to adjust the average coded
bit rate as a function of the buffer fullness. For example, the average coded bit rate may be
lowered by increasing the degree of quantisation applied to the DCT coefficients. This
reduces the number of bits generated by the variable-length coding, but increases distortion
in the decoded image. The decoder must also have a buffer between the channel and the
variable rate input to the decoding process. The size of the buffers in the coder and decoder
must be the same.

MPEG-2 defines the maximum decoder (and hence coder) buffer size, although the coder
may choose to use only part of this. The delay through the coder and decoder buffer is equal
to the buffer size divided by the channel bit rate. For example, an MPEG-2 coder operating
at 6 Mbit/s with a buffer size of 1.8 Mbits would have a total delay through the coder and
decoder buffers of around 300 ms. Reducing the buffer size will reduce the delay, but may
affect picture quality if the buffer becomes too small to accommodate the variation in bit rate
from the coder VLC.

Profiles and levels

MPEG-2 video is an extension of MPEG-1 video. MPEG-1 was targeted at coding


progressively scanned video at bit rates up to about 1.5 Mbit/s. MPEG-2 provides extra
algorithmic 'tools' for efficiently coding interlaced video and supports a wide range of bit
rates. MPEG-2 also provides tools for 'scalable' coding where useful video can be
reconstructed from pieces of the total bitstream. The total bitstream may be structured in
layers, starting with a base layer (that can be decoded by itself) and adding refinement layers
to reduce quantisation distortion or improve resolution.

A small number of subsets of the complete MPEG-2 tool kit have been defined, known as
profiles and levels. A profile is a subset of algorithmic tools and a level identifies a set of
constraints on parameter values (such as picture size or bit rate). The profiles and levels
defined to date fit together such that a higher profile or level is superset of a lower one. A
decoder which supports a particular profile and level is only required to support the
corresponding subset of algorithmic tools and set of parameter constraints.

Details of non-scalable profiles: Two non-scalable profiles are defined by the MPEG-2
specification.

The simple profile uses no B-frames, and hence no backward or interpolated prediction.
Consequently, no picture reordering is required (picture reordering would add about 120 ms
to the coding delay). With a small coder buffer, this profile is suitable for low-delay
applications such as video conferencing where the overall delay is around 100 ms. Coding is
performed on a 4:2:0 video signal.

The main profile adds support for B-pictures and is the most widely used profile. Using B-
pictures increases the picture quality, but adds about 120 ms to the coding delay to allow for
the picture reordering. Main profile decoders will also decode MPEG-1 video. Currently,
most MPEG-2 video decoder chip-sets support the main profile at main level.

Details of scalable profiles: The SNR profile adds support for enhancement layers of DCT
coefficient refinement, using the 'signal to noise (SNR) ratio scalability' tool. Fig. 4 shows an
example SNR-scalable coder and decoder.
Fig. 4 - (a) SNR-scalable video coder; (b) SNR-scalable video decoder.

The codec operates in a similar manner to the non-scalable codec shown in Fig. 2, with the
addition of an extra quantisation stage. The coder quantises the DCT coefficients to a given
accuracy, variable-length codes them and transmits them as the lower-level or 'base-layer'
bitstream. The quantisation error introduced by the first quantiser is itself quantised,
variable-length coded and transmitted as the upper-level or 'enhancement-layer' bitstream.
Side information required by the decoder, such as motion vectors, is transmitted only in the
base layer.

The base-layer bitstream can be decoded in the same way as the non-scalable case shown in
Fig. 2(b). To decode the combined base and enhancement layers, both layers must be
received, as shown in Fig. 4(b). The enhancement-layer coefficient refinements are added to
the base-layer coefficient values following inverse quantisation. The resulting coefficients
are then decoded in the same way as the non-scalable case.

The SNR profile is suggested for digital terrestrial television as a way of providing graceful
degradation.

The spatial profile adds support for enhancement layers carrying the coded image at
different resolutions, using the 'spatial scalability' tool. Fig. 5 shows an example spatial-
scalable coder and decoder.

You might also like