MPEG-unit-5
MPEG-unit-5
MPEG-2 is intended for higher bitrates and higher quality than MPEG-1 (for DVD, BluRay, DVB,
HDTV);this is achieved partially by allowing other types of color subsampling: 4:2:2 where only
one dimension is sub sampled and 4:4:4 without subsampling).
MPEG-3 was originally intended for HDTV but has been incorporated into MPEG-2 as a
“profile”.Several profiles and levels have been defined supporting different functionalities
(profiles) and parameter ranges (level). As a consequence, there is not a single MPEG-2
“algorithm” but many different algorithms for the different profiles. This makes the situation
more difficult since a decoder needs to support the profile a bitstream has been generated with
at the encoder.
MPEG-2 (ISO 13818) is published in 7 parts: multiplexing, video, audio, conformance testing,
technical
report with software simulation, digital storage media command control (DSM-CC, for VoD
applicationsdefining protocols for manipualtion / playback of MPEG bitstreams, and a non-backward
compatible audiocompression scheme (AAC as known from i-Tunes).The basic structure and core coding
system of MPEG-2 is similar to MPEG-1, and a MPEG-1 video withso-called constrained parameters can
be decoded by any MPEG-2 decoder (a flag in the MPEG-1 bitstream indicates this property: ≤ 768 × 576
pixels, frame rate maximal 30 fps, bitrate at most 1.856 Mbits / sec(constant bitrate)). However, there
are also important differences which will be discussed in detail in the
following:
• MPEG-2 is intended for higher quality and higher bitrate (half pixel accuracy for MV is a must, finer
coefficient quantisation, separate quantisation matrices for luma and chroma for 4:4:4 and 4:2:2 data),
Interlaced video (as opposed to “progressive” video) is composed of fields. Each field consists only
ofeven or odd numbered lines (even and odd filed) and in each field all pixels values correspond to the
samepoint in time, however, two fields are displaced by a certain amount of time. This scheme
originates fromearly TV, where upon arrival of new data (a field), only half of the lines in the image
needed to be refreshed.This leads to new data entities with respect to compression as compared to
progressive video: interlaced data can be either compressed as “field pictures” (where each field is
compressed as an idenpendent unit) or as “frame pictures” (where the corresponding lines of two fields
are interleaved and compressed together.
The decision about how to encode pictorial data can be made adaptive to image content and
ratedistortion considerations and can be mixed within an image sequence as shown in the figure. This
mixture
is not arbitrary, i.e. if the first field of a frame is a P (B) picture, then also the second field should be P
(B). If the first field is I, the second should be I or P.
Due to the reduced horizontal correlation present in field pictures (which is due to the reduced
resolution in this direction, basically a downsampling by 2 operation is done) an alternate scan pattern is
used aftertransform and quantisation as shown in the figure.
Due to the introduction of field and frame pictures, the motion estimation and compensation process an
offer more flexibility and more modes to exploit temporal correlations. For field pictures, field-based
prediction can be done by computing a single motion vector per macroblock in each direction or by
computing two motion vectors in each direction (two MV for each field using two fields as reference).
Dual prime prediction (see below) is different. For frame pictures, a frame-based prediction can be
done, where one MV is computed per macroblock in each direction (the fields of the reference frame
may have been compressed individually as two field picture
or together as a frame picture). Field-based prediction in frame pictures (as shown in the figure)
computes two motion vectors per marcoblock for each prediction direction, one MV for each of the
fields. Again, the fields of the reference frames can be compressed individually or together.
A specific case of prediction is “dual-prime” prediction. For field pictures, data from two reference fields
of opposite parity is averaged to predict a field picture. A single MV is computed for each macroblock of
the field. This motion vector V points to the reference field. The second MV, V1, pointing to the field of
other parity, is derived from V on the assumption of a linear motion trajectory (i.e. no acceleration). The
prediction a is subsequently averaged between b and c. This mode is restricted to P pictures in case
there are no B pictures between predicted and refereced fields.
In frame pictures, dual prime prediction is a special case of field-based prediction. A single MV V is
computed for each macroblock of the predicted frame picture. Using this vector, each field in the
macroblock is associated with a field of the same parity in the reference frame. Motion vectors V1 and
V2 are derived from V based on linear motion trajectory assumption again. Predictions for a and b are
based on averaging c, e and d, f, respectively.
In frame pictures, it is possible to switch prediction modes on a macroblock basis. For example, the
encoder may prefer field-based prediction mode over frame prediction in areas with sudden and
irregular motion, since in this case the differences between even and odd field in predicted and
reference frame do not match..
The concept of scalability refers to the ability of a decoder to decode only a part of a scalable bitstream
and generate a “useful” video. E.g., a lower resolution video can be obtained without decodeing the
entire bitstream first and then lowering the resolution by postprocessing, or without simulcasting two
distinct bitstreams corresponding to different resolutions. There are three types of scalability in video
coding:
1. Temporal scalbility (different frame rates)
2. Spatial scalability or Resolution scalablity
3. Quality scalability (different levels of SNR / picture quality)
In MPEG-2, scalbility is achieved by the concept of layered compression, i.e. by generating a base layer
and additional enhancement layers, which successivley improve the quality of the base layer if decoded
together. In all schemes of this type, usually a higher computational demand and reduced compression
efficiency is observed as the price of increased functionality.
Temporal scalability is achieved very easily by assigning a set of suited B frames to the enhancement
layer. In order to keep the compression degradation as small as possible, correlations among the frames
in the base layer and those in the enhancement layer need to be exploited. While the first scheme
depicted is similar to the classical prediction in a GOP (here prediction is only from base layer to
enhancement layer), the second one applies a different prediciton structure (including intra-
enhancement prediction).
Spatial scalability is accomplished in a similar manner as in hierachical JPEG. Video frames are
represented as spatial pyramids, prediction for the enhancement-layer frames is done based on base
layer data (interpolated from the lower resolution) and based on temporal prediction in the
enhancement layer data itself.
Finally, quality scalability is achieved based on differently quantised version of the video (as opposed
to the lower resolution in resolution scalability). The reconstructed lower quality version is used as the
prediction for the enhancement layer as well as motion compensated data from the enhancement layer
itself is used. The scalability profiles in MPEG-2 (see below) were not very successful since when using a
few layers, compression performance is drastically reduced at the price of significant complexity
increase.
Since MPEG-2 is also intended for broadcast applications where transmission errors are to be expected
(e.g. DVB-T, DVB-S, ....), several techniques how to cope with these errors have been incorporated into
the standard. In particular, “error concealment” strategies have been suggested for the decoder. In P
and B frames, a lost marcoblock is substituted with the motion compensated macroblock of the
previously decoded frame. For motion vectors, the MV belonging to the block above the lost macroblock
is used. If the top macroblock used in concealment is intra coded, there are two options:
• If a MV for the intra block has been included (the standard allows to do this for concealment
purposes), the concealment is done like described before.
• Otherwise, the concealment is done using the MV of the macroblock of the block in the previous
frame (assuming zero motion). For the macroblocks in the first line in a slice, it is done in the same
manner.
In I-frames, error concealment is done in the same manner. The scheme can be improved by using two
adjacent macroblocks, in case enough MV are retained, motion-compensated concealment can be done
to approximate the lost macroblock.
Different functionalities are supported by different profiles of the standard, which are basically different
algorithms. The different levels correspond to different parameter ranges which are supported (e.g.
frame resolution, frame rate, etc.). A MPEG-2 decoder does not need to support all profiles and levels, in
contrast, usually, a given decoder supports a specific profile and level combination (see the figure for
examples).
MPEG-4
MPEG-4 (ISO/IEC 14496) currently consists of 27 (!) parts out of which we will describe Part 2 (Visual)
and later also Part 3 (Audio). MPEG-4 Part 10 is commonly referred to as MPEG-4 AVC (advanced video
coding) or H.264. MPEG-4 is targeted to applications in the multimedia area. Since interactivity is an
important aspect of many multimedia applications, providing this functionality is one of the main aims
of the standard. Due to the frame-oriented processing in former standards, interactivity was very
limited. MPEG-4 uses the concepts of (arbitrary shape) video objects instead of frames to enable object-
based interactivity.
MPEG-4 fuses ideas from the following (successful) fields:
• Digital TV
• Interactive graphics applications (e.g. computer games)
• Interactive multimedia applications (e.g. WWW)
The aim is to improve the applicability for users, service providers, and content creators:
• Content creators: due to object-based representation, parts of the video data can be kept and re-used.
Also copyright can be protected in a more focussed manner.
• Service providers: network load can be controlled better due to MPEG-4 QoS descriptors.
• Consumers: facilitating interactivity with multimedia data, over high-bandwidth networks or mobile
networks.
As a consequence, MPEG-4 does not only provide an improvement with respect to compression but
mainly is focused on improved functionality. This is achieved by standardisation in the following areas:
• Representation of “media objects” which are units of visual or audio (or both) data, which can be of
natural or synthetic origin.
• Composition of media objects to an audiovisual scene.
• Multiplexing and synchronisation of data corresponding to media objects for network transfer accor-
ding to respective QoS requirements.
• Interaction between user and audiovisual scene (note that this requires an uplink which does more as
FFW or RWD).
MPEG-4 audio-visual scenes are composed of several media objects which are organised in an
hierarchical fashion (tree). The leaves of the tree are “primitive media objects” like still images, video
data, audio objects, etc. Additionally, special objects are defined like text, graphics, synthetic heads and
faces with associated text to synthetise speech in a synchronised manner to head animation (talking
head), synthetic sound, etc.Scene description is standardised (features similar to computer graphics and
virtual reality – VRML) and supports the following functionalities: