INTERNATIONAL ORGANISATION FOR STANDARDISATION

ORGANISATION INTERNATIONALE DE NORMALISATION

ISO/IEC JTC1/SC29/WG11

CODING OF MOVING PICTURES AND AUDIO

ISO/IEC JTC1/SC29/WG11/N2195

MPEG 98

March 1998/Tokyo

Source: Requirements Group

Title: MPEG-4 Applications

Status: Approved

MPEG-4 Applications

Table of Contents:



1 Introduction

Tools are built to enable applications. For that reason, within this document a number of applications are listed which are enabled by the tools and methods currently standardized within MPEG-4. This preliminary summary does certainly not list all the applications enabled by MPEG-4, but rather gives an idea of what is or will be possible using MPEG-4 technology. The document is intended to describe and highlight possible future usage of MPEG-4 technology and shall stimulate the creation of new application scenarios where the MPEG-4 functionalities enable new systems or services.

Most of the applications listed in this document have been used to collect the requirements for MPEG-4 which themselves can be found in the MPEG-4 Requirements Document [1]. Also the Object Profiles listed in the MPEG-4 Profiles Document [2] were written with most of these applications in mind. However, the requirements and profiles summarized in these documents are not exclusively based on the applications that can be found in here.

Please note that MPEG-4 version 1 may not support all applications listed in this document, the tools are listed irrespective of their availability in version 1 or 2.

2 Applications

2.1 Real Time Communications

2.1.1 Application Description

Real-time Communications systems are targeted toward applications which encompass two-way human interaction, or one-way applications that impose strict one-way delay constraints. A videophone system is a prime example of a two-way real-time system. An example of a one-way delay constrained system is a surveillance system, which because of its importance is described later as an application of its own.

One key feature of real-time systems is that if there is both audio and video present, the audio and video are synchronized so that the viewer is given the impression of lip synchronization. Interaction between the users of two-way systems requires that the overall end-to-end delay will be relatively small and fairly constant. Usability studies have shown that the maximum one-way delay tolerable is approximately 400 ms, end-to-end delays of much less than 400 ms are desirable however.

The underlying transport system for real-time communications application is likely to encompass a broad cross section of technologies. A key attribute of the real-time communications systems application is the ability to successfully operate over a wide variety of media including low and high mobility wireless, LAN transmission channels, PSTN and ISDN transmission channels. Interworking between various media channels should be supported. Operation in a multipoint configuration is also envisioned.

It is expected that real-time communications systems will operate in a variety of different system configurations including those where the complexity of the encoding/decoding process constitutes a major design constraint. Audio and/or visual quality maybe traded off against delay and complexity such that a balance is found between the desire for high quality audio/video and the need to provide low delay operation at a reasonable complexity.

2.1.2 Application-Specific Requirements

The following MPEG-4 functionalities are essential for the Real-time Communications application:

• Improved Coding Efficiency

• Robustness in Error-Prone Environments

• Synchronization

• Virtual Channel Allocation Flexibility

• Low End-to-End Delay Mode

• User Controls

• Transmission Media Interworking

• Interworking with Other Audio/Visual Systems

• Low Bitrate Mode

• Low Complexity Decoder Mode

The following functionalities are desirable for the Real-time Communications application:

• Improved Temporal Random Access

• Content-Based Scalability

• Auxiliary Data Capability

• Multipoint Capability

2.1.3 Proposed MPEG-4 Tools

Visual Audio Systems Delivery
Intra coding mode (I-Mode) CELP The Systems Decoder Model Tool DMIF-Application Interface (DAI)
Inter prediction mode (P-Mode)
HVXC
Object Descriptor Tool DMIF-Network Interface (DNI)
AC/DC Prediction HILN Synchronization Tool
Slice resynchronisation
FlexMux Tool
Reversible VLC tables


Data partitioning


Overlapped block motion compensation (OBMC)


4 motion vectors per macro block


Unrestricted motion vectors


H.263/MPEG-2 quantization tables


2.2 Surveillance

2.2.1 Application Description

Many modern surveillance sensors produce output in the form of images or sequences of images (i.e. video). Audio sensors are also used. In many applications, these sensors are connected via a telecommunications system to one or more terminals that provide both monitoring and control. Unlike video conferencing, the surveillance application usually involves unidirectional communication of audiovisual data, with only control and configuration data on the reverse channel.

Surveillance imposes a different concept of quality to other applications, such as entertainment. Subjective degradation in images, video or sound is important only if it inhibits its use. In a perimeter surveillance system around a factory, this might occur if the degradation prevents a human operator from detecting an intruder. It might also occur if the degradation increases the fatigue suffered by the operator such that the operator’s ability to detect intruders is decreased.

Many of these sensors generate video that does not conform to the traditional digital video formats, in which each pixel comprises a luminance component and two chrominance components, each of which is represented with a precision of 8 bits. Many surveillance sensors, such as a range of commercially available infrared imaging systems, generate digital video that is represented with a precision of up to 12 bits. Often, this video contains only a luminance component.

Often, a surveillance operator will use contrast and brightness controls to change this mapping on a regular basis. Delay is often an important consideration in surveillance systems. It is often desired that action be taken immediately upon detection of particular events. It may also be necessary for a video camera to be controlled in real time by a remote operator. This can only be achieved if the coding, transmission and decoding delays are sufficiently small. It is important to recognize, however, that these delay constraints are unlikely to be as tight as those imposed by real time conversational services such as video conferencing.

2.2.2 Application-Specific Requirements

The following functionalities are considered to be essential for the application scenario of Surveillance:

• Improved coding efficiency

• Robustness in error-prone environments

• Content-based Scalability

• Synchronization

• Virtual channel allocation flexibility

• User controls

• Transmission media interworking

• Interworking with other audio/visual systems

• Channel hopping

• Low bitrate mode

• Low complexity decoder mode

• Support pixel resolutions up to 12 bit (possibly luminance only)

• Improved temporal random access

• Multipoint capability

2.2.3 Proposed MPEG-4 Tools

Visual Audio Systems Delivery
Intra coding mode (I-Mode) CELP The Systems Decoder Model Tool DMIF-Application Interface (DAI)
Inter prediction mode (P-Mode) HVXC Object Descriptor Tool DMIF-Network Interface (DNI)
AC/DC prediction HILN Synchronization Tool
Reversible VLC tables
FlexMux Tool
Slice resynchronisation


Data partitioning


Binary shape coding


P-VOB based temporal scalability


H.263/MPEG-2 quantization tables


Overlapped block motion compensation (OBMC)


4 motion vectors per macro blockq


Unrestricted motion vectors


N-bit Video


For Access Control, the first three Visual tools and the first three Audio tools are most important.

2.2.4 Prototypical Example - Access Control to Information Systems or Secured Buildings

The aim of such an surveillance application will be to control and regulate the access of different users to secured information systems (e.g. access to servers, access to computer nets, private sites etc. ...) or secured buildings (for example, central headquarters of banks, Companies, …).

Nowadays, the most developed way to secure the access to information systems is by using a password or by identifying the machine. Similarly, the access to secured building is in most situations done by means of simple badge reader. Modern multimodal techniques such as voice recognition and face identification can now be used to authenticate a claimed identity with improved accuracy.

The following functionalities are desirable for the application scenario of Access Control:

• Improved temporal random access

• Improved scalability for low bit rates (Audio)

• SNR scalability

The scalability appears to be a very important feature since these kinds of application may be very sensitive to network congestion. As a matter of fact, this kind of application should become very popular on Internet under the form of Web-banking or electronic trade and on networks such as Internet highly scalable schemes will definitely be needed. Moreover, high scalability allows to guarantee that in case of congestion of the transmission network, the signals would be of the best possible quality.

In addition it is foreseen that for future development of this type of application it will be very interesting in having lip features to track lip motion. Tracking lip motion will be the only way to avoid an intruder to fool the authentication scheme by using the badge, the photo and a recorded tape of an authorized user.

The system will include a camera that will take the frontal image of the face of the user, and a microphone to record the voice of the user. The application is based on verification which means that the user will claim a certain identity and the application will diagnose if the user is an impostor or not.

Undisplayed Graphic

Figure: Access Control to secured Building

The procedure to complete an access trial is the following:

The user introduces in the system his identification code through keyboard or magnetic card.

One or several photos of the users are taken. Those images are either transmitted to the information (or control) site where facial features are extracted and compared to the one obtained from the training of the claimed user. The other possibility is to extract the facial features on the local server or computer before sending them to the Information site.

The user is also asked to pronounce a short sentence. As above, the speech signal is either transmitted to the Information (or control) site where voice features are extracted and compared to the one obtained from the training of the claimed or the voice features are extracted on the computer or (local station) before sending them to the Information site.

Finally, the system gives an access authorization based on the fusion on both verification algorithms, face recognition and speech recognition.

2.3 Mobile Multimedia

2.3.1 Application Description

Mobile computing means the use of a portable computer capable of wireless communication. That is, a portable computer is not only used for local, standalone data processing, but also for wireless communication situations of a mobile user in motion. In a typical mobile computing scenario, a mobile user communicates with a remote computer system using e.g. a notebook or a Personal Digital Assistant (PDA) via wireless communication links.

Mobile multimedia applications face technical challenges that are significantly different from the problems typically encountered with desktop multimedia applications. This is because current mobile computing technologies are subject to inherent limitations such as limited computation capacity, narrow bandwidth, and unsatisfactory reliability of the pertaining transmission media.

Besides the requirement of high compression performance, adaptivity is also a very important requirement for mobile applications, because of the following reasons:

• Diversity of mobile devices (e.g. PDA, subnotebooks, notebooks, or portable workstations) in regard to available resources and the diversity of wireless networks (e.g. HIPERLAN, GSM, UMTS, or satellite) in regard to network topology, protocols, bandwidth, reliability etc.

• The need of being able to make trade-off between quality, performance and cost.

MPEG-4, designed as an adaptive representation scheme that also accommodates very low bitrate applications, is very appropriate for mobile multimedia applications. Concretely, MPEG-4 is useful because:

High compression performance can be achieved.

Flexibility of encoding and decoding complexity, e.g., different spatial resolution, temporal resolution, and quality enables very flexible trade-off between quality, performance and cost.

Object-based coding functionalities allow for interaction with audiovisual objects and enable new interactive applications in a mobile environment.

• Face animation parameters can be used to reduce bandwidth consumption for real-time communication applications in a mobile environment, e.g. mobile conferencing.

2.3.2 Application-Specific Requirements

• Improved coding efficiency

• Robustness in error-prone environments

• Multiplexing of audio, video, and other information

• Face animation parameters

• Synchronization

• Transmission media interworking

• Interworking with other audiovisual systems

• User controls (e.g., sensitive regions, fast forward, pause, etc.)

• Content-based coding and interaction

• Coding flexibility

• Low bitrate mode

• Low complexity decoder mode

• Improved temporal random access

• Feedback, capability exchange

Low power consumption

2.3.3 Proposed MPEG-4 Tools

Visual Audio System Delivery
Intra coding mode (I-Mode) HVXC The Systems Decoder Model Tool DMIF-Application Interface (DAI)
Inter prediction mode (P-Mode) HILN BIFS Scene Description Tool DMIF-Network Interface (DNI)
AC/DC prediction AAC Object Descriptor Tool
Reversible VLC tables TTS Synchronization Tool
Slice resynchronization SASL FlexMux Tool
Data partitioning MIDI

Binary Shape Coding


P-VOP based temporal scalability


H.263/MPEG-2 quantization tables


Face Animation Parameters (FAP)


2.4 Infotainment

2.4.1 Application Description

As interaction with AV objects is considered as the most important aspect of MPEG-4, infotainment applications, containing a combination of entertainment and information are well within the scope. Generally, the users of such systems have the means both to get information about a specific subject of interest and to configure and amuse themselves within a multimedia environment. The interactivity aspect includes e.g. the requesting of additional objects and changing of content of existing scene nodes.

A key feature of infotainment applications is the manifold of necessarily diversified features. Typical infotainment applications will make heavy use of natural and synthetic audio and video in form of e.g. spoken text and music of all kind with underlying visual animation. For this kind of application it will be necessary to guarantee a high quality of presentation during the whole session if the user shall not become bored of his/her pastime. The quality aspect addresses both high AV quality and time constraints to end-to end latency.

Typical scenarios for infotainment include the usage of simple PCs or Set-Top Boxes at home or public terminals within shopping centers or at visitor centers, with their special suppositions concerning hardware e.g. for user input and for data repository. A touch-screen on one hand and a high percentage of local available data e.g. on CD-ROM on the other hand is common in the latter applications.

MPEG-4 provides an ideal framework for infotainment applications:

• It will feature the means to support the utmost multifaceted set of multimedia types to be combined within a presentation scenario in a standardized way.

• The composition concepts, which will cover 2D as well as 3D, will be the base for mixing all kind of data types within a consistent object handling and user interaction paradigm.

• MPEG’s tradition is to achieve the highest possible quality with existing techniques, which is only adequate for the demanding nature of infotainment applications.

2.4.2 Application-Specific Requirements

In the following infotainment-specific requirements are listed in typical order of importance:

• Multiplexing of audio, video, and synthetic contents

• User controls (e.g. sensitive regions, fast forward, pause, etc.)

• Content-based coding and interaction

• Synchronisation

• Improved temporal random access

• Face animation parameters

• Transmission media interworking

• Interworking with other audiovisual systems

2.4.3 Proposed MPEG-4 Tools

Visual Audio Systems Delivery
Intra coding mode (I-Mode) AAC The Systems Decoder Model Tool DMIF-Application Interface (DAI)
Inter prediction mode (P-Mode) CELP BIFS Scene Description Tool DMIF-Network Interface (DNI)
AC/DC prediction TTS Object Descriptor Tool
Slice resynchronization SAOL Synchronization Tool
Data partitioning SASL FlexMux Tool
Binary Shape Coding


P-VOP based temporal scalability


H.263/MPEG-2 quantization tables


Face Animation Parameters


2.4.4 Prototypical Example - Virtual-City-Guide

The targeted application, an enhanced virtual city-guide, consists of three parts:

Undisplayed Graphic

Figure: Infotainment Prototype Front-end

In the first part, the main square, the focus is both on information and entertainment. The background is a video object of the main square of a virtual city with its three major sights: a Roman pillar, a dungeon and an interactive video drome. When the user requests information about one of the sights additional video objects will appear and talk about the history of a virtual city and mention other sights of interest.

The second part takes place inside of an interactive video drome. This is a synthetic environment, where the user can interact with the audio/visual objects. The objects are a collection of musicians, singers and dancers. The focus here is upon interactive composition and the user can add synthetic and natural objects (artists) and move them around.

The third part, a kind of maze, takes place inside the dungeon. It is a multi-user 3D game, where players around the world can play together in the same game field. The inclusion of this probably VRML based feature will be a nice demonstration of MPEG-4 and VRML living next to another on one terminal.

Summarising the example, infotainment is achieved as follows: the information part concentrates in the underlying city-guide, the entertainment aspects are focused within a 3D user game. The included video drome is a mixture, as it can provide both entertainment and information concerning e.g. classical music or dancing.

2.5 DVD

2.5.1 Application Description

DVDs are featured by their large memory capacity (4.7 Gbytes/layer) and relatively low access speed (order of 100 ms). There are both read-only DVDs and read/write DVDs. Their main application areas are interactive movies, knowledge/travel/whatever guidance, self-learning, games, Internet Karaoke, or interaction with other incoming bitstreams such as broadcasting or Internet, etc.

Interactive movie application enables the audience to interact with the reproducing contents. An example is story selection in the middle of reproduction by the audience interaction. The other example is a parental switch which suppresses unsuitable scenes for children. For these purpose, the MPEG-4 system composition should have a capability to accept inputs from the users and change the object composition during the decoding of bitstreams. The self-learning application and games also require this interaction capability.

The Internet Karaoke application combines incoming music and text streams with DVD stored image sequences. This application requires information linking and synchronization between the incoming streams and the DVD stored streams. This should also be realized by the system composition.

Examples of the DVD-RAM application are time-shifted reproduction of broadcast programs. Users can enjoy broadcast programs at any time by once storing the streams in DVDs. During the reproduction of the stored streams, commercials may be skipped automatically, or the reproduction speed may be differed from the original.

Another example is the daytime exploration of nightly downloaded database. Newspaper data may be downloaded very early in the morning or a newly published patents gazette may be downloaded at night when the transaction is slow. Even a whole day TV programs except for lives might be downloaded at night. For these applications, very high speed transmission lines and wide bandwidth of DVD drives are required.

2.5.2 Application-Specific Requirements

The following functionalities are essential for the DVD application:

• Composition interactivity

• Objects synchronization

• Improved coding efficiency

• Improved temporal random access

• Content-based scalability

• Auxiliary data capability

• Multistream reproduction capability

• Gray shape

• Shape only object

• Low Decoding Delay Mode

• User Controls

• Interworking with Other Audio/Visual Systems

2.5.3 Proposed MPEG-4 Tools

Visual Audio System Delivery
Intra coding mode (I-Mode) AAC The Systems Decoder Model Tool DMIF-Application Interface (DAI)
Inter prediction mode (P-Mode)
Object Descriptor Tool DMIF-Network Interface (DNI)
AC/DC prediction
Synchronization Tool
Bi-directional prediction mode

(B-Mode)


FlexMux Tool
H.263/MPEG-2 quantization tables
BIFS Scene Description Tool
Overlapped block motion compensation (OBMC)


4 motion vectors per macro block


Unrestricted motion vectors


Static Sprites


Grey scale alpha shape


2.6 Content based Storage and Retrieval

2.6.1 Application Description

In the following, the term ‘content-based’ will be used rather loosely to refer to systems that aim to provide access based on attributes associated with the video or audio content where these attributes may be keywords (often a semantic description) as well as numerical attributes. Generally, the purpose of such libraries is to assist in the management of large collections of digital video assets. For example, many of these systems rely on temporal and/or spatial segmentation of an audiovisual stream. Although the segmentation process itself is not a subject of MPEG standardization, providing mechanisms to efficiently access such temporal or spatial segments is within the scope of MPEG-4. In addition to accessing these segments based on more traditional methods (e.g., fast forward to a particular temporal segment), it is also desirable to be able to access (query or browse) these segments based on textual and numeric attributes associated with it.

Many content based digital library applications will require the rapid comparison of the attributes associated with the assets stored in the database with a representative set of attributes defining a query. Often, in the case of browsing, the query is defined by one or more examples and their associated attributes. To make the query of very large collections feasible the attributes used for the indexing/retrieval must be accessible without decoding the entire audiovisual stream.

In this context, a fundamental component of such systems is the use of a „decision support representative" (DSR) which is used to represent a large audiovisual asset in a very condensed form allowing the user to decide on the appropriateness of an asset for his purposes. The exact nature of the DSR can be very application specific (e.g., icons, single representative frames, a small mosaic of frames, etc.), however, it is desirable to have support for the efficient storage and access to the DSR’s. The DSR is assumed to take the form of one of the stream types defined by the compatibility requirements (e.g., MPEG, JPEG, etc.). It should be noted that it is the primary purpose for using MPEG-4 to support storage and access/browsing for the purposes of identifying the assets of interest. It is not intended to address the delivery of the assets which may in fact be on film media.

2.6.2 Application-Specific Requirements

The following functionalities are essential for the application scenario of Content based Storage & Retrieval:

• Content based multimedia access tools

• Content based manipulation and bitstream editing

• Improved temporal random access

• Support of Decision Support Representative (DSR)

The following functionalities are desirable:

• Content based scalability

• User controls

2.6.3 Proposed MPEG-4 Tools

Visual Audio Systems Delivery
Intra coding mode (I-Mode) AAC The Systems Decoder Model Tool DMIF-Application Interface (DAI)
Inter prediction mode (P-Mode)
Object Descriptor Tool DMIF-Network Interface (DNI)
AC/DC prediction
Synchronization Tool
Binary shape coding
FlexMux Tool
P-VOP based temporal scalability
BIFS Scene Description Tool
Overlapped block motion compensation (OBMC)


4 motion vectors per macro block


Unrestricted motion vectors


H.263/MPEG-2 quantization tables


Frame-based spatial scalability


2.7 Streaming Video on the Internet/Intranet

2.7.1 Application Description

Streaming Video on the Internet is an application which enables video transmission from a server to clients using the Internet. Different from a file transfer, video can be viewed immediately after receiving data without waiting for the entire file to download. Associated audio, text as well as video can be played back with a correct synchronization. A viewing tool at the client site can be installed e.g. as a plug-in software for a Web browser.

Depending on currently available modems, the upper limit on the transmission bandwidth of the current Internet is still around 28.8 kbits/s or 38 kbits/s. In the case of ISDN users can use 56 kbits/s or 64 kbits/s. For the Intranet application, much higher bandwidth (ex. up to 10 Mbits/s for the Ethernet) is available. In addition, if many clients access the video data of a server in one time, packet losses may occur and eventually the practically available channel bandwidth will be decreased. Therefore, it is necessary for the Internet video server to cope with such a heterogeneous channel condition. It is not practical to produce several bit streams having several bit rates in real time since a server should have many video and audio encoders operating in one time.

Thus, an encoder producing bandwidth scaleable bitstreams is desired while keeping the maximum coding efficiency around 20 kbits/s since majority of the Internet users are still using modem dial up system. Video quality should be judged by frame rate, picture size, picture quality, and noise shape. In the typical application, numbers to determine the quality are 5 frames/s, 160x120 pels image size at 20 kbits/s, and 10 frames/s, 160x120 pels (or 320x240 pels) image size at 40 kbits/s. Blocking noise is annoying especially for the very low bit rate application. If additional bandwidth is available, picture quality should increase gradually. This bandwidth scalability should be included not only in the video encoder but also in the audio encoder. It is necessary for the encoder to be able to discard a part of the video bit stream with graceful degradation (i.e. without drift) to fit the channel bandwidth if the channel to the server is crowded by many client’s access.

As for the human interface aspect, interactive operation to the video and audio such as Fast Forward, Fast Reverse, Pause as well as Playback is necessary. It can be achieved by clicking some buttons of the viewing window in the Web browser. If there is a text associated with video and audio, it may be clickable for linking it to video and audio. If an object shape of an image is embedded in the video, it also can be used as a clickable map.

2.7.2 Application-Specific Requirements

The following functionalities are essential for the Streaming Video on the Internet application:

• Improved Coding Efficiency Especially at Very Low Bit Rate

• Bandwidth Scalability

• Synchronization with Video and Audio

• Interactivity such as VCR like operation

• Quick Recovery after the Packet Loss

• Very Low Bit Rate Audio

• Real Time Decompression of Video and Audio by Personal Computers

The following functionalities are desirable for the Streaming Video on the Internet application:

• Clickable Map in the Video

• Global Motion Compensation

• Real Time Compression of Video and Audio

2.7.3 Proposed MPEG-4 Tools

Visual Audio Systems Delivery
Intra coding mode (I-Mode) AAC The Systems Decoder Model Tool DMIF-Application Interface (DAI)
Inter prediction mode (P-Mode)
Object Descriptor Tool DMIF-Network Interface (DNI)
AC/DC prediction
Synchronization Tool
Data partitioning
FlexMux Tool
Binary shape coding
Intermedia Format Tool
P-VOP based temporal scalability
BIFS Scene Description Tool
H.263/MPEG-2 quantization tables
AAVS Scene Description Tool
Bi-directional prediction mode (B-Mode)(?)


Overlapped block motion compensation (OBMC)


4 motion vectors per macro block


Unrestricted motion vectors


Global motion compensation


Frame-based spatial scalability


2.8 Broadcast

2.8.1 Application Description

The broadcasting application can be described by understanding three aspects of its architecture, that is the System Structure, the Communication Channel, and the Content.

System Structure

A broadcast system typically provides a complement of services over a fixed unidirectional communication channel. The overall goal of broadcasters is to provide more and better services over a given bandwidth. The use of MPEG-4 in certain broadcast application areas is expected to provide the means to achieve this goal. If broadcasters can significantly increase the number of available programs without negatively affecting the perceived quality of the overall service by the end-user, then a migration to MPEG-4 becomes realistic. More compression assuming an acceptable quality will justify increased cost/complexity in the decoder. In general, encoder complexity is not a big issue.

The overall system is constructed of a single (logical) origination point, a real-time, unidirectional communication channel and a large number of end-user receiver/decoder terminals. It is a one-to-many, or possibly a few-to-many system The asymmetry of the architecture leads to an emphasis on reducing complexity and cost at the receiving side even if this implies increasing the complexity and cost at the transmitting side. The communication bandwidth is always considered a scarce commodity, therefore compression should be maximized within the quality constraints of the particular application.

The system architecture must also provide for a means of conditional access. The stream syntax must allow for restricted viewing. The ability to gather statistics and perform billing and other administrative functions also needs to be available.

In most cases, end-to-end latency of content delivered across the broadcast communication channel path is not critical. A delay of seconds is acceptable in most applications. For brief events the delay, however, should not exceed the length of the event. For example, a wager should not able to be placed as a result of the arrival delay of that event. An element where latency can become an issue is channel (program) change time (see Communication Channel).

Communication Channel

The broadcast scenario is defined by the presence of one or more broadband, unidirectional channels providing multiple programs each of which may contain the full range of audio, video, and data content (multimedia). The broadcast environment has no requirement for a real-time backchannel in order to receive and decode the stream.

A receiver shall be able to access programming content (establish a link and session) without the use of a return path. Therefore, all pertinent information needed to understand the basic organization of a channel or channel multiplex and the data contained therein shall be available in the broadcast data on a periodic basis. This is required in order to support rapid tuning, acquisition, synchronization, parsing, decoding and presentation to the end-user or device. The acceptable delay from the time a user requests a change in programs (channel surfing) until an „acceptable" quality rendition of the requested material is presented is less than 300 - 500 msec. This value contains the total system latency including: user interface for selection; tuning, channel coding acquisition and synchronization delays; MPEG-4 set-up, synchronization, and decoding delays; presentation delays.

The communication channel can be any of a wide range of existing and planned types of media. Decoding efficiency shall not be impeded by the use of any of the different communication channel coding structures available in the marketplace. Most of these are represented by the ITU Recommendations for satellite, cable, terrestrial, fiber, or the other emerging core methods. These Recommendations are each optimized to the media or market demands of an application which lead to a requirement that the structure of the stream shall not be tied to a specific physical link, session, or transport data structure. The system and transport level multiplex of MPEG-4 could use an adaptation layer to map it onto various communication channel coding data structures to meet this requirement.

A broadcast service may comprise any number of programs multiplexed onto any number of communications links. The bit rates that can be expected for each link range from hundreds of bits per second up to hundreds of megabits per second and possibly beyond as technology permits. Typical audiovisual broadcast services today use links in the range of 500 kbits/s to 40 Mbits/s. Within each link a number of variable bit rate programs can be statistically multiplexed to allow for sharing of a fixed bandwidth link. This needs to be done in such a way that consistent quality for each program is maintained. MPEG -4 shall support a similar capability.

Communication channels used for broadcast services tend to be susceptible to random noise and burst errors. Therefore, most ITU channel coding algorithms used for high quality audio/visual transports employ error correction techniques. Typical systems that employ these techniques operate at corrected bit error rates below 10-6. Mobile receivers will encounter situations in which there can be a loss of signal for periods than can run into seconds and 10s of seconds. When operating in such an environment the system must be capable of quick and graceful recovery from errors caused by poor signal quality or complete loss of signal due to the location of the mobile unit.

Content

The content of a broadcast service can vary depending on the type of service. The type of content can be:

I. Audio only - This can include a channel of high quality music („CD quality") or news or condition reports on weather or traffic.

II. Audio/Visual - Two types of requirements characterize the A/V needs:

A. Real-time encoding will be required for: live events such as sports, cultural events, news or other events that do not permit off-line encoding, storage and broadcasting of the material, such as „turnarounds". A turnaround is defined as receiving an A/V signal and immediately encoding and re-broadcasting it. This obviates the need for cascaded coding of MPEG-4 with itself and other coding algorithms such as MPEG-2. This technique is used extensively in the broadcast industry today.

B. Off-line or non-real-time encoding can be used when circumstances permit. These include encoding for multiple and/or delayed broadcasts, such as commercials, daily entertainment programming and movies. Off-line encoding permits multiple passes at scenes and allows for the entire content to achieve the highest quality while managing aggregate bit rate for multiple channels and/or total storage capacity for the content (DVD).

III. Multimedia content - This implies text, still pictures, moving pictures, audio, and graphics. Current examples of multimedia broadcasting include TV station logos (watermarking), graphical overlays used in sporting events, and multi-window screen formats such as those used by Bloomberg Information Television® or the sports and stock market crawls used by CNN Headline News®.

IV. Datacasting - Datacasting service is characterized by sending information that one assumes many receivers need or wish to receive. This is achieved e.g. by multiplexing the data in a carousel structure. The repeat interval of such a carousel is based on the bit rate allocated to the service and the volume of data in the service. All forms of digital information can be multiplexed into any digital broadcasting environment. Not all data services require low latency. These can be accommodated in a large carousel that can conceivably contain vast amounts of data.

Some services, however, will require very low latency such as download of small software or data snippets that run in set-top boxes (applets) or other end-user terminals. The length of the carousel directly effects the latency of operations by end-users and therefore must be small in those applications. Small carousels can reduce the amount of memory required in a terminal. Other forms of datacasting are also possible.

The quality of the various types of content is application dependent and ranges from „recognizable" (task-oriented applications) to „no annoying artifacts" (entertainment) to „perfect" (most types of datacasting). Provisions for constant quality are required.

Conventional audio/video only upto full „multimedia" services will be common in the MPEG-4 time frame. The services can be heterogeneous in nature. For example, you may wish to listen to music of your choice while browsing a carousel based data (shopping) service. Quality of both audio and video may also vary according to either the type of service or the availability of receiver equipment for playback. This notion of scalability goes beyond quality. It also encompasses the ability of the receiver to utilize all or part of the broadcast service. For example, a small handheld display (PDA) may not be capable of allowing the user to view a multi-window screen. He would, therefore, select the content that provides the greatest information on his/her current device and save the complete facility for later viewing (listening) on a more capable rendering device.

Typical video resolutions required are CCIR-601 but may migrate to High Definition resolution. Audio is currently stereo but may migrate toward a multichannel format.

Object-based Broadcast application

Consider a number of sports events such as soccer taking place concurrently at different locations throughout a country. The fans thus spend possibly a whole afternoon watching soccer on TV. TV stations want to exploit the popularity of soccer and send camera teams to all of the games. Scenes from all of the games are broadcast, and the fans at home can choose exactly which games they want to follow. They may even have the option of watching three or four games simultaneously, at different resolutions, and maybe they have the added functionality of automatic switching, between games whenever a goal is scored in one of the games. Also, the programs could be decodable and displayable at varying spatial resolution.

This scenario can become more complicated when different parts of the final composition are distributed via different transmission channels such as satellite, cable and terrestrial. At the receiver the different objects are combined into one scene, synchronized both spatially and temporally. Typical use of such a scheme could be locally distributed sound or text objects while the main part of the broadcast is received via satellite. Another example is the local insertion of different ‘virtual’ billboards.

This scenario will use MPEG-4 features. It uses distributed contribution, defined as a scenario where the logical sources of contribution are distributed and the different streams are collected in a server before transmission in a single multiplex. Another possibility is collecting the separate streams in the end user terminal.

Broadcast of programmes derived using Virtual Studio techniques [ More information can be found at http://www.bbc.co.uk/rd/pubs/tech_inf/virtual]

The following is a special case of a broadcast application which places very high demands on the flexibility and quality of the MPEG-4 toolset, but shows what can potentially be achieved by the use of object-based coding.

Virtual production techniques are becoming increasingly used in TV production. These techniques are developments of the well-known chroma-key method which has been in use for many years. The actors perform in front of a coloured background (usually blue), and a key signal is derived from a chroma-key unit, indicating which parts of the image contain the actors. A mixer then overlays the actors on another ‘virtual’ background image, using the key signal to control a ‘soft’ switch between foreground and background. With traditional chroma-key, the studio camera cannot be moved, since the registration between actors and background would be lost. One of the new features of virtual production is that the camera can be moved, because its position and orientation are measured, and the background image is adjusted to keep the correct registration. This is usually achieved by rendering the background on a graphics computer, and updating the position of the virtual camera to match that of the studio camera. In situations where the camera is allowed to pan, tilt and zoom, but not translate, the background image can be stored as a 2D image, which is transformed to match the current camera angle. If the camera is allowed to translate, true 3D models of all virtual set elements are generally required. Virtual objects can also be inserted in front of the live action, by generating a key signal for each object that forces the mixer to switch to the virtual background signal within the object, regardless of the presence of the actor’s key signal. This can be achieved by giving every object (including the actors) a depth value, which determines the way in which objects are overlaid.

MPEG-4 offers the possibility of broadcasting programmes originated in this way, using object-based techniques. The final image composition then takes place in the decoder rather than in the studio mixer. This is likely to offer advantages both in coding efficiency and increased functionality (such as user interaction and stereoscopic viewing of the scene).

The types of objects that need to be represented include:

Virtual scene elements:

2D static images, which can be transformed to correspond to the current camera viewpoint (representing flat objects, or 2D views of a 3D background as long as the camera does not translate), including images composed largely of text.

3D objects with texture mapping

2D moving images (e.g. representing video walls)

3D animated people

Real scene elements:

2D images of actors, accompanied by a grey-level shape description

2D images of shadows (possibly a grey-level shape description and a black image, representing a semi-transparent black object, to have the effect of darkening parts of background objects). It may be convenient to describe an actor and his/her shadow as a single 2D video object and shape description.

Note that it must be possible to overlay scene elements in many ways whilst achieving a clean anti-aliased transition between their edges. Objects will generally have arbitrary shapes.

Good shadows and lighting effects are very important for conveying a realistic image. Current practice usually involves pre-rendering the texture maps on objects to include light and shadow effects. Using this approach, it is also possible to simulate lighting changes by cross-fading between texture maps representing different lighting conditions.

Another feature useful for adding realism is depth of field. Objects a long way away from the point at which the camera is notionally focused should appear defocused. This effect can be achieved by defocusing the texture maps and grey-level shape signals associated with these objects. This can lead to shape representations having very soft edges.

It is also important to ensure that the audio signal presents the same spatial ‘story’ as the pictures that go with it. As an actor who is talking moves within a space, the proportion of direct-to-reverberant voice-sound changes and indeed even the spectral content of the sound may change. This is normally accomplished manually but in the context of MPEG-4, where spatial information may be rendered at the consumer’s end of the chain, it is important to generate this feature automatically. This can be done using tools such as those provided in structured audio effects (SAFX).

Virtual actors are sometimes incorporated in virtual productions. These are 3D models of people, usually animated by motion data captured from a real actor. The face and body animation features of MPEG-4 could be used here, as long as they offer sufficient control over the appearance of the virtual actor to satisfy the programme producer. It might also be required on occasions to generate the voice of a virtual actor using text-to-speech (TTS) synthesis.

In addition to a traditional broadcast application (which presents a normal linear programme in real-time), the same types of objects could be used to present a programme in a non-linear manner under the control of the user. All the elements of a programme would be downloaded first, together with an adaptive scene description which would replay the objects in accordance with a particular set of rules, controlled by input from the user. This could allow the user to control many aspects of the programme, including complex actions that affect the storyline. It may be possible to produce a non-linear version of a conventional linear programme with only a modest amount of additional production effort. Such material could become a significant source of MPEG-4 software, and would provide significant added value compared to a conventional linear programme.

2.8.2 Application-Specific Requirements

• Temporal random access

• Unidirectional communication channel

• High video quality (higher temporal and spatial resolutions)

• Interlaced and progressive scanning modes

• Coding efficiency

• Object-based functionality

• Text and graphics overlays

• Low complexity decoder

• Temporal and spatial synchronization of A/V objects

For virtual-studio-type applications, following requirements are additionally needed:

• Face & body animation parameters (if virtual actors used)

• Stereoscopic views (for optional 3D display)

• 3D composition

• Text to speech (TTS)

• Audio rendering

• User controls

2.8.3 Proposed MPEG-4 Tools

Visual Audio Systems Delivery
Intra coding mode (I-Mode) AAC The Systems Decoder Model Tool DMIF-Application Interface (DAI)
Inter prediction mode (P-Mode)
Object Descriptor Tool DMIF-Network Interface (DNI)
AC/DC prediction
Synchronization Tool
Binary shape coding
FlexMux Tool
H.263/MPEG-2 quantization tables
BIFS Scene Description Tool
Bi-directional prediction mode (B-Mode)


Overlapped block motion compensation (OBMC)


4 motion vectors per macro block


Unrestricted motion vectors


Interlaced coding tools


Scalable texture compression


Global motion compensation


Quarter-pel motion compensation


Shape Adaptive DCT


For virtual-studio-type applications, the tools in the next table have to be added:

Visual Tools Audio Tools Systems Tools Delivery Tools
Gray scale alpha shape coding TTS

Dynamic Sprites SASL

Face & body animation with downloadable models/textures


Global Motion Compensation


2.9 Digital Television Set-Top Box

2.9.1 Application Description

Digital Television (DTV) will change the nature of television broadcasting since digital data together with digital audio/video can be delivered to consumers. Digital data can enhance the consumers’ viewing experience by providing a more interactive environment. Some of the possibilities are:

• Linking TV programs and advertising to web pages with the „pushing" data tailored to users

• Access to Internet entertainment and information

• Simple Electronic Mail and Messaging on TV sets with cordless keyboards

• Secure and authenticated Electronic Commerce such as banking and shopping

• Interactive games and video-on-demand

• Any specialized interactive application that creative developers invent in an open, competitive development environment.

The digital data may contain the following components:

• Scene composition information and control

• Adaptive session with downloadable applets

By its very nature, MPEG-4 is perfectly suited to provide the functionality mentioned above. And since the typical platform on which these functions will be available and used is the set-top box, this implies that the set-top box environment should be considered explicitly as an MPEG-4 application.

Typically, a set-top box will receive MPEG-2 transport streams, carrying MPEG-2 video and audio, and possibly some data and/or applet-like programs in the user defined area. In the current situation, as sketched below in Figure 1 on the left, the elementary streams are obtained from the transport stream, and passed on their appropriate decoders. The decoded audio and video will then be rendered. But, in one MPEG-4 scenario, the MPEG-2 transport stream may also carry MPEG-4 streams. In the conventional set-up, this information would simply be discarded by the demultiplexer. This makes the bitstream carrying the additional MPEG-4 data fully backwards compatible.

On the other hand, on MPEG-4 enabled set-top boxes, the MPEG-4 stream will be recognized by the demultiplexer as such, and passed on to the MPEG-4 decoder section. This is shown in the figure below below, on the right. The system is MPEG-4 compliant, to a possibly to be specified set-top box profile and level. The MPEG-2 video and audio will be rendered through the MPEG-4 rendering mechanism, the MPEG-4 compositor, together with the additionally received MPEG-4 objects.

Undisplayed Graphic

Figure: MPEG-4 enabled set-top box

2.9.2 Application-specific requirements

In addition to the regular MPEG-2 video and audio decoding and rendering, the following lists the desirable functions of digital set-top boxes:

Multimedia streams control other multimedia

Control video, audio, and streams (synchronization and switching etc)

Presentation control

Control the display, audio and other presentable output

User interface control

Interface with user

Graphics composition and control

Control object placement, transparency effect, and its user interaction

Return channel management

Control return channel (method, data rate, protocols etc)

Conditional access management

Entitlement management and control

EPG display and control

Electronic program guide management and display

Profile management

Profile of the client for selective adaptation

Resource management

Resources available at the client (e.g. storage, digital interface, and other peripheral)

Most of the functions and requirements listed in the above can be realized with the MPEG-4 BIFS scene graph structure functionality and other MPEG-4 defined elements. This includes the stream and presentation control, and the graphics and audio rendering. However, some of the desired functionalities must be realized via an applet-like program, downloaed as a separate elementary stream, and subsequently executed on the set-top box. Functions that can be realized his way include:

Return channel access (e.g. for home shopping applications)

User interface control (e.g. remote control)

Conditional access management (e.g. for channel subscriptions)

User profile management

2.9.2.1 Prototypical Example: Home-shopping

There already exist more than one channel that solely have programming for home shopping. Products are shown and demonstrated, and the viewer is given the possibility to order the product. A typical scene layout of such a program consists multiple objects, thus appealing to the specific MPEG-4 object oriented approach. Examples of the objects in such a program scene include a still picture of the product (possibly seen from more than one viewing angle), the presenter (a talking head or a whole person), a background (again a still picture), product information (text), a channel logo, a clock, etc.

Ordering the product is typically done by telephone. However, if a backchannel is configured (which may well be a telephone line), ordering the product can be made more user-friendly, simply by interacting with the TV screen using the TV remote control. This will make shopping easier, which should be an appealing feature to home-shopping channel operators. On-line ordering could thus involve sending a small MPEG-4 AAVS application program, which execution might comprise of

checking if a backchannel is configured,

taking product ordering information with special pop-up menus, and using the TV remote control: number of items, item specifics (size, color, options, etc.), cedit card information (note that this will involve security issues),

opening the backchannel (possibly by dialing a specific phone number),

sending the order information,

closing the backchannel.

2.9.2.2 Prototypical Example: Conditional access

With the current analog TV set-top boxes it has been possible for quite some time now to subscribe to specific TV channels, or to pay to view a certain single movie ("pay-TV"). Ordering a movie might involve similar steps as buying a product on-line as described in the example above. It is for example also possible to prompt for a password, that will be verified via a backchannel or compared with the set-top box hardware. It may also be required to access a specially installed device, like a SmartCard, that is used for access verification and/or decryption.

2.9.2.3 Prototypical Example: Interactive game show

Game shows often have contestants competing in answering questions. With the current set-top boxes the experience is more or less passive. However, by sending an AAVS application program along with such a game show, interaction with the game could be realized, where the answers can be entered by the user, a score is kept, and it might even be possible to use a backchannel to send the final scores back to a server, and thus to compete with all other "home-participants".

2.9.2.4 Adaptive Content Application

The scenario is that the author of content provides both the content, cast to a scene graph with multiple media nodes, plus an applet which responds to finite resources. The figure below illustrates the applet and its interaction with the infrastructure. The premise is that, in addition to the applet, a pipeline decodes protocol to build a scene graph. The scene graph presents interface through which the applet selects nodes which represent alternative content formats.

The content is scalable with respect to resources. Some formats, for example, require less computation bandwidth. The applet measures resource utilization and responds to changes in utilization. If it detects degredation, for example, it selects a format which is less compute intensive.

The motivation for the experiment is not to formalize the applet interfaces; rather the motivation is to exercise the interfaces to which the applet delegates. The interface which the applet requires is:

• Scene Graph: The applet can traverse the scene graph and select nodes which contains alternative content formats.

• Resource Execution: The applet measures execution bandwidth for both software execution and media hardware execution.

• Screen Allocation: The applet perhaps presents resource execution on the screen. It might also present adaptation options. Since the author creates both the scene graph and the applet, the scene graph perhaps contains a node through which the applet access the screen.

2.9.2.5 Simple Program Guide Application

The scenario is that an applet presents a program guide through which a person selects programs. To present the program guide, the applet renders to the screen. The implication is that the applet shares the screen with a video pipeline. The interface which the applet invokes depends on whether a single scene graph contains nodes through which the applet renders plus node through which the video pipeline renders. (The scenarion illustrates the complication which occurs if multiple clients attempt to share resources.)

The motivation for the experiment is not to formalize the applet interfaces; rather the motivation is to exercise the interfaces to which the applet delegates. The interface which the applet requires is:

• Resource Allocation: The applet presents the program guide on the screen. The primitives which render to the screen are not the concern; rather the concern is that the applet and the video pipeline share a resource, that is the screen. The applet must reserve resources. The infrastructure must alert the video pipleine when the applet acquires a portion of the screen. The infrastructure must alert the applet if the video pipeline acquires the screen.

• Interaction: The applet interacts with a person. It responds to input streams.

• Network: The applet selects channels and selects streams bound to the channels. The infrastructure must provide interface for these functions.

2.9.2.5 Multimedia Services

A new generation of interactive multimedia services requires an extensive capability to interact with the multimedia content. The interaction may typically occur via the use of a web browser interface and allows manipulation, navigation or extraction of the needed information. The multimedia content may be stored as a database which is available either locally or remotely. Examples of services include mobile or internet multimedia under programmatic control. Such services may need to be offered in a platform independent manner so as to work on various types of devices such as computers (in home or office environment) and information appliances.

2.9.9 Proposed MPEG-4 Tools

The tools proposed for this application scenario is largely the same as listed for the broadcast application.

Additionally, the AAVS tools are required.

2.10 ISDB

2.10.1 Application Description

ISDB concept

ISDB is a concept for constructing a complete digital broadcasting system which offers a great variety of services with high spectrum efficiency, flexibility and extendibility. ISDB provides not only existing basic broadcasting services such as SDTV and HDTV, but also new services, such as multimedia TV, the TV newspaper (multimedia information services) and two-way information services.

An integrated-services television is a terminal for receiving ISDB services. It enables viewers to make better use of television and offers services with multiple functions as well as new multimedia information services such as the TV newspaper. It is controlled by a CPU, thus enabling the viewer to enjoy programs through personal filters and an intelligent agent for broadcasting. Figure 1 shows an example of the program menu the user can see just after turning on the switch.

Undisplayed Graphic

Fig. 1 Example of the program menu of a terminal for receiving ISDB services.

ISDB offers the following services:

• SDTV and HDTV programs

• Multiple TV programs

• Electronic Program Guides (EPG)

• News at any time

• Weather forecasts at any time

• TV newspaper

• Video at any time (VOD or Near VOD)

• Audio at any time (AOD or Near AOD)

• Multiple language subtitles

• Information linked with TV programs

• Interactive questionnaire survey

• Interactive home shopping

• Automatic recording by intelligent agent

• Automatic selection of programs by intelligent agent

Terrestrial ISDB

The terrestrial ISDB system (ISDB-T) provides ISDB services via terrestrial networks. ISDB-T has the flexibility to accommodate different service configurations and ensure flexible use of transmission capacity. The transmission scheme of ISDB-T is based on segmented OFDM (Orthogonal Frequency Division Multiplexing). In this scheme, a logical transmission channel is composed of a set of small OFDM band blocks called BST (Band Segmented Transmission)-Segments. Modulation and error correction can be set independently for each BST-Segment. The BST-OFDM scheme has the robustness and flexibility to adapt to various frequency situations and service applications.

Figure 2 shows an example of the ISDB-T system. ISDB-T provides hierarchical transmission capability by several combinations of carrier modulation within a given bandwidth. This enables mixed transmission for different receiving conditions, which means that audio and data broadcasts for mobile and portable receivers can be performed simultaneously with television broadcasts for home use. The three types of receivers are as follows:

An integrated receiver with a demodulator for all BST-segments and an HDTV resolution display

A light-weight mobile receiver with a demodulator for all BST-segments and a small SDTV resolution display

A portable or pocket-size receiver with a demodulator for one BST-segment for sound and data services.

In the case of all BST-segments channel, the signal is frequency-interleaved within the given bandwidth, however, the sound services can also be transmitted using one BST-segment at the center of all BST-segments. In this case, the interleaving range is divided into two parts for the center segment and the remaining segments. These sound services could be decoded by one BST-segment receiver.

The ISDB-T system brings the ability of MPEG-4 tools into full play by fitting MPEG-4 Objects to the BST-segments.

Undisplayed Graphic

Fig. 2 Examples of application and transmission images of the terrestrial ISDB system.

2.10.2 Application-specific requirements

• Multiplexing A/V objects and other information data

• Temporal and spatial synchronization of A/V objects which are distributed via different transmission channels

• Object-based coding flexibility by allowing all objects to be selectively coded

• Object-based spatial/temporal quality flexibility

• Object-based spatial/temporal scalability

• Object-based bitstream manipulating without the need for transcoding

• Compatibility with MPEG-2 standard

• Quick recovery after changing the program

• Prevention of illegal copying and altering of contents

• User interaction

• Downloading of A/V objects and other information data

• Multipoint operation

• 3D composition (BIFS)

• Interlaced and progressive scanning modes

• High video quality (SDTV and HDTV)

• Robustness to information error and loss

• Low delay decoding

• Tandem encoding

• Low complexity decoder

• Text and graphics overlay

Coding of multiple concurrent data streams

2.10.3 Proposed MPEG-4 Tools

Visual Audio Systems Delivery
Intra coding mode (I-Mode) AAC The systems decoder model tool DMIF-Application Interface (DAI)
Inter prediction mode (P-Mode)
BIFS scene description tool DMIF-Network Interface (DNI)
AC/DC prediction
Object Descriptor tool
H.263/MPEG-2 quantization tables
Synchronization tool
Bi-directional prediction mode (B-Mode)
FlexMux tool
Overlapped block motion compensation (OBMC)
Intermedia format tool
4 motion vectors per macro block
AAVS scene description tool
Unrestricted motion vectors


Static sprites


Interlaced coding tools


Gray scale alpha shape coding


Global motion compensation


Quarter-pel motion compensation


Interlaced shape coding


2.11 Studio and Television Post-production

2.11.1 Application Description

In a video post-production environment source material goes through various processing stages before assuming its final form (i.e. a television program). Often these stages involve operations such as editing and multiple copying from one storage medium to another.

The transparency of digital recording operations and the automation using edit controllers and off-line edit systems increases the ease with which complex effects can be created. Examples of editing operations are cutting-and-pasting, cross-fading, captioning, color separation overlaying (also known as chroma-keying) and various digital video effects (DVE) to name but a few. The need for versions in different languages favors separation of program elements and requires separate storage of all intermediate versions. Regional variations of a program may also have similar requirements.

Typically, a wide range of operations such as those described above will be required to produce the final product and caption it. This may not be performed in a single pass (for technical or production reasons) but may involve several passes thereby requiring the intermediate storage of results to an appropriate medium (typically tape). Depending on the complexity of the desired effects as many as 10-20 passes may be required and 4 or 5 versions retained.

A coding standard offering the possibility of content-based video manipulation is an attractive proposition to television production and post-production for the following reasons:

• Combining multiple sources of visual information to produce a single entity (i.e. program) often amounts to retrieving objects of interest from those sources and re-combining them accordingly. In the multimedia era, as sources become more diverse, this tendency is likely to increase.

• Lossy digital compression techniques, which do not impair visually perceptible distortions are being increasingly employed in studios mainly through the proliferation of storage devices (i.e. recorders) that use bit-rate reduction to improve storage efficiency.

• The emergence of Virtual Reality techniques is anticipated to have a considerable impact on television and film production. Such techniques are object-driven and make extensive use of 3-D audiovisual models. Efficient storage and transport of 3-D object descriptions is essential where capacity is at a premium i.e. servers and networks for one-to-one service provision.

The combination of those facts leads naturally to the idea of content-driven manipulation of coded video in the studio. While it is desirable that each video object is retained as long as possible in its compressed form, many applications require transcoding and therefore the compression scheme used should be very efficient in this respect.

In the digital domain, straight copying of uncompressed data (dubbing) offers the possibility of distortion-free reproduction of video data for as many times as required. However, in a domain in which lossy compression is employed, the accumulation of coding errors is a limiting factor to the number of multiple generations of encoding-decoding operations in cascade that can be applied before visual quality becomes unacceptable.

The situation becomes further complicated when additional processing is applied between generations. Simple effects can be used to simulate inter-generation processing, such as small spatio/temporal shifts and fades. When such effects are combined with conventional compression algorithms at television transmission rates (i.e. MPEG-2 MP@ML@4.5 Mbits/s), it can be demonstrated that output video becomes very quickly unusable (i.e. after 3-4 generations).

2.11.2 Application-Specific Requirements

The single most critical requirement of studio and post-production applications is the encoding of full bandwidth luminance and chrominance signals without resorting to down- and up-sampling at the two ends of the coding chain. This requirement implies that encoding of object attributes (shape, texture, motion) at full-bandwidth should be possible.

Further essential requirements are the following:

• the lossless encoding of object shapes

• high-density of I-VOPs in the bit-stream for improved temporal random access(i.e. I-B-I-B VOPs or I-only VOPs - „still picture mode")

Desirable requirements are the following:

• Provision for common video recording functionalities (i.e. picture-in-shuttle) to allow recognizable playback of recorded material at viewing speeds other than normal. This leads to a requirement for scalability (i.e. spatial).

• Provision for error resilience to account for the characteristics of the recording channel (i.e. burst errors commonly occurring in tape recordings)

• Support for both Variable and Constant Bit-rate coding to account for potentially different characteristics of the recording channel (i.e. fixed or variable allocation of storage on tape)

2.11.3 Proposed MPEG-4 Tools

Visual Audio Systems Delivery
Intra coding mode (I-Mode) AAC The Systems Decoder Model Tool DMIF-Application Interface (DAI)
Inter prediction mode (P-Mode)
Object Descriptor Tool DMIF-Network Interface (DNI)
AC/DC prediction
Synchronization Tool
H.263/MPEG-2 quantization tables
FlexMux Tool
Interlace coding tools
Intermedia Format Tool
Frame-based spatial scalability
BIFS Scene Description Tool
Greyscale alpha shape


2D-Meshes


4:2:2 chrominance sampling

(until now no such tool is provided by MPEG-4)




2.12 TVML

2.12.1 Application Description

TVML is a script description language that can produce full TV programs in real time by using real-time computer graphic (CG) characters, a voice synthesizer, and multimedia computing techniques. A user can create his own TV program on a desktop workspace simply by making a text-based script written by a description language that we have designed for this purpose. In a script written by TVML, contents of programs are represented by text-based commands such as "Tiltle#1" or "ZoomIn". A TVML Player interprets the script written in TVML and then generates a TV program in real time. Figure 1 shows an outline of the system.

Undisplayed Graphic

Fig. 1 Outline of TVML system.

TVML covers the following items:

• Studio shot using CG characters with synthesized voice, CG studio set and CG camerawork

• Video clip playing

• Title using HTML-like text layout language

• Superimposing

• Audio clip playing as background music

• Narration by synthesized voice

TVML is suitable for using the following MPEG-4 tools:

• Object-based video coding tool to store video clips

• Object-based audio coding tool to store audio clips

• BIFS to describe the camera view position and lighting

• Face and Body Animation to generate CG characters

• Text to Speech (TTS) for talking CG characters

2.12.2 Application-Specific Requirements

Requirements for Systems part

• Multiplexing A/V objects and other information data

• 3D composition (BIFS)

• Support all VRML nodes (BIFS)

• Object-based bitstream editing and manipulating without the need for transcoding

• User interaction

• Synchronization between Face animation and TTS

Requirements for Visual and Audio parts

Support the following video formats:

• Luminance spatial resolutions: Sub-QCIF, QCIF, CIF, ITU-R BT.601 and 709

• Color spaces: Monochrome, Y/Cr/Cb and R/G/B with an alpha channel

• Chorominance spatial resolutions: 4:0:0, 4:2:0, 4:2:2 and 4:4:4

• Temporal resolutions: 60 fps maximum

• Pixel depth: up to 10 bits per component

• Scanning methods: Progressive and interlaced

Support the following type objects:

• 2D natural video

• 2D static images

• 2D arbitrary shaped objects accompanied by binary or gray scale shape description

• 2D static sprites

• 3D objects with texture mapping

• 3D animated objects

• Parameters of camera works and lighting

• Natural audio/speech objects

• Synthetic audio/speech objects

• Text and graphics

• High video quality

• Compatibility with MPEG-2 Video standard

• Object-based coding flexibility by allowing all objects to be selectively coded

• Object-based spatial/temporal quality flexibility

• Low delay decoding

• Tandem encoding

• Object-based random access

• Low complexity decoder

• Text and graphics overlays

• Coding of multiple concurrent data streams

• Face and Body animation parameters

• Text to Speech

2.12.3 Proposed MPEG-4 Tools

Visual Audio Systems Delivery
Intra coding mode (I-Mode) AAC The systems decoder model tool DMIF-Application Interface (DAI)
Inter prediction mode (P-Mode) TTS BIFS scene description tool DMIF-Network Interface (DNI)
AC/DC prediction Structured Audio Tools Object Descriptor tool
H.263/MPEG-2 quantization tables
Synchronization tool
Static/Dynamic sprites
FlexMux tool
Global motion compensation
Intermedia format tool
Interlace coding tools
AAVS scene description tool
Grey scale alpha shape coding


Interlaced shape coding


2D/3D meshs


Face and Body animation parameters (FAP, BAP)


10 bit video


4:2:2 and 4:4:4 chrominance sampling

(until now no such tool is provided by MPEG-4)




2.13 Collaborative Scene Visualization

2.13.1 Application Description

Collaborative Scene Visualization supports a class of Computer Supported Cooperative Work (CSCW) applications where groups of people typically working simultaneously in distributed locations leverage visualization tools to accomplish a task by sharing a common visual information space.

A trend of this kind of applications is that they will provide Augmented Reality (AR). A particular feature of these applications is that they not only use dedicated audiovisual streams as usual tele-conferencing applications for interpersonal communication, but also use an additional video stream to achieve AR effects. The objective of AR is to create an environment in which a user perceives both real and virtual/synthetic (generated with a computer) objects in a seamless way.

From the viewpoint of communication, multiple audiovisual streams of natural and synthetic origins are transferred: audiovisual stream for conferencing; video stream containing a video shot of the empty office, and 3D synthetic object stream for the furniture, etc.

Like any distributed multimedia system where partly bulk data (video, audio, high resolution image, animation sequence, etc.) is transferred, appropriate data coding methods are needed. For this end, MPEG-4 is very useful, because of the following reasons:

It supports high performance data compression.

A trade-off between quality and performance can be made by scaling encode and decode complexity, spatial resolution, temporal resolution, and quality.

Content-based coding enables interactivity with objects. It is desired that real objects can be conveniently manipulated in the same way as virtual objects.

• The composition concept of MPEG-4 is very appropriate for organizing a scene consisting of real and virtual objects to be transferred among dispersed participants.

• Stereoscopic views help a user perceiving a scene.

Face Animation parameters can be used to replace the audiovisual streams used for interpersonal communication to achieve bandwidth reduction. The saved bandwidth can be used to improve the quality of the video stream used for AR scenes.

2.13.2 Application-Specific Requirements

The following functionalities are essential for Collaborative Scene Visualization applications:

• Improved coding efficiency for multiple audiovisual streams

• Multiplexing of audio, video, and other information

• Support of Face Animation parameters

• Synchronization

• Low end-to-end delay mode

• Interworking with other audiovisual systems

• User controls (e.g., sensitive regions)

• Content-based coding and interaction

• Coding flexibility

• Stereoscopic views

• 3D Composition

2.13.3 Proposed MPEG-4 Tools

Visual Audio System Delivery
Intra coding mode (I-Mode) HVXC The Systems Decoder Model Tool DMIF-Application Interfa