INTERNATIONAL ORGANISATION FOR STANDARDISATION

ORGANISATION INTERNATIONALE DE NORMALISATION

ISO/IEC JTC1/SC29/WG11

CODING OF MOVING PICTURES AND AUDIO

ISO/IEC JTC1/SC29/WG11N2194

MPEG 98

March 1998/Tokyo

Title: MPEG-4 Requirements, version 7 (Tokyo revision)

Source: Requirements Group

Status: approved

MPEG-4 Requirements

Table of Contents



1. Introduction

This document presents a set of requirements for the MPEG-4 standard. The requirements are specified for the framework, tools, algorithms, and conformance points that are anticipated to need standardization. While this document describes what MPEG-4 should be, another document entitled "MPEG-4 Technical Overview" describes when and how the technical solutions corresponding to these requirements will be developed.

The requirements in this document drive the MPEG-4 development process and will be used to determine evaluation criteria for proposals, to verify that the standard meets the specified requirements, and to define the conformance points for the standard.

Although the requirements listed in this document can be considered stable, they will undergo further change. Notably, requirements for MPEG-4 version 2 will keep being added.

More information about MPEG-4 can be found at MPEG’s home page (case sensitive): . This web page contains links to a wealth of information about MPEG, including much about MPEG-4, many publicly available documents, several lists of ‘Frequently Asked Questions’ and links to other MPEG-4 web pages.

2. MPEG-4 Framework

As the MPEG-4 project description [1] states, a number of concurrent evolutions have created the need for new ways to represent, integrate and exchange pieces of audio-visual information:

• the deployment of diverse new two-way delivery systems such as fixed broadband and mobile narrowband;

• the progress of micro-electronic technology that is providing extremely powerful and programmable processors, and

• the change of the audio-visual information production and consumption paradigm, because of the increased role of synthetic information and higher degrees of interactivity.

The MPEG-4 project aims to establish universal, efficient coding of different forms of audio-visual data, called audio-visual objects. These objects can be of natural or synthetic origin.

A concise overview of what the MPEG-4 standard will offer:

• a new kind of interactivity, with dynamic objects rather than just static ones;

• the integration of natural and synthetic audio and visual material;

• the possibility to influence the way audiovisual material is presented (‘composited’)

• reusability of both tools and data,;

• a coded representation that can take into account lower layers, while the application developer need not worry about those layers;

• the simultaneous use of material coming from different sources - and support of material going to different destinations;

• the integration of real time and non-real time (stored) information in a single presentation.

These goals will be reached by defining two basic elements:

1. A set of coding tools for audio-visual objects capable of providing support to different functionalities such as object-based interactivity and scalability, and error robustness, in addition to efficient compression.

2. A syntactic description of coded audio-visual objects, providing a formal method for describing the coded representation of these objects and the methods used to code them.

The coding tools will be defined in such a way that users will have the opportunity to assemble the standard MPEG-4 tools to satisfy specific user requirements, some configurations of which are expected to be standardized. The syntactic description will be used to convey to a decoder the choice of tools made by the encoder.

The MPEG-4 general requirements will be described in more detail below.

3. Definition of Terms

This section defines terms used within the context of this document.

Audio Visual Objects (AV Objects)

An AV object is a representation of a real or virtual object that can be manifested aurally and/or visually. AV objects are generally hierarchical, in that they may be defined as composites of other AV objects, which are called sub-objects. AV objects that are composites of sub-objects are called compound AV objects. All other AV objects are called primitive AV objects. AV objects that cannot be decomposed into sub-objects are called primitive AV objects.

Scalability

An object bitstream is scalable if at least one subset of the bitstream is sufficient for generating a useful presentation of the object.

Tool

A tool is a technique that enables one or more MPEG-4 functionalities.

Tools may, themselves, consist of tools.

Examples: such as motion compensation, Sub-band filter, Audiovisual synchronization

Algorithm

An algorithm is an organized collection of tools that fulfills one or more requirement(s) .

Algorithms may, themselves, be composed of tools and/or algorithms.

Examples: Code Excited Linear Prediction, DCT image coding, Reed-Solomon Coding, Speech driven image coding

Conformance Points

Conformance points are specifications of a particular Systems or Combination Profile at a certain Level at which conformance may be tested. Conformance Points establish normative parts of the MPEG-4 Standard.

Object Profile

An Object Profile defines the syntax of the bitstream for one single Object, that can represent a meaningful entity in the (Audio or Visual) scene.

Note that this corresponds to a list of tools. There are Audio Object Profiles and Visual Object Profiles.

Combination Profile

A Combination Profile defines which different Object Profiles can be combined in the (Audio or Visual) Scene.

Profiles (Object as well as Combination Profiles) only define syntax, and not yet complexity bounds, that is a level issue. Note that a Combination Profile is more than a collection of tools: it is a collection of admissible Object Profiles, and hence a list of admissible Elementary Streams - perhaps with a restriction on how they can be combined.

Note that MPEG does not want to specify what are acceptable combinations of Audio and Visual Cobination Profiles. We want to let the market decide this.

Level

A level is a specification of the constraints and performance criteria on an Audio or Visual Combination Profile, a Systems Profile or a DMIF Profile, and thus on the corresponding tools.

Note that in the normative parts of the standard (S, V and A) Combination Profiles can only exist at a certain level, there are no profiles without a level. We could however have a profile that is defined at only one level, in which case mentioning the level can be omitted.

Profile Requirements

A 'set of profile requirements' is a specification of the requirements identified to address a cluster of Audio, Visual or Systems functionalities that satisfy the needs of one or more classes of applications.

Profiles form one of the dimensions used to specify MPEG-4 conformance points. There are Visual and Audio Object Profiles, Visual and Audio Composition Profiles and Systems Profiles.

Level Requirements

A 'set of level requirements' is a specification of the constraints and performance criteria within a specific Systems, or Composition profile that satisfy the needs of one or more classes of applications.

A level is one of the dimensions used to specify MPEG-4 conformance points. Levels are defined for Combination Profiles and one or more levels may be specified for each Combination Profile.

The definition of Object Profile Levels is currently not envisaged, but perhaps they may be useful as elements to be used in the definition of Composition Profiles.

Delay Definitions

Call-setup delay

The time between when a user requests that a connection be established via a network to another user to the time that the communications channel is made available to the user.

Initial delay

The time required once a communications channel has been established for the first usable information to be presented at a receiver. An example of initial delay is the time required by MPEG-4 video to transmit an initial I-frame. Initial delay includes the capture/encoding time, transmission time, and decoding/presentation time.

Algorithmic delay

The time required by the system to transport information from the input of the transmitter/encoder to the presentation of the information at the receiving end. Algorithmic delay includes the capture/encoding time, transmission time, and decoding/presentation time.

Control Response Delay

The codec's contribution to the time between when a control command is issued at the decoding unit and when the effect of the command is presented from the decoding unit.

4. General Requirements Specifications

This section specifies a set of general MPEG-4 Requirements. These requirements are applicable to the MPEG-4 standard and represent the extreme limits expected to be encountered by MPEG-4 implementations.

The set of requirements in this section are not application or profile specific. Therefore, some requirements may not be essential or applicable for some profiles. It may also be appropriate for some requirements to be relaxed from their extreme values for some profiles.

4.1 Requirements for Systems

4.1.1 Flexibility

Requirement

The following requirements are still under study in MPEG, and may be supported:

a) the downloading and execution of composition scripts

b) the dynamic configuration of a collection of standardized tools by the decoder during an initial scripting phase;

c) a reconfiguration during a communication phase;

d) the flexible configuration of the demultiplexer.

4.1.2 Multiplexing of Audio, Visual and Other Information

Requirement

The MPEG-4 standard shall support the dynamic multiplexing of scalable objects, compositing information, as well as other data. The multiplexing, and particularly the demultiplexing, should be easy and cost-effective to implement.

Specification

a) The multiplex must support at least a number of 1024 elementary bitstreams in the same multiplex bitstream.

b) The multiplex shall supply means to support a dynamically changing number of object streams.

c) The multiplex shall support the extraction of object from the bitstream, without requiring additional capabilities.

d) The multiplex shall support the mixing of objects from local and remote sources.

e) The multiplex shall interfere as little as possible with network, link and physical layers, and uses as much as possible the functionality provided by these layers.

4.1.3 Composition of Audio and Visual Objects

Requirement

The MPEG-4 standard shall provide the means to composite, in time and in space, the audio and visual objects as well as other data, e.g. for presentation purposes. This includes support for multi-channel audio objects and multi-view video objects. It also includes combining natural and synthetic objects. The standard shall support the association of different objects (e.g. an audio and a video object).

The compositing function includes the synchronization of the audio and video objects, as well as other data, for presentation purposes. Furthermore, object time base information shall be provided to be able to recover the encoder timing.

The compositing function shall allow changing presentation characteristics for individual objects (e.g. change volume of a single audio object; change contrast of a single video object.) Note that also audio objects can have a spatial localization.

Specification

a) MPEG-4 Systems shall support carrying compositing information in the bitstream.

b) The maximum number of video objects shall be bounded by the maximum number of objects supported by the multiplex.

c) The maximum number of audio objects is 256 (natural and synthetic).

d) The maximum allowable differential delay in presentation of any two video objects with the same temporal reference is 15 milliseconds.

e) The maximum allowable differential delay in presentation of any two synthetic objects with the same temporal reference is 15 milliseconds.

f) The maximum allowable differential delay in presentation of any two audio objects with the same temporal reference is TBD milliseconds.

Ed. Note : Find this figure !

g) The maximum allowable differential delay in presentation of any audio and any video object with the same temporal reference is audio preceding video by 20 milliseconds, or video preceding audio by 40 ms.

h) MPEG-4 shall provide means to synchronize 2D/3D surface models and texture in a 3D compositor with a maximum differential delay of 15 milliseconds.

i) Synchronization between TTS speech and Facial Animation: MPEG-4 shall provide the capability to synchronize the synthetic speech and facial animation.

Note: This could be realized by specifying the exact time to start speaking and animating. Or, this capability could be achieved by specifying the maximum differential delay of speech and visual decoders.

j) MPEG-4 shall provide the means to integrate audio in 2D/3D compositor to obtain auralization effects this means to virtually place an audio object in a controlled position. The minimum configuration to be supported is a 2-channel configuration.

4.1.4 Downloading

Requirement

The MPEG-4 standard shall provide the means to download and store AV objects which shall remain persistent in a terminal cache under the control of the session until released.

Specification

a) The MPEG-4 standard shall provide the means to download from either remote or local sources audio, and visual objects.

b) The MPEG-4 standard shall provide the means to store to either remote or local sources audio, and visual objects.

c) MPEG-4 shall support progressive downloading of objects.

d) Downloading composition information or scenarios shall be supported.

e) Downloaded compositions shall have the means to instantiate by reference stored AV objects that may be installed in place within the terminal as a part of a decoder or application.

f) downloading MPEG-4 TTS programs and synthesis unit databases used by the TTS.

Note: Downloaded objects may not require a temporal reference.

4.1.5 User Interaction

Requirement

The MPEG-4 standard shall provide the means for the user (at the decoder), or for the decoder itself, to define the compositing script as well as coding, decoding, and other parameters.

Specification

MPEG-4 shall support the possibility to:

a) Interact with the compositor at the receiver side (e.g. by specifying the spatial position, size or other attributes of the objects)

b) Interact with the decoding process (e.g. by specifying which objects should be decoded in a situation with limited processing power)

c) Interact with the encoding process (e.g. by specifying the spatial or temporal resolutions of the objects to be coded)

d) Interact with the 2-D or 3-D compositor at the transmitter side (e.g. by specifying which objects are transmitted or updating a shared database)

e) Support user controls (e.g. scan forward, reverse, pause, video block (inhibit outgoing video), dynamic A/V quality trade off)

f) Support the attachment of URLs (Uniform Resource Locators) to (audio)visual objects, so that clicking on the object results in accessing this resource.

g) A back channel for user interaction should be supported

Note

A back channel will be needed to allow interaction with the sending side.

4.1.6 Media Interworking

Requirement

MPEG-4 shall support the ability to work in and interoperate between various media, both storage and delivery.

Specification

The standard shall support operation across various transmission media, including: magnetic disks, optical disks, chip cards, GSM, DECT, UMTS, PSTN, ISDN, ATM and the Internet, and it shall support operation across heterogeneous transmission media.

4.1.7 Compatibility

Requirement

The MPEG-4 standard shall allow backward compatibility to some audio, video, imaging and audio-visual standards.

Specification

The systems layer shall support MPEG-1, MPEG-2 and H.263 Video streams, and MPEG-1 and MPEG-2 audio streams.

Note

Direct interworking with MPEG-2 Systems is not supported. The MPEG-2 Transport Multiplex may be utilized however.

4.1.8 Robustness to Information Errors and Loss

Requirement

The MPEG-4 standard shall provide the tools to achieve error resilient object-based streams, either in terms of bit errors or cell loss in relevant environments such as mobile networks with severe error conditions, ATM networks or storage media. It shall be possible to provide different error protection for individual objects. It shall be possible to switch off error protection if there is no need for it.

Specification

• MPEG-4 Systems shall provide the ability to withstand Random Errors with a BER up to10-2.

• MPEG-4 Systems shall provide the ability to withstand Burst Errors with an average BER up to [TBD] and an average burst length of [TBD].

• MPEG-4 Systems shall provide, for errors which do not cause loss of the channel, a recovery time within one round trip delay (note: some additional processing time is acceptable).

• MPEG-4 Systems shall provide capability for Data Prioritization, Error Detection (corrupt data, insertion, deletion), and Error Concealment.

• MPEG-4 Systems shall provide the possibility to do reliable downloading to insure the integrity of identified data, e.g. scene composition data.

Note

There is a linkage between latency and the length of the burst that Systems can cope with.

4.1.9 Object-based Bitstream Manipulation and Editing

Requirement

The MPEG-4 standard shall provide the means for editing (e.g. cutting and pasting) or manipulating (e.g. translating, rotating, scaling) objects in a sequence without the need for transcoding (either all or just those which are chosen). Combined editing associated objects (e.g. visual and audio) shall be accomplished with minimal discontinuities.

Specification

a) All individually accessible objects shall be available as an individual data stream in the system multiplex.

b) MPEG-4 shall support the combining of objects from different bitstreams into a single bitstream. For example, it is possible to extract all of the data corresponding to an object from bitstream A, extract all of the data corresponding to an object from bitstream B, and to combine both sets of data into a single valid MPEG-4 bitstream.

c) MPEG-4 shall provide the ability to modify the composition of objects in a bitstream.

d) MPEG-4 shall support scaling, rotation, and translation of any video object in a bitstream about any axis in 2D or 3D space. It shall support changing the spatial localization of any audio object in the bitstream. It also shall support changing of the temporal relationships of objects.

e) MPEG-4 shall support the ability to modify properties of an object in a bitstream (e.g. texture).

4.1.10 Content Management & Protection and Identification

4.1.10.1 Identification of Intellectual Property

Requirement

1. The MPEG-4 standard shall provide for possibility to record, transmit and retrieve the identifiers of the copyrighted components that compose the MPEG 4 object, using existing identification systems, e.g. International Standard Book Number (ISBN number).

2. The MPEG-4 standard shall provide the capability to uniquely identify the registration authority. (e.g. a reference to the ISBN agency)

3. It shall be possible to identify a composed MPEG-4 object (i.e. a sum of other copyrighted objects) as a separate copyrighted object.

Note

• The ability to identify the registration authority is necessary in order to access information, specific to the particular component of content (object), in external databases.

• Multiple objects can have dependent identifiers, but if the authority is the same for each object, there is no need to repeat the authority ID over and over. An example would be the case in which a movie is identified by an International Standard Audio-visual Number (ISAN), containing several songs identified by International Standard Work Code for Tunes (ISWC-T).

4.1.10.2 Content Management & Protection

Requirement

The MPEG-4 Standard shall provide hooks to (non-normative) Content Management & Protection Systems (CMPS) to support appropriate on-line and off-line transactions among users, content providers and/or rights holders. Amongst the functions supported by the CMPS’ shall be:

• Conditional access to content based on criteria to be defined by the content provider.

• Verification of authenticity of source and content and integrity of content.

• Identification and, wherever possible, prevention of illegal copying.

• Audit trails.

4.1.11 Multipoint Operation

Requirement

MPEG-4 shall support sending audio-visual objects to multiple destinations and decoding objects from multiple sources with possibly different time bases.

Specification

MPEG-4 Systems shall support receiving objects from 8 different sources.

4.1.12 Object Content Information (OCI)

Requirement

MPEG-4 shall provide the possibility to associate content description information to the various audiovisual objects in the scene

Specification

1. MPEG-4 shall support normative Object Content Information (OCI) data (syntax and semantics.) Room for private description information data shall also be provided.

2. The amount of normative Object Content Information used for each specific case should depend on the content provider’s needs and thus the OCI syntax should be flexible enough to accommodate very different needs, in an efficient way. The minimum amount of OCI possible to add should have an insignificant weight (null, if possible) in the bitstream budget. This means that no MPEG-4 application should be unnecessarily loaded with object content information, if it does not want to provide this type of data.

3. Taking into account the content-based nature of the MPEG-4 audio-visual representation, where a scene is composited by many objects, independently accessible and usable, it should be possible to associate Object Content Information to all levels of a scene hierarchy, down to each elementary object.

4. Object Content Information syntax and semantics should be as much as possible the same at all levels of a scene hierarchy, down to each elementary object.

5. Taking into account the MPEG-4 time schedule and the emerging MPEG-7 effort, Object Content Information (OCI) in MPEG-4 should be limited to textual descriptors and description classifiers.

Example

The following types of data are considered examples of Object Content Information (OCI) data: IPR data (following the specification in this document), data concerning content description and classification, e.g. movie, news, sports, music, children, game, etc. organized in one or more layers, data concerning parental rating, e.g. by ages, data concerning events, such as event name, event description, start time, duration, etc, data concerning language (for audio), content textual description.

Note

The provision of Object Content Information (OCI), is essential to allow the selection, retrieval, and access of services, scenes, events, etc. Although in principle OCI data can be regarded as MPEG-7 information, the fulfillment of this MPEG-4 requirement should not constrain, in any way, the MPEG-7 development.

4.1.13 Video-related Metadata

Requirement

MPEG-4 video shall provide a means to store dynamically changing metadata (e.g. acquisition parameters like camera parameters) on a VOP by VOP basis.

Specification

To be specified in the context of profiles.

Note

It is believed that this metadata requirement, which conceptually belongs to Systems, can be accommodated using OCI. If MPEG-4 Systems cannot fulfil this requirement, then it becomes a requirement on the video part of the standard.

4.1.14 Delay

Definition

Requirement

The MPEG-4 multiplex shall support applications with various end-to-end delay requirements, including those requiring low delay.

Specification

a) Low delay requirements: the system component shall support a mode with the following characteristics:
- ‘call set-up’: maximum 10 sec.
- initial delay: maximum 100 ms
- algorithmic delay: maximum 50 ms

b) AV object streams with different delay constraints may occur in one multiplex stream.

Note

Call set-up in H.324 systems takes about 10 seconds.

4.1.15 Configuration Modes

Requirement

The flexibility shall be supported to send configuration information for audio-visual objects at the beginning of an interactive session or repeating it regularly in broadcast sessions, e.g. to enable random entry into bitstream. It shall also be possible to transmit this information on request, e.g. when a new decoder joins in a multipoint session.

4.1.16 Priority of AV Objects

Requirement

Means to identify the relative importance of parts of the coded AV information shall be provided.

Specification

a) At least 32 levels of priority shall be supported by the Systems syntax.

b) It shall be possible to set the maximum number of priority levels utilized in a particular session to a number lower or equal to 32.

Note

item b) highlights that it may be desirable to save bits by allowing less priority levels if they are not needed.

4.1.17 Dynamic Resource Management

Ed. note: This requirement needs careful reviewing together with the Systems group.

Requirement

Support for dynamic management of network and encoder/decoder resources shall be provided. The session controller shall have access to information to choose statically (at session begin) or dynamically (e.g. in case of a temporary bandwidth reduction) the optimal configuration for the current application (memory, CPU, error resilience, bandwidth, latency...).

Note

One reason for this requirement is to support graceful degradation of the presented objects if not enough resources are available to present them in their full quality.

4.1.18 Reference to associated MPEG-7 data

MPEG-4 shall provide a mechanism to reference to an MPEG-7 stream that is associated to a particular MPEG-4 object or a collection of objects.

4.1.19 Intermedia Format

The MPEG-4 Intermedia Format shall:

1. include synchronisation information of the stored content;

2. not require decoding for access to data (i.e., it should be content data agnostic);

3. support efficient (re)multiplexing of the stored data into TransMuxes (e.g. for transmission/streaming)

4. support the ability to identify, extract and add elementary streams efficiently;

5. be extensible (i.e., it should have room for private fields and ISO defined extensions in a way that makes it possible to locate and ignore these private fields);

6. support exchange / distribution of MPEG-4 content on storage media (tape, CDROM, DVD,...);

7. allow the ‘publishing’ of the stored content in multiple forms so as to scale to various constraints;

8. contain information to locate Random Access Points of all elementary streams efficiently;

9. allow access to the Random Access Points of all elementary streams efficiently;

10. support the efficient location of scene description and object descriptors;

11. support the efficient location of security, copyright and MPEG-4 Object Content Information;

12. support easy conversion of the content into a streamable format.

4.1.20 Adaptive Audiovisual Session Format

Requirement

MPEG-4 shall provide Application Program Interfaces to a number of elements in an MPEG-4 System and Content.

Specification

MPEG-4 shall provide API’s to:

a) …Application Program Execution

MPEG-4 shall provide means to download, store and execute an application program to control the behavior of the player and the content. The application program delivery and execution shall be secure, where "security" shall be further defined.

b) …the MPEG-4 Terminal Environment

MPEG-4 shall provide an application program interface to the terminal configuration, resources and capabilities.
MPEG-4 shall support means for an application program to:

• obtain information about terminal configuration and associated devices,

• obtain information about dynamic terminal resources,

• retrieve and store user preferences.

c) …the Scene Graph

MPEG-4 shall provide an application program interface to the BIFS scene graph and its components.

Means shall be provided for an application program to obtain information about the BIFS scene graph, and details about the BIFS nodes, such as its attributes.

Means shall be provided to modify BIFS nodes attributes.

d) …the Network using DMIF

MPEG-4 shall provide an application program interface to the network. MPEG-4 shall support means for an application program to:

• query information from DMIF,

• (re-)configure network access using DMIF,

• have access to a backchannel

• have access to and control over external elementary stream sources.

e) …User Input Devices

MPEG-4 shall provide an application program interface to user input devices.

Specification. MPEG-4 shall have means to obtain user input from special hardware devices, such as a TV remote control, a computer keyboard, a game control plus joystick, and possibly a camera and speech recognition.

f) …Other Devices

MPEG-4 shall provide an application program interface to external devices. Interfaces shall be provided to devices such as a Smart Card, DVD player, recording devices, and a Credit Card reader.

g) …Elementary Stream Decoders

MPEG-4 shall provide an application program interface to the demux and the elementary stream decoders with the objective to perform resource management, and have access to specific decoder attributes (such as shape in video).

4.2 Requirements for Natural Video Objects

4.2.1 Object-based Representation

Requirement

The MPEG-4 standard shall provide a representation of the video scene understood as a composition of arbitrary shaped video objects according to a script that describes their spatial and temporal relationship.

MPEG-4 shall support that the individual objects in a scene can be coded with different parameters, at different quality levels and with different coding algorithms.

MPEG-4 shall provide techniques with sufficient performance and the lowest complexity for representing AV objects of arbitrary shapes, with interior voids (i.e. an object with one or more holes in it), and regions of partial transparency.

Specification

MPEG-4 shall provide representation of arbitrarily shaped video objects. An arbitrarily shaped video object includes :

1. binary shape (i.e. without associated texture),

2. binary shape and associated texture

3. gray level shape, in up to 3 components (a1,2,3) including exact representation of the original shape, and associated texture. The gray level of the shape can be used to specify the transparency, the depth shape, the disparity shape of the object, or a secondary texture map. This shall allow depth keying.

Note

The objects composing the scene will be very often associated to the scene content and thus to the meaningful objects in the scene (representation based on semantic criteria) but any other criteria may be used for the composition. Blue-screened material is widely used in the movie and TV industries.

A frame-based video is a special case of arbitrarily shaped video objects. The types of shape to be supported by a profile are specified in the Profile Requirements document.

4.2.2 Video Content

Requirement

MPEG-4 Video shall support all types of pixel-based video content. In the context of a profile, assumptions may be made about the nature of the content; this may have consequences on the type of tools used for this profile.

4.2.3 Object-based Bitstream Manipulation and Editing

Requirement

The MPEG-4 standard shall provide the means for editing a video object.

Specification

a) The coarsest granularity for accessing the object is 0.5 seconds in the object’s time base.

b) It shall be possible to decode the shape without decoding the associated texture.

c) It shall be possible to access the object at different levels of spatial and temporal resolution.

4.2.4 Object-based Random Access

Requirement

The MPEG-4 standard shall provide efficient methods to allow object-based random access, within a limited time and with fine resolution, to some or all objects of a scene. This access shall include also conventional random access (entering a object bitstream a an arbitrary point in time) and conventional special modes (such as fast forward) and also shall include support for low bitrates. The system shall support object-based random access within any layer of a scalable representation.

Specification

MPEG-4 Video shall support Temporal Random Access by providing usable video (as defined under Quality below) within 0.5 seconds after entering a bitstream at any arbitrary point.

4.2.5 Object Quality and Fidelity

Requirement

Quality in MPEG-4 shall be as high as possible. This means that MPEG-4 shall provide a subjective video quality that is better than the quality achieved by the available or emerging standards in similar conditions.

Specification

a) For any given scene, MPEG-4 Video shall produce a quality equivalent to, or better than, that achievable with the best performing available standard for that bitrate, with all options enabled, for similar conditions.

b) It shall be possible to have good quality intra pictures.

c) In the context of profiles, more specific requirements regarding texture and shape quality and fidelity will be specified.

Example

Good quality intra frames can be used to transmit a background object that subsequently needs no updating anymore.

Note

In the context of a profile, certain task based quality requirements may be specified at particular bitrates.

4.2.6 Coding of Multiple Concurrent Data Streams:

Requirement

MPEG-4 shall provide the ability to code multiple views/soundtracks of a scene efficiently and provide sufficient synchronization between the resulting elementary streams. For stereoscopic video applications, MPEG-4 shall include the ability to exploit redundancy in multiple viewing or hearing points of the same scene, permitting joint coding solutions that allow compatibility with normal audio and video as well as the ones without the compatibility constraint. Views may be completely independent.

Specification

a) MPEG-4 Video shall support joint coding of at least 4 views of a video scene.

b) For any stereoscopic video, MPEG-4 shall perform at least as well in exploiting redundancy as the MPEG-2 multiview profile.

4.2.7 Robustness to Information Errors and Loss

Requirement

The MPEG-4 standard shall provide the tools to achieve error resilient video streams over a variety of wireless and wired networks and storage media, with possibly severe error conditions (e.g. long error bursts). This includes support for low bitrate applications.

The error protection shall support that some objects receive different protection than others, and that some parts of an object bitstream receive different protection than other parts (e.g. headers or shape information receives better protection). Error resilience should consider concealment, fault tolerance, graceful degradation and graceful recovery, also in an object-based way. It shall be possible to switch off error protection if there is no need for it.

Specification

• MPEG-4 Video shall provide the ability to withstand Random Errors and produce usable video (as defined within the context of profiles) with a BER up to 10-4.

• MPEG-4 Video shall provide the ability to withstand Burst Errors and produce usable video (as defined within the context of profiles) with an average BER up to 10-3 and an average burst length of 10 ms.

• MPEG-4 Video shall provide for an Error Recovery Time of 1 frame-time after recovery of the systems layer. (This may require the encoder to be aware that an error has occurred.

• MPEG-4 Video shall provide capability for Data Prioritization, Error Detection (corrupt data, insertion, deletion), and Error Concealment.

4.2.8 Object-based Coding Flexibility

MPEG-4 shall provide the tools and syntactic elements to support changing the coding content, texture quality (SNR, spatial and temporal), shape accuracy and complexity of objects with fine granularity.

Specification

a) MPEG-4 Video shall support content flexibility by allowing all objects to be selectively coded (code or don’t code) with a corresponding increase/decrease in the bitstream. The range is from no objects coded to all objects coded.

b) MPEG-4 Video shall support spatial/temporal quality flexibility by allowing all objects in a scene to be coded at specified spatial/temporal resolutions. The range is up to the capture resolutions.

c) MPEG-4 Video shall support SNR quality flexibility by allowing objects in a scene to be coded to specified SNR quality parameters.

d) MPEG-4 Video shall support shape accuracy flexibility by allowing objects in a scene to be coded to specified shape accuracy.

e) MPEG-4 Video shall support complexity flexibility by allowing objects in a scene to be coded with differing options (thus changing the complexity factor). The range is from the lower limit specified in Complexity to the case where all options are specified.

4.2.9 Object-based Scalability

Requirement

MPEG-4 shall provide the tools and syntactic elements to achieve scalability with a fine granularity in terms of content, texture - SNR, spatial and temporal - and shape - SNR and spatial. These types of scalability shall result in a very flexible content-based scaling of the video information.

Specification

MPEG-4 Video shall support spatial/temporal texture scalability by allowing objects in a scene to be coded with a base layer and up to 4 enhancement layers (spatial, temporal, and/or SNR).

In the context of profiles, more specific requirements regarding the number of texture and shape scalable layers will be specified.

4.2.10 Delay Modes

Requirement

MPEG-4 shall support various delay modes, including a low end-to-end (encode/decode) mode and low decoding delay mode. One of the objectives of such modes is supporting real-time communications.

Specification

a) MPEG-4 Video shall support a low delay mode which provides for a maximum initial delay of 0.5 seconds and a maximum algorithmic delay of 150 ms at 24 kbit/s. (encoding plus decoding time).

b) MPEG-4 Video shall support a decoding mode with a delay of 50 ms.

4.2.11 Formats

Requirement

MPEG-4 Video shall support a number of video formats.

Specification

MPEG-4 Video shall support:

a) the following Luminance Spatial Resolutions: SQSIF/SQCIF, QSIF/QCIF, SIF/CIF, 4*SIF/CIF, ITU-R BT.601 and ITU-R BT.709, as well as arbitrary sizes from 8x8 to 2048x2048.

b) the following Color Spaces: Monochrome, Y/Cr/Cb, R/G/B, combined with up to 3 alpha channels (the alpha channels having the same size as Y data).

c) the following Chrominance Spatial Resolutions: 4:0:0, 4:2:0, 4:2:2, and 4:4:4.

d) various Temporal Resolutions. The maximum temporal resolution is specified as the capture rate (which may be as high as 60 fps in frame based video).
The frame rate shall be continuously variable, on a frame-by-frame basis.

e) the following Pixel Depths: up to 12 bits per component on video data and alpha channels

f) the following Scanning Methods: Progressive and Interlaced.

g) Variable aspect ratio, and colorimetry parameters as specified for MPEG-2

Note

Work for the high quality requirements mentioned above (e.g. support of 3 alpha channels, 4:4:4 sampling format) is currently under consideration. It will only be pursued if enough participants in MPEG are willing to carry out the necessary work

4.2.12 Bitrate Modes

Requirement

MPEG-4 Video is optimized for the following Bitrate Ranges: < 64 kbit/s, 64 - 384 kbit/s, and 384 kbit/s - 4 Mbit/s.

Currently under discussion are bitrates up to 50 Mbit/s for ITU-R BT.601 and 150 Mbit/s for ITU-R BT.709

The MPEG-4 standard shall support efficient video coding for constant bit rate (CBR) and variable bit rate (VBR) environments.

4.2.13 Complexity Modes

Requirement

The MPEG-4 standard shall support various complexity modes, including low complexity video encoders and decoders. Complexity scalable video conditions will be defined.

Specification

MPEG-4 Video shall support a Low Complexity Mode which allows real-time decoding of 4:2:0 QSIF/QCIF video at 15 fps on the equivalent of an Intel 75 MHz 486 with 4 Mbytes of memory.

4.2.14 Still images

Requirement

MPEG-4 Video shall support the efficient coding of still images.

Specification

a) Content: generic.

b) Format:

Luminance Spatial Resolution: sizes from 8x8 to 4096x4096

Color Spaces: Monochrome, Y,Cr, Cb and RGB

Alpha Channel: same resolution as Y with 8 bits per pixel

Pixel Depths: up to 16 bits per component

Chrominance Spatial Resolution: 4:0:0, 4:2:0, 4:2:2, 4:4:4

c) Bitrates: from 0.01 bit per pixel to visually transparent.

d) Scalability: spatial scalability with 8-11 layers and SNR scalability from lossy to visually transparent. Number of scalable layers up to 32 (including both spatial and SNR).

4.2.16 Tandem Coding

Requirement

MPEG-4 Video shall provide the capability of tandem connection with an identical codec while still meeting the basic video requirements.

Specification

The amount of generations is to be decided.

4.3 Requirements for Synthetic Video Objects

4.3.1 Object Types

Requirement

MPEG-4 shall support the following object types and associated data with a minimal set of geometric primitives and their composition into specific 2D or 3D scenes at a certain level of detail: data for scene composition, behavior, text, 2D/3D objects and their attributes, static and animated texture, transformations, and parameter data.

Specification

a) The object data shall include the following appearance attributes: font style, texture, color, transparency, surface characteristics.

b) Scene data shall include viewing characteristics as applicable notably lighting, and viewpoint.

c) It shall be possible to download objects or components so that some objects may be added, removed or modified.

d) Geometry:

Both: indexed face and line sets with defining list of shared vertices.

2D: rectangle, circle, line, polygon, Bezier curve, 2D mesh with implicit structure.

3D: box, cone, cylinder, sphere, 3D mesh.

a) Material properties:

Both: transparency, color, normal, texture mapping texture translation.

2D: filled or empty shape, border/line width, dotted border/line, , shadow properties.

b) Surface appearance:

Both: material, image texture, video texture.

c) Text:

Simple text, formatted text, font styles

International language including direction of composition

Justification of text, direction of streaming

d) Animated streams:

Dynamic state information (position and attitude, FBA, 2D mesh)

e) Mixing of 2D and 3D objects

f) Face and Body objects

Note

For exact definitions of some of the above items, refer to VRML node definitions.

Behavior is also a composition issue.

4.3.2 2D/3D Mesh Compression

Requirement

MPEG-4 shall provide means for efficient compression and streaming or downloading 2D/3D vertex positions, normals, texture coordinates and topology.

Examples

• 3D elevation grid

• Face and body objects in the form of 3D polygon meshes.

• A 2D Delaunay mesh that can be represented by vertex positions.

4.3.3 Definition & Animation Parameter Compression

Requirement

MPEG-4 shall provide syntax and compression for Face Animation Parameters (FAP) and Face Definition Parameters (FDP), as well as Body Animation Parameters (BAP) and Body Definition Parameters (BDP).

Specification

It shall be possible to compress FAP with 2 kbit/s.

Example

A baseline face in a decoder shall be capable of immediately receiving FAPs from the bitstream, to produce facial animation: expressions, speech, etc. without downloading a specific face. If FDPs are received, they can be used to transform a generic face into a particular face determined by its shape and (optional) texture. Such tailoring of the bitstream for terminal capability must recognize the performance capabilities and limitations of the terminal in set-up.

Note

Specification of body animation is under investigation.

4.3.4 Texture Mapping

Requirements

MPEG-4 shall support texture mapping on 2D/3D mesh.

Specification

Texture size shall be an integer power of 2 (16x16, 1024x1024, ...)

Note

Real time texture mapping capabilities are expected once the texture is loaded into texture memory. The mesh onto which texture is mapped can be regular or have an arbitrary shape.

Examples

Mapping of a face image on a face mesh, mapping of an aerial image on a grid mesh.

• Alpha blending of a still or moving texture onto a video object.

4.3.5 Text Overlay

Requirements

• MPEG-4 shall provide capability for text overlay.

• MPEG-4 shall allow standalone text overlay, in the absence of natural audio and video.

• Text overlay can be independent of underlying A/V representation, as well as MPEG-4 shall provide capability to compose overlay text into layered spatial hierarchies that can be arranged and synchronized with spatial and temporal events in associated audio and video.

• MPEG-4 shall provide capability to use bitmapped text.

• MPEG-4 shall provide capability to animate text at slow or real-time rates for ready interpretation and comprehension, controllable by user and/or provider.

• MPEG-4 shall provide capabilities for spatial-temporal location and manipulation of text overlay.

• MPEG-4 shall support international character sets, and text composition.

Note

The MPEG-4 text overlay standard shall be designed to accommodate its easy incorporation into systems utilizing other existing and developing standards, e.g. MPEG-1 and MPEG-2.

Examples

• News program similar to "PointCast Network"

• Program guides for broadcast television

• User-selected electronic ticker tape information

• Low-bandwidth news delivery (broadcast, Internet, etc.)

• Real-time "insertion" of advertisements (e.g. product background information, event announcements, local phone numbers, etc.)

• Hyperlinked text in video

4.3.6 Image and Graphics Overlay

Requirements

• MPEG-4 shall provide capability for image and graphics overlay.

• MPEG-4 shall allow standalone image and graphics overlay, in the absence of natural audio and video.

• MPEG-4 shall provide image and graphics overlay that can be independent of underlying A/V representation, as well as MPEG-4 shall provide capability to compose overlay images and graphics into layered spatial hierarchies that can be arranged and synchronized with spatial and temporal events in associated audio and video.

• MPEG-4 shall provide capability to use coded images and graphics based on existing standards.

• MPEG-4 shall provide capability to animate overlaid images and graphics at slow or real-time rates for ready interpretation and comprehension, controllable by user and/or provider.

• MPEG-4 shall provide capabilities for spatial-temporal location and manipulation of image and graphics overlays.

Notes

The MPEG-4 image and graphics overlay standard shall be designed to accommodate its easy incorporation into systems utilizing other existing and developing standards, e.g. MPEG-1 and MPEG-2.

Examples

• Low-bandwidth news delivery (broadcast, Internet, etc.)

• Special effects for advertising

• Real-time "insertion" of advertisements (e.g. local company logos on network programming, etc.)

• Hyperlinked images and graphics in video

4.3.7 View-Dependent Texture Scalability

Requirement

MPEG-4 shall provide means to change non-uniformly the spatial resolution of the texture data by taking into account viewing conditions (viewpoint, aimpoint, lighting,…) and 3D mesh on which texture is to be mapped.

Specification

Forward channel bandwidth shall be up to 1 Mbit/s.

Note

Use of back channel may be required to transmit viewing conditions.

Example

An aerial view is mapped on a 3D grid mesh, the most visible regions of this texture are transmitted with a high quality, the least visible with the lowest quality (may even not be transmitted).

4.3.8 Geometrical transformations

Requirement

MPEG-4 shall provide cost-effective means to cope with a large number of geometrical transformations without significant effect on the quality of the final rendered data. Geometric transformations shall support relative positioning, scaling, and orientation of objects in scene composition.

Specification

2D and 3D transformations are to be supported:

a) linear affine;

b) non-linear or perspective affine;

c) bi-linear transformations.

4.3.9 Video Object Tracking

Requirement

MPEG-4 shall support efficient coding of mesh-based video object tracking information. This includes coding of mesh geometry (once for each video object or a temporal segment of a video object) and one motion vector for each node point at each frame.

Specification

Applicable to all video objects considered for coding. It shall be possible to code Video tracking as side information at 4-5 kbits/s.

Note

The inclusion of tracking information with a video object is optional.

Example

• Animated texture or graphics overlay on a moving natural or synthetic video object.

• Synthetic transfiguration and augmented reality.

4.4 Requirements for Natural Audio Objects

4.4.1 Object Based Representation

Requirement

The MPEG-4 standard shall provide a representation of an audio scene understood as a composition of audio objects according to a script that describes their temporal (and possibly also spatial) relationship. The objects composing the scene will be very often associated to the scene content and thus to the meaningful objects in the scene (representation based on semantic criteria) but any other criteria may be used for the composition. No limitations on the audio content to be coded exist.

MPEG-4 shall support that the individual objects in a scene can be coded with different parameters, at different quality levels and with different coding algorithms.

4.4.2 Audio Content

Requirement

MPEG-4 Audio shall support a number of types of audio content. In the context of a profile, assumptions may be made about the nature of the content; this may have consequences on the type of tools used for this profile.

Specification

The following types of content are supported, specified together with their bandwidths: high quality audio (> 15 kHz) , intermediate quality audio (< 15 kHz), wideband speech (50 Hz-7 kHz), narrowband speech (50 Hz-3.6 kHz), intelligible speech (300 Hz-3.4 kHz).

4.4.3 Object Based Bitstream Editing and Manipulation

Requirement

The MPEG-4 standard shall provide the means for editing an audio object.

Specification

The coarsest granularity for accessing the object is 0.5 seconds in the object’s time base.

4.4.4 Object Based Scalability

Requirement

MPEG-4 shall provide the tools and syntactic elements to achieve scalability with a fine granularity in terms of content and quality (includes bitstream, SNR, and bandwidth). These types of scalability shall result in a very flexible content-based scaling of the audio information.

Specification

The content of the audio objects has to be in layered representation. The quality scalability layers shall be present in a maximum content bitstream (at the maximum (64) allowed bitrate per channel or audio object). The bit rates for the individual enhancement layers are 1 kbit/s at bit rates below 16 kbit/s and 8 kbit/s at bit rates above 16 kbit/s.

4.4.5 Object-based Random Access and User Controls

Requirement

MPEG-4 shall support the ability of the user to interactively control playback of audio information recorded on digital storage media including the ability of the user to override certain default system settings, controlling the setting at the start of real-time applications and during the course of a real-time application. A number of these system settings are used to control the audio encoder and/or decoder. The interactive operation can be applied on the entire scene or on one or more objects.

Specification

a) Modes supported shall include: turn an audio object on or off, choose levels of speech and audio quality, playback controls such as random access, forward play (normal, fast forward, and slow forward), reverse play (normal, fast reverse, and slow reverse); pause, and random access. A low delay mode for these functions shall be supported. In the fast playback modes the reproduction speed may be fixed (two times normal speed) or flexible (Jog-Shuttle related speed).

b) Usable audio shall be provided 0.5 seconds after entering the bitstream.

4.4.6 Time Scale Change

Requirement

MPEG-4 audio shall be able to replay audio objects at a different speed without changing the pitch and without annoying quality degradation.

Specification

It should be possible to change the speed of a decoder output signal up to +/- 50 percent in fine steps.

4.4.7 Pitch Change

Requirement

MPEG-4 audio shall be able to replay audio objects at a different pitch without changing the speed and without annoying quality degradation.

Specification

It should be possible to change the pitch of a decoder output signal up to +/- 30 percent in fine steps.

4.4.8 Robustness to Information Errors and Loss

Requirement

The MPEG-4 standard shall provide the tools to achieve error resilient audio streams either in terms of bit errors or cell loss (e.g. varying channel bandwidth). This includes support for low bitrate applications.

The error protection may be provided in an object-based way, which means that some objects receive different protection than other parts. Error resilience should consider concealment, fault tolerance, graceful degradation and graceful recovery, also in an object-based way. It shall be possible to switch off error protection if there is no need for it.

Specification

• MPEG-4 Audio provides the ability to withstand Random Errors and produce usable audio (as defined under Quality below) with a BER up to 10-4.

• MPEG-4 Audio provides the ability to withstand Burst Errors and produce usable audio (as defined under Quality below) with an average BER up to 10-3 and an average burst length of 10 ms.

• MPEG-4 Audio provides for an Error Recovery Time of 80 ms.

• MPEG-4 Audio provides capability for Data Prioritization Capability, Error Detection (corrupt data, insertion, deletion), and Error Concealment.

4.4.9 Delay Modes

Definition

a) Initial Delay is the audio codec’s contribution to the time between when the communication channel is established and when the audio material presentation is begun.

b) Algorithmic Delay is the audio codec's contribution to the time between when the data (excluding initial data whose delay is specified by the Initial Delay) is acquired at the encoding unit and when the data is presented from decoding unit.

c) Control Response Delay is the audio codec's contribution to the time between when a control command is issued at the decoding unit and when the effect of the command is presented from the decoding unit.

Requirement

MPEG-4 Audio shall support modes with several delay characteristics. This shall include modes with low end-to-end delay, as well as modes with low decoding delay

Specification

a) A mode with a maximum Initial Delay of 200ms shall be supported

b) A mode with a maximum Algorithmic Delay of 20ms shall be supported

c) A mode with a maximum Control Response delay of 200ms shall be supported

Delay modes have to be specified for different profiles.

4.4.10 Complexity Modes

Requirement

MPEG-4 Audio shall provide various complexity modes, including low complexity audio encoders and decoders. Complexity scalability shall also be supported: (i.e. the possibility to decode a bitstream with a low complexity decoder delivering reduced audio quality and with a decoder of higher complexity delivering the highest possible quality.)

Specification

Profile dependent specification.

4.4.11 Bitrate Modes

Requirement

MPEG-4 Audio shall provide the capability for operation at fixed as well as at variable bitrates. In case of variable rate coding, the bitrate can be controlled either by the encoder or by external parameters.

Specification

Fixed bitrate modes: operating bitrates between 2 kbit/s and 64 kbit/s shall be supported. This range may be extended upwards.

Variable rate coding with average rates above 1 kbit/s shall be supported. The syntax shall carry information about the currently used bitrate.

Coding at extremely low bit rates (several hundred bps) shall be supported to provide intellegible speech at the highest possible quality.

4.4.12 Downmix

Requirement

MPEG-4 Audio shall provide the capability to reduce the number of channels to a configuration with a lower number of channels for presentation purposes (e.g. listening to multi-channel audio using stereophonic reproduction).

Specification

To be defined for specific profiles.

4.4.13 Transcoding

Requirement

MPEG-4 Audio shall provide the capability to easily transcode to and from other previously standardized coding schemes.

Specification

To be specified in the context of profiles.

4.4.14 Tandem Coding

Requirement

MPEG-4 Audio shall provide the capability of tandem connection with an identical codec or with a specifically identified codec while still meeting the basic audio quality requirement.

Specification

To be specified in the context of profiles.

4.4.15 Audio Formats

Requirement

MPEG-4 Audio shall support a number of audio formats, as defined by the sampling frequency, amplitude resolution (dynamic range), quantizer characteristic, and the number of channels.

Specification

Supported sampling frequencies (in kHz): 8, 11.025, 12, 16, 22.05, 24, 32, 44.1, 48, 96

Amplitude resolution: up to 24 bit/sample

Number of channels: up to 8 audio channels per audio object, including support for monaural, stereo, 3\0 and 5.1 channel configurations.

4.4.16 Improved Coding Efficiency

Requirement

If no other functionalities are provided, the quality should exceed that of existing standards at given bit rates. If standards exist which provide sufficient quality, a set of them will be included in MPEG-4, i.e. the syntax shall provide the capability of addressing the corresponding algorithms.

High coding efficiency shall provide intelligible speech at several hundred bps. The syntax shall carry information to aid in improving the speech quality and providing speaker personalization. Synthesis tools shall have the greatest possible compatibility with the tools developed for the Text to Speech functionality defined under section 4.5 below.

Specification

Reference standard codecs are: FS1016, G.723.1, MPEG-1, MPEG-2.

4.5 Requirements for Synthetic Audio Objects

4.5.1 Low Bit Rate Speech

Requirement

MPEG-4 shall provide a low bit-rate speech coder.

Specification

Speech coding compression shall support intelligible speech at 2 kbit/s.

4.5.2 Synthetic Speech Data

Requirement

MPEG-4 shall support decoding synthetic speech segments represented by phonemes, spectral sequences or waveform sequences which are used as synthesis units to obtain synthetic speech.

Specification

4.5.3 Text to Speech

Requirement

MPEG4 shall support an extended TTS functionality which converts written text (in some cases, text with auxiliary information such as F0 contour, phoneme duration, and/or amplitude of each phoneme) into synthetic speech.

Example

MPEG-4 Story Teller on Demand (STOD)

In the STOD application, users can select a story from a huge database of story libraries which are stored in hard disks or CD memories. The STOD system reads the story aloud via MPEG-4 TTS with MPEG-4 facial animation and with appropriately selected scenes. The user can stop and resume speaking at any moment desired via the user interfaces of the local machine (e.g. Mouse or Keys). The user can also select the gender, age, volume, the speech rate of the electronic story teller.

4.5.4 Downloading TTS & Speech Data

4.5.5 Sound Synthesis

Specification

MPEG-4 shall provide functionality for normative delivery of synthetic sound content, including the delivery of synthesis algorithms created by content providers.

MPEG-4 shall provide bounded-algorithmic-complexity and limited-RAM-complexity synthesis methods for use in lower-capability terminals. Low-complexity synthesis shall still provide as much normative sound quality, functionality and flexibility as possible.

MPEG-4 intends to allow compatibility with existing music-synthesis standards insofar as this requirement does not unduly limit the technical sophistication of the MPEG-4 standard.

Detail

MPEG-4 shall support known methods of sound synthesis, including but not limited to wavetable synthesis, physical-modelling synthesis, sinusoidal (additive) synthesis, granular synthesis, FM synthesis, and non-parametric hybrids of these methods.

MPEG-4 shall support control of synthesis at arbitrary time granularity and in arbitrary manner; that is, with arbitrary and author-defined mapping from control parameters to sound realizations.

MPEG-4 shall support the download of techniques for algorithmic musical composition.

MPEG-4 shall support the description, parametric or algorithmic, of sound-effects methods and their application to synthetic and natural audio sources.

MPEG-4 shall support the synchronization, to 1 ms or finer accuracy, of natural and synthetic audio sources.

MPEG-4 shall support the integration of terminal input devices, such as MIDI keyboards and microphones, into the control and synthesis processes.

Examples

Music synthesis. Networked and broadcast distribution of new musical compositions. Sound effects for virtual reality applications and other virtual environments. Internet-based karaoke. Interactive music applications. Sound effects and interactive music for video games.

4.6 Requirements for Delivery Multimedia Integration Format

4.6.1 Connectivity

Requirement

- DMIF shall allow any DMIF end-system to be connected to any other while preserving individual network views

- DMIF shall allow the management of Inter-working Units between two dissimilar networks

Specification

DMIF shall support the possibility of simultaneously accessing more than one network at a DMIF end system.

4.6.2 Transparency

Requirement

DMIF shall allow the transparent content location and access to an application.

Specification

DMIF shall support both content retrieval from and recording at a local or remote storage location.

Note

This requirement applies to both the sending and the receiving side.

4.6.3 Application Service Enablement

Requirement

DMIF shall allow the delivery of elementary streams to individual services

Specification

DMIF shall be transparent to service names which shall be defined and communicated by DMIF applications

4.6.4 End-to-end QoS Management

Requirement

DMIF shall ensure that a QoS defined by the content authors and selected by an end-user are consistently applied and met across the DMIF end-systems and the heterogeneous networks involved in a session.

Specification

- DMIF shall allow network Session and Resource Management

- DMIF shall accommodate intermediate networks that do not lend to Session and Resource Management

- If QoS requirements cannot be met, DMIF shall notify the application.

- DMIF shall allow notification of the available QoS to the application

4.6.5 Network based Stream Processing and Management

Requirement

DMIF shall allow resources in the network to be shared by different Service Providers for maximum efficiency

Specification

- DMIF shall apply the QoS criteria to the Stream-Processing-and-Management Resources

- DMIF shall provide logging capability of the usage of Stream-Processing-and-Management resources in individual sessions

References

[1] MPEG-4 project description, ISO/IEC JTC1/SC29/WG11 N1177, Munich MPEG meeting, January 1996.

Annex A Synthetic & Natural Scene Definitions

The following definitions still need review and can then be moved to the main body of the document

Scene

A complete composition of the instantiation(s) of specific types of AV objects in bounded space and time, often independent of specific and varied conditions under which the scene may be experienced (viewed or heard), and usually sufficiently self-contained to support a user in performing a task.

Behavior

The action or reaction of any AV object within a scene defined by a temporal model (e.g. physics, decision criteria, locomotion, parametric or frame-based animation, scenario, script, story board, lifelike autonomy, group dynamics), typically independent of the frame rate of presentation.

2D/3D Object

A structural abstraction collecting together the constituent 2D or 3D parts of an AV object whose elements are expected to share certain spatial characteristics or temporal behavior. A 2D/3D object may consist of elementary parts (e.g. polygons with attributes) or other objects (e.g. video object, audio object, 2D/3D object) arranged in space and time with respect to each other.

Surface

In general, a bounded 2-dimensional region, which may be planar or curved, defined by points, intersecting lines, space curves, parametric functions, or other spatial-temporal constraints.

Polygon

A flat closed planar surface defined by points, usually in a way to guarantee planarity, with attributes that support computation of the appearance and visibility of the surface. Polygons can be defined by ordered lists of points as references to vertices shared within an object.

Vertex

A point defined by Cartesian coordinates in a specific 2D or 3D object space. A vertex may serve to locate an object (e.g. audio source, 2D/3D object) or to define specific shapes within an object.

Analytic Text

2D/3D text modeled by space curves or functions that are inherently resolution-scalable.

Texture

A 2-dimensional array of regular radiometric samples in space and time (e.g. color, intensity, transparency) which describes the appearance of an object (e.g. surface, frame, view pane, window, or scene). Texture is applied to an underlying surface or pane, usually with a spatial or temporal transform.

Texture Map

A rectangular array of texture that represents an image independent of its presentation, sometimes with supporting layers for resolution scalability or progressive transmission, after decoding.

Animated Texture

The time sequencing of texture states applied within a scene, which may or may not involve the alteration the contents of a specified texture map (e.g. motion or warping of a given texture map with a dynamic transformation, sequencing of texture maps to obtain a "flip book" effect.)

Attribute

Elementary, auxiliary information that details the specification of an object with respect to its shape, appearance, or behavior, without being divisible and usually not subject to manipulation.

Transformation

A logical or mathematical mapping from one object space or state to another (e.g. location/orientation of an object in a scene, the current warping of a 2D mesh to track underlying video).

Parameter

Control data, subject to manipulation by the user or content developer, that alters the shape, appearance, or behavior of an object through specifying events usually originating outside the object.

Visual Object

Any elementary or composite AV object which presents visually (e.g. video object, 2D/3D synthetic object, text object, any meaningful combination of these objects).

Note:

Special cases of a Visual Objects are Video Object or a Facial Animation Object.

Audio Object

A sound source, or a composition of sound sources or channels, specified by a model of the elementary behavior and layout of real-world or synthetic sound(s) approximated by the model (e.g. wave table, sample stream, TTS stream, control stream that drives algorithmic or sample-based synthesis of music, program defining a voice), localized within an explicit or implicit 2D or 3D aural environment.

Video Object

A temporal stream of moving images, or a spatial and temporal composition of such streams within a scene, with any supporting data for temporal, resolution, and quality scalability. A degenerate video object is a texture map. Audio and video objects are potentially independent of the viewing conditions and frame rate of their presentation, and are interpreted by a presentation layer.

Scene Composition

A hierarchical nesting of AV objects in space and time that supports the content-based access, manipulation, and scalability requirements of MPEG-4. Such a nesting may:

1. Compose spatial-temporal scenes from efficient shared representations of objects;

2. Provide alternative or subordinated representations of AV objects for scalability;

3. Provide space-time partitioning to manage large data sets and user access to portions of them;

4. Represent structural relationships that enable user manipulation or interaction between objects;

5. Provide granularity or priorities for selecting scene elements to present within terminal resources.

A scene composition may include local and/or remote, streaming and/or downloaded objects that are named, identified uniquely, spatially located, and synchronized to each other in a specific view.

Presentation Conditions

Any specific viewing or listening conditions imposed during a specific user session, which are not embedded in the scene composition of AV objects.