We all use the MPEG-4 or what we called it MP4 files.The main reason we used it because of its good picture and sound quality but does any one know about its history?, From where does that format come?, who invent it?, etc etc . So here is an info on MPEG 4, This is only first part next part will be posted soon .I hope you will like it.
INTRODUCTION
1.1 What is MPEG-4?
MPEG-4 is an ISO/IEC standard developed by MPEG (Moving Picture Experts Group) . These standards made interactive video on CD-ROM, DVD and Digital Television possible. MPEG-4 is the result of another international effort involving hundreds of researchers and engineers from all over the world. MPEG-4, with formal as its ISO/IEC designation 'ISO/IEC 14496', was finalized in October 1998 and became an International Standard in the first months of 1999. The fully backward compatible extensions under the title of MPEG-4 Version 2 were frozen at the end of 1999, to acquire the formal International Standard Status early in 2000. Several extensions were added since and work on some specific work-items work is still in progress.
MPEG-4 builds on the proven success of three fields:
Digital television;
Interactive graphics applications (synthetic content);
Interactive multimedia (World Wide Web, distribution of and access to content)
MPEG-4 provides the standardized technological elements enabling the integration of the production, distribution and content access paradigms of the three fields.
The standard, developed over five years by the Moving Picture Experts Group (MPEG) of the Geneva-based International Organization for Standardization (ISO), explores every possibility of the digital environment. Recorded images and sounds co-exist with their computer-generated counterparts; a new language for sound promises compact-disk quality at extremely low data rates; and the multimedia content could even adjust itself to suit the transmission rate and quality.
Possibly the greatest of the advances made by MPEG-4 is that viewers and listeners need no longer be passive. The height of "interactivity" in audiovisual systems today is the user's ability merely to stop or start a video in progress. MPEG-4 is completely different: it allows the user to interact with objects within the scene, whether they derive from so-called real sources, such as moving video, or from synthetic sources, such as computer-aided design output or computer-generated cartoons. Authors of content can give users the power to modify scenes by deleting, adding, or repositioning objects, or to alter the behavior of the objects; for example, a click on a box could set it spinning.
Perhaps the most immediate need for MPEG-4 is defensive. It supplies tools with which to create uniform (and top-quality) audio and video encoders and decoders on the Internet, preempting what may become an unmanageable tangle of proprietary formats. For example, users must choose among video formats such as QuickTime (from Apple Corp., Cupertino, Calif.), AVI (from Microsoft Corp., Redmond, Wash.), and RealVideo (from RealNetworks Inc., Seattle, Wash.)--as well as a bewildering number of formats for audio.
In addition to the Internet, the standard is also designed for low bit-rate communications devices, which are usually wireless. For example, mobile receivers and "Dick Tracy" wristwatches with video will have far greater success now that the standard is in place. But whether wired or not, devices can have differing access speeds depending on the type of connection and traffic. In response, MPEG-4 supports scalable content, that is, it allows content to be encoded once and automatically played out at different rates with acceptable quality for the communication environment at hand.
On the other end of the quality/bit-rate scale, future television sets will no doubt accept content from both broadcast and interactive digital sources. Accordingly, MPEG-4 provides tools for seamlessly integrating broadcast content with equally high-quality interactive MPEG-4 objects. The expectation is for content of broadcast-grade quality to be displayed within World Wide Web screen layouts that are as varied as their designers can make them. However, the standard's potential for encoding individual objects with the extremely high quality needed in studios has the recording industries very much on the alert.
Recently, digital copying of audio from the Internet has become a popular and--to the music industry--increasingly worrying practice. For video, the same situation will arise when MPEG-4 encoding and higher bandwidths become widespread and as digital storage prices continue to drop. Accordingly, MPEG designed in features for protection of intellectual property and digital content
1.2 Versions
MPEG-4 Version 1 was approved by MPEG in December 1998; version 2 was frozen in December 1999. After these two major versions, more tools were added in subsequent amendments that could be qualified as versions, even though they are harder to recognize as such. Recognizing the versions is not too important, however; it is more important to distinguish Profiles. Figure 1 below depicts the relationship between the versions. Version 2 is a backward compatible extension of Version 1, and version 3 is a backward compatible extension of Version 2 – and so on. The versions of all major parts of the MPEG-4 Standard (Systems, Audio, Video, DMIF) were synchronized; after that, the different parts took their own paths.
2.ABOUT MPEG-4
2.1 The Objectives and Achievements
The three major trends above mentioned - mounting importance of audiovisual media on all networks, increasing mobility and growing interactivity - have driven, and still drive, the development of the MPEG-4 standard .
To address the identified needs and requirements , a standard was needed that could:
Efficiently represent a number of data types:
Video from very low bitrates to very high quality conditions;
Music and speech data for a very wide bitrate range, from transparent music to very low bitrate speech;
Generic dynamic 3-D objects as well as specific objects such as human faces and bodies;
Speech and music to be synthesized by the decoder, including support for 3-D audio spaces;
Text and graphics;
Provide, in the encoding layer, resilience to residual errors for the various data types, especially under difficult channel conditions such as mobile ones;
Independently represent the various objects in the scene, allowing independent access for their manipulation and re-use;
Compose audio and visual, natural and synthetic, objects into one audiovisual scene;
Describe the objects and the events in the scene;
Provide interaction and hyperlinking capabilities;
Manage and protect intellectual property on audiovisual content and algorithms, so that only authorized users have access.
Provide a delivery media independent representation format, to transparently cross the borders of different delivery environments.
A major difference with previous audiovisual standards, at the basis of the new functionalities, is the object-based audiovisual representation model that underpins MPEG-4 (see Figure 1). An object-based scene is built using individual objects that have relationships in space and time, offering a number of advantages. First, different object types may have different suitable coded representations—a synthetic moving head is clearly best represented using animation parameters, while video benefits from a smart representation of pixel values. Second, it allows harmonious integration of different types of data into one scene: an animated cartoon character in a real world, or a real person in a virtual studio set. Third, interacting with the objects and hyperlinking from them is now feasible. There are more advantages, such as selective spending of bits, easy re-use of content without transcoding, providing sophisticated schemas for scalable content on the Internet, etc.
The applications that benefit from what MPEG-4 brings are found in many —and very different— environments . Therefore, MPEG-4 is constructed as a tool-box rather than a monolithic standard, using profiles that provide solutions in these different settings (see the paper on MPEG-4 profiling in this issue). This means that although MPEG-4 is a rather big standard, it is structured in a way that solutions are available at the measure of the needs. It is the task of each implementer to extract from the MPEG-4 standard the technological solutions adequate to his needs, which are very likely a small sub-set of the standardized tools.
The MPEG-4 requirements have been addressed by the 6 parts of the recently finalized MPEG-4 Version 1 standard, notably:
Part 1: Systems - specifies scene description, multiplexing, synchronization, buffer management, and management and protection of intellectual property ;
Part 2: Visual - specifies the coded representation of natural and synthetic visual objects ;
Part 3: Audio - specifies the coded representation of natural and synthetic audio objects ;
Part 4: Conformance Testing - defines conformance conditions for bitstreams and devices; this part is used to test MPEG-4 implementations ;
Part 5: Reference Software - includes software corresponding to most parts of MPEG-4 (normative and non-normative tools); it can be used for implementing compliant products as ISO waives the copyright of the code ;
Part 6: Delivery Multimedia Integration Framework (DMIF) - defines a session protocol for the management of multimedia streaming over generic delivery technologies .
Parts 1 to 3 and 6 specify the core MPEG-4 technology, while Parts 4 and 5 are "supporting parts". Parts 1, 2 and 3 are delivery independent, leaving to Part 6 (DMIF) the task of dealing with the idiosyncrasies of the delivery layer. While the various MPEG-4 parts are rather independent and thus can be used by themselves, also combined with proprietary technologies, they were developed in order that the maximum benefit results when they are used together
2.2 Scope and features of the MPEG-4 standard
The MPEG-4 standard provides a set of technologies to satisfy the needs of authors, service providers and end users alike.
For authors, MPEG-4 enables the production of content that has far greater reusability, has greater flexibility than is possible today with individual technologies such as digital television, animated graphics, World Wide Web (WWW) pages and their extensions. Also, it is now possible to better manage and protect content owner rights.
For network service providers MPEG-4 offers transparent information, which can be interpreted and translated into the appropriate native signaling messages of each network with the help of relevant standards bodies. The foregoing, however, excludes Quality of Service considerations, for which MPEG-4 provides a generic QoS descriptor for different MPEG-4 media. The exact translations from the QoS parameters set for each media to the network QoS are beyond the scope of MPEG-4 and are left to network providers. Signaling of the MPEG-4 media QoS descriptors end-to-end enables transport optimization in heterogeneous networks.
For end users, MPEG-4 brings higher levels of interaction with content, within the limits set by the author. It also brings multimedia to new networks, including those employing relatively low bitrate, and mobile ones. An MPEG-4 applications document exists on the MPEG Home page (www.cselt.it/mpeg), which describes many end user applications, including interactive multimedia broadcast and mobile communications.
For all parties involved, MPEG seeks to avoid a multitude of proprietary, non-interworking formats and players.
MPEG-4 achieves these goals by providing standardized ways to:
describe the composition of these objects to create compound media objects that form audiovisual scenes;
multiplex and synchronize the data associated with media objects, so that they can be transported over network channels providing a QoS appropriate for the nature of the specific media objects; and
interact with the audiovisual scene generated at the receiver’s end.
The following sections illustrate the MPEG-4 functionalities described above.
2.2.1 Coded representation of media objects
MPEG-4 audiovisual scenes are composed of several media objects, organized in a hierarchical fashion. At the leaves of the hierarchy, we find primitive media objects, such as:
Still images (e.g. as a fixed background);
Video objects (e.g. a talking person - without the background;
Audio objects (e.g. the voice associated with that person, background music);
MPEG-4 standardizes a number of such primitive media objects, capable of representing both natural and synthetic content types, which can be either 2- or 3-dimensional. In addition to the media objects mentioned above and shown in Figure 1, MPEG-4 defines the coded representation of objects such as:
Text and graphics;
Talking synthetic heads and associated text used to synthesize the speech and animate the head; animated bodies to go with the faces;
Synthetic sound.
A media object in its coded form consists of descriptive elements that allow handling the object in an audiovisual scene as well as of associated streaming data, if needed. It is important to note that in its coded form, each media object can be represented independent of its surroundings or background.
The coded representation of media objects is as efficient as possible while taking into account the desired functionalities. Examples of such functionalities are error robustness, easy extraction and editing of an object, or having an object available in a scaleable form.
2.2.2 Composition of media objects
Compound media objects group primitive media objects together. Primitive media objects correspond to leaves in the descriptive tree while compound media objects encompass entire sub-trees. As an example: the visual object corresponding to the talking person and the corresponding voice are tied together to form a new compound media object, containing both the aural and visual components of that talking person.
Such grouping allows authors to construct complex scenes, and enables consumers to manipulate meaningful (sets of) objects.
More generally, MPEG-4 provides a standardized way to describe a scene, allowing for example to:
Place media objects anywhere in a given coordinate system;
Apply transforms to change the geometrical or acoustical appearance of a media object;
Group primitive media objects in order to form compound media objects;
Apply streamed data to media objects, in order to modify their attributes (e.g. a sound, a moving texture belonging to an object; animation parameters driving a synthetic face);
Change, interactively, the user’s viewing and listening points anywhere in the scene.
The scene description builds on several concepts from the Virtual Reality Modeling language (VRML) in terms of both its structure and the functionality of object composition nodes and extends it to fully enable the aforementioned features.
2.2.3 Description and synchronization of streaming data for media objects
Media objects may need streaming data, which is conveyed in one or more elementary streams. An object descriptor identifies all streams associated to one media object. This allows handling hierarchically encoded data as well as the association of meta-information about the content (called ‘object content information’) and the intellectual property rights associated with it.
Each stream itself is characterized by a set of descriptors for configuration information, e.g., to determine the required decoder resources and the precision of encoded timing information. Furthermore the descriptors may carry hints to the Quality of Service (QoS) it requests for transmission (e.g., maximum bit rate, bit error rate, priority, etc.)
Synchronization of elementary streams is achieved through time stamping of individual access units within elementary streams. The synchronization layer manages the identification of such access units and the time stamping. Independent of the media type, this layer allows identification of the type of access unit (e.g., video or audio frames, scene description commands) in elementary streams, recovery of the media object’s or scene description’s time base, and it enables synchronization among them. The syntax of this layer is configurable in a large number of ways, allowing use in a broad spectrum of systems.
2.2.4 Delivery of streaming data
The synchronized delivery of streaming information from source to destination, exploiting different QoS as available from the network, is specified in terms of the synchronization layer and a delivery layer containing a two-layer multiplexer, as depicted in Figure 1.
The “TransMux” (Transport Multiplexing) layer in Figure 1 models the layer that offers transport services matching the requested QoS. Only the interface to this layer is specified by MPEG-4 while the concrete mapping of the data packets and control signaling must be done in collaboration with the bodies that have jurisdiction over the respective transport protocol. Any suitable existing transport protocol stack such as (RTP)/UDP/IP, (AAL5)/ATM, or MPEG-2’s Transport Stream over a suitable link layer may become a specific TransMux instance. The choice is left to the end user/service provider, and allows MPEG-4 to be used in a wide variety of operation environments.
Figure 1 - The MPEG-4 System Layer Model
Use of the FlexMux multiplexing tool is optional and, as shown in Figure 1, this layer may be empty if the underlying TransMux instance provides all the required functionality. The synchronization layer, however, is always present.
With regard to Figure 1, it is possible to:
Identify access units, transport timestamps and clock reference information and identify data loss.
Optionally interleave data from different elementary streams into FlexMux streams
Convey control information to:
Indicate the required QoS for each elementary stream and FlexMux stream;
Translate such QoS requirements into actual network resources;
Associate elementary streams to media objects
Convey the mapping of elementary streams to FlexMux and TransMux channels
Parts of the control functionalities are available only in conjunction with a transport control entity like the DMIF framework.
2.2.5 Interaction with media objects
In general, the user observes a scene that is composed following the design of the scene’s author. Depending on the degree of freedom allowed by the author, however, the user has the possibility to interact with the scene. Operations a user may be allowed to perform include:
Change the viewing/listening point of the scene, e.g. by navigation through a scene;
Drag objects in the scene to a different position;
Trigger a cascade of events by clicking on a specific object, e.g. starting or stopping a video stream;
Select the desired language when multiple language tracks are available;
More complex kinds of behavior can also be triggered, e.g. a virtual phone rings, the user answers and a communication link is established.
2.3 Major Functionalities in MPEG-4
This section contains, in an itemized fashion, the major functionalities that the different parts of the MPEG-4 Standard offers in the finalized MPEG-4 Version 1. Description of the functionalities can be found in the following sections.
2.3.1 Transport
In principle, MPEG-4 does not define transport layers. In a number of cases, adaptation to a specific existing transport layer has been defined:
Transport over MPEG-2 Transport Stream (this is an amendment to MPEG-2 Systems)
Transport over IP (In cooperation with IETF, the Internet Engineering Task Force)
2.3.2 DMIF
DMIF, or Delivery Multimedia Integration Framework, is an interface between the application and the transport, that allows the MPEG-4 application developer to stop worrying about that transport. A single application can run on different transport layers when supported by the right DMIF instantiation.
MPEG-4 DMIF supports the following functionalities:
A transparent MPEG-4 DMIF-application interface irrespective of whether the peer is a remote interactive peer, broadcast or local storage media.
Control of the establishment of FlexMux channels
Use of homogeneous networks between interactive peers: IP, ATM, mobile, PSTN, Narrowband ISDN.
Support for mobile networks, developed together with ITU-T
UserCommands with acknowledgment messages.
Management of MPEG-4 Sync Layer information.
2.4 Extensions underway
MPEG is currently working on a number of extensions:
2.4.1 The Animation Framework eXtension, AFX
The Animation Framework extension (AFX – pronounced ‘effects’) provides an integrated toolbox for building attractive and powerful synthetic MPEG-4 environments. The framework defines a collection of interoperable tool categories that collaborate to produce a reusable architecture for interactive animated contents. In the context of AFX, a tool represents functionality such as a BIFS node, a synthetic stream, or an audio-visual stream.
AFX utilizes and enhances existing MPEG-4 tools, while keeping backward-compatibility, by offering:
Higher-level descriptions of animations (e.g. inverse kinematics)
Enhanced rendering (e.g. multi-texturing, procedural texturing)
Compact representations (e.g. piecewise curve interpolators, subdivision surfaces)
Low bitrate animations (e.g. using interpolator compression and dead-reckoning)
Scalability based on terminal capabilities (e.g. parametric surfaces tessellation)
Interactivity at user level, scene level, and client-server session level
Compression of representations for static and dynamic tools
Compression of animated paths and animated models is required for improving the transmission and storage efficiency of representations for dynamic and static tools.
2.4.2 Advanced Video Coding
Work is ongoing on MPEG-4 part 10, 'Advanced Video Coding', This codec is being developed jointly with ITU-T, in the so-called Joint Video Team (JVT). The JVT unites the standard world's video coding experts in a single group. The work currently underway is based on earlier work in ITU-T on 'H.26L'. H.26L and MPEG-4 part 10 will be the same. (H.26L will be renamed when it is done. The final name may be H.264, but that is not yet sure). MPEG-4 AVC/H.26L is slated to be ready by the end of 2002.
2.4.3 Audio extensions
There are two work items underway for improving audio coding efficiency even further.
Bandwidth extension is a tool that gives a better quality perception over the existing audio signal, while keeping the existing signal backward compatible.
MPEG is investigating bandwidth extensions, and may standardize of one or both of:
General audio signals, to extend the capabilities currently provided by MPEG-4 general audio coders.
Speech signals, to extend the capabilities currently provided by MPEG-4 speech coders.
A single technology that addresses both of these signals is preferred. This technology shall be both forward and backward compatible with existing MPEG-4 technology. In other words, an MPEG-4 decoder can decode an enhanced stream and a new technology decoder can decode an MPEG-4 stream. There are two possible configurations for the enhanced stream: MPEG-4 AAC streams can carry the enhancement information in the DataStreamElement, while all MPEG-4 systems know the concept of elementary streams, which allow second Elementary Stream for a given audio object, containing the enhancement information.
2.5 Profiles in MPEG-4
MPEG-4 provides a large and rich set of tools for the coding of audio-visual objects. In order to allow effective implementations of the standard, subsets of the MPEG-4 Systems, Visual, and Audio tool sets have been identified, that can be used for specific applications. These subsets, called ‘Profiles’, limit the tool set a decoder has to implement.
Profiles exist for various types of media content (audio, visual, and graphics) and for scene descriptions. MPEG does not prescribe or advise combinations of these Profiles, but care has been taken that good matches exist between the different areas.
2.5.1 Visual Profiles
The visual part of the standard provides profiles for the coding of natural, synthetic, and synthetic/natural hybrid visual content. There are five profiles for natural video content:
The Simple Visual Profile provides efficient, error resilient coding of rectangular video objects, suitable for applications on mobile networks, such as PCS and IMT2000.
The Simple Scalable Visual Profile adds support for coding of temporal and spatial scalable objects to the Simple Visual Profile, It is useful for applications which provide services at more than one level of quality due to bit-rate or decoder resource limitations, such as Internet use and software decoding.
The Core Visual Profile adds support for coding of arbitrary-shaped and temporally scalable objects to the Simple Visual Profile. It is useful for applications such as those providing relatively simple content-interactivity (Internet multimedia applications).
The Main Visual Profile adds support for coding of interlaced, semi-transparent, and sprite objects to the Core Visual Profile. It is useful for interactive and entertainment-quality broadcast and DVD applications.
The N-Bit Visual Profile adds support for coding video objects having pixel-depths ranging from 4 to 12 bits to the Core Visual Profile. It is suitable for use in surveillance applications.
The profiles for synthetic and synthetic/natural hybrid visual content are:
The Simple Facial Animation Visual Profile provides a simple means to animate a face model, suitable for applications such as audio/video presentation for the hearing impaired.
The Scalable Texture Visual Profile provides spatial scalable coding of still image (texture) objects useful for applications needing multiple scalability levels, such as mapping texture onto objects in games, and high-resolution digital still cameras.
The Basic Animated 2-D Texture Visual Profile provides spatial scalability, SNR scalability, and mesh-based animation for still image (textures) objects and also simple face object animation.
The Hybrid Visual Profile combines the ability to decode arbitrary-shaped and temporally scalable natural video objects (as in the Core Visual Profile) with the ability to decode several synthetic and hybrid objects, including simple face and animated still image objects. It is suitable for various content-rich multimedia applications.
Version 2 adds the following Profiles for natural video:
The Advanced Real-Time Simple Profile (ARTS) provides advanced error resilient coding techniques of rectangular video objects using a back channel and improved temporal resolution stability with the low buffering delay. It is suitable for real time coding applications; such as the videophone, tele-conferencing and the remote observation.
The Core Scalable Profile adds support for coding of temporal and spatial scalable arbitrarily shaped objects to the Core Profile. The main functionality of this profile is object based SNR and spatial/temporal scalability for regions or objects of interest. It is useful for applications such as the Internet, mobile and broadcast.
The Advanced Coding Efficiency (ACE) Profile improves the coding efficiency for both rectangular and arbitrary shaped objects. It is suitable for applications such as mobile broadcast reception, the acquisition of image sequences (camcorders) and other applications where high coding efficiency is requested and small footprint is not the prime concern.
2.5.2 Audio Profiles
Four Audio Profiles have been defined in MPEG-4 V.1:
The Speech Profile provides HVXC, which is a very-low bit-rate parametric speech coder, a CELP narrowband/wideband speech coder, and a Text-To-Speech interface.
The Synthesis Profile provides score driven synthesis using SAOL and wavetables and a Text-to-Speech Interface to generate sound and speech at very low bitrates.
The Scalable Profile, a superset of the Speech Profile, is suitable for scalable coding of speech and music for networks, such as Internet and Narrow band Audio DIgital Broadcasting (NADIB). The bitrates range from 6 kbit/s and 24 kbit/s, with bandwidths between 3.5 and 9 kHz.
The Main Profile is a rich superset of all the other Profiles, containing tools for natural and synthetic Audio.
Another four Profiles were added in MPEG-4 V.2:
The High Quality Audio Profile contains the CELP speech coder and the Low Complexity AAC coder including Long Term Prediction. Scalable coding coding can be performed by the AAC Scalable object type. Optionally, the new error resilient (ER) bitstream syntax may be used.
The Low Delay Audio Profile contains the HVXC and CELP speech coders (optionally using the ER bitstream syntax), the low-delay AAC coder and the Text-to-Speech interface TTSI.
The Natural Audio Profile contains all natural audio coding tools available in MPEG-4, but not the synthetic ones.
The Mobile Audio Internetworking Profile (MAUI) contains the low-delay and scalable AAC object types including TwinVQ and BSAC. This profile is intended to extend communication applications using non-MPEG speech coding algorithms with high quality audio coding capabilities.
2.5.3 Graphics Profiles
Graphics Profiles define which graphical and textual elements can be used in a scene. These profiles are defined in the Systems part of the standard:
Simple 2-D Graphics Profile The Simple 2-D Graphics profile provides for only those graphics elements of the BIFS tool that are necessary to place one or more visual objects in a scene.
Complete 2-D Graphics Profile The Complete 2-D Graphics profile provides two-dimensional graphics functionalities and supports features such as arbitrary two-dimensional graphics and text, possibly in conjunction with visual objects.
Complete Graphics Profile The Complete Graphics profile provides advanced graphical elements such as elevation grids and extrusions and allows creating content with sophisticated lighting. The Complete Graphics profile enables applications such as complex virtual worlds that exhibit a high degree of realism.
The 3D Audio Graphics Profile sounds like a contradictory in terms, but really isn’t. This profile does not propose visual rendering, but graphics tools are provided to define the acoustical properties of the scene (geometry, acoustics absorption, diffusion, transparency of the material). This profile is used for applications that do environmental spatialization of audio signals.
Next post :-MPEG-4 NATURAL AUDIO CODING
INTRODUCTION
1.1 What is MPEG-4?
MPEG-4 is an ISO/IEC standard developed by MPEG (Moving Picture Experts Group) . These standards made interactive video on CD-ROM, DVD and Digital Television possible. MPEG-4 is the result of another international effort involving hundreds of researchers and engineers from all over the world. MPEG-4, with formal as its ISO/IEC designation 'ISO/IEC 14496', was finalized in October 1998 and became an International Standard in the first months of 1999. The fully backward compatible extensions under the title of MPEG-4 Version 2 were frozen at the end of 1999, to acquire the formal International Standard Status early in 2000. Several extensions were added since and work on some specific work-items work is still in progress.
MPEG-4 builds on the proven success of three fields:
Digital television;
Interactive graphics applications (synthetic content);
Interactive multimedia (World Wide Web, distribution of and access to content)
MPEG-4 provides the standardized technological elements enabling the integration of the production, distribution and content access paradigms of the three fields.
The standard, developed over five years by the Moving Picture Experts Group (MPEG) of the Geneva-based International Organization for Standardization (ISO), explores every possibility of the digital environment. Recorded images and sounds co-exist with their computer-generated counterparts; a new language for sound promises compact-disk quality at extremely low data rates; and the multimedia content could even adjust itself to suit the transmission rate and quality.
Possibly the greatest of the advances made by MPEG-4 is that viewers and listeners need no longer be passive. The height of "interactivity" in audiovisual systems today is the user's ability merely to stop or start a video in progress. MPEG-4 is completely different: it allows the user to interact with objects within the scene, whether they derive from so-called real sources, such as moving video, or from synthetic sources, such as computer-aided design output or computer-generated cartoons. Authors of content can give users the power to modify scenes by deleting, adding, or repositioning objects, or to alter the behavior of the objects; for example, a click on a box could set it spinning.
Perhaps the most immediate need for MPEG-4 is defensive. It supplies tools with which to create uniform (and top-quality) audio and video encoders and decoders on the Internet, preempting what may become an unmanageable tangle of proprietary formats. For example, users must choose among video formats such as QuickTime (from Apple Corp., Cupertino, Calif.), AVI (from Microsoft Corp., Redmond, Wash.), and RealVideo (from RealNetworks Inc., Seattle, Wash.)--as well as a bewildering number of formats for audio.
In addition to the Internet, the standard is also designed for low bit-rate communications devices, which are usually wireless. For example, mobile receivers and "Dick Tracy" wristwatches with video will have far greater success now that the standard is in place. But whether wired or not, devices can have differing access speeds depending on the type of connection and traffic. In response, MPEG-4 supports scalable content, that is, it allows content to be encoded once and automatically played out at different rates with acceptable quality for the communication environment at hand.
On the other end of the quality/bit-rate scale, future television sets will no doubt accept content from both broadcast and interactive digital sources. Accordingly, MPEG-4 provides tools for seamlessly integrating broadcast content with equally high-quality interactive MPEG-4 objects. The expectation is for content of broadcast-grade quality to be displayed within World Wide Web screen layouts that are as varied as their designers can make them. However, the standard's potential for encoding individual objects with the extremely high quality needed in studios has the recording industries very much on the alert.
Recently, digital copying of audio from the Internet has become a popular and--to the music industry--increasingly worrying practice. For video, the same situation will arise when MPEG-4 encoding and higher bandwidths become widespread and as digital storage prices continue to drop. Accordingly, MPEG designed in features for protection of intellectual property and digital content
1.2 Versions
MPEG-4 Version 1 was approved by MPEG in December 1998; version 2 was frozen in December 1999. After these two major versions, more tools were added in subsequent amendments that could be qualified as versions, even though they are harder to recognize as such. Recognizing the versions is not too important, however; it is more important to distinguish Profiles. Figure 1 below depicts the relationship between the versions. Version 2 is a backward compatible extension of Version 1, and version 3 is a backward compatible extension of Version 2 – and so on. The versions of all major parts of the MPEG-4 Standard (Systems, Audio, Video, DMIF) were synchronized; after that, the different parts took their own paths.
2.ABOUT MPEG-4
2.1 The Objectives and Achievements
The three major trends above mentioned - mounting importance of audiovisual media on all networks, increasing mobility and growing interactivity - have driven, and still drive, the development of the MPEG-4 standard .
To address the identified needs and requirements , a standard was needed that could:
Efficiently represent a number of data types:
Video from very low bitrates to very high quality conditions;
Music and speech data for a very wide bitrate range, from transparent music to very low bitrate speech;
Generic dynamic 3-D objects as well as specific objects such as human faces and bodies;
Speech and music to be synthesized by the decoder, including support for 3-D audio spaces;
Text and graphics;
Provide, in the encoding layer, resilience to residual errors for the various data types, especially under difficult channel conditions such as mobile ones;
Independently represent the various objects in the scene, allowing independent access for their manipulation and re-use;
Compose audio and visual, natural and synthetic, objects into one audiovisual scene;
Describe the objects and the events in the scene;
Provide interaction and hyperlinking capabilities;
Manage and protect intellectual property on audiovisual content and algorithms, so that only authorized users have access.
Provide a delivery media independent representation format, to transparently cross the borders of different delivery environments.
A major difference with previous audiovisual standards, at the basis of the new functionalities, is the object-based audiovisual representation model that underpins MPEG-4 (see Figure 1). An object-based scene is built using individual objects that have relationships in space and time, offering a number of advantages. First, different object types may have different suitable coded representations—a synthetic moving head is clearly best represented using animation parameters, while video benefits from a smart representation of pixel values. Second, it allows harmonious integration of different types of data into one scene: an animated cartoon character in a real world, or a real person in a virtual studio set. Third, interacting with the objects and hyperlinking from them is now feasible. There are more advantages, such as selective spending of bits, easy re-use of content without transcoding, providing sophisticated schemas for scalable content on the Internet, etc.
The applications that benefit from what MPEG-4 brings are found in many —and very different— environments . Therefore, MPEG-4 is constructed as a tool-box rather than a monolithic standard, using profiles that provide solutions in these different settings (see the paper on MPEG-4 profiling in this issue). This means that although MPEG-4 is a rather big standard, it is structured in a way that solutions are available at the measure of the needs. It is the task of each implementer to extract from the MPEG-4 standard the technological solutions adequate to his needs, which are very likely a small sub-set of the standardized tools.
The MPEG-4 requirements have been addressed by the 6 parts of the recently finalized MPEG-4 Version 1 standard, notably:
Part 1: Systems - specifies scene description, multiplexing, synchronization, buffer management, and management and protection of intellectual property ;
Part 2: Visual - specifies the coded representation of natural and synthetic visual objects ;
Part 3: Audio - specifies the coded representation of natural and synthetic audio objects ;
Part 4: Conformance Testing - defines conformance conditions for bitstreams and devices; this part is used to test MPEG-4 implementations ;
Part 5: Reference Software - includes software corresponding to most parts of MPEG-4 (normative and non-normative tools); it can be used for implementing compliant products as ISO waives the copyright of the code ;
Part 6: Delivery Multimedia Integration Framework (DMIF) - defines a session protocol for the management of multimedia streaming over generic delivery technologies .
Parts 1 to 3 and 6 specify the core MPEG-4 technology, while Parts 4 and 5 are "supporting parts". Parts 1, 2 and 3 are delivery independent, leaving to Part 6 (DMIF) the task of dealing with the idiosyncrasies of the delivery layer. While the various MPEG-4 parts are rather independent and thus can be used by themselves, also combined with proprietary technologies, they were developed in order that the maximum benefit results when they are used together
2.2 Scope and features of the MPEG-4 standard
The MPEG-4 standard provides a set of technologies to satisfy the needs of authors, service providers and end users alike.
For authors, MPEG-4 enables the production of content that has far greater reusability, has greater flexibility than is possible today with individual technologies such as digital television, animated graphics, World Wide Web (WWW) pages and their extensions. Also, it is now possible to better manage and protect content owner rights.
For network service providers MPEG-4 offers transparent information, which can be interpreted and translated into the appropriate native signaling messages of each network with the help of relevant standards bodies. The foregoing, however, excludes Quality of Service considerations, for which MPEG-4 provides a generic QoS descriptor for different MPEG-4 media. The exact translations from the QoS parameters set for each media to the network QoS are beyond the scope of MPEG-4 and are left to network providers. Signaling of the MPEG-4 media QoS descriptors end-to-end enables transport optimization in heterogeneous networks.
For end users, MPEG-4 brings higher levels of interaction with content, within the limits set by the author. It also brings multimedia to new networks, including those employing relatively low bitrate, and mobile ones. An MPEG-4 applications document exists on the MPEG Home page (www.cselt.it/mpeg), which describes many end user applications, including interactive multimedia broadcast and mobile communications.
For all parties involved, MPEG seeks to avoid a multitude of proprietary, non-interworking formats and players.
MPEG-4 achieves these goals by providing standardized ways to:
describe the composition of these objects to create compound media objects that form audiovisual scenes;
multiplex and synchronize the data associated with media objects, so that they can be transported over network channels providing a QoS appropriate for the nature of the specific media objects; and
interact with the audiovisual scene generated at the receiver’s end.
The following sections illustrate the MPEG-4 functionalities described above.
2.2.1 Coded representation of media objects
MPEG-4 audiovisual scenes are composed of several media objects, organized in a hierarchical fashion. At the leaves of the hierarchy, we find primitive media objects, such as:
Still images (e.g. as a fixed background);
Video objects (e.g. a talking person - without the background;
Audio objects (e.g. the voice associated with that person, background music);
MPEG-4 standardizes a number of such primitive media objects, capable of representing both natural and synthetic content types, which can be either 2- or 3-dimensional. In addition to the media objects mentioned above and shown in Figure 1, MPEG-4 defines the coded representation of objects such as:
Text and graphics;
Talking synthetic heads and associated text used to synthesize the speech and animate the head; animated bodies to go with the faces;
Synthetic sound.
A media object in its coded form consists of descriptive elements that allow handling the object in an audiovisual scene as well as of associated streaming data, if needed. It is important to note that in its coded form, each media object can be represented independent of its surroundings or background.
The coded representation of media objects is as efficient as possible while taking into account the desired functionalities. Examples of such functionalities are error robustness, easy extraction and editing of an object, or having an object available in a scaleable form.
2.2.2 Composition of media objects
Compound media objects group primitive media objects together. Primitive media objects correspond to leaves in the descriptive tree while compound media objects encompass entire sub-trees. As an example: the visual object corresponding to the talking person and the corresponding voice are tied together to form a new compound media object, containing both the aural and visual components of that talking person.
Such grouping allows authors to construct complex scenes, and enables consumers to manipulate meaningful (sets of) objects.
More generally, MPEG-4 provides a standardized way to describe a scene, allowing for example to:
Place media objects anywhere in a given coordinate system;
Apply transforms to change the geometrical or acoustical appearance of a media object;
Group primitive media objects in order to form compound media objects;
Apply streamed data to media objects, in order to modify their attributes (e.g. a sound, a moving texture belonging to an object; animation parameters driving a synthetic face);
Change, interactively, the user’s viewing and listening points anywhere in the scene.
The scene description builds on several concepts from the Virtual Reality Modeling language (VRML) in terms of both its structure and the functionality of object composition nodes and extends it to fully enable the aforementioned features.
2.2.3 Description and synchronization of streaming data for media objects
Media objects may need streaming data, which is conveyed in one or more elementary streams. An object descriptor identifies all streams associated to one media object. This allows handling hierarchically encoded data as well as the association of meta-information about the content (called ‘object content information’) and the intellectual property rights associated with it.
Each stream itself is characterized by a set of descriptors for configuration information, e.g., to determine the required decoder resources and the precision of encoded timing information. Furthermore the descriptors may carry hints to the Quality of Service (QoS) it requests for transmission (e.g., maximum bit rate, bit error rate, priority, etc.)
Synchronization of elementary streams is achieved through time stamping of individual access units within elementary streams. The synchronization layer manages the identification of such access units and the time stamping. Independent of the media type, this layer allows identification of the type of access unit (e.g., video or audio frames, scene description commands) in elementary streams, recovery of the media object’s or scene description’s time base, and it enables synchronization among them. The syntax of this layer is configurable in a large number of ways, allowing use in a broad spectrum of systems.
2.2.4 Delivery of streaming data
The synchronized delivery of streaming information from source to destination, exploiting different QoS as available from the network, is specified in terms of the synchronization layer and a delivery layer containing a two-layer multiplexer, as depicted in Figure 1.
The “TransMux” (Transport Multiplexing) layer in Figure 1 models the layer that offers transport services matching the requested QoS. Only the interface to this layer is specified by MPEG-4 while the concrete mapping of the data packets and control signaling must be done in collaboration with the bodies that have jurisdiction over the respective transport protocol. Any suitable existing transport protocol stack such as (RTP)/UDP/IP, (AAL5)/ATM, or MPEG-2’s Transport Stream over a suitable link layer may become a specific TransMux instance. The choice is left to the end user/service provider, and allows MPEG-4 to be used in a wide variety of operation environments.
Figure 1 - The MPEG-4 System Layer Model
Use of the FlexMux multiplexing tool is optional and, as shown in Figure 1, this layer may be empty if the underlying TransMux instance provides all the required functionality. The synchronization layer, however, is always present.
With regard to Figure 1, it is possible to:
Identify access units, transport timestamps and clock reference information and identify data loss.
Optionally interleave data from different elementary streams into FlexMux streams
Convey control information to:
Indicate the required QoS for each elementary stream and FlexMux stream;
Translate such QoS requirements into actual network resources;
Associate elementary streams to media objects
Convey the mapping of elementary streams to FlexMux and TransMux channels
Parts of the control functionalities are available only in conjunction with a transport control entity like the DMIF framework.
2.2.5 Interaction with media objects
In general, the user observes a scene that is composed following the design of the scene’s author. Depending on the degree of freedom allowed by the author, however, the user has the possibility to interact with the scene. Operations a user may be allowed to perform include:
Change the viewing/listening point of the scene, e.g. by navigation through a scene;
Drag objects in the scene to a different position;
Trigger a cascade of events by clicking on a specific object, e.g. starting or stopping a video stream;
Select the desired language when multiple language tracks are available;
More complex kinds of behavior can also be triggered, e.g. a virtual phone rings, the user answers and a communication link is established.
2.3 Major Functionalities in MPEG-4
This section contains, in an itemized fashion, the major functionalities that the different parts of the MPEG-4 Standard offers in the finalized MPEG-4 Version 1. Description of the functionalities can be found in the following sections.
2.3.1 Transport
In principle, MPEG-4 does not define transport layers. In a number of cases, adaptation to a specific existing transport layer has been defined:
Transport over MPEG-2 Transport Stream (this is an amendment to MPEG-2 Systems)
Transport over IP (In cooperation with IETF, the Internet Engineering Task Force)
2.3.2 DMIF
DMIF, or Delivery Multimedia Integration Framework, is an interface between the application and the transport, that allows the MPEG-4 application developer to stop worrying about that transport. A single application can run on different transport layers when supported by the right DMIF instantiation.
MPEG-4 DMIF supports the following functionalities:
A transparent MPEG-4 DMIF-application interface irrespective of whether the peer is a remote interactive peer, broadcast or local storage media.
Control of the establishment of FlexMux channels
Use of homogeneous networks between interactive peers: IP, ATM, mobile, PSTN, Narrowband ISDN.
Support for mobile networks, developed together with ITU-T
UserCommands with acknowledgment messages.
Management of MPEG-4 Sync Layer information.
2.4 Extensions underway
MPEG is currently working on a number of extensions:
2.4.1 The Animation Framework eXtension, AFX
The Animation Framework extension (AFX – pronounced ‘effects’) provides an integrated toolbox for building attractive and powerful synthetic MPEG-4 environments. The framework defines a collection of interoperable tool categories that collaborate to produce a reusable architecture for interactive animated contents. In the context of AFX, a tool represents functionality such as a BIFS node, a synthetic stream, or an audio-visual stream.
AFX utilizes and enhances existing MPEG-4 tools, while keeping backward-compatibility, by offering:
Higher-level descriptions of animations (e.g. inverse kinematics)
Enhanced rendering (e.g. multi-texturing, procedural texturing)
Compact representations (e.g. piecewise curve interpolators, subdivision surfaces)
Low bitrate animations (e.g. using interpolator compression and dead-reckoning)
Scalability based on terminal capabilities (e.g. parametric surfaces tessellation)
Interactivity at user level, scene level, and client-server session level
Compression of representations for static and dynamic tools
Compression of animated paths and animated models is required for improving the transmission and storage efficiency of representations for dynamic and static tools.
2.4.2 Advanced Video Coding
Work is ongoing on MPEG-4 part 10, 'Advanced Video Coding', This codec is being developed jointly with ITU-T, in the so-called Joint Video Team (JVT). The JVT unites the standard world's video coding experts in a single group. The work currently underway is based on earlier work in ITU-T on 'H.26L'. H.26L and MPEG-4 part 10 will be the same. (H.26L will be renamed when it is done. The final name may be H.264, but that is not yet sure). MPEG-4 AVC/H.26L is slated to be ready by the end of 2002.
2.4.3 Audio extensions
There are two work items underway for improving audio coding efficiency even further.
Bandwidth extension is a tool that gives a better quality perception over the existing audio signal, while keeping the existing signal backward compatible.
MPEG is investigating bandwidth extensions, and may standardize of one or both of:
General audio signals, to extend the capabilities currently provided by MPEG-4 general audio coders.
Speech signals, to extend the capabilities currently provided by MPEG-4 speech coders.
A single technology that addresses both of these signals is preferred. This technology shall be both forward and backward compatible with existing MPEG-4 technology. In other words, an MPEG-4 decoder can decode an enhanced stream and a new technology decoder can decode an MPEG-4 stream. There are two possible configurations for the enhanced stream: MPEG-4 AAC streams can carry the enhancement information in the DataStreamElement, while all MPEG-4 systems know the concept of elementary streams, which allow second Elementary Stream for a given audio object, containing the enhancement information.
2.5 Profiles in MPEG-4
MPEG-4 provides a large and rich set of tools for the coding of audio-visual objects. In order to allow effective implementations of the standard, subsets of the MPEG-4 Systems, Visual, and Audio tool sets have been identified, that can be used for specific applications. These subsets, called ‘Profiles’, limit the tool set a decoder has to implement.
Profiles exist for various types of media content (audio, visual, and graphics) and for scene descriptions. MPEG does not prescribe or advise combinations of these Profiles, but care has been taken that good matches exist between the different areas.
2.5.1 Visual Profiles
The visual part of the standard provides profiles for the coding of natural, synthetic, and synthetic/natural hybrid visual content. There are five profiles for natural video content:
The Simple Visual Profile provides efficient, error resilient coding of rectangular video objects, suitable for applications on mobile networks, such as PCS and IMT2000.
The Simple Scalable Visual Profile adds support for coding of temporal and spatial scalable objects to the Simple Visual Profile, It is useful for applications which provide services at more than one level of quality due to bit-rate or decoder resource limitations, such as Internet use and software decoding.
The Core Visual Profile adds support for coding of arbitrary-shaped and temporally scalable objects to the Simple Visual Profile. It is useful for applications such as those providing relatively simple content-interactivity (Internet multimedia applications).
The Main Visual Profile adds support for coding of interlaced, semi-transparent, and sprite objects to the Core Visual Profile. It is useful for interactive and entertainment-quality broadcast and DVD applications.
The N-Bit Visual Profile adds support for coding video objects having pixel-depths ranging from 4 to 12 bits to the Core Visual Profile. It is suitable for use in surveillance applications.
The profiles for synthetic and synthetic/natural hybrid visual content are:
The Simple Facial Animation Visual Profile provides a simple means to animate a face model, suitable for applications such as audio/video presentation for the hearing impaired.
The Scalable Texture Visual Profile provides spatial scalable coding of still image (texture) objects useful for applications needing multiple scalability levels, such as mapping texture onto objects in games, and high-resolution digital still cameras.
The Basic Animated 2-D Texture Visual Profile provides spatial scalability, SNR scalability, and mesh-based animation for still image (textures) objects and also simple face object animation.
The Hybrid Visual Profile combines the ability to decode arbitrary-shaped and temporally scalable natural video objects (as in the Core Visual Profile) with the ability to decode several synthetic and hybrid objects, including simple face and animated still image objects. It is suitable for various content-rich multimedia applications.
Version 2 adds the following Profiles for natural video:
The Advanced Real-Time Simple Profile (ARTS) provides advanced error resilient coding techniques of rectangular video objects using a back channel and improved temporal resolution stability with the low buffering delay. It is suitable for real time coding applications; such as the videophone, tele-conferencing and the remote observation.
The Core Scalable Profile adds support for coding of temporal and spatial scalable arbitrarily shaped objects to the Core Profile. The main functionality of this profile is object based SNR and spatial/temporal scalability for regions or objects of interest. It is useful for applications such as the Internet, mobile and broadcast.
The Advanced Coding Efficiency (ACE) Profile improves the coding efficiency for both rectangular and arbitrary shaped objects. It is suitable for applications such as mobile broadcast reception, the acquisition of image sequences (camcorders) and other applications where high coding efficiency is requested and small footprint is not the prime concern.
2.5.2 Audio Profiles
Four Audio Profiles have been defined in MPEG-4 V.1:
The Speech Profile provides HVXC, which is a very-low bit-rate parametric speech coder, a CELP narrowband/wideband speech coder, and a Text-To-Speech interface.
The Synthesis Profile provides score driven synthesis using SAOL and wavetables and a Text-to-Speech Interface to generate sound and speech at very low bitrates.
The Scalable Profile, a superset of the Speech Profile, is suitable for scalable coding of speech and music for networks, such as Internet and Narrow band Audio DIgital Broadcasting (NADIB). The bitrates range from 6 kbit/s and 24 kbit/s, with bandwidths between 3.5 and 9 kHz.
The Main Profile is a rich superset of all the other Profiles, containing tools for natural and synthetic Audio.
Another four Profiles were added in MPEG-4 V.2:
The High Quality Audio Profile contains the CELP speech coder and the Low Complexity AAC coder including Long Term Prediction. Scalable coding coding can be performed by the AAC Scalable object type. Optionally, the new error resilient (ER) bitstream syntax may be used.
The Low Delay Audio Profile contains the HVXC and CELP speech coders (optionally using the ER bitstream syntax), the low-delay AAC coder and the Text-to-Speech interface TTSI.
The Natural Audio Profile contains all natural audio coding tools available in MPEG-4, but not the synthetic ones.
The Mobile Audio Internetworking Profile (MAUI) contains the low-delay and scalable AAC object types including TwinVQ and BSAC. This profile is intended to extend communication applications using non-MPEG speech coding algorithms with high quality audio coding capabilities.
2.5.3 Graphics Profiles
Graphics Profiles define which graphical and textual elements can be used in a scene. These profiles are defined in the Systems part of the standard:
Simple 2-D Graphics Profile The Simple 2-D Graphics profile provides for only those graphics elements of the BIFS tool that are necessary to place one or more visual objects in a scene.
Complete 2-D Graphics Profile The Complete 2-D Graphics profile provides two-dimensional graphics functionalities and supports features such as arbitrary two-dimensional graphics and text, possibly in conjunction with visual objects.
Complete Graphics Profile The Complete Graphics profile provides advanced graphical elements such as elevation grids and extrusions and allows creating content with sophisticated lighting. The Complete Graphics profile enables applications such as complex virtual worlds that exhibit a high degree of realism.
The 3D Audio Graphics Profile sounds like a contradictory in terms, but really isn’t. This profile does not propose visual rendering, but graphics tools are provided to define the acoustical properties of the scene (geometry, acoustics absorption, diffusion, transparency of the material). This profile is used for applications that do environmental spatialization of audio signals.
Next post :-MPEG-4 NATURAL AUDIO CODING
0 comments:
Post a Comment
Thanks for your Valuable comment