Home | History | Annotate | Download | only in doc
      1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
      2 <html>
      3 <head>
      4 
      5 <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-15"/>
      6 <title>Ogg Vorbis Documentation</title>
      7 
      8 <style type="text/css">
      9 body {
     10   margin: 0 18px 0 18px;
     11   padding-bottom: 30px;
     12   font-family: Verdana, Arial, Helvetica, sans-serif;
     13   color: #333333;
     14   font-size: .8em;
     15 }
     16 
     17 a {
     18   color: #3366cc;
     19 }
     20 
     21 img {
     22   border: 0;
     23 }
     24 
     25 #xiphlogo {
     26   margin: 30px 0 16px 0;
     27 }
     28 
     29 #content p {
     30   line-height: 1.4;
     31 }
     32 
     33 h1, h1 a, h2, h2 a, h3, h3 a, h4, h4 a {
     34   font-weight: bold;
     35   color: #ff9900;
     36   margin: 1.3em 0 8px 0;
     37 }
     38 
     39 h1 {
     40   font-size: 1.3em;
     41 }
     42 
     43 h2 {
     44   font-size: 1.2em;
     45 }
     46 
     47 h3 {
     48   font-size: 1.1em;
     49 }
     50 
     51 li {
     52   line-height: 1.4;
     53 }
     54 
     55 #copyright {
     56   margin-top: 30px;
     57   line-height: 1.5em;
     58   text-align: center;
     59   font-size: .8em;
     60   color: #888888;
     61   clear: both;
     62 }
     63 </style>
     64 
     65 </head>
     66 
     67 <body>
     68 
     69 <div id="xiphlogo">
     70   <a href="http://www.xiph.org/"><img src="fish_xiph_org.png" alt="Fish Logo and Xiph.org"/></a>
     71 </div>
     72 
     73 <h1>Ogg Vorbis stereo-specific channel coupling discussion</h1>
     74 
     75 <h2>Abstract</h2>
     76 
     77 <p>The Vorbis audio CODEC provides a channel coupling
     78 mechanisms designed to reduce effective bitrate by both eliminating
     79 interchannel redundancy and eliminating stereo image information
     80 labeled inaudible or undesirable according to spatial psychoacoustic
     81 models. This document describes both the mechanical coupling
     82 mechanisms available within the Vorbis specification, as well as the
     83 specific stereo coupling models used by the reference
     84 <tt>libvorbis</tt> codec provided by xiph.org.</p>
     85 
     86 <h2>Mechanisms</h2>
     87 
     88 <p>In encoder release beta 4 and earlier, Vorbis supported multiple
     89 channel encoding, but the channels were encoded entirely separately
     90 with no cross-analysis or redundancy elimination between channels.
     91 This multichannel strategy is very similar to the mp3's <em>dual
     92 stereo</em> mode and Vorbis uses the same name for its analogous
     93 uncoupled multichannel modes.</p>
     94 
     95 <p>However, the Vorbis spec provides for, and Vorbis release 1.0 rc1 and
     96 later implement a coupled channel strategy. Vorbis has two specific
     97 mechanisms that may be used alone or in conjunction to implement
     98 channel coupling. The first is <em>channel interleaving</em> via
     99 residue backend type 2, and the second is <em>square polar
    100 mapping</em>. These two general mechanisms are particularly well
    101 suited to coupling due to the structure of Vorbis encoding, as we'll
    102 explore below, and using both we can implement both totally
    103 <em>lossless stereo image coupling</em> [bit-for-bit decode-identical
    104 to uncoupled modes], as well as various lossy models that seek to
    105 eliminate inaudible or unimportant aspects of the stereo image in
    106 order to enhance bitrate. The exact coupling implementation is
    107 generalized to allow the encoder a great deal of flexibility in
    108 implementation of a stereo or surround model without requiring any
    109 significant complexity increase over the combinatorially simpler
    110 mid/side joint stereo of mp3 and other current audio codecs.</p>
    111 
    112 <p>A particular Vorbis bitstream may apply channel coupling directly to
    113 more than a pair of channels; polar mapping is hierarchical such that
    114 polar coupling may be extrapolated to an arbitrary number of channels
    115 and is not restricted to only stereo, quadraphonics, ambisonics or 5.1
    116 surround. However, the scope of this document restricts itself to the
    117 stereo coupling case.</p>
    118 
    119 <a name="sqpm"></a>
    120 <h3>Square Polar Mapping</h3>
    121 
    122 <h4>maximal correlation</h4>
    123  
    124 <p>Recall that the basic structure of a a Vorbis I stream first generates
    125 from input audio a spectral 'floor' function that serves as an
    126 MDCT-domain whitening filter. This floor is meant to represent the
    127 rough envelope of the frequency spectrum, using whatever metric the
    128 encoder cares to define. This floor is subtracted from the log
    129 frequency spectrum, effectively normalizing the spectrum by frequency.
    130 Each input channel is associated with a unique floor function.</p>
    131 
    132 <p>The basic idea behind any stereo coupling is that the left and right
    133 channels usually correlate. This correlation is even stronger if one
    134 first accounts for energy differences in any given frequency band
    135 across left and right; think for example of individual instruments
    136 mixed into different portions of the stereo image, or a stereo
    137 recording with a dominant feature not perfectly in the center. The
    138 floor functions, each specific to a channel, provide the perfect means
    139 of normalizing left and right energies across the spectrum to maximize
    140 correlation before coupling. This feature of the Vorbis format is not
    141 a convenient accident.</p>
    142 
    143 <p>Because we strive to maximally correlate the left and right channels
    144 and generally succeed in doing so, left and right residue is typically
    145 nearly identical. We could use channel interleaving (discussed below)
    146 alone to efficiently remove the redundancy between the left and right
    147 channels as a side effect of entropy encoding, but a polar
    148 representation gives benefits when left/right correlation is
    149 strong.</p>
    150 
    151 <h4>point and diffuse imaging</h4>
    152 
    153 <p>The first advantage of a polar representation is that it effectively
    154 separates the spatial audio information into a 'point image'
    155 (magnitude) at a given frequency and located somewhere in the sound
    156 field, and a 'diffuse image' (angle) that fills a large amount of
    157 space simultaneously. Even if we preserve only the magnitude (point)
    158 data, a detailed and carefully chosen floor function in each channel
    159 provides us with a free, fine-grained, frequency relative intensity
    160 stereo*. Angle information represents diffuse sound fields, such as
    161 reverberation that fills the entire space simultaneously.</p>
    162 
    163 <p>*<em>Because the Vorbis model supports a number of different possible
    164 stereo models and these models may be mixed, we do not use the term
    165 'intensity stereo' talking about Vorbis; instead we use the terms
    166 'point stereo', 'phase stereo' and subcategories of each.</em></p>
    167 
    168 <p>The majority of a stereo image is representable by polar magnitude
    169 alone, as strong sounds tend to be produced at near-point sources;
    170 even non-diffuse, fast, sharp echoes track very accurately using
    171 magnitude representation almost alone (for those experimenting with
    172 Vorbis tuning, this strategy works much better with the precise,
    173 piecewise control of floor 1; the continuous approximation of floor 0
    174 results in unstable imaging). Reverberation and diffuse sounds tend
    175 to contain less energy and be psychoacoustically dominated by the
    176 point sources embedded in them. Thus, we again tend to concentrate
    177 more represented energy into a predictably smaller number of numbers.
    178 Separating representation of point and diffuse imaging also allows us
    179 to model and manipulate point and diffuse qualities separately.</p>
    180 
    181 <h4>controlling bit leakage and symbol crosstalk</h4>
    182 
    183 <p>Because polar
    184 representation concentrates represented energy into fewer large
    185 values, we reduce bit 'leakage' during cascading (multistage VQ
    186 encoding) as a secondary benefit. A single large, monolithic VQ
    187 codebook is more efficient than a cascaded book due to entropy
    188 'crosstalk' among symbols between different stages of a multistage cascade.
    189 Polar representation is a way of further concentrating entropy into
    190 predictable locations so that codebook design can take steps to
    191 improve multistage codebook efficiency. It also allows us to cascade
    192 various elements of the stereo image independently.</p>
    193 
    194 <h4>eliminating trigonometry and rounding</h4>
    195 
    196 <p>Rounding and computational complexity are potential problems with a
    197 polar representation. As our encoding process involves quantization,
    198 mixing a polar representation and quantization makes it potentially
    199 impossible, depending on implementation, to construct a coupled stereo
    200 mechanism that results in bit-identical decompressed output compared
    201 to an uncoupled encoding should the encoder desire it.</p>
    202 
    203 <p>Vorbis uses a mapping that preserves the most useful qualities of
    204 polar representation, relies only on addition/subtraction (during
    205 decode; high quality encoding still requires some trig), and makes it
    206 trivial before or after quantization to represent an angle/magnitude
    207 through a one-to-one mapping from possible left/right value
    208 permutations. We do this by basing our polar representation on the
    209 unit square rather than the unit-circle.</p>
    210 
    211 <p>Given a magnitude and angle, we recover left and right using the
    212 following function (note that A/B may be left/right or right/left
    213 depending on the coupling definition used by the encoder):</p>
    214 
    215 <pre>
    216       if(magnitude>0)
    217         if(angle>0){
    218           A=magnitude;
    219           B=magnitude-angle;
    220         }else{
    221           B=magnitude;
    222           A=magnitude+angle;
    223         }
    224       else
    225         if(angle>0){
    226           A=magnitude;
    227           B=magnitude+angle;
    228         }else{
    229           B=magnitude;
    230           A=magnitude-angle;
    231         }
    232     }
    233 </pre>
    234 
    235 <p>The function is antisymmetric for positive and negative magnitudes in
    236 order to eliminate a redundant value when quantizing. For example, if
    237 we're quantizing to integer values, we can visualize a magnitude of 5
    238 and an angle of -2 as follows:</p>
    239 
    240 <p><img src="squarepolar.png" alt="square polar"/></p>
    241 
    242 <p>This representation loses or replicates no values; if the range of A
    243 and B are integral -5 through 5, the number of possible Cartesian
    244 permutations is 121. Represented in square polar notation, the
    245 possible values are:</p>
    246 
    247 <pre>
    248  0, 0
    249 
    250 -1,-2  -1,-1  -1, 0  -1, 1
    251 
    252  1,-2   1,-1   1, 0   1, 1
    253 
    254 -2,-4  -2,-3  -2,-2  -2,-1  -2, 0  -2, 1  -2, 2  -2, 3  
    255 
    256  2,-4   2,-3   ... following the pattern ...
    257 
    258  ...   5, 1   5, 2   5, 3   5, 4   5, 5   5, 6   5, 7   5, 8   5, 9
    259 
    260 </pre>
    261 
    262 <p>...for a grand total of 121 possible values, the same number as in
    263 Cartesian representation (note that, for example, <tt>5,-10</tt> is
    264 the same as <tt>-5,10</tt>, so there's no reason to represent
    265 both. 2,10 cannot happen, and there's no reason to account for it.)
    266 It's also obvious that this mapping is exactly reversible.</p>
    267 
    268 <h3>Channel interleaving</h3>
    269 
    270 <p>We can remap and A/B vector using polar mapping into a magnitude/angle
    271 vector, and it's clear that, in general, this concentrates energy in
    272 the magnitude vector and reduces the amount of information to encode
    273 in the angle vector. Encoding these vectors independently with
    274 residue backend #0 or residue backend #1 will result in bitrate
    275 savings. However, there are still implicit correlations between the
    276 magnitude and angle vectors. The most obvious is that the amplitude
    277 of the angle is bounded by its corresponding magnitude value.</p>
    278 
    279 <p>Entropy coding the results, then, further benefits from the entropy
    280 model being able to compress magnitude and angle simultaneously. For
    281 this reason, Vorbis implements residue backend #2 which pre-interleaves
    282 a number of input vectors (in the stereo case, two, A and B) into a
    283 single output vector (with the elements in the order of
    284 A_0, B_0, A_1, B_1, A_2 ... A_n-1, B_n-1) before entropy encoding. Thus
    285 each vector to be coded by the vector quantization backend consists of
    286 matching magnitude and angle values.</p>
    287 
    288 <p>The astute reader, at this point, will notice that in the theoretical
    289 case in which we can use monolithic codebooks of arbitrarily large
    290 size, we can directly interleave and encode left and right without
    291 polar mapping; in fact, the polar mapping does not appear to lend any
    292 benefit whatsoever to the efficiency of the entropy coding. In fact,
    293 it is perfectly possible and reasonable to build a Vorbis encoder that
    294 dispenses with polar mapping entirely and merely interleaves the
    295 channel. Libvorbis based encoders may configure such an encoding and
    296 it will work as intended.</p>
    297 
    298 <p>However, when we leave the ideal/theoretical domain, we notice that
    299 polar mapping does give additional practical benefits, as discussed in
    300 the above section on polar mapping and summarized again here:</p>
    301 
    302 <ul>
    303 <li>Polar mapping aids in controlling entropy 'leakage' between stages
    304 of a cascaded codebook.</li>
    305 <li>Polar mapping separates the stereo image
    306 into point and diffuse components which may be analyzed and handled
    307 differently.</li>
    308 </ul>
    309 
    310 <h2>Stereo Models</h2>
    311 
    312 <h3>Dual Stereo</h3>
    313 
    314 <p>Dual stereo refers to stereo encoding where the channels are entirely
    315 separate; they are analyzed and encoded as entirely distinct entities.
    316 This terminology is familiar from mp3.</p>
    317 
    318 <h3>Lossless Stereo</h3>
    319 
    320 <p>Using polar mapping and/or channel interleaving, it's possible to
    321 couple Vorbis channels losslessly, that is, construct a stereo
    322 coupling encoding that both saves space but also decodes
    323 bit-identically to dual stereo. OggEnc 1.0 and later uses this
    324 mode in all high-bitrate encoding.</p>
    325 
    326 <p>Overall, this stereo mode is overkill; however, it offers a safe
    327 alternative to users concerned about the slightest possible
    328 degradation to the stereo image or archival quality audio.</p>
    329 
    330 <h3>Phase Stereo</h3>
    331 
    332 <p>Phase stereo is the least aggressive means of gracefully dropping
    333 resolution from the stereo image; it affects only diffuse imaging.</p>
    334 
    335 <p>It's often quoted that the human ear is deaf to signal phase above
    336 about 4kHz; this is nearly true and a passable rule of thumb, but it
    337 can be demonstrated that even an average user can tell the difference
    338 between high frequency in-phase and out-of-phase noise. Obviously
    339 then, the statement is not entirely true. However, it's also the case
    340 that one must resort to nearly such an extreme demonstration before
    341 finding the counterexample.</p>
    342 
    343 <p>'Phase stereo' is simply a more aggressive quantization of the polar
    344 angle vector; above 4kHz it's generally quite safe to quantize noise
    345 and noisy elements to only a handful of allowed phases, or to thin the
    346 phase with respect to the magnitude. The phases of high amplitude
    347 pure tones may or may not be preserved more carefully (they are
    348 relatively rare and L/R tend to be in phase, so there is generally
    349 little reason not to spend a few more bits on them)</p>
    350 
    351 <h4>example: eight phase stereo</h4>
    352 
    353 <p>Vorbis may implement phase stereo coupling by preserving the entirety
    354 of the magnitude vector (essential to fine amplitude and energy
    355 resolution overall) and quantizing the angle vector to one of only
    356 four possible values. Given that the magnitude vector may be positive
    357 or negative, this results in left and right phase having eight
    358 possible permutation, thus 'eight phase stereo':</p>
    359 
    360 <p><img src="eightphase.png" alt="eight phase"/></p>
    361 
    362 <p>Left and right may be in phase (positive or negative), the most common
    363 case by far, or out of phase by 90 or 180 degrees.</p>
    364 
    365 <h4>example: four phase stereo</h4>
    366 
    367 <p>Similarly, four phase stereo takes the quantization one step further;
    368 it allows only in-phase and 180 degree out-out-phase signals:</p>
    369 
    370 <p><img src="fourphase.png" alt="four phase"/></p>
    371 
    372 <h3>example: point stereo</h3>
    373 
    374 <p>Point stereo eliminates the possibility of out-of-phase signal
    375 entirely. Any diffuse quality to a sound source tends to collapse
    376 inward to a point somewhere within the stereo image. A practical
    377 example would be balanced reverberations within a large, live space;
    378 normally the sound is diffuse and soft, giving a sonic impression of
    379 volume. In point-stereo, the reverberations would still exist, but
    380 sound fairly firmly centered within the image (assuming the
    381 reverberation was centered overall; if the reverberation is stronger
    382 to the left, then the point of localization in point stereo would be
    383 to the left). This effect is most noticeable at low and mid
    384 frequencies and using headphones (which grant perfect stereo
    385 separation). Point stereo is is a graceful but generally easy to
    386 detect degradation to the sound quality and is thus used in frequency
    387 ranges where it is least noticeable.</p>
    388 
    389 <h3>Mixed Stereo</h3>
    390 
    391 <p>Mixed stereo is the simultaneous use of more than one of the above
    392 stereo encoding models, generally using more aggressive modes in
    393 higher frequencies, lower amplitudes or 'nearly' in-phase sound.</p>
    394 
    395 <p>It is also the case that near-DC frequencies should be encoded using
    396 lossless coupling to avoid frame blocking artifacts.</p>
    397 
    398 <h3>Vorbis Stereo Modes</h3>
    399 
    400 <p>Vorbis, as of 1.0, uses lossless stereo and a number of mixed modes
    401 constructed out of lossless and point stereo. Phase stereo was used
    402 in the rc2 encoder, but is not currently used for simplicity's sake. It
    403 will likely be re-added to the stereo model in the future.</p>
    404 
    405 <div id="copyright">
    406   The Xiph Fish Logo is a
    407   trademark (&trade;) of Xiph.Org.<br/>
    408 
    409   These pages &copy; 1994 - 2005 Xiph.Org. All rights reserved.
    410 </div>
    411 
    412 </body>
    413 </html>
    414 
    415 
    416 
    417 
    418 
    419 
    420