Monkey's Audio - a fast and powerful lossless audio compressor

Digital Audio

Sound is simply a wave, and digital audio is the digital representation of this wave. The digital representation is achieved by "sampling" the magnitude of an analog signal many times per second. This can be thought of conceptually as recording the "height" of the wave many times per second. Today's audio CD's store 44,100 samples per second. Since CD's are in stereo, they store both a left and a right value 44,100 times per second. These values are represented by 16 bit integers. Basically a WAV file is a header, followed by an array of R, L, R, L…. Since each sample takes 32 bits (16 for the left, and 16 for the right), and there are 44,100 samples per second, one second of audio takes 1,411,200 bits, or 176,400 bytes.

Lossless Compression

Lossless compression can be broken down into a few basic steps. They are detailed below.

1. Conversion to X,Y

The first step in lossless compression is to more efficiently model the channels L and R as some X and Y values. There is often a great deal of correlation between the L and R channels, and this can be exploited several ways, with one popular way being through the use of mid / side encoding. In this case, a mid (X) and a side (Y) value are encoded instead of a L and a R value. The mid (X) is the midpoint between the L and R channels and the side (Y) is the difference in the channels. This can be achieved:

X = (L + R) / 2
Y = (L - R)

2. Predictor

Next, the X and Y data is passed through a predictor to attempt to remove any redundancy. Basically, the goal of this stage is to make the X and Y arrays contain the smallest possible values while still remaining decompressible. This stage is what separates one compression scheme from another. There are virtually countless ways to do this. Here is a sample using simple linear algebra:

PX and PY are the predicted X and Y; X_-1is the previous X value; X_-2 is the X value two back
PX = (2 * X_-1) - X_-2
PY = (2 * Y_-1) - Y_-2

As an example, if X = (2, 8, 24, ?); PX = (2 * X_-1) - X_-2= (2 * 24) - 8 = 40

Then, these predicted values are compared with the actual value and the difference (error) is what gets sent to the next stage for encoding.

Most good predictors are adaptive, so that they adjust to how "predictable" the data currently is. For example, let's use a factor 'm' that ranges from 0 to 1024 (0 is no prediction and 1024 is full prediction). After each prediction, m is adjusted up or down depending on whether the prediction was helpful or not. So in the previous example, what leaves the predictor is this:

X = (2, 8, 24, ?)
PX = (2 * X_-1) - X_-2= (2 * 24) - 8 = 40

If ? = 45 and m = 512

[Final Value] = ? - (PX * m / 1024) = 45 - (40 * m / 1024) = 45 - (40 * 512 / 1024) = 45 - 20 = 25

After this m would be adjusted upward because a higher m would have been more efficient.

Using different prediction equations and using multiple passes through the predictor can make a fairly substantial difference in compression level. Here is a quick list of some prediction equations as shown in the Shorten technical documentation (for different orders):

P0 = 0
P1 = X_-1
P2 = (2 * X_-1) - X_-2
P3 = (3 * X_-1) - (3 *X_-2) + X_-3

3. Encoding of Data

The goal behind audio compression is to make all of the numbers as small as possible by removing any correlation that may exist between them. Once this is achieved the resulting numbers must be written to disk.

Why are smaller numbers better? They are better because they take less bits to represent. For example, say we want to encode this array of numbers (32 bit longs):

Base 10: 10, 14, 15, 46

or in binary

Base 2: 1010, 1110, 1111, 101110

Now obviously if we want to represent these numbers in the fewest possible bits, it would be quite inefficient to represent them each as separate longs with 32 bits apiece. That would take 128 bits, and just from looking at the same numbers represented in base two, it is obvious that there must be a better way. The ideal thing would be just to slap the four numbers together using the least bits necessary, so 1010, 1110, 1111, 101110 without the commas would be 101011101111101110. The problem here is that we don't know where one number starts and the next begins.

To store small numbers so that they take less bits, but can still be decompressed, we use "entropy coding". There are details about a common entropy coding system called Rice Coding below. Monkey's Audio uses a slightly more advanced entropy coder, but the fundamentals of Rice Coding are still useful.

Entropy Coding

Rice Coding

Rice coding is a way of using less bits to represent small numbers, while still maintaining the ability to tell one number from the next. Basically it works like this:

You make your best guess as to how many bits a number will take, and call that k
Take the rightmost k bits of the number and remember what they are
Imagine the binary number without those rightmost k bits and look at its new value (this is the overflow that doesn't fit in k bits)
Use these values to encode the number. This encoded value is represented as a number of zeroes corresponding to step 3, then a terminating 1 to tell that you're done sending the "overflow", then the k bits from step 2.

Example

Let's work through our example, and try to encode the fourth number in our series 10, 14, 15, 46.

You make your best guess as to how many bits a number will take, and call that k: since the previous 3 numbers took 4 bits, that seems like a reasonable guess so we will set k = 4
Take the rightmost k bits of the number and remember what they are: The right 4 bits of 46 (101110) are 1110
Imagine the binary number without those rightmost k bits and look at its new value (this is the overflow that doesn't fit in k bits): When you take the 1110 away from the right of 101110 you are left with 10 or 2 (in base 10)
Use these values to encode the number. So, we put two 0's, followed by the terminating 1, followed by the k bits 1110. All together we have: 0011110

Now to undue this operation, we just take 0011110 and k = 4 and work our way backwards. We first see that the overflow is 2 (there are two zeroes before the terminating 1). We also see that the last four bits = 1110. So, we take the value 10 (the overflow) and the values 1110 (the k) and just do a little shifting and volah! (overflow is shifted << k bits)

More Technical Version of the Same Example

Here is a little more technical and mathematical description of the same process:

Assuming some integer n is the number to encode, and k is the number of bits to encode directly.

sign (1 for positive, 0 for negative)
n / (2^k) 0's
terminating 1
k least significant bits of n

As an example, if n = 578 and k = 8: 100101000010

sign (1 for positive, 0 for negative) = [1]
n / (2^k) 0's: n / 2^k = 578 / 256 = 2 = [00]
terminating 1: [1]
k least significant bits of n: 578 = [01000010]
put the 1-4 together: [1][00][1][01000010] = 100101000010

During the encode process, the optimum k is determined by looking at the average value over the past however many values (16 - 128 works well), and choosing the optimum k for that average (basically it's guessing what the next value will be, and trying to choose the most efficient k based on that). The optimum k can be calculated as [log(n) / log(2)].

Theory