Data Compression: A little introduction for beginners

When you look at computers and the internet, data compression is everywhere. The music you listen to, the pictures you see, the movies, all that is data, and all that needs to be compressed in some way or another so it “fits” into your computer memory and so it can be downloaded or sent to someone else in a reasonable amount of time.

A picture, for example: the information on a real life image is almost infinite. You can “split” a real life image up to it’s individual atoms, and still have information to gather from that. But the first question should be: Does all this information matters to you?

I mean, I took a picture of my fiancée in front of the Eiffel Tower, during our honeymoon in Paris. Do I want to see the individual atoms of her hair? I don’t even care to see each one of her hairs individually. That’s just too much information! So what did cameras do, way before computers where even invented? They quantize information. Photographic film is covered with very small grains of sensitive material (usually some silver nitrate or derived compound) that change color whenever exposed to light. The size of these grains are what determines the size of the information the photo will store.


Quantization is the very first form of data compression and simply consists of dropping bits of information that are too small to be relevant. It is also on the base of the most modern forms of data compression, so quantization is not only the simplest, but also the most powerful form of compression.

And when we come to computers, quantization occurs all the time. How many “megapixels” does your camera have? This actually means, “how large is the quantum[1] of information it will store”? Or the sample rate of this song is 44.1 Khz, meaning that we store the sound not as an infinite continuous wave, but rather as a discrete quantized collection of sound intensities measured at every one 44.100th of a second. As you can see, whenever we talk about information on computers, we talk about discrete[2] information, and if this comes from the real world, which is not discrete, this involves quantization (or the choice of a quantum and the representation of the information in proportions of this quantum).

Lossless x Lossy

But quantization also means losing information. What if the information we have is already quantized, and we cannot lose any bit of it? Think, for example, about a book. Books are made of words, and words are made of letters. And if we change a single letter in a book, this might change the whole meaning of some important sentence, which can turn a good book into a bad book. Or an unreadable one. So in the case of written information, we already have our quantum: A single letter. (or number, or punctuation symbol, or whatever, a single type or character as they are called by the printing industries).

So if we come to think, quantization not always apply, because it involves loss of information, thus we say it is a lossy compression method. When we come to texts, for example (or some very sensitive imaging data, or anything else that cannot be further quantized), we need to guarantee there will be no loss of information, and thus need to apply lossless methods[8]. Although lossy methods are much more common and occur also outside computers, geeks tend to think lossless methods are easier to learn, and every compression book will keep lossless methods in the first chapters and lossy ones in the latest. I am still a geek (or at least a nerd), so I will stand to the tradition and start by the lossless ones:

Lossless Compression

A simple Telegraphic key.[10]

Some centuries ago (actually not even 2 centuries, but hey, I wasn’t born yet), Samuel Morse invented the telegraph. It was a very simple way of communication consisting of am electric battery, two buttons, two bells an a very, very, very long wire between them. Transmitting information through this system was easy: I press the button, the bell rings on the other side of the wire, hundreds of miles away. The guy at the other side presses his button, and my bell rings. Nice huh? But how to send a whole world? A full sentence? A huge text with several pages? Well, you code it in bell rings, some short, some long. Each letter was assigned a sequence of bell rings:

  • one short ring and one long ring means “A”,
  • one long and three short means “B”,
  • And so on…[3]

Nice, but.. how does this relates to data compression? That’s simple: Morse realized that some letters occur more frequently in the English language, while others occur rarely. So to save the time of the telegraphers (the guys who operated that thing), he made the more frequent letters shorter! Letter “E”, for example, the most common one, uses only one single ring (and a short one). Letters like “Q”, “Y” and “Z” take four rings, three of them long.

What does it mean? Lets say that we only have 3 letters, “E”, “Q” and “Y”. “E” occurs 80% of the time, while “Q” and “Y” occur 10% of the time. The telegrapher is a good one: he can “type” one ring every second (ok, he is a really bad one, but this is an example). And our message has 100 letters. We have two choices: A code in which each letter has the same size (which is 2 rings), and a code in which letter “E” takes 1 ring, letter “Q” takes 2 and letter “Y” takes 3 (similar to what the Morse code does). Then we have:

  • 80 letters “E”, 10 letters “Q” and 10 letters “Y”, which means:
    • in the first code, 200 rings, in 200 seconds.
    • in the Morse like code, 80 rings plus 10 double rings (20 rings) plus 10 triple rings (30 rings), which amounts to 130 rings, in 130 seconds!

WOW, thats a 35% saving in time (yup, I opened up the calculator just to do this math). So there’s REALLY a compression. And a good one. And better yet: No single letter was lost.

But hey, what if what I need to transmit is not written in English, but in French? Or in Russian? Well, then you have to study this language, figure out what are the good proportions among letters and build a code for it. And what if my code is a bad one? What if there’s a best one? Well, that’s where we get to Huffman codes.

Huffman Codes

It was in 1952 when an M.I.T. student came up with this idea: How can I build the best code possible to write a given message in it’s shortest representation? His name was David A. Huffman, and the set of codes he described was named after him. What’s nice about Mr. Huffman is that he did his homework well: He proved mathematically that no code can be shorter than his in representing a message[4]. And you might think: wow, but constructing such a code must be really hard. But, not even that: it’s really simple:

  1. You count all the occurrences of the letters and symbols, and sort them by this count.
  2. You pick the 2 smaller ones an assign the digits “1” and “0” to each one, then you pretend they are in fact a new symbol and count its occurrence, put it in the list and sort again. (Actually you don’t count the occurrences, you rather sum up the occurrences of the two other symbols).
  3. You go back to step one until only one big symbol that sums up all the others is left.
  4. Now you read the digits associates with each symbol and make a binary number out of it. That’s your Huffman code.

For our 3 letter example we will have the counts: “E” = 80, “Q” = 10 and “Y” = 10. We pick the smaller ones and give them a digit: “Y” takes the “0” and Q takes the one. We join them in a supersymbol “YQ” with count of 20 and go back. Now we take “YQ” and “E” and assign “0” to “YQ” and “1” to “E”. And we join them in a big symbol “EYQ” that is the last one.

So we gather the digits: “Y” took a “0” and then another “0”, so its code is “00”, “Q” took a “1” and then a “0”, so its code is “01” (yes, the codes are inverted), and “E” only got the digit “1”, so its code is now “1”. As you can see, we saved 10 “rings” still in our code: “E” and “Q” codes have the same size, but “Y” code is shorter by one digit/ring, so we saved one ring for each “Y” in the message.[5]

And as Dr. Huffman assured us, it’s as short as it can get! But wait, did he? I mean, he said that this is the shortest code for this message, but what if i can shorten the message? Like, use abbreviations instead of full names. Or even better: I’ve got this nice children’s dictionary and I’ve got an idea…

Dictionary Methods

This children’s dictionary of mine has the 200 most common English words. Each word in a different page. (And so as to make it more realistic, I also live in the 19th century and my only way to send messages is by telegraph, which is paid by the letter). My interest is to save money, and as such, to send very short messages. Each letter costs me 1 cent, since the telegrapher don’t really care if his Morse Code already compresses the data. So even if my message is transmitted 35% faster, it still costs me 1 dollar. Now I want to save money, and my friend to whom i wish to send the message also has the exact same dictionary as I do. What do we do? instead of sending whole words, we send page numbers! So word “Pneumonoultramicroscopicsilicovolcanoconiosis” is on page 130. Instead of 45 letters, i send only 3.Nice huh? But it only works for us two friends. My mom doesn’t have this dictionary, so how do i send messages to her? Well, Abraham Lempel and Jacob Ziv came up with a solution to this in the late seventies. Actually, two solutions, they called them:

LZ77 and LZ78

The numbers stand for the dates in which the published each of the methods. LZ77 was published in 1977 and LZ78 in 1978. What both methods have in common is that they say: Why do you need a dictionary? Build one from the message! Basically, they say that long messages tend to have repeated words. So whenever you see a new word, instead or sending a page number, send the word and add it to the dictionary. This is called adaptative data compression since the compression adapts itself to the message. You need no previous information, everything will be sent along with the message.

In order to build this dictionary, LZ77 and LZ78 take slightly different approaches. In LZ78 the dictionary is built in the way i described in the previous paragraph: if you find an old word, send its dictionary code instead of the real world, if you find a new word, send the full word and add it to the dictionary. The receiver knows that whenever he receives a new word, it should be added to the dictionary. The code is simply the order in which the words were added to the dictionary. But LZ77 makes it even simpler: the message is the dictionary. Instead od sending a dictionary code, find the previous occurrence of the word in the message and sent its position as the code.

Due to the intrinsic characteristics of each method, LZ78 tends to be faster and yeld better compression than LZ77. This should be so, except that Lempel and Ziv filed a patent on it. So no one except them could use LZ78 and as such, other researchers simply forgot about it. In the mean time, LZ77 was still free to use, so lots of research was done on improving and optimizing LZ77, and many variants of this method were published. The result is that today LZ77 variants yield better compression and are faster than LZ78 and its variants. The file formats zip and gzip both use a variant of LZ77 called DEFLATE. In the imaging scenario, GIF images use an LZ78 variant, while the better and faster PNG format uses LZ77.

So I guess we’ve seen all there’s to see about lossless methods, we can now move on to the lossy ones.

Lossy Compression

We have already seen the main part of what lossy compression is made: quantization. We just drop information that is irrelevant to us, that is too small to be seen or heard. But can we drop even more information without losing what is relevant? Well, think about my picture of the Eiffel Tower from the beginning of this text. It is composed by 3 basic parts: The subject, which is my fiancée, the background, which is the tower, and the sky. I do love my fiancée, so i want to keep every detail of her. I think the tower matters a little, since it is what shows people where we were. but what about the sky? I don’t even remember if it was cloudy or not, and I couldn’t care less. So what if I figure out a way of quantizing very little of my fiancée, a little bit more of the tower and a lot of the sky? This should save me plenty of space on my memory card and still give me a pretty nice picture in the end. And that’s what most lossy compression methods are: they do not really compress the image, they just choose how much quantization should be applied to each part of the image. The real compression is simply the plain old quantization we’ve seen before.

Now, deciding which information is more or less relevant is a difficult task, so if you though that the lossless methods were hard, lossy methods are pure math! I might be confusing during some explanations because there are too many math concepts involved, so feel free to drop a comment whenever you find this too hard to get. I will first start with images, which are the easier part of lossy compression. And the main tool for image compression are the mathematic transforms. Specially one called Discrete Cosine Transform, or DCT.

Discrete Cosine Transform

DCTs are a set of mathematic functions that apply to matrices. For those who don’t remember (or never learned) that from high school, matrices are like a table with several numbers in it, one in each cell of the table. Images can be treated as matrices: each color has a number, each pixel in the image is a cell in the table. So the image is really a big matrix of color numbers. But if you remember how painful it was to multiply matrices, imagine now a matrix representing a 5Mpixel image. 5 million numbers to multiply. Thats huge! So we break the image in smaller matrices, usually 8×8 or 16×16 pixels and apply the transforms to these smaller images.

DCT is one of the functions that apply to matrices, and has some very interesting characteristics: when you apply a DCT to a matrix, the first numbers in the matrix (those on the top left) become huge, while the other positions will get smaller numbers the further away from the top left they are. Another nice thing is that, if we lose some of the smaller numbers and then apply the inverse DCT, the image changes very little. Now you’re getting it, aren’t you? We apply a DCT and just drop (or quantize) the smaller numbers before transmitting. If we drop enough small numbers, the image gets smaller and smaller.

Of course DCT is not the only transform that has these characteristics. The Karhunen-Loève[6] transform, for example, yields a much better compression with even smaller numbers in the last matrix positions. But it is also much more difficult to calculate. Even for computers. So we use the DCT which is not that bad, but still fast. Other transforms that have been studied recently and show good results are the Wavelet transforms. These are quite nice transforms, but talking about them would be even more math intensive, so I’ll do it later, in another article.

So if you ever used a JPEG image, you’ve used DCT (and Huffman coding): JPEG DCT transforms the image, quantizes the small coefficients and then applies Huffman coding to them, reducing the image size up to 10 or 20 times with very little information loss. Newer methods such as JPEG2000 already use wavelets as a better way to compress (wavelet are really good at retaining details such as hair or folding marks in clothes). So what’s next? Well we covered almost all the basics of compression, but in order to apply them to sound and movies, that need way better compression, we need to improve even more, and for that we use some advanced techniques.

Sound Compression

When we are dealing with sound, another little thing is important: Our brain responds differently to abrupt changing than it does to slow change. It also perceives some sound frequencies better than others. Sound compression takes advantages of these brain characteristics in order to gain better compression using what is called a psychoacoustic model. A psychoacoustic model is the base of the successful MP3 music compression method. MP3 is in essence a DCT transform on sound, where the quantization is dictated by a psychoacoustic model. The model is not defined in the MPEG standard that defines MP3, so each vendor or developer is free to use its own, though there are some reference implementations that most people use. Those psychoacoustic models are product of long research in human hearing and define things such as: which sounds we hear better, which changes in sound are never noticed and which ones are always noticed, what sounds can be merged into a single sound, etc.

Thanks to these models, DCT’s efficiency is greatly improved in MP3, and we can have compression ratios of up to 200. But as of modern computers, sounds are still a small thing compared to movies.

Movement compression

So we come the our last stop: movies! Movies are basically a sequence of images and a sound track. So why not do just that? Compress the sound and compress all the images, then pack them in a single file? Because there’s one more thing about movies that we cannot forget, and that help improving compression even more: there are few changes from one “frame” to the other. Let’s think of a dialogue between two actors in the screen. They will hardly move. They will move their lips, and if they are of Italian origin, their hands[7]. The background and most of their bodies will not move. And if they move, they will not change shape or color.So you figure out, we can record the first frame as an image and then just record what has changed, gaining much space on the following frames. This is called temporal encoding, and is largely used in most video compression methods. But as I said, even if things do move, they do so without changing shape or color. So here comes the what the better compression methods do: motion detection.

The key is to figure out what has moved fro ma frame to the other and, instead of encoding the difference between the frames, just encode the position of what moved in both frames. When decompressing, you just copy the position from the first frame to the second one, keeping all the rest the same.

Apart from this, all the rest in movie compression is basic image and sound compression, with DCT and psychoacoustic models.

Further readings and notes

When I began to study compression I was a teenager recently out of technical high school. My work at IBM granted me some privileges: I had internet access. Or rather, I had access to FTP servers on the internet through some obscure gateway machines, so I could request things from the internet and they would be delivered to my email the following morning. This was early 1990’s, so internet at home was not a common thing, and only those who worked with technology had such privilege. At that time, my main source of information was the comp.compression newsgroup and the comp.compression FAQ. So this is my first recommended reading, for those interested in compression. But beware, it is not only outdated, but also very hard to read.

My following contact with compression was during college, in algorithms class. At that time I had a whole comp. sci. library at my disposal, so several good books about the matter were available. for the basics, from this time, I remember reading Knuth and Sedgewig. (I guess i remember wrong, I was unable to find anything about compression in Knuth’s books, but Sedgewick still holds).

Later, when I graduated and could buy my own books, I finally got this masterpiece of David Salomon (mine is actually the 1999 edition, with a blue cover).

I have plenty of other more technical books on the subject, but for the starter, these are what I recommend. I hope all of this was helpfully and fun for you, reader. Enjoy your new knowledge and use it wisely.


  1. Quantum is the latin word for “how much” and means the smallest piece of something. In our case, the smallest piece of information we can store.
  2. Discrete is the opposite of “continuous”, which means that it is counted in small units rather than in an infinite “real” value. We could compare it to beans, which can be counted in small units (one bean, 2 beans, 3 beans) and water, which cannot…
  3. Check the wikipedia article for the full code:
    Morse code. (2008, July 29). In Wikipedia, The Free Encyclopedia. Retrieved 00:43, August 2, 2008
  4. Actually, some restrictions apply: Huffman codes are the shorter codes in which each symbol is represented by an integer number of bits. Later on, some researchers found out that you can represent symbols by less than one bit (say half a bit),
    See Wikipedia, Arithmetic coding (as of Aug. 2, 2008, 01:21 GMT).
  5. Actual implementations of Huffman coding use a Binary Tree. There are plenty of implementations around the net, so if you like programing, just google for it and you’ll have nice examples to peek at.
  6. See Wikipedia, Karhunen-Loève theorem (as of Aug. 2, 2008, 03:23 GMT).
  7. That’s just a joke, if you are Italian and don’t move your hands while talking, take no offense.
  8. As Ricbit pointed out, there might be some occasions where loss can be supported on texts: loss of extra spaces or of line breaks, for example, wont affect the information.
  9. Work of Ronaldo Gazel over a picture by Scott Murdoch. Original found on Flickr:
    Tokyo Subway (3)
  10. Image from Flickr, released under creative commons license.
    MFJ Telegraph Key
  11. Image from Flickr, released under creative commojns license.
    i, v and mechanical pencil
  12. Image from flickr, released under creative commons license.

2 thoughts on “Data Compression: A little introduction for beginners

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.