Data Compression for Computer Science Students: An Introduction to the Basics

DATA REPRESENTATION
1.7 DATA COMPRESSION

WHAT IS DATA COMPRESSION ?

Many files such as images, videos and even text documents can take up large amounts of memory, meaning an increased need for storage space and slower transfer speeds do to file size, this has led to the need for data compression. Data Compression is the process of making files require less memory to store. When people talk about 'file size' they are usually referring to the memory required to store the file and not the physical size of the document or file.

There are two main methods of file compression, lossy and lossless. Each type of file compression has its benefits and disadvantages. To compress a file software such as Win Zip or is used Archive Utility Zip are used, files zipped by one software brand should be able to be uncompressed by other brands. Different file types use different methods of compression, for example compression an image will use a different algorithm to compressing a text document.

Some reasons for compression are:

✓ Compression makes the file size smaller so less space is needed to store the file
✓ Compression makes the file size smaller so files transfer faster over a network such as the internet
✓ Compression makes the file size smaller which helps with file streaming

LOSSY COMPRESSION

The key element of Lossy compression is that the file will lose quality when it is compressed. The lose of quality is not important for many files and in many cases we do not even notice the reduction in quality. Some key points on Lossy compression are:

✓ Lossy compression reduces the file size by removing some of the data, because of this an exact match of the original data cannot be recreated. Quality is lost.
✓ Lossy compression uses an algorithm that looks to remove detail that is barely noticeable, for example if pixels next to each other in an image are almost the same colour then the Lossy algorithm will give them the same value to reduce the bytes needed to store the detail.
✓ Lossy compression is often used on files such as images and sound files such as MP3s and JPGs
✓ Lossy compression is often not a good option for files such as text documents
✓ Lossy compression can make files sizes smaller that is possible with Lossless compression

LOSSLESS COMPRESSION

The key element to lossless compression is the no quality is lost during the process of compression. Lossless compression is used when it is important to maintain the original quality. Some key points of Lossless compression are:

✓ Lossless compression will not remove any quality from the file, the compressed version will be the same as the original when uncompressed.
✓ Lossless compression uses an algorithm that looks for repeat data, this can be groups and categorised and a token be given for where each group will be used in the reconstruction
✓ Lossless compression is often used on files such as text files and images such as DOCXs, GIFs and PNGs
✓ Lossless compression is often not a good option for audio files and high colour images
✓ Lossless compression is more limited than Lossy compression with how small the file size can be made

VIDEO BY: CRASH COURSE

1: Which type of compression does the reconstructed version remain the same as the original?
2:Give two reasons for compressing files.
3: When lossy compression is used to compress an image, it cannot be reconstructed back to its original form, why?
4: Which of the three file types below represents an uncompressed audio format?
A: WAV
B: AAC
C: MP3
5: Which of the three file types below uses a lossy compression technique?
A: MP3
B: WAV
C: AIFF

COMPRESSION METHODS

Run Length Encoding (RLE)
Run Length Encoding is a method of compression that looks for repeating patterns and then encodes them into one item of data of a specified length.

IMAGE 1 : 8 x 8 pixels

Take the top row of the image, it has 8 white pixels and then the second row 1 white pixel, 2 red pixels, 2 white and so on. An uncompressed representation of the image would represent each pixel individually for example the binary for the row, if this was an 8 bit image then the top row has 8 pixels with 8 bits used to represent the colour of each pixel meaning it take 8pixels x 8bits = 64bit to represent the 8 white pixels. With run length encoding we can simply encode this as 8 white pixels in a row 8W this would mean 8 bits would be used to represent the length of the pattern and 8 bits to represent the colour meaning using run-length encoding the top row could be compressed from 64 bits to just 16 bits.

Huffman Encoding
Huffman Encoding is often used in Lossless compression and it uses a greedy algorithm to create an encoding system that uses a binary tree principle to allocate each item a unique code, ensuring that the most frequently occurring item gets the smallest code.

"learning is a journey, enjoy"

Doing a frequency analysis on the quote above we can see that the most frequently occurring letter is 'n' and also the 'space', which appear 4 times each followed be 'e' appearing 3 times. Continuing this frequency analysis we can put each letter in a chart in order of frequency.

We can then use a tree to put to illustrate and allocate each letter a binary code, the letters that occur the most frequent will go at the top of the tree and the code allocated for these letters will require less Bits to encode than those occurring further down the tree.

Looking at the allocation of encoding in this method we can see that each letter in the quote was allocated the following binary representations, this can then be used as the key to recovering the compressed file with zero loss to the original quality.

Run Length Encoding Illustrated

Huffman encoding explained

VIDEO COMPRESSION
Interlaced VS Progressive
Two popular methods used to compress video files are explained in the video below by 'techquickie'.

With an interlaced compression method one frame will display only the odd lines of the frame and the next frame will display only the even lines of the frames. With progressive the entire frame is displayed, this is better quality but is heavier on data transfer and bandwidth.

TEMPORAL REDUNDANCY
Breaking each frame into smaller parts and only loading the part of the frame that change. Parts of the picture that do not change between frames are encoded to be used during each relevant subsequent frame, only the moving parts are loaded per frame whilst a background that has no change may be encoded to stay the same for the next X amount of frames.

1: Using run length encoding on the image below, which row could end up taking more data after being compressed using run length encoding? and why?

IMAGE 1 : 8 x 8 pixels Bit Depth 8

We hope you find this site useful. If you notice any errors or would like to contribute material then please contact us.

2: What would the file size of the image be before and after using run length encoding?
3: Using Huffman encoding, draw a tree and list the binary representation of each letter for the following quote:

'BACK TO THE FUTURE'