transparent-huff

transparent-huff implements, in C, Huffman coding for file compression in a way that allows you to see the data structures used for encoding. This, along with the easy-to-follow source code, can hopefully give you a better idea of how Huffman coding works and how it may be implemented in a non-memory-managed language.

Preview

Input File Contents:
sleeveless lee sees sleeves

Byte Frequencies:
 32 ( ): 3
101 (e): 11
108 (l): 4
115 (s): 7
118 (v): 2

Huffman Tree:
27
├ 16
│ ├ 9
│ │ ├ 5
│ │ │ ├ 3:  32 ( )
│ │ │ └ 2: 118 (v)
│ │ └ 4: 108 (l)
│ └ 7: 115 (s)
└ 11: 101 (e)

Prefix Code (Symbol-to-Codeword Mappings):
 32 ( ): 0000
101 (e): 1
108 (l): 001
115 (s): 01
118 (v): 0001

Results:
The compressed file is 70% the size of the original file.
The original file and the decompressed file match.

Purpose

Get experience writing and debugging bit-level software
Work out my C muscles, including non-trivial pointer usage and declarations
Work out my recursion muscles
Provide straightforward and well-written(?) source code as a reference for people working on similar projects
Learn how to do cross-architecture testing
Learn how to fuzz (using american fuzzy lop)

Usage (Linux)

Make sure you have gcc installed (for compiling the C source files).
Download or clone this repository and move into its directory.
Run ./build.sh to compile the source into the encoder and decoder binaries.

With Helper Script

Run the compression-and-decompression script with any file.
- ./compress-then-decompress.sh sample-files/slss

Without Helper Script

Both the encoder and decoder binaries read from the given filename, output messages to stderr, and output data to stdout. By default, both stderr and stdout get printed to the terminal, but in these steps we will separate the streams by redirecting stdout to a file. Note that this will overwrite the file you are redirecting to.

Compress any file, redirecting stdout to your desired filename.
- ./encoder sample-files/slss > slss.compressed
Decompress the compressed file, redirecting stdout to your desired filename.
- ./decoder slss.compressed > slss.decompressed

Notes

When compressing very small files, the compressed file is actually bigger than the original file because the encoded data plus the metadata needed to decode it (which is the number of bytes encoded and the Huffman tree) takes up more bytes than the original data itself.
When compressing a file that has only 1 unique byte/symbol, an extra, arbitrary node is added to maintain the fact that the Huffman tree is a binary tree, since that is what the related functions operate on. Otherwise, logic would be needed to also handle 1-node "trees".
You can quickly make your own sample file without a newline character at the end by running something like echo -n "alfalfa" > filename-here on Linux. Note that this will overwrite the file if it already exists.
I was about to make a test file that forced the maximum codeword length of 255, but if my reasoning and math are correct, that file would have this many bytes: 1 + sum 2^i, i=0 to 254
To fully understand the source code, you should have a basic idea of how Huffman coding works. One way you can learn is by watching the video in the "Thanks" section below.
For some functions, I used declarations like int function(int array[256]) instead of int function(int *array) to make it clear that the array is expected to have exactly that many elements, even though the argument just decays to a pointer anyway.
I used Valgrind to fix any memory leaks I could find. I did this by testing each return branch in the main functions of both src/encoder.c and src/decoder.c. I don't know if that is sufficient to say that there are no possible memory leaks, though.
These programs are definitely not the fastest nor the most memory efficient, but that's OK since they weren't designed to be.
After fixing problems found via fuzzing, src/decoder.c has a lot more error handling and is not as simple as it was.

Author

This project was created by Brandon Simmons.

Thanks

Thanks to Marty Stepp for explaining in this video how Huffman coding works: 42 CS 106B Lecture Huffman Encoding in detail.

Thanks to skeeto for testing these programs and offering advice on how they can be improved.

Thanks to David Dailey for making this webpage of words that are long but have few unique letters. I used some of those words here as examples.

Thanks to the countless people whose explanations I read online while researching things about C and Huffman coding.

And thanks to you for visiting.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
sample-files		sample-files
src		src
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
compress-then-decompress.sh		compress-then-decompress.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

transparent-huff

Preview

Purpose

Usage (Linux)

With Helper Script

Without Helper Script

Notes

Author

Thanks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

transparent-huff

Preview

Purpose

Usage (Linux)

With Helper Script

Without Helper Script

Notes

Author

Thanks

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages