Huffman Coding Algorithm
In the world of data compression, one of the most efficient and widely used techniques is the Huffman Coding Algorithm.
Invented by David A. Huffman in 1952, it uses a greedy approach to generate prefix codes that minimize the overall size of the encoded data.
What is Huffman Coding?
Huffman Coding is a lossless compression algorithm. It assigns variable-length codes to input characters based on their frequency of occurrence. The basic idea is
- Characters that occur more frequently are assigned shorter codes.
- Characters that occur less frequently are assigned longer codes.
This reduces the average length of encoded data and hence achieves compression.
Why Use Trees?
The algorithm builds a special type of binary tree, called a Huffman Tree, to represent the codes.
Each leaf node in the tree represents a character, and the path from the root to the leaf forms its binary code.
Steps of the Huffman Algorithm
Step 1: Create a Priority Queue (Min-Heap)
For each character and its frequency Create a node with
char
: the characterfreq
: the frequencyleft
andright
: initially null
Insert the node into a min-heap based on frequency.
Step 2: Build the Huffman Tree
- While the heap contains more than one node:
- Remove the two nodes with the smallest frequencies.
- Create a new internal node:
freq = freq1 + freq2
left = node1
,right = node2
char = null
(since it's not a leaf)
- Insert the new node back into the min-heap.
After the loop, the remaining node in the heap is the root of the Huffman Tree.
Step 3: Generate Huffman Codes
- Define a recursive function
generateCodes(node, code)
- If
node
is a leaf (i.e.,node.char
is not null): - Assign the current
code
as the Huffman code for the character. - Recursively call
generateCodes(node.left, code + "0")
generateCodes(node.right, code + "1")
- Call
generateCodes(root, "")
to start traversal from the root with an empty code.
Step 4: Calculate Total Encoded Data Length
- Initialize
totalBits = 0
- For each character
- Get its
frequency
andcode length
- Multiply:
encodedBits = frequency × code length
- Add to
totalBits
- Get its
Step 5: Calculate Average Code Length
- Compute total frequency:
sumFreq = sum of all character frequencies
- Compute:
averageLength = totalBits ÷ sumFreq
Huffman Coding Calculation
Here’s a step-by-step explanation to calculate the Huffman code, length of the encoded data, and the average code length for the following frequency table: { A = 45, B = 13, C = 12, D = 16, E = 9, F = 5 }
Character | Frequency |
---|---|
A | 45 |
B | 13 |
C | 12 |
D | 16 |
E | 9 |
F | 5 |
Step 1: Build a Min-Heap (Priority Queue)
Start by placing all characters as leaf nodes in a priority queue, sorted by frequency: [F:5, E:9, C:12, B:13, D:16, A:45]
Step 2: Build the Huffman Tree
We repeatedly remove the two nodes with the lowest frequency, combine them into a new node, and reinsert it into the queue.
Iteration 1:
Remove F (5) and E (9) → New node FE = 5 + 9 = 14
Heap: [C:12, B:13, D:16, A:45, FE:14]
Iteration 2:
Remove C (12) and B (13) → New node CB = 12 + 13 = 25
Heap: [D:16, A:45, FE:14, CB:25]
Iteration 3:
Remove FE (14) and D (16) → New node FED = 14 + 16 = 30
Heap: [A:45, CB:25, FED:30]
Iteration 4:
Remove CB (25) and FED (30) → New node CBFED = 25 + 30 = 55
Heap: [A:45, CBFED:55]
Final Iteration:
Remove A (45) and CBFED (55) → New node Root = 45 + 55 = 100
Now, the tree is complete. Let’s move on to generating codes.
Step 3: Assign Huffman Codes (by Tree Traversal)
Assign
0
when going left1
when going right
Start from the root and trace to each character.
(100)
/ \
A(45) (55)
/ \
(25) (30)
/ \ / \
C(12) B(13) (14) D(16)
/ \
F(5) E(9)
Now, trace each character
Character | Code Path | Huffman Code |
---|---|---|
A | Left | 0 |
C | Right → Left → Left | 100 |
B | Right → Left → Right | 101 |
F | Right → Right → Left → Left | 1100 |
E | Right → Right → Left → Right | 1101 |
D | Right → Right → Right | 111 |
Step 4: Calculate Encoded Data Length
Multiply each character's frequency by its code length.
Character | Frequency | Code | Code Length | Frequency × Code Length |
---|---|---|---|---|
A | 45 | 0 | 1 | 45 × 1 = 45 |
B | 13 | 101 | 3 | 13 × 3 = 39 |
C | 12 | 100 | 3 | 12 × 3 = 36 |
D | 16 | 111 | 3 | 16 × 3 = 48 |
E | 9 | 1101 | 4 | 9 × 4 = 36 |
F | 5 | 1100 | 4 | 5 × 4 = 20 |
Total Encoded Bits Data Length = 45 + 39 + 36 + 48 + 36 + 20 = 224 bits
Step 5: Calculate Average Code Length
Use the formula
- Average Code Length (bits per symbol) = (Total Encoded Data Length) / (Total Frequency)
- Total Frequency = 45 + 13 + 12 + 16 + 9 + 5 = 100
- Average Code Length (bits per symbol) = 224 / 100 = 2.24 bits/character
Final Results:
Huffman Codes
- A:
0
- B:
101
- C:
100
- D:
111
- E:
1101
- F:
1100
Total Encoded Length: 224 bits
Average Code Length: 2.24 bits per character
Numerical Question
1. Given the following characters and their frequencies:
Character | Frequency |
---|---|
A | 10 |
B | 15 |
C | 30 |
D | 16 |
E | 29 |
Tasks
- Build the Huffman tree.
- Generate the Huffman code for each character.
- Calculate the total encoded data length.
- Find the average code length.
2. Given these frequencies: {M: 35, N: 30, O: 20, P: 10, Q: 5}
Tasks
- Create the Huffman tree step-by-step.
- Show how the codes are generated.
- Calculate the total bits used to encode the data.
- Compute the average bits per symbol.
3. You are given the following symbols with their frequencies:
Symbol | Frequency |
---|---|
W | 7 |
X | 3 |
Y | 8 |
Z | 6 |
V | 2 |
T | 10 |
S | 5 |
Tasks:
- Construct the full Huffman tree.
- Assign codes to each character.
- Find the total encoded data length.
- Compare it with fixed-length encoding (3 bits per symbol). How many bits are saved?
Advantages of Huffman Coding
Efficient Compression: Reduces file size by minimizing average code length.
Lossless: Original data can be perfectly reconstructed.
Widely Used: Employed in formats like ZIP, JPEG, MP3, etc.
Time and Space Complexity
Time Complexity: O(n log n)
where n
is the number of unique characters (due to priority queue operations).
Space Complexity: O(n)
for storing the tree and the codes.
The Huffman Algorithm is a classic example of how greedy strategies and tree structures can be combined to solve real-world problems efficiently. It's not just theoretical it powers everyday technologies we use to store and transmit data.
By understanding how Huffman trees are built and how codes are assigned, you gain insights into both compression and algorithmic thinking.