Huffman Coding Algorithm

In the world of data compression, one of the most efficient and widely used techniques is the Huffman Coding Algorithm.

Invented by David A. Huffman in 1952, it uses a greedy approach to generate prefix codes that minimize the overall size of the encoded data.

What is Huffman Coding?

Huffman Coding is a lossless compression algorithm. It assigns variable-length codes to input characters based on their frequency of occurrence. The basic idea is

  • Characters that occur more frequently are assigned shorter codes.
  • Characters that occur less frequently are assigned longer codes.

This reduces the average length of encoded data and hence achieves compression.

Why Use Trees?

The algorithm builds a special type of binary tree, called a Huffman Tree, to represent the codes.

Each leaf node in the tree represents a character, and the path from the root to the leaf forms its binary code.


Steps of the Huffman Algorithm

Step 1: Create a Priority Queue (Min-Heap)

For each character and its frequency Create a node with

  • char: the character
  • freq: the frequency
  • left and right: initially null

Insert the node into a min-heap based on frequency.

Step 2: Build the Huffman Tree

  • While the heap contains more than one node:
  • Remove the two nodes with the smallest frequencies.
  • Create a new internal node:
    • freq = freq1 + freq2
    • left = node1, right = node2
    • char = null (since it's not a leaf)
  • Insert the new node back into the min-heap.

After the loop, the remaining node in the heap is the root of the Huffman Tree.

Step 3: Generate Huffman Codes

  • Define a recursive function generateCodes(node, code)
  • If node is a leaf (i.e., node.char is not null):
  • Assign the current code as the Huffman code for the character.
  • Recursively call
    • generateCodes(node.left, code + "0")
    • generateCodes(node.right, code + "1")
  • Call generateCodes(root, "") to start traversal from the root with an empty code.

Step 4: Calculate Total Encoded Data Length

  • Initialize totalBits = 0
  • For each character
    • Get its frequency and code length
    • Multiply: encodedBits = frequency × code length
    • Add to totalBits

Step 5: Calculate Average Code Length

  • Compute total frequency: sumFreq = sum of all character frequencies
  • Compute: averageLength = totalBits ÷ sumFreq

Huffman Coding Calculation

Here’s a step-by-step explanation to calculate the Huffman code, length of the encoded data, and the average code length for the following frequency table: { A = 45, B = 13, C = 12, D = 16, E = 9, F = 5 }

CharacterFrequency
A45
B13
C12
D16
E9
F5

Step 1: Build a Min-Heap (Priority Queue)

Start by placing all characters as leaf nodes in a priority queue, sorted by frequency: [F:5, E:9, C:12, B:13, D:16, A:45]

Step 2: Build the Huffman Tree

We repeatedly remove the two nodes with the lowest frequency, combine them into a new node, and reinsert it into the queue.

Iteration 1:

Remove F (5) and E (9) → New node FE = 5 + 9 = 14

Heap: [C:12, B:13, D:16, A:45, FE:14]

Iteration 2:

Remove C (12) and B (13) → New node CB = 12 + 13 = 25

Heap: [D:16, A:45, FE:14, CB:25]

Iteration 3:

Remove FE (14) and D (16) → New node FED = 14 + 16 = 30

Heap: [A:45, CB:25, FED:30]

Iteration 4:

Remove CB (25) and FED (30) → New node CBFED = 25 + 30 = 55

Heap: [A:45, CBFED:55]

Final Iteration:

Remove A (45) and CBFED (55) → New node Root = 45 + 55 = 100

Now, the tree is complete. Let’s move on to generating codes.

Step 3: Assign Huffman Codes (by Tree Traversal)

Assign

  • 0 when going left
  • 1 when going right

Start from the root and trace to each character.

             (100)
            /     \
         A(45)   (55)
                /    \
            (25)      (30)
           /   \      /   \
        C(12) B(13)  (14)  D(16)
                     /  \
                  F(5)   E(9)

Now, trace each character

CharacterCode PathHuffman Code
ALeft0
CRight → Left → Left100
BRight → Left → Right101
FRight → Right → Left → Left1100
ERight → Right → Left → Right1101
DRight → Right → Right111

Step 4: Calculate Encoded Data Length

Multiply each character's frequency by its code length.

CharacterFrequencyCodeCode LengthFrequency × Code Length
A450145 × 1 = 45
B13101313 × 3 = 39
C12100312 × 3 = 36
D16111316 × 3 = 48
E9110149 × 4 = 36
F5110045 × 4 = 20

Total Encoded Bits Data Length = 45 + 39 + 36 + 48 + 36 + 20 = 224 bits

Step 5: Calculate Average Code Length

Use the formula

  • Average Code Length (bits per symbol) = (Total Encoded Data Length) / (Total Frequency)
  • Total Frequency = 45 + 13 + 12 + 16 + 9 + 5 = 100
  • Average Code Length (bits per symbol) = 224 / 100 = 2.24 bits/character

Final Results:

Huffman Codes

  • A: 0
  • B: 101
  • C: 100
  • D: 111
  • E: 1101
  • F: 1100

Total Encoded Length: 224 bits

Average Code Length: 2.24 bits per character


Numerical Question

1. Given the following characters and their frequencies:

CharacterFrequency
A10
B15
C30
D16
E29

Tasks

  • Build the Huffman tree.
  • Generate the Huffman code for each character.
  • Calculate the total encoded data length.
  • Find the average code length.

2. Given these frequencies: {M: 35, N: 30, O: 20, P: 10, Q: 5}

Tasks

  • Create the Huffman tree step-by-step.
  • Show how the codes are generated.
  • Calculate the total bits used to encode the data.
  • Compute the average bits per symbol.

3. You are given the following symbols with their frequencies:

SymbolFrequency
W7
X3
Y8
Z6
V2
T10
S5

Tasks:

  • Construct the full Huffman tree.
  • Assign codes to each character.
  • Find the total encoded data length.
  • Compare it with fixed-length encoding (3 bits per symbol). How many bits are saved?

Advantages of Huffman Coding

Efficient Compression: Reduces file size by minimizing average code length.

Lossless: Original data can be perfectly reconstructed.

Widely Used: Employed in formats like ZIP, JPEG, MP3, etc.


Time and Space Complexity

Time Complexity: O(n log n) where n is the number of unique characters (due to priority queue operations).

Space Complexity: O(n) for storing the tree and the codes.

The Huffman Algorithm is a classic example of how greedy strategies and tree structures can be combined to solve real-world problems efficiently. It's not just theoretical it powers everyday technologies we use to store and transmit data.

By understanding how Huffman trees are built and how codes are assigned, you gain insights into both compression and algorithmic thinking.