Using a Huffman Tree to Compress a String in Python

Ever wondered how modern compression algorithms as used in gzip, zip or zlib work? Well, I have recently and to get a better understanding I decided to reproduce the Huffman encoding algorithm which all these programs have in common in Python.

The problem with conventional, fixed-length encoding is that it assigns the same number of bits to each character, leading to an inefficient representation. Huffman encoding, created by David A. Huffman in 1952, employs a variable-length code system that assigns shorter codes to frequently occurring characters and longer codes to less common ones. This results in a more compact representation and significant data compression. Together with the LZ77 algorithm which replaces duplicate strings with pointers, the Huffman coding is the main component of the DEFLATE algorithm used to compress among other things PDF streams and HTTP requests.

The tricky part of any variable length encoding is to recover the boundaries between the individual codes which represent the symbols. This is mainly what the Huffman tree allows us to do if we can obtain it. There is a so called static Huffman tree which is a standardized tree which should be obtainable on the web and the dynamic Huffman tree which is computed during compression and stored alongside the encoded data so it can be used to decompress later on.

A static Huffman tree will not be optimal as it is not adapted to the actual data which is compressed. However, if the length of the original data is short the non-optimal encoding can still be more compact since we can leave away the representation of the computed Huffman tree itself from the final encoding. In most practical cases however, a dynamically computed tree will be the more efficient choice.

How do we build the Huffman tree?

The first step to create our Huffman tree will be to establish the character frequencies which will be strictly proportional to the probability of reading said symbol in the data.

Python
freq_dist = {}
for symbol in seq:
  freq_dist[symbol] = freq_dist.get(symbol, 0) + 1

To generate the binary huffman tree from symbols and symbol probabilities we follow this algorithm:

  1. Create a leaf node for each symbol and add it to a queue.
  2. While there is more than one node in the queue:
    • Remove the two nodes with the lowest probability from the queue
    • Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes’ probabilities.
    • Add the new node to the queue.
  3. The remaining node is the root node and the tree is complete.

In Python we store the tree as a dictionary of symbols and the corresponding paths to the leaf node representing that symbol. These paths will act as codes for the symbols with more frequent symbols getting shorter paths and less frequent symbols the longer paths. The paths are recorded as the tree is constructed according to the above algorithm.

Python
## make a tree represented by recursive tuples of nodes (parent, left child, right child)
paths = {}
remaining = [((s,p),) for s,p in freq_dist.items()]
while len(remaining)>1:
  # sort by probability
  remaining = sorted(remaining, key=lambda x:(x[0][1], x[0][0]))
  # remove first two nodes
  first=remaining.pop(0)
  second=remaining.pop(0)
  # merge the nodes and sum their probabilities
  new_node = merge_nodes(first, second)
  # add it back into the queue
  remaining.append(new_node)

tree=remaining.pop(0)

def merge_nodes(node_a: Node , node_b: Node):
  # extend all paths on the left
  for s in node_a[0][0]:
      paths[s]="0"+paths.get(s, "")
  # extend all paths on the right
  for s in node_b[0][0]:
      paths[s]="1"+paths.get(s, "")
  return (tuple(a+b for a,b in zip(node_a[0], node_b[0])), node_a, node_b)

Now, if we want to recover the original version from the encoding of such a tree, we read the individual bits from left to right. According to the bit currently read we follow the tree and print the symbol as soon as we reach a leaf node. Then, we start again at the root of the tree. The boundary of the symbols is given by the tree traversal i.e. whenever we reach a leaf there is a boundary. This allows the variable length encoding of the message or data.

How do we encode the Huffman tree?

A huffman tree with large number of unique symbols might utilise considerable amounts of space so we need a clever way to represent this in as little symbols as possible. To achieve this a canonical Huffman coding was introduced which restructures the path values assigned to the symbols such that all symbols with codes of a certain length are assigned their path values sequentially (alphabetically in the case of strings). In this case it is sufficient to just transmit the code length for each symbol in sequential order to restore the full tree. If the number of unique symbols is smaller than the whole alphabet it might be better to store the actual symbols and additionally numbers for each distinct code length in the tree which say how many of the symbols share the same path length. From both of these methods we are able to reconstruct the canonical Huffman tree. Here we use the latter method since our tree has a relatively small number of symbols.

Python
def form_canon_path(path, new_length):
  if path == 0:
      path = "0"*new_length
  else:                    
      path = format(int(path, 2) + 1 , 'b').rjust(
          len(path), "0" # fill back 0s on the left lost in int conversion
      ).ljust(
          new_length, "0" # fill 0s on the right according to bit length required
      )
  return path
    
def canonicalize(paths):
  # sort paths by length and then alphabetically by their leaf node
  sorted_paths = sorted(paths.items(), key=lambda x:(len(x[1]), x[0]))
  # initialize the canonical path as a string of 0s lengths of the shortest path
  canon_path = 0
  for symbol, path in sorted_paths:
      # add one to previous code
      canon_path = form_canon_path(canon_path, len(path))
      paths[symbol] = canon_path
  return paths
  
paths = canonicalize(paths)
# get all the paths and all the symbols sorted first by length of path
all_symbols, all_paths = map(
    list, 
    zip(*sorted(paths.items(), key=lambda x:(len(x[1]), x[0])))
)
# count how many symbols share the same code length (code length bins)
bit_lens = [0 for _ in range(max([len(p) for p in all_paths]))]
for path in all_paths:
    bit_lens[len(path)-1]+=1

# convert the canonical tree and message to a bit string of 7-bit integers
ct = "".join([
    "".join(
        [format(bl,'b').zfill(7) for bl in [len(bit_lens)]+bit_lens]
    ), 
    "".join(
        [format(len(all_symbols),'b').zfill(7)]+
        [format(ord(s),'b').zfill(7) for s in all_symbols]
    )
])

In the code example we canonicalize (encode) the tree from the representation encoded according to the second strategy. That means we first encode a number (num_bins) to signify how many of the coming numbers represent the numbers of symbols belonging to the same code length. Then, follow the actual numbers. After that, we have a number signifying how many symbols are following and then the unicode integers representing all the symbols. We use 7 bit integers for all these numbers so that we save a bit of space at a restricted possible symbol vocabulary.

How do we decode the Huffman tree?

For decoding we do the bit string conversion in reverse which might make it a bit more understandable. Then, we can use the same trick from the canonicalization to get the tree back.

Python
# consume the encoded canonicalized tree (ct) and convert to code length bins and symbolstack
# how many bins?
num_bins=int(ct[:7], 2)
ct = ct[7:]
bit_nums=[int(ct[i:i+7], 2) for i in range(0, num_bins*7, 7)]
ct = ct[num_bins*7:]
# how many symbols?
num_symbols = int(ct[:7], 2)
ct = ct[7:]
symbolstack = [chr(int(ct[i:i+7], 2)) for i in range(0, num_symbols*7, 7)]
ct=None

# decompress a tree from the code length bins and the unique symbols recovered from the encoding       
def decompress_tree(bit_nums, symbolstack):
  paths = {}
  path = 0
  for i, nums in enumerate(bit_nums):
      code_len = i+1
      for _ in range(nums):
          # pop as many symbols as there are of the same code length
          symbol = symbolstack.pop(0)
          # use the canonicalization algorithm to recover the tree 
          path = form_canon_path(path, code_len)
          paths[symbol] = path

  return paths

Putting it all together

Finally, let’s put all these steps into methods of a HuffmanTree class and test it on a few paragraphs of Lorem ipsum…. The test string is not included here since it is quite long but can be found on Github. We return the weighted path length of the Huffman tree which should converge to the Shannon entropy as the number of unique encoded symbols goes to infinity.

Python
# string is a paragraph of lorem ipsum and we transform it to a 
# binary string
uncompressed = ''.join([format(ord(c),'b') for c in string])

print(f"uncompressed length: {len(uncompressed)}")
ht = HuffmanTree()
compressed = ht.compress(string)

print(f"compressed length: {len(compressed)}")
print(f"longest path length: {max([len(e) for e in ht.paths.values()])}")
print(f"weighted path length: {ht.get_weighted_pathlength():.4f}")
print(f"shannon entropy: {calculate_entropy(string, ht.freq_dist):.4f}")

# save tree and reset the object
old_paths = ht.paths
ht = HuffmanTree()

decompressed = ht.decompress(compressed)
# Is the tree recovered correctly from the encoding?
assert old_paths==ht.paths
# Is the decompressed string correct?
assert string==decompressed

Running this code we see that we achieve a considerable compression of ~40% on the relatively short text and that the sum of the path lengths of the symbols weighted by their probability nears the theoretical limit of optimal compression of 4.2161 set by the Shannon entropy.

uncompressed length: 26485
compressed length: 16929
longest path length: 12
weighted path length: 4.2537
shannon entropy: 4.2161
  1. All the code the reproduce this can be found in the Github repository.
  2. More about Huffmann Codes can be found on Wikipedia.