Sunday, January 21, 2018

Compressing short texts

I want to compress text to 512 bytes as much asp ossible for Zcash and Hush memo field.

I didn't have any luck searching the internet on how to compress short texts....until I had finished doing it from scratch. Normal compression algorithms are not useful because they need to include various types of look-up tables in the data which can make small texts even bigger.

I used Huffman-like look-up tables for the language at hand, using the known frequency of letters and words to make decisions, combined with other tricks based on knowing things like a capital letter will probably be a period. I used 2 bits to tell the decoder how many bits were about to follow depending on which table contained the word or letter to replace. 4, 6, 7, or 12 bits followed the 2 bits. The 4 bit table was for the 16 most common characters (e,t,a,i,o,l,s,n, newline, period, comma, from memory). The 12 bit table was for less-common words. I had a table of 4096 words. A 6 bit table was for the rest of ascii-printable characters, and the 7-bit was for the most common words.

But here's a better job of it in precise detail:
https://academic.oup.com/comjnl/article-pdf/24/4/324/1039097/240324.pdf

He and I got about compression down to about 45% of the original english files. He did it with only 250 words instead of 4096, so his method is really good. We're using almost exactly the same scheme, but he used only 4 bits for the most common letters which is why he was able to do as good with a lot less. It's hard to get better no matter how big the tables are, but my next improvement could be to follow his method exactly, but add another table for 4096 words with 12 bits after the 4-bit.

His method has a big additional advantage: by the bits staying multiple of 4, a common procedure can be applied: Burrow-Wheeler transform (BWT) followed by move-to-front (MTF), then Arithmetic encoding (AC). BWT gets like-symbols closer together which enables MTF to more frequently assign a low-number position number. BWT and MTF do not do compression. AC does the compression following MTF to make use of the lower entropy due to more frequent position numbers. By itself, the BWT-MTF-AC method got the original file down to 60% using 4 bit sequences. It did not benefit my method because my odd-length method offsets the bits all the time, making 4-bit sequences more random. 4-bit sequences are better instead of byte for short messages because AC has to attach a "dictionary" to the data (i.e. a list of each symbol's count in the original) and 4 bits means I need only 16 counts up 256 if no symbol is more than 25% of a 1000 byte pre-compression text (since my goal is 512 bytes and I won't get more than 50% on text already pre-compressed). That means only 16 extra bytes needed verses 256 for 8-bits (I'll just list the count in a logical order of fixed length instead of including both symbol and count). MTF will be a little longer since the ranking on 1000 bytes will require a distance of up to 2000 for 4-bits instead of 1000.

The 4-bit method is also immediately applicable to UTF-8.

No comments:

Post a Comment