Hacker News Re-Imagined

Time-Series Compression Algorithms

  • 189 points
  • 8 days ago

  • @todsacerdoti
  • Created a post

Time-Series Compression Algorithms

@marcodiego 8 days

Replying to @todsacerdoti 🎙

Simple-8b is not talked about in Wikipedia.


@dhsysusbsjsi 8 days

This was a great read. Previously I have implemented some IoT telemetry using a simple protobuf spec using delta encoding and batching a few samples together for each send, so it sends a lot of zeros.

Protobuf uses variable length encoding so most small integers will be sent as 1 byte. The first 16 in the spec use 1 byte overhead to show there is a non-zero value, and subsequent use 2 bytes. So you want to pack all your mostly-zero (or zero deltas) values at the end of the packet, and use the first 16 for your commonly changed values. And it's always useful to determine your desired data resolution and just convert a potential 8-byte double to 2 or maybe 3 bytes length integer (with a fixed multiplication). This also works storing in SQLite: use a View to read out the values in the end-format, and store them as integers in a raw table.

This was _good enough_ for what I was doing but it's great to see how far you can go if you need extreme.


@infogulch 8 days

I found TerminuDB's white paper on graph compression interesting and I think could be useful to time series compression:

Succinct Data Structures and Delta Encoding for Modern Databases - https://assets.terminusdb.com/research/succinct-data-structu...


@injinj 8 days

I didn't see binary interpolative coding (BIC) referenced. It is one of my favorites introduced to me by the book "Managing Gigabytes" by Moffett and Bell [1]. It has great compression ratio for sequences and is commonly used in inverted indexes.

There is neat implementation [2] and technical paper [3] by Giulio Ermanno Pibiri, which I just found today by looking for it.

[1] https://people.eng.unimelb.edu.au/ammoffat/mg/

[2] https://github.com/jermp/interpolative_coding

[3] http://pages.di.unipi.it/pibiri/papers/BIC.pdf


One notable omission from this piece is a technique to compress integer time series with both positive and negative values.

If you naively apply bit-packing using the Simple8b algorithm, you'll find that negative integers are not compressed. This is due to how signed integers are represented in modern computers: negative integers will have their most significant bit set [1].

Zigzag encoding is a neat transform that circumvents this issue. It works by mapping signed integers to unsigned integers so that numbers with a small absolute value can be encoded using a small number of bits. Put another way, it encodes negative numbers using the least significant bit for sign. [2]

If you're looking for a quick way to experiment with various time series compression algorithm I highly recommend Daniel Lemire's FastPFor repository [3] (as linked in the article). I've used the Python bindings [4] to quickly evaluate various compression algorithms with great success.

Finally I'd like to humbly mention my own tiny contribution [5], an adaptation of Lemire's C++ Simple8b implementation (including basic methods for delta & zigzag encoding/decoding).

I used C++ templates to make the encoding and decoding routines generic over integer bit-width, which expands support up to 64 bit integers, and offers efficient usage with smaller integers (eg 16 bit). I made a couple other minor tweaks including support for arrays up to 2^64 in length, and tweaking the API/method signatures so they can be used in a more functional style. This implementation is slightly simpler to invoke via FFI, and I intend to add examples showing how to compile for usage via JS (WebAssembly), Python, and C#. I threw my code up quickly in order to share with you all, hopefully someone finds it useful. I intend to expand on usage examples/test cases/etc, and am looking forward to any comments or contributions.

[1] https://en.wikipedia.org/wiki/Signed_number_representation

[2] https://en.wikipedia.org/wiki/Variable-length_quantity#Zigza...

[3] https://github.com/lemire/FastPFor

[4] https://github.com/searchivarius/PyFastPFor

[5] https://github.com/naturalplasmoid/simple8b-timeseries-compr...


@spullara 8 days

For Wavefront I got this down to about 1.5 bytes per metric reported (on average across customers) where each metric was a tuple of:

name, source, timestamp (s), value (int) and a set of name/value pairs

Key to optimizing it was having a database that supported different compression strategies based on the data.


@sdmike1 7 days

On a related note, we've been using InfluxDB and running into an issue where the instance will occasionally have writes take several orders of magnitude longer than expected (to the point where a single write can take upwards of a second).

Has anyone else encountered behavior like this?


@rklaehn 8 days

I have found that a very good approach is to apply some very simple transformations such as delta encoding of timestamps, and then letting a good standard compression algorithm such as zstd or deflate take care of the rest.

Delta encoding of timestamps helps a lot though, because it makes the redundancy more visible to a general purpose compression algorithm.

I used this for the telemetry storage of the Columbus module of the International Space Station, back in ~2010, and then a few times since.




@awild 8 days

Question for people who've implemented these kind of compression schemes: some of these mechanisms require local context (rle and s8b) how do you handle random access in these cases? Is the metadata embedded as a sort of key frame to which we have to advance before decoding the value? Or as a separate block in the headers?

RLE especially sounds like it could degenerate into a binary search per lookup, maybe aided by caching the bounds of the last run and assuming locality and linear-ish usage?

This might sound obvious, but wasn't mentioned in the article, you can also apply integer based compression schemes on dictionary encoded data.

And floating point values don't always need those 64bits when the use case and input data don't require that level of precision.


@anonymoushn 8 days

Are there similar algorithms useful for compressing stuff like "a list of the top 100 scores in a game?" We already store these as diffs of the list, but we could probably do better than zstd-ing the list-diffs.


About Us

site design / logo © 2022 Box Piper