Ben Chuanlong Du's Blog

It is never too late to learn.

Binary Serialization Format

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Summary

  1. Protobuf is best for message serialization. Some companies (e.g., Google) also use it extensively for disk serialization.

  2. FlatBuffers has better CPU performance.

  3. Apache Parquet is the most popular binary serialization format for data frames.

  4. For text serialization format, please refer to Serialization and deserialization in Python .

Protobuf vs FlatBuffers

Flatbuffers are mmap-able and don't have any parsing overhead compared to protos. Large protos not only have CPU overhead but cause a memory usage spike when proto is parsed during the resource loading phase. The memory usage spike can lead to more page faults and increased end user latency. Flatbuffers have none of these disadvantages.

messagepack

Apache Parquet

References

Comments