Benchmark: PHP7 Serialization

Computers are pedantic creatures. While a human would be content to simply look at a number, like 5, and call it a “number”, a computer needs more specificity. What kind of number is it? Is it an “integer” (a whole number), a “float” (a number with some degree of decimal precision, e.g. 5.0 or 5.00000), a “string” (the keyboard letter "5"), or some language-specific subset of these? In database contexts, for example, where storage efficiency matters, you end up with designations like “tinyint”, “smallint”, “int”, and “bigint” — all of which are integers — which place corresponding limits on the physical byte-size of the values being stored.

In many programming languages, there are also complex and custom types with structures defined entirely within the code. The most obvious example of this is the computer simulation in which you are currently “living” (spoiler alert). In that simulation, you, your neighbor, your sister, are each instances of a “person” object, a data type that includes all sorts of innate properties like hair color, spitefulness, taste in music, etc., as well as methods encapsulating various chained actions like do_cartwheel() and drink_whiskey().

For a program to run correctly, the code needs to properly declare its types. At runtime, this is fine; the code is there and the data is there, floats are floats, cats are cats.

But when information has to be transferred beyond the scope of the running application — whether to a local disk for storage or over the river and through the wires to Grandmother’s house — sending the value on its own may not be sufficient. If the receiving application doesn’t know what to make of a lonely 5, it will fail.

Enter Serialization

Serialization is the solution. It provides a way to represent arbitrary data in a format that can be stored or transferred, while maintaining sufficient contextual information so that, at some later date, an equal and opposite deserialization action can accurately restore the data to its original form.

So which serialization method is the best? Lonely developers have been asking each other that question for decades, but there is no universal answer. As with so much else, the devil is in the details. What platform(s) are being used? What kind of data, and how much of it, is being handled? Where are the bottlenecks?

Failure to appreciate this nuance has sparked countless wars, leaving a wake of scorched forums and wounded bios across the internet. But for PHP users, the violence seems to have largely subsided with the release of PHP 7.

Why?

The Players

Within PHP, there is an ancient pair of functions called serialize() and unserialize(). They are sturdy, reliable, and built into the core. Ubiquity makes these a default solution for many applications, including big players like WordPress.

But there are some alternatives. A common drop-in replacement called Igbinary works pretty much the same way as the native serializer, but stores its information in binary instead of a “string”, potentially increasing performance while decreasing the footprint. PHP has also begun to embrace platform-agnostic standards like XMLJSON, and MessagePack, which have become widespread in recent years as applications have become more interconnected.

According to conventional wisdom, a lightweight solution like JSON will have the best performance, but may not have the ability to deal with large or complex datasets. For applications with other needs, a more robust solution is required.

The following benchmarks cover the four serialization technologies I use most often: the native serializer, JSON, Igbinary, and MessagePack.

Results

Baseline

To start with, a grab bag of hundreds of simple values — a mixture of “boolean”, “float”, “integer”, “null”, and “string” types ‐ were thrown at each of the serializers.

Debian Jessie, PHP-FPM 7.0.17, Encoding Simple/Mixed Data
Debian Jessie, PHP-FPM 7.0.17, Decoding Simple/Mixed Data

At this level, the outbound act of serialization is a bit like watching paint dry. The reverse operation, at least, is a little sexier, with Igbinary ending 0.5 – 1.6x faster than the competition.

Lots of Data

A given code optimization will sometimes apply equally across the board, but more typically its effects will be limited to a defined range. How well a function performs at various extremes can say a lot about the use cases its developers envisioned.

To see how each of the serializers would cope under a high-calorie diet, each was fed a hefty, multi-dimensional “array” representation of the blob-mimes database.

Debian Jessie, PHP-FPM 7.0.17, Encoding Large Array Data
Debian Jessie, PHP-FPM 7.0.17, Decoding Large Array Data

Under a heavier load, we have something more akin to watching grass grow; aside from the occasional weed that sprouts like a basketball player, each blade is more or less uniform in height.

Things were more variable on the storage side of the equation (the physical size of the data in serialized form):

  • 67,053 bytes Igbinary
  • 79,256 bytes MessagePack
  • 94,181 bytes JSON
  • 134,746 bytes PHP Serialize

Complex Data

Last but not least, we venture into the wilderness of complexity. I wasn’t able to examine the simulated reality we call Life, so I went with the next best thing: the Blobfolio API “user” object, which contains some basic data like name, contact information, and a few notification settings. As mentioned earlier, this is the station where JSON gets off, so the results are limited to the remaining passengers.

Debian Jessie, PHP-FPM 7.0.17, Encoding Object Data
Debian Jessie, PHP-FPM 7.0.17, Decoding Object Data

The processing times, finally, showed a little entropy! However in terms of storage, there wasn’t much variation:

  • 1042 bytes MessagePack
  • 1075 bytes Igbinary
  • 1380 bytes PHP Serialize

Analysis

The TL;DR answer to the question, “Where did all the riots go?” seems to be that in PHP7, there really isn’t much practical difference between the serialization methods. The server used for testing was a bit too powerful, so it took some doing to build up the workloads to the point where results were even hitting graphable fractions of a microsecond.

Still, there was one surprise: PHP’s native serializer is actually quite competitive. In fact, it was the fastest encoder in both the large array and object series, even beating out the bookie favorite JSON, which everyone had always assumed was so good at this sort of thing. For a standalone PHP application storing data for its own purposes, this is a totally acceptable default choice.

In terms of performance, the only option that would consistently do better is Igbinary. Its documentation says it is designed for environments that “serialize rarely [and] unserialize often”, and that is exactly where it excelled; Igbinary’s deserialization method was the clear winner in every single test run. Its serialization times weren’t always competitive, but this was in large part due to running the extension with its default configuration value compact_strings=on. With this value disabled, Igbinary’s serialization operations are on par with PHP Serialize.

Speaking of Igbinary, it can also be used as the PHP session serializer. As session data tends to be exactly of the write-little/read-lots variety, it is an ideal use case. To enable it, just set the following in php.ini:

# session serialization handler
session.serialize_handler=igbinary

JSON and MessagePack both fell somewhere in the middle.

By and large, this shouldn’t discourage the use of JSON. What it lacks in performance, it more than makes up for in portability. For interconnected applications like APIs, it is still the best choice. It is fast enough and small enough to get the job done, and the technology is so widely implemented that there is little fear a connecting party will have trouble working with it.

MessagePack, in aiming to be a viable JSON replacement, still has a little ground to cover. It was able to compete in the complex data type challenge, a point in its favor, and its data footprint was consistently smaller than JSON’s. But its performance wasn’t noticeably faster, and its community adoption is still comparatively limited. Unless your girlfriend is a core contributor, you can probably skip this one for now.

Methodology

Identical data was used for each serialization test. The start and end times were measured on either end of the execution of the serialization function. For deserialization, the input values were precalculated by storing the results of the original serialization.

Each test was run a total of 40 times. The first 20 results from each batch were discarded. The remaining 20 were used for comparison.

A dedicated PHP-FPM socket was used to handle all requests. The PHP process was restarted before each batch.