Exploring different ways of storing a system configuration. Part 1 - How about Google Protocol Buffers?
This is the first post in a series that explores different ways of storing a system configuration. In this post I define the term “system configuration” and describe how we can leverage Google Protocol Buffers (an unlikely candidate) to store such configuration.
Some time ago I’ve had to migrate a software component that was a part of a natural language generation pipeline from C++ to Java. That component was configured differently for different languages, and the configuration was stored in Google Protocol Buffers message files. In the beginning, I was surprised that a file format that I’ve associated with RPC systems was used for storing configuration. I’ve started to wonder if it would make any sense to implement a similar mechanism in an entirely new project and where does it fit with more common approaches and file formats used for such purpose.
In this post I’ll cover the following topics:
- what Google Protocol Buffers are,
- how can you use them from Java,
- how can you use them for storing a system configuration,
- would I use Google Protocol Buffers to store a system configuration in a new project.
All the examples presented in this post are available on my GitHub.
Defining the term “system configuration”
Before I proceed with describing Protocol Buffers and how they can be used to store configuration I’d like to define what I mean by “system configuration”. In the context of this series the term “system configuration” refers to application parameters that we provide during the application startup but we don’t expect them to change while the process is running.
If our application has only a few of those parameters then they can be passed directly through the command line. With time, as our application grows in complexity and we need to make it more configurable, it can be cumbersome to pass hundreds of parameters directly. We may also discover that there are well-known sets of parameters we’d like to pass for certain use cases. Examples include configuring the application for different environments (test, production) or specific clients. An easy solution to this problem is to write down the configuration in a file with well-defined syntaxes like JSON or YAML and pass a path to the file as a single parameter.
I won’t be evaluating how good are the different approaches for storing secrets like database passwords or AWS access keys. For the sake of this discussion I’m assuming we are not keeping any “critical” data in our system configuration.
What are Protocol Buffers?
The official Protocol Buffers page has a nice, concise definition of Protocol Buffers:
Protocol buffers are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages.
Schema
To do anything useful with Protocol Buffers we first need to create a schema file that describes the structure of our data. Based on the schema file Protocol Buffers tooling can:
- automatically generate the code needed to serialize and deserialize the data,
- generate model classes that represent the structure of our data,
- verify that the data we want to deserialize corresponds to the schema.
Using Protocol Buffers in Java
Example schema
The official documentation has an extensive overview of features available in the Protocol Buffers schema language. We won’t cover most of them here, only those needed to define some nested structures with fields, lists, and maps.
I’ll use an example of modeling a person with multiple bank accounts. Remember that it’s just a toy example and shouldn’t be used as-is in a real application.
|
|
The schema file consists of one or more message
declarations, each message
consists of one or more typed fields. Protocol Buffers support multiple field types out of the box and we can also define our own types.
Protoc compiler
Once we have our schema file, usually saved with .protoc
extension, we can generate the code necessary to serialize and deserialize Protocol Buffers messages in our language of choice. The code is generated using protoc
compiler. You can get it by downloading one of the available pre-built binaries or you can compile it straight from the source.
To generate Java source code for the example.protoc
schema file we can use the following command:
~/Downloads/protoc/bin/protoc --java_out ~/IdeaProjects/JavaGoogleProtocolbuffersExample/src/main/java/ example.protoc
As a result, you’ll get a single source file that contains all the code necessary to perform basic operations against the model you’ve defined in the protoc
file.
Example code
You can find a full example of a unit test that constructs an object, serializes it, deserializes it, and asserts the object equality on my GitHub. Below I’ve included only the parts that deal directly with Protocol Buffers.
Creating classes defined in the schema
The code presented below creates an instance of a Person with a single AccountDetails.
|
|
When executed it prints:
|
|
Serialization and deserialization
Now that we know how to create classes defined in the model we can take a look at the serialization and deserialization Protocol Buffers API. The example below shows how to use more compact binary serialization and also human-readable TextFormat. I’ll discuss how the second serialization option can be used to store a system configuration later in this post.
|
|
When executed it prints:
|
|
If we try to parse a Protocol Buffer TextFormat message with an unknown field or the type of the value does not match the type defined in the schema we’ll get an informative error message. That message includes problematic line and column number making it easy to debug such problems.
com.google.protobuf.TextFormat$ParseException: 6:16: Enum type "com.adebski.AccountType" has no value named "INVALID_VALUE".
A peek under the hood
Every message
that we’ve described in the .protoc
has a corresponding class with a builder. Classes corresponding to message
s are immutable and to create a new instance we need to use the corresponding builder. Builders can create a new instance from scratch or take an existing builder or instance if we want to modify a subset of fields.
Non-primitive fields are internally represented as private volatile java.lang.Object
and the type safety is guaranteed by getters in a class:
|
|
or setters in a corresponding builder:
|
|
Primitive fields, like int
, are internally represented as Java primitive types.
Binary encoding details can be found in the Protocol Buffers documentation.
Using Protocol Buffers Text Format to store a system configuration
I’ve seen multiple projects that stored a system configuration in JSON or YAML files but only once I’ve encountered a project that used Protocol Buffers for that purpose.
Protocol Buffers were designed with speed and storage efficiency in mind so binary encoding is the standard way of serializing Protocol Buffers structures. Storing configuration in a binary format may save us some disk space but files would be hard to edit and we would lose any ability to have meaningful diffs in our version control system. The system I’ve worked on used an alternative, human-readable way of representing Protocol Buffers structures called Text Format.
We’ve already seen an example Protocol Buffer message encoded using the Text Format in the Example Code section:
|
|
Protocol Buffers structures stored in this format are human-readable, the format itself looks similar to JSON. Unfortunately, I couldn’t find any syntax reference guide for the format but it’s easy to figure out the specifics by looking at some examples of serialized messages.
Example
In this section, we’ll go through the example of configuring Java ExecutorService
s with details stored in a Protocol Buffer Text Format file.
First we need to define the schema for our configuration:
|
|
Then we execute the protoc
command to generate the necessary Java code:
~/Downloads/protoc/bin/protoc --java_out ~/IdeaProjects/JavaGoogleProtocolbuffersExample/src/main/java/ executor-service.protoc
We’ve seen both steps previously, now it’s time to define the actual ExecutorService
configuration file that follows the schema defined earlier in this section:
|
|
Here we configure two ExecutorService
s with a different number of threads and different name patterns.
The unit test below parses the configuration file and makes sure that it’s equal to one defined by hand. To parse the configuration file we use the TextFormat.parse
method we’ve already seen in action in the Example Code section.
|
|
When executed the test prints that both configurations are equal:
executorServices.equals(parsedExecutorServices): true
Closing thoughts
To answer the question from the first paragraph, I would not consider using Protocol Buffers to store a system configuration in a new project for the following reasons:
- From my experience “most” of the projects already have JSON/XML/YAML parser/library in their dependency tree but they don’t depend on Protocol Buffers.
- “Most” enterprise application developers are already familiar with JSON/XML/YAML and their syntaxes. There are also widely available tools for things like validation and pretty-printing, e.g. JSONLint, I did not find similar tools for Protocol Buffers.
I decided to still use Protocol Buffers to configure the natural language generation component I ported to Java because:
- I would need to invest the time to rewrite (manually or in an automated fashion) already existing Protocol Buffers Text Format files to another format. It would expand the scope of the project without bringing significant benefits.
- Porting the configuration to a different format can be done as a separate project if it feels like the Protocol Buffer based configuration is slowing us significantly or blocks some other feature from being implemented. We’ve rarely (couple of times per year at most) modified the configuration in any way and no one ever raised an issue about this aspect of the system being problematic. Storing the configuration in the Protocol Buffer files was not an ideal solution but also it was not something that needed to be improved outright.
In the next post in this series I’ll explore alternative, more popular (from my experience) approaches and file formats for storing a system configuration and compare them to Google Protocol Buffers.