Exploring different ways of storing a system configuration. Part 1 - How about Google Protocol Buffers?

This is the first post in a series that explores different ways of storing a system configuration. In this post I define the term “system configuration” and describe how we can leverage Google Protocol Buffers (an unlikely candidate) to store such configuration.

Some time ago I’ve had to migrate a software component that was a part of a natural language generation pipeline from C++ to Java. That component was configured differently for different languages, and the configuration was stored in Google Protocol Buffers message files. In the beginning, I was surprised that a file format that I’ve associated with RPC systems was used for storing configuration. I’ve started to wonder if it would make any sense to implement a similar mechanism in an entirely new project and where does it fit with more common approaches and file formats used for such purpose.

In this post I’ll cover the following topics:

what Google Protocol Buffers are,
how can you use them from Java,
how can you use them for storing a system configuration,
would I use Google Protocol Buffers to store a system configuration in a new project.

All the examples presented in this post are available on my GitHub.

Defining the term “system configuration”

Before I proceed with describing Protocol Buffers and how they can be used to store configuration I’d like to define what I mean by “system configuration”. In the context of this series the term “system configuration” refers to application parameters that we provide during the application startup but we don’t expect them to change while the process is running.

If our application has only a few of those parameters then they can be passed directly through the command line. With time, as our application grows in complexity and we need to make it more configurable, it can be cumbersome to pass hundreds of parameters directly. We may also discover that there are well-known sets of parameters we’d like to pass for certain use cases. Examples include configuring the application for different environments (test, production) or specific clients. An easy solution to this problem is to write down the configuration in a file with well-defined syntaxes like JSON or YAML and pass a path to the file as a single parameter.

I won’t be evaluating how good are the different approaches for storing secrets like database passwords or AWS access keys. For the sake of this discussion I’m assuming we are not keeping any “critical” data in our system configuration.

What are Protocol Buffers?

The official Protocol Buffers page has a nice, concise definition of Protocol Buffers:

Protocol buffers are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages.

Schema

To do anything useful with Protocol Buffers we first need to create a schema file that describes the structure of our data. Based on the schema file Protocol Buffers tooling can:

automatically generate the code needed to serialize and deserialize the data,
generate model classes that represent the structure of our data,
verify that the data we want to deserialize corresponds to the schema.

Using Protocol Buffers in Java

Example schema

The official documentation has an extensive overview of features available in the Protocol Buffers schema language. We won’t cover most of them here, only those needed to define some nested structures with fields, lists, and maps.

I’ll use an example of modeling a person with multiple bank accounts. Remember that it’s just a toy example and shouldn’t be used as-is in a real application.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// Available syntaxes are "proto2" and "proto3".
syntax = "proto3";
// Generated Java classes will be put inside the "com.adebski" package.
package com.adebski;

// message keyword defines a data type that we want to model, inside we define the fields present in a Person data type.
//
// Protocol Buffers require that each field has type, name, and a numeric identifier used during serialization and
// deserialization. In the compact message format the serialized message does not contain field names, only identifiers
// in order to reduce the message size.
message Person {
  string name = 1;
  string surname = 2;
  string email = 3;
  // repeated keyword is used to declare a field holding a sequence of items. We can declare sequences of primitive
  // type or our custom type.
  repeated AccountDetails accounts = 4;
}

enum AccountType {
  PERSONAL = 0;
  PROFESSIONAL = 1;
}

message AccountDetails {
  string id = 1;
  AccountType accountType = 2;
  map<string, string> metadata = 3;
}

The schema file consists of one or more message declarations, each message consists of one or more typed fields. Protocol Buffers support multiple field types out of the box and we can also define our own types.

Protoc compiler

Once we have our schema file, usually saved with .protoc extension, we can generate the code necessary to serialize and deserialize Protocol Buffers messages in our language of choice. The code is generated using protoc compiler. You can get it by downloading one of the available pre-built binaries or you can compile it straight from the source.

To generate Java source code for the example.protoc schema file we can use the following command:

~/Downloads/protoc/bin/protoc --java_out ~/IdeaProjects/JavaGoogleProtocolbuffersExample/src/main/java/ example.protoc

As a result, you’ll get a single source file that contains all the code necessary to perform basic operations against the model you’ve defined in the protoc file.

Example code

You can find a full example of a unit test that constructs an object, serializes it, deserializes it, and asserts the object equality on my GitHub. Below I’ve included only the parts that deal directly with Protocol Buffers.

Creating classes defined in the schema

The code presented below creates an instance of a Person with a single AccountDetails.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
// Using generated builders to create instances of classes defined in the .protoc file
ExampleProtoc.AccountDetails accountDetails =
        ExampleProtoc.AccountDetails.newBuilder()
        .setId("someId")
        .setAccountType(ExampleProtoc.AccountType.PROFESSIONAL)
        .build();
ExampleProtoc.Person person =
            ExampleProtoc.Person.newBuilder()
            .setName("someName")
            .setSurname("someSurname")
            .setEmail("some_email@foo.bar")
            .addAccounts(accountDetails)
            .build();

// Taking advantage of the auto-generated toString() methods
System.out.println("AccountDetails toString() result:");
System.out.println(accountDetails);

System.out.println("Person toString() result:");
System.out.println(person);

When executed it prints:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
AccountDetails toString() result:
id: "someId"
accountType: PROFESSIONAL

Person toString() result:
name: "someName"
surname: "someSurname"
email: "some_email@foo.bar"
accounts {
  id: "someId"
  accountType: PROFESSIONAL
}

Serialization and deserialization

Now that we know how to create classes defined in the model we can take a look at the serialization and deserialization Protocol Buffers API. The example below shows how to use more compact binary serialization and also human-readable TextFormat. I’ll discuss how the second serialization option can be used to store a system configuration later in this post.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Serializing using binary/compact encoding
ByteArrayOutputStream personAsBytesStream =
        new ByteArrayOutputStream();
person.writeTo(personAsBytesStream);
byte[] personAsBytes = personAsBytesStream.toByteArray();

System.out.println(
        String.format(
                "Serialized Person object using binary encoder takes %d bytes in memory:",
                personAsBytes.length));
for(byte b: personAsBytes) {
    System.out.print(String.format("%02x", b));
}
System.out.println();
System.out.println("original.equals(deserialized): " + ExampleProtoc.Person.parseFrom(personAsBytes).equals(person));

// Serializing using human-readable TextFormat encoding
System.out.println();
String personAsTextFormat = TextFormat.printer().printToString(person);
System.out.println(
        String.format(
                "Serialized Person object using TextFormat encoder takes %d bytes in memory",
                personAsTextFormat.length()));
System.out.println(personAsTextFormat);
System.out.println("original.equals(deserialized): " + TextFormat.parse(personAsTextFormat, ExampleProtoc.Person.class).equals(person));

When executed it prints:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
Serialized Person object using binary encoder takes 55 bytes in memory:
0a08736f6d654e616d65120b736f6d655375726e616d651a12736f6d655f656d61696c40666f6f2e626172220a0a06736f6d6549641001
original.equals(deserialized): true

Serialized Person object using TextFormat encoder takes 124 bytes in memory
name: "someName"
surname: "someSurname"
email: "some_email@foo.bar"
accounts {
  id: "someId"
  accountType: PROFESSIONAL
}

original.equals(deserialized): true

If we try to parse a Protocol Buffer TextFormat message with an unknown field or the type of the value does not match the type defined in the schema we’ll get an informative error message. That message includes problematic line and column number making it easy to debug such problems.

com.google.protobuf.TextFormat$ParseException: 6:16: Enum type "com.adebski.AccountType" has no value named "INVALID_VALUE".

A peek under the hood

Every message that we’ve described in the .protoc has a corresponding class with a builder. Classes corresponding to messages are immutable and to create a new instance we need to use the corresponding builder. Builders can create a new instance from scratch or take an existing builder or instance if we want to modify a subset of fields.

Non-primitive fields are internally represented as private volatile java.lang.Object and the type safety is guaranteed by getters in a class:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
public static final int ID_FIELD_NUMBER = 1;
private volatile java.lang.Object id_;
/**
 * <code>string id = 1;</code>
 * @return The id.
 */
@java.lang.Override
public java.lang.String getId() {
  java.lang.Object ref = id_;
  if (ref instanceof java.lang.String) {
    return (java.lang.String) ref;
  } else {
    com.google.protobuf.ByteString bs = 
        (com.google.protobuf.ByteString) ref;
    java.lang.String s = bs.toStringUtf8();
    id_ = s;
    return s;
  }
}

or setters in a corresponding builder:

1
2
3
4
5
6
7
8
9
public Builder setId(java.lang.String value) {
    if (value == null) {
        throw new NullPointerException();
    }

    id_ = value;
    onChanged();
    return this;
}

Primitive fields, like int, are internally represented as Java primitive types.

Binary encoding details can be found in the Protocol Buffers documentation.

Using Protocol Buffers Text Format to store a system configuration

I’ve seen multiple projects that stored a system configuration in JSON or YAML files but only once I’ve encountered a project that used Protocol Buffers for that purpose.

Protocol Buffers were designed with speed and storage efficiency in mind so binary encoding is the standard way of serializing Protocol Buffers structures. Storing configuration in a binary format may save us some disk space but files would be hard to edit and we would lose any ability to have meaningful diffs in our version control system. The system I’ve worked on used an alternative, human-readable way of representing Protocol Buffers structures called Text Format.

We’ve already seen an example Protocol Buffer message encoded using the Text Format in the Example Code section:

1
2
3
4
5
6
7
name: "someName"
surname: "someSurname"
email: "some_email@foo.bar"
accounts {
  id: "someId"
  accountType: PROFESSIONAL
}

Protocol Buffers structures stored in this format are human-readable, the format itself looks similar to JSON. Unfortunately, I couldn’t find any syntax reference guide for the format but it’s easy to figure out the specifics by looking at some examples of serialized messages.

Example

In this section, we’ll go through the example of configuring Java ExecutorServices with details stored in a Protocol Buffer Text Format file.

First we need to define the schema for our configuration:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
syntax = "proto3";
package com.adebski;


message FixedExecutorServiceConfiguration {
  string namePattern = 1;
  int32 numberOfThreads = 2;
  bool daemon = 3;
}

message ExecutorServices {
  repeated FixedExecutorServiceConfiguration executorServiceConfigurations = 1;
}

Then we execute the protoc command to generate the necessary Java code:

~/Downloads/protoc/bin/protoc --java_out ~/IdeaProjects/JavaGoogleProtocolbuffersExample/src/main/java/ executor-service.protoc

We’ve seen both steps previously, now it’s time to define the actual ExecutorService configuration file that follows the schema defined earlier in this section:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
executorServiceConfigurations {
    namePattern: "first-test-pool-%d"
    numberOfThreads: 6
    daemon: true
}
executorServiceConfigurations {
    namePattern: "second-test-pool-%d"
    numberOfThreads: 3
    daemon: true
}

Here we configure two ExecutorServices with a different number of threads and different name patterns.

The unit test below parses the configuration file and makes sure that it’s equal to one defined by hand. To parse the configuration file we use the TextFormat.parse method we’ve already seen in action in the Example Code section.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
package com.adebski;

import com.google.protobuf.TextFormat;
import org.junit.Test;

import java.io.IOException;
import java.net.URISyntaxException;
import java.nio.file.Files;
import java.nio.file.Paths;

public class ExecutorServiceConfigurationTest {

    @Test
    public void createsExecutorServices() throws URISyntaxException, IOException {
        ExecutorServiceProtoc.ExecutorServices executorServices = ExecutorServiceProtoc.ExecutorServices.newBuilder()
                .addExecutorServiceConfigurations(
                        ExecutorServiceProtoc.FixedExecutorServiceConfiguration.newBuilder()
                                .setDaemon(true)
                                .setNamePattern("first-test-pool-%d")
                                .setNumberOfThreads(6)
                                .build())
                .addExecutorServiceConfigurations(ExecutorServiceProtoc.FixedExecutorServiceConfiguration.newBuilder()
                        .setDaemon(true)
                        .setNamePattern("second-test-pool-%d")
                        .setNumberOfThreads(3)
                        .build())
                .build();

        String exampleExecutorServiceConfiguration =
                Files.readString(Paths.get("./example-executor-service-configuration.proto"));
        ExecutorServiceProtoc.ExecutorServices parsedExecutorServices =
                TextFormat.parse(exampleExecutorServiceConfiguration, ExecutorServiceProtoc.ExecutorServices.class);

        System.out.println("executorServices.equals(parsedExecutorServices): " + executorServices.equals(parsedExecutorServices));
    }
}

When executed the test prints that both configurations are equal:

executorServices.equals(parsedExecutorServices): true

Closing thoughts

To answer the question from the first paragraph, I would not consider using Protocol Buffers to store a system configuration in a new project for the following reasons:

From my experience “most” of the projects already have JSON/XML/YAML parser/library in their dependency tree but they don’t depend on Protocol Buffers.
“Most” enterprise application developers are already familiar with JSON/XML/YAML and their syntaxes. There are also widely available tools for things like validation and pretty-printing, e.g. JSONLint, I did not find similar tools for Protocol Buffers.

I decided to still use Protocol Buffers to configure the natural language generation component I ported to Java because:

I would need to invest the time to rewrite (manually or in an automated fashion) already existing Protocol Buffers Text Format files to another format. It would expand the scope of the project without bringing significant benefits.
Porting the configuration to a different format can be done as a separate project if it feels like the Protocol Buffer based configuration is slowing us significantly or blocks some other feature from being implemented. We’ve rarely (couple of times per year at most) modified the configuration in any way and no one ever raised an issue about this aspect of the system being problematic. Storing the configuration in the Protocol Buffer files was not an ideal solution but also it was not something that needed to be improved outright.

In the next post in this series I’ll explore alternative, more popular (from my experience) approaches and file formats for storing a system configuration and compare them to Google Protocol Buffers.

2020-12-22