Encoding and Evolution

Applications change over time, requiring adaptations to features and data storage.

Relational databases use schema migrations to handle data changes, while schema-on-read databases allow for mixed data formats without enforcing a schema.

Application code changes are often needed when data formats or schemas change, but these changes can’t always happen instantaneously. This is especially true for large applications, where rolling upgrades are used for server-side applications and user updates are dependent on client-side applications.

Maintaining compatibility in both directions is crucial for systems with old and new code and data formats. Backward compatibility allows newer code to read older data, while forward compatibility allows older code to read newer data.

Backward compatibility is straightforward, but forward compatibility is challenging because older code must ignore new additions. A specific challenge arises when older code updates records with new fields, potentially losing data if the fields are not preserved.

This chapter explores data encoding formats like JSON, XML, Protocol Buffers, and Avro, focusing on schema changes and coexistence of old and new data. It also discusses their applications in databases, web services, REST APIs, RPCs, workflow engines, and event-driven systems.

Formats for Encoding Data

Programs use two data representations: in-memory structures optimized for CPU access and byte sequences for file storage or network transmission. Encoding translates in-memory data to byte sequences, while decoding reverses the process.

Terminology clash

The term “serialization” is avoided in this book due to its different meaning in the context of transactions.

Most systems require conversion between in-memory objects and flat byte sequences, leading to a variety of libraries and encoding formats.

Language-Specific Formats

Built-in encoding libraries, while convenient, are problematic due to language dependency, security risks, versioning challenges, and inefficiency. These issues make them unsuitable for anything beyond transient purposes.

JSON, XML, and Binary Variants

JSON, XML, and CSV are popular standardized data encoding formats, but each has limitations. XML is verbose and lacks number encoding clarity, JSON doesn’t distinguish between integer and floating-point numbers, and CSV lacks schema support and struggles with nested data. Despite these flaws, these formats remain widely used, especially for data interchange between organizations.

JSON Schema

JSON Schema is widely used for modeling data exchanged between systems or stored, appearing in web services, schema registries, and databases.

JSON Schema includes standard primitive types and a validation specification for overlaying constraints on fields.

JSON Schemas can have open or closed content models. Open models, enabled by default with additionalProperties set to true, allow any field not defined in the schema.

JSON Schema, while powerful, can be challenging due to features like conditional logic and remote schema references.

Binary encodings

Binary encodings for JSON and XML, such as MessagePack and WBXML, offer more compact and faster parsing than their textual counterparts. However, these binary formats are not as widely adopted.

Some formats extend JSON/XML datatypes without changing the data model, requiring all object field names to be included in encoded data.

{
  "userName": "Martin",
  "favoriteNumber": 1337,
  "interests": ["daydreaming", "hacking"]
}

Example of MessagePack binary encoding for JSON shown in Figure 5-2.

The first byte, 0x83, indicates an object with three fields.
Second byte, 0xa8, indicates an eight-byte string.
The next eight bytes contain the ASCII field name userName.
Seven bytes encode the string Martin with prefix 0xa6.

Binary JSON encoding is only slightly more compact than textual JSON, raising questions about its value. However, more efficient encoding methods are explored later.

Protocol Buffers

Protocol Buffers (protobuf) is a binary encoding library developed at Google, similar to Apache Thrift.

Protocol Buffers requires a schema for data encoding, defined in its interface definition language (IDL). The schema specifies record fields and types, and a code generation tool produces classes for encoding and decoding data.

Encoding Example 5-2 with Protocol Buffers requires 33 bytes, including type annotations and length indications for fields.

This example uses field tags (numbers) instead of field names, unlike Figure 5-2.

Protocol Buffers saves space by packing field type and tag number into a single byte and using variable-length integers.

Protocol Buffers uses the repeated modifier to indicate a list of values in a field, represented by repeated field tags in binary encoding.

Field tags and schema evolution

Protocol Buffers handles schema changes while maintaining backward and forward compatibility.

Encoded records are concatenations of encoded fields, identified by tag numbers and annotated with datatypes. Field tags are crucial for encoded data meaning, while field names can be changed without affecting encoded data.

New fields can be added to a schema with unique tag numbers, ensuring forward compatibility. Old code can ignore unrecognized fields, while datatype annotations help preserve unknown fields.

Backward compatibility is maintained by using unique tag numbers for each field, allowing new code to read old data.

Removing a field is similar to adding one, but with reversed compatibility concerns. Old tag numbers cannot be reused due to potential data conflicts.

Changing a field’s datatype can lead to data truncation, especially when converting from a smaller to a larger type.

Avro

Apache Avro, a binary encoding format, was created in 2009 as a Hadoop subproject due to Protocol Buffers’ limitations.

Avro uses schemas to specify data structure, offering two languages: Avro IDL for human editing and a JSON-based language for machine readability.

Avro binary encoding, using a schema without tag numbers, results in a compact 32-byte encoded record.

Byte sequence lacks field and datatype identifiers, using concatenated values. Strings are length-prefixed UTF-8 bytes, and integers use variable-length encoding.

Binary data parsing requires matching schemas for accurate decoding.

The writer’s schema and the reader’s schema

Applications use the writer’s schema, the version they know, to encode data for storage or transmission.

Applications use writer and reader schemas for data decoding. Avro resolves schema differences by translating data from the writer’s schema to the reader’s schema.

Avro schema resolution matches fields by name, ignoring extra fields in the writer’s schema and filling missing fields with defaults from the reader’s schema.

Schema evolution rules

Avro supports forward and backward compatibility, allowing schema version differences between writer and reader.

To maintain compatibility, only fields with default values can be added or removed from schemas.

Adding or removing fields without default values breaks compatibility between old and new data formats.

Avro requires using a union type to allow null values for fields, unlike some programming languages where null is a default.

Avro allows changing field datatypes and names, with name changes being backward compatible. Adding branches to union types is also backward compatible.

But what is the writer’s schema?

The challenge is determining how to inform the reader about the schema used for encoding data without including the entire schema with each record.

Avro’s answer depends on its context of use.

Avro is commonly used for storing large files with millions of records encoded with the same schema. The writer can include the schema once at the beginning of the file.

Databases with records written at different times using different schemas can include a version number in each record. Readers can then fetch the corresponding schema version from the database to decode the record.

Two processes can negotiate a schema version on a bidirectional network connection and use it for the connection’s duration.

A database of schema versions, documented with incrementing integers or hashes, is useful for checking compatibility.

Dynamically generated schemas

Avro’s schema lacks tag numbers, unlike Protocol Buffers, raising the question of its importance.

Avro is well-suited for dynamically generated schemas, making it easy to convert relational database contents into a binary format.

When a database schema changes, a new Avro schema can be generated from the updated database schema. The data export process can then convert the schema every time it runs, ensuring compatibility with the old reader’s schema.

Protocol Buffers requires manual assignment of field tags, unlike Avro, which automatically generates schemas.

The Merits of Schemas

Protocol Buffers and Avro use simpler schema languages than XML and JSON, making them easier to implement and use across various programming languages.

The encoding ideas are similar to ASN.1, a schema definition language standardized in 1984, but ASN.1 is complex and poorly documented.

Proprietary binary encoding, often used by data systems like relational databases, offers advantages over textual formats. These include compactness, schema-based documentation, compatibility checks, and code generation for statically typed languages.

Schema evolution offers flexibility similar to schema-less JSON databases while providing better data guarantees and tooling.

Modes of Dataflow

Data must be encoded as bytes for transmission between processes without shared memory, such as over a network or to a file.

Compatibility, a relationship between encoding and decoding processes, is important for evolvability.

Data flow between processes can occur through databases, service calls, workflow engines, and asynchronous messages.

Dataflow Through Databases

Database processes encode data for writing and decode it for reading, ensuring backward compatibility for future access.

Multiple processes often access databases simultaneously, potentially involving different versions of code. This necessitates forward compatibility to ensure data integrity when newer code writes data that older code reads.

Different values written at different times

Databases allow values to be updated at any time, resulting in a mix of old and new data.

Data outlasts code, as database contents remain unchanged during application updates unless explicitly rewritten.

Rewriting data into a new schema is expensive for large datasets, so databases often defer the operation. Schema evolution allows databases to appear as if encoded with a single schema, even if underlying storage contains records with various historical schema versions.

Complex schema changes, like altering single-valued attributes to multivalued or moving data to separate tables, necessitate data rewriting, often at the application level.

Archival storage

Data dumps for backup or data warehousing are typically encoded using the latest schema, even if the source database contains a mixture of schema versions.

Avro object container files and Parquet are recommended formats for data dumps due to their immutability and analytics-friendly column-oriented structure.

Dataflow Through Services: REST and RPC

Network communication typically involves clients and servers, with servers exposing APIs (services) for client requests.

Clients make requests to web servers using standardized protocols and data formats, allowing any web browser to access any website.

Web browsers are not the only clients that communicate with servers. Native apps and client-side JavaScript applications also make HTTP requests, often receiving data in JSON format for further processing.

Services, unlike databases, expose application-specific APIs with predetermined inputs and outputs, providing encapsulation and fine-grained restrictions on client actions.

Service-oriented/microservices architecture aims to enhance application changeability and maintainability by enabling independent deployment and evolution of services. This requires data encoding compatibility across different versions of servers and clients, allowing teams to modify systems without coordination.

Web services

Web services, used in service-oriented and microservices architectures, are not limited to the web.

Client applications make HTTP requests to services, often over the public internet. Services also communicate with each other, either within the same organization or between different organizations, for data exchange.

REST, the most popular service design philosophy, builds upon HTTP principles, emphasizing simple data formats and using URLs for resource identification.

Service developers use IDLs like OpenAPI and Protocol Buffers to define and document API endpoints and data models, enabling other developers to query the service.

OpenAPI service definitions are written in JSON or YAML, while Protocol Buffers use IDL.

Service frameworks like Spring Boot, FastAPI, or gRPC simplify API implementation by handling routing, metrics, caching, and authentication, allowing developers to focus on business logic.

Many frameworks, like FastAPI and gRPC, couple service definitions with server code, enabling client library and SDK generation. IDL tools like Swagger’s also offer documentation, schema change verification, and a GUI for service interaction.

The problems with remote procedure calls

Web services, like previous technologies for API requests, face challenges with complexity and compatibility, despite aiming for vendor interoperability.

The RPC model, introduced in the 1970s, aims to make remote network service requests appear as local function calls, but this approach is flawed due to fundamental differences between network requests and local calls.

Network requests are unpredictable and prone to failure due to external factors, unlike local function calls. Applications must anticipate and handle network problems, such as retrying failed requests.
Network requests can fail silently due to timeouts, unlike local function calls which always return a result or an exception.
Retrying failed network requests can lead to duplicate actions unless idempotence is implemented. Local function calls avoid this issue.
Network requests are significantly slower and more variable in latency than local function calls.
Passing references to local objects is efficient, but network requests require encoding parameters into bytes, which can be problematic for large data and mutable objects.
RPC frameworks must translate datatypes between languages, which can be problematic due to differing types across languages.

REST treats state transfer over a network as distinct from a function call, unlike local objects.

Load balancers, service discovery, and service meshes

Service discovery is a challenge because clients need to know the address of the service they are connecting to.

Load balancing distributes requests across multiple service instances for higher availability and scalability.

Hardware load balancers, installed in datacenters, route incoming client connections to servers and detect network failures to shift traffic.

Software load balancers, like NGINX and HAProxy, function similarly to hardware load balancers but are applications installed on standard machines.

DNS resolves domain names to IP addresses, supporting load balancing by associating multiple IPs with a single domain. However, DNS propagation delays can lead to clients connecting to stale IP addresses.

Service discovery systems use a centralized registry to track available service endpoints, allowing for dynamic environments and providing clients with metadata for smarter load-balancing decisions.

Service meshes, a sophisticated load balancing method, combine software load balancers and service discovery. Deployed as in-process client libraries or sidecar containers, they offer advantages like encrypted connections and sophisticated observability.

Service mesh solutions like Istio or Linkerd are ideal for dynamic environments with orchestrators like Kubernetes. Specialized infrastructure and simpler deployments may require purpose-built load balancers or software load balancers, respectively.

Data encoding and evolution for RPC

RPC clients and servers should be independently changeable and deployable. Backward compatibility is needed for requests, and forward compatibility for responses.

RPC schemes inherit compatibility properties from their encoding formats. RESTful APIs using JSON or URI-encoded parameters generally maintain compatibility with optional parameters and new response fields.

RPC’s use across organizational boundaries complicates service compatibility, as providers cannot force client upgrades. This necessitates long-term compatibility maintenance, often requiring multiple API versions.

There is no consensus on API versioning, with common approaches including version numbers in URLs or HTTP headers, or storing client-requested versions on the server.

Durable Execution and Workflows

Service-based architectures consist of multiple services, each responsible for a specific application portion.

Processing a single payment involves multiple service calls, forming a workflow of tasks. Workflow definitions can be written in various languages or markup languages.

Workflow engines use different terms for tasks, such as activities or durable functions, but the concepts are the same.

Workflow engines execute workflows, determining task execution timing, machine allocation, and handling failures.

Workflow engines consist of an orchestrator, which schedules tasks, and an executor, which runs them. Workflows can be triggered by time-based schedules, external sources, or human interaction.

Workflow engines like Airflow, Dagster, and Prefect integrate with data systems for ETL tasks, while Camunda and Orkes offer graphical notations for non-engineers. Temporal and Restate provide durable execution.

Durable execution frameworks are used to build service-based architectures requiring transactionality, ensuring each payment is processed exactly once.

Durable execution frameworks ensure exactly-once semantics for workflows by re-executing failed tasks while skipping successful RPC calls and state changes. This is achieved by logging all RPCs and state changes to durable storage.

Durable execution frameworks like Temporal require unique IDs for external services and consistent RPC call ordering to avoid issues with code changes.

Durable execution frameworks require deterministic code, making nondeterministic code problematic. Frameworks often provide deterministic implementations and static analysis tools to address this.

Event-Driven Architectures

Event-driven architectures use events or messages, sent via a message broker, for data flow between processes. Unlike RPC, senders don’t wait for recipients to process events.

Message brokers offer advantages over direct RPC, including improved reliability, message redelivery, and decoupling of senders and recipients.

Message broker communication is asynchronous, but a synchronous RPC-like model can be implemented by having the sender wait for a response on a separate channel.

Message brokers

The message broker landscape has evolved from commercial software to open-source implementations and cloud services.

Two common message distribution patterns are used: queue-based, where one consumer receives a message from a named queue, and topic-based, where a broker delivers a message to all subscribers of a named topic.

Message brokers don’t enforce data models, allowing any encoding format like Protocol Buffers, Avro, or JSON. A schema registry can store schema versions and check compatibility.

Message brokers vary in message durability, with some writing messages to disk for crash recovery and others automatically deleting them after consumption. Some brokers can be configured to store messages indefinitely for event sourcing.

Preserve unknown fields when republishing messages to prevent database issues.

Distributed actor frameworks

The actor model is a concurrency programming model where logic is encapsulated in actors, each representing a client or entity. Actors communicate asynchronously through messages, eliminating thread-related issues.

Distributed actor frameworks like Akka, Orleans, and Erlang/OTP use a message-passing model to scale applications across nodes, encoding and decoding messages for network transmission.

Location transparency is more effective in the actor model than in RPC due to the actor model’s inherent assumption of potential message loss.

Distributed actor frameworks integrate message brokers and actor programming models, but rolling upgrades require addressing forward and backward compatibility for message exchanges between nodes running different versions.

Encoding and Evolution

On this page