Designing Data-Intensive Applications - Chapter 4: Encoding and Evolution
Charles
Charles
Everything changes and nothing stands still. —Heraclitus of Ephesus, as quoted by Plato in Cratylus (360 BCE)
- Applications inevitably change over time..
- schema-on-read (“schemaless”) databases don’t enforce a schema, so the database can contain a mixture of older and newer data formats written at different times
- When data changes usually you will need code change, but in a large application, code changes often cannot happen instantaneously.
- With server-side applications you may want to perform a rolling upgrade (also known as a staged rollout)
- With client-side applications you’re at the mercy of the user, who may not install the update for some time.
- We need to maintain compatibility in both directions
- Backward compatibility: Newer code can read data that was written by older code. (relative easy, since you already knew the existing old data schema)
- Forward compatibility: Older code can read data that was written by newer code. (relative harder, it requires older code to ignore additions made by a newer version of the code)
Formats for Encoding Data:
- The translation from the in-memory representation to a byte sequence is called encoding (also known as serialization or marshalling), and the reverse is called decoding (parsing, deserialization, unmarshalling).
- Language-Specific Formats: (bad idea, usually don’t recommended)
- Java has java.io.Serializable, Ruby has Marshal, Python has pickle, and so on. Many third-party libraries also exist, such as Kryo for Java.
- Problems:
- The encoding is often tied to a particular programming language, and reading the data in another language is very difficult.
- In order to restore data in the same object types, the decoding process needs to be able to instantiate arbitrary classes.(Potential security risk)
- Versioning data is often an afterthought in these libraries
- Efficiency (CPU time taken to encode or decode, and the size of the encoded structure) is also often an afterthought
- JSON, XML, and Binary Variants:
- Problems:
- There is a lot of ambiguity around the encoding of numbers. This is a problem when dealing with large numbers(e.g. 2^53);
- JSON and XML have good support for Unicode character strings (i.e., human- readable text), but they don’t support binary strings
- There is optional schema support for both XML and JSON (but complicated to learn and use)
- CSV does not have any schema, so it is up to the application to define the meaning of each row and column.
- Despite these flaws, JSON, XML, and CSV are good enough for many purposes.
- The difficulty of getting different organizations to agree on anything outweighs most other concerns;
- Problems:
- Binary encoding: For data that is used only internally within your organization, there is less pressure to use a lowest-common-denominator encoding format
- JSON is verbose but it use lots of space, hence it led to the development of a profusion of binary encodings for JSON (MessagePack, BSON, BJSON, UBJSON, BISON, and Smile, to name a few) and for XML (WBXML and Fast Infoset, for example) [but they wasn’t save too much space neither]
- Thrift and Protocol Buffers: Apache Thrift [by Facebook] and Protocol Buffers (protobuf) [by Google] are binary encoding libraries that are based on the same principle.
- Both Thrift and Protocol Buffers
- require a schema for any data that is encoded.
- each come with a code generation tool
- Thrift has two different binary encoding formats,
- BinaryProtocol
- CompactProtocol (and DenseProtocol only support C++)
- Protocol only have Binary encoding;
- does not have a list or array data type, but instead has a repeated marker for fields (which is a third option alongside required and optional).
- Field tags and schema evolution:
- to maintain backward compatibility, every field you add after the initial deployment of the schema must be optional or have a default value.
- you can only remove a field that is optional (a required field can never be removed), and you can never use the same tag number again
- Data Types and schema evolution;
- Both Thrift and Protocol Buffers
- Apache Avro: (subproject of Hadoop) is another binary encoding format that is interestingly different from Protocol Buffers and Thrift.
- The writer’s schema and the reader’s schema:
- The key idea with Avro is that the writer’s schema and the reader’s schema don’t have to be the same—they only need to be compatible.
- Schema evolution rules:
- To maintain compatibility, you may only add or remove a field that has a default value.
- In Avro:
- if you want to allow a field to be null, you have to use a union type. For example, union { null, long, string } field
- Doesn’t have optional and required markers.
- Change field name: the reader’s schema can contain aliases for field names, so it can match an old writer’s schema field names against the aliases.
- A database of schema versions is a useful thing to have in any case, since it acts as documentation and gives you a chance to check schema compatibility.
- Dynamically generated schemas
- Code generation and dynamically typed languages:
- After a schema has been defined, you can generate code that implements this schema in a programming language of your choice.
- The writer’s schema and the reader’s schema:
- The Merits of Schemas:
- schema evolution allows the same kind of flexibility as schemaless/ schema-on-read JSON databases provide, while also providing better guarantees about your data and better tooling.
Modes of Dataflow
- Forward and backward compatibility, which are important for evolvability (upgrade system independently, and not having to change everything at once)
- Via Databases:
- Different values written at different times: A database generally allows any value to be updated at any time. (aka, data outlives code) using Schema evolution rules.
- Archival storage: use latest schema
- encode the data in an analytics-friendly column-oriented format such as Parquet
- Via Synchronized Service call: REST or RPC
- REST seems to be the predominant style for public APIs.
- RPC frameworks are on requests between services owned by the same organization, typically within the same datacenter.
- The most common arrangement is to have two roles: clients and servers.
- The API exposed by the server is known as a service.
- Moreover, a server can itself be a client to another service (e.g. a web app server acts as client to a database)
- service-oriented architecture (SOA), more recently refined and rebranded as microservices architecture;
- A key design goal of a service-oriented/microservices architecture is to make the application easier to change and maintain by making services independently deployable and evolvable.
- Web services: REST and SOAP
- REST is not a protocol, but rather a design philosophy that builds upon the principles of HTTP. An API designed according to the principles of REST is called RESTful.
- A definition format such as OpenAPI, also known as Swagger, can be used to describe RESTful APIs and produce documentation.
- SOAP is an XML-based protocol for making network API requests.
- The API of a SOAP web service is described using an XML-based language called the Web Services Description Language**, or WSDL**, WSDL enables code generation so that a client can access a remote service using local classes and method calls; (good for static typed language like Java)
- REST is not a protocol, but rather a design philosophy that builds upon the principles of HTTP. An API designed according to the principles of REST is called RESTful.
- RPC:
- The RPC model tries to make a request to a remote network service look the same as calling a function or method in your programming language, within the same process (this abstraction is called location transparency).
- Although RPC seems convenient at first, the approach is fundamentally flawed. A network request is very different from a local function call.
- A network request is unpredictable;
- A network response is unpredictable;
- Idempotence is required for re-try;
- A network request is much slower than a function call, and its latency is also wildly variable;
- Current directions for RPC: Thrift and Avro come with RPC support included, gRPC is an RPC implementation using Protocol Buffers, Finagle also uses Thrift, and Rest.li uses JSON over HTTP.
- Custom RPC protocols with a binary encoding format can achieve better performance than something generic like JSON over REST;
- But REST it is good for experimentation and debugging;
- Data encoding and evolution for RPC
- Via Asynchronous message passing: asynchronous message-passing systems, which are somewhere between RPC and databases.
- Advantages:
- Act as buffer;
- Auto-redeliver message;
- Allow message fan-out;
- Decouple the sender and recipient;
- Disadvantages:
- message-passing communication is usually one-way: a sender normally doesn’t expect to receive a reply to its messages
- communication pattern is asynchronous: the sender doesn’t wait for the message to be delivered, but simply sends it and then forgets about it
- Message brokers: open source implementations such as RabbitMQ, ActiveMQ, HornetQ, NATS, and Apache Kafka have become popular
- one process sends a message to a named queue or topic, and the broker ensures that the message is delivered to one or more consumers of or subscribers to that queue or topic.
- There can be many producers and many consumers on the same topic.
- A topic provides only one-way dataflow. However, a consumer may itself publish messages to another topic (so you can chain them together), or to a reply queue that is consumed by the sender of the original message (allowing a request/response dataflow, similar to RPC)
- Distributed actor frameworks: The actor model is a programming model for concurrency in a single process. (e.g. Akka, Orleans, Erlang OTP)
- A distributed actor framework essentially integrates a message broker and the actor programming model into a single framework.
- Advantages:
Summary:
- many services need to support rolling upgrades → evolvability
- released without downtime
- make deployments less risky
- all data flowing around the system is encoded in a way that provides backward compatibility (new code can read old data) and forward compatibility (old code can read new data).
- data encoding formats and it pros and cons;
- several modes of dataflow, illustrating different scenarios in which data encodings are important.
- Databases, where the process writing to the database encodes the data and the process reading from the database decodes it.
- RPC and REST APIs, where the client encodes a request, the server decodes the request and encodes a response, and the client finally decodes the response
- Asynchronous message passing (using message brokers or actors), where nodes communicate by sending each other messages that are encoded by the sender and decoded by the recipient