Qiyang Wang
WritingDesigning Data-Intensive Applications

Data Models and Query Languages

This chapter provided a brief overview of various data models, encouraging further exploration of the model that best suits specific application requirements.

Data models are crucial in software development, impacting both code and problem-solving approaches.

Applications are built by layering data models, with each layer represented in terms of the next-lower layer. This includes modeling the real world, storing data structures, representing data in storage engines, and ultimately, representing bytes in hardware.

Complex applications often use multiple intermediary levels, like APIs, to hide complexity and provide a clean data model. This abstraction enables effective collaboration between different groups, such as database engineers and application developers.

This chapter compares data models (relational, document, graph-based, event sourcing, DataFrames) and their query languages to help decide which model to use for different data and queries.

Declarative query languages like SQL and Cypher allow users to specify data patterns and transformations without detailing the execution process. This approach simplifies query writing and enables database systems to optimize performance, such as parallel execution, without requiring user intervention.

Relational Versus Document Models

The relational model, proposed by Edgar Codd in 1970, organizes data into tables and remains dominant in data management. While alternatives like the network, hierarchical, and object models have emerged, the relational model, particularly SQL, has adapted to incorporate various data types. The NoSQL movement, emphasizing flexibility and scalability, popularized the document model, represented as JSON, which is now widely supported by both NoSQL and relational databases.

The Object-Relational Mismatch

Object-oriented programming languages create an impedance mismatch with the SQL data model, requiring a translation layer between objects and the relational database model.

Impedance mismatch, borrowed from electronics, occurs when output and input impedances of connected circuits differ, potentially causing signal reflections and other issues.

Object-relational mapping

Object-relational mapping (ORM) frameworks like ActiveRecord and Hibernate simplify the translation between relational and object data models, reducing boilerplate code. However, they are criticized for their complexity, limited support for diverse data systems, potential for inefficient queries, and challenges in managing relational schemas. Despite these drawbacks, ORMs offer advantages like reduced boilerplate code for simple queries, caching, and schema migration management.

The document data model for one-to-many relationships

The relational model has limitations, as illustrated by the example of representing a résumé. While relational schemas can represent one-to-many relationships, a JSON document may be a more natural and object-oriented representation.

JSON reduces impedance mismatch between application code and storage, offering better data locality and an explicit tree structure for one-to-many relationships. However, JSON also presents problems as a data encoding format.

A one-to-many relationship, also called one-to-few, is suitable for a small number of related items, but a relational approach is better for a large number of items.

Normalization, Denormalization, and Joins

Using standardized lists of geographic regions with IDs, rather than plain-text strings, offers advantages like consistent style, avoiding ambiguity, ease of updates, localization support, and better search functionality. While normalized data requires additional lookups to display human-readable information, it avoids data duplication and potential inconsistencies. Document databases often favor denormalization due to their data model and limited join support.

Trade-offs of normalization

Denormalized data, while faster to read, can be more expensive to write and maintain consistency, especially in large-scale systems. Normalized data, on the other hand, is generally faster to write and better suited for OLTP systems, but may be slower to query due to the need for joins. The choice between the two models depends on the specific needs and scale of the system.

Denormalization in the social networking case study

Materialized timelines at X (formerly Twitter) store only post IDs and sender IDs, requiring two joins to retrieve full post content and user information. This approach balances scalability and data freshness, as frequently changing data like likes and profile pictures are not denormalized. The decision to denormalize should consider data change frequency, read/write costs, and implementation effort.

Many-to-One and Many-to-Many Relationships

Many-to-one and many-to-many relationships, such as those between résumés and organizations, are better suited for normalized representations. In a document model, these relationships can be represented as references to other documents, while a normalized representation stores the relationship in one place and uses indexes for efficient querying. Both models can utilize indexes on relevant fields to enable bidirectional queries.

Stars and Snowflakes: Schemas for Analytics

Data warehouses use relational structures like star schemas, snowflake schemas, dimensional modeling, or one big table (OBT) to optimize data for business analysis. These schemas, often containing wide tables with numerous columns, represent events and their associated metadata through fact and dimension tables. While snowflake schemas are more normalized, star schemas are preferred for their simplicity, and OBTs, though requiring more storage, can enable faster queries.

When to Use Which Model

The document data model offers schema flexibility, better performance, and alignment with object models, making it suitable for applications with document-like structures. However, it has limitations in referencing nested items and may be less efficient for applications requiring complex relationships or reorderable lists.

Schema flexibility in the document model

Document databases, often called schemaless, use a schema-on-read approach where data structure is interpreted at read time, unlike relational databases with schema-on-write that enforce structure at write time. This difference impacts data format changes, with schema-on-read allowing flexible handling of old and new formats, while schema-on-write requires migrations. Schema-on-read is beneficial for heterogeneous data or when data structure is externally controlled and subject to change.

Data locality for reads and writes

Storing documents as single continuous strings, like JSON or XML, offers performance advantages for applications that frequently access entire documents. However, this approach can be inefficient for applications that only need small parts of a document or frequently update documents.

Query languages for documents

Relational databases are queried using SQL, while document databases use varied languages and APIs. Some document databases offer rich query languages, like MongoDB’s aggregation pipeline, which can be used to generate reports, such as the number of sharks sighted per month.

Convergence of document and relational databases

Relational and document databases have converged, with relational databases adding JSON support and document databases adding relational features. This convergence allows developers to combine both models in the same database, offering flexibility and power.

Codd’s original relational model allowed nested relations (tables) as values, similar to JSON and XML support later added to SQL.

Graph-Like Data Models

The document model is suitable for applications with mostly one-to-many relationships, while the graph model is better for complex many-to-many relationships.

Graphs model data with vertices and edges, representing entities and relationships. Examples include social graphs, web graphs, and road networks.

Well-known algorithms, like those used in map navigation apps and PageRank, can operate on graphs.

Graphs can be represented using adjacency lists or matrices. Adjacency lists are suitable for graph traversals, while matrices are beneficial for machine learning.

Graphs can store different types of objects in a single database, as demonstrated by Facebook’s graph of people, locations, and events, and search engines’ knowledge graphs of entities.

The next section discusses the property graph model and the triple store model, highlighting their similarities and differences. It also explores four graph query languages (Cypher, SPARQL, Datalog, and GraphQL) and SQL support for querying graphs, using a running example to illustrate their usage.

Property Graphs

Property graph model: vertices have unique identifiers, labels, edges, and properties; edges have unique identifiers, tail and head vertices, labels, and properties.

A graph store can be represented as two relational tables, one for vertices and one for edges, with properties stored in the jsonb datatype.

The model allows any vertex to connect with any other, enabling efficient traversal of the graph. Different labels and properties enable storing various information types while maintaining a clean data model.

The edges table generalizes many-to-many relationships, storing various relationship types and allowing efficient retrieval of vertices or edges with specific properties.

Graph models are limited to two-way relationships, unlike relational join tables which can represent higher-degree relationships. This can be addressed in graphs by adding vertices for each join table row or using hypergraphs.

Graphs offer flexibility for data modeling, allowing for the representation of complex structures and varying data granularities. They are easily extendable, making them suitable for applications with evolving data structures.

The Cypher Query Language

Cypher is a query language for property graphs, originally created for Neo4j and later developed into an open standard. It is supported by several graph databases.

Cypher query inserts graph data, using symbolic names for vertices and arrow notation to create edges.

Adding vertices and edges from Figure 3-6 to the database allows for querying, such as finding people who emigrated from the US to Europe.

Cypher query example demonstrates using arrow notation in MATCH clauses to find patterns in graphs.

Find person vertices with BORN_IN and LIVES_IN edges leading to Location vertices named United States and Europe, respectively.

One way to execute the query is to scan the database, examining each person’s birthplace and residence, and returning those who meet the criteria.

An alternative approach to finding people based on location is to start with the Location vertices for the US and Europe, then find all locations within those regions and finally identify people connected to those locations.

Graph Queries in SQL

Graph data can be represented in a relational database, but querying it with SQL is challenging due to the variable number of edges that may need to be traversed. Cypher query language, however, can express this concisely using the :WITHIN*0.. pattern.

Recursive common table expressions (WITH RECURSIVE) can express variable-length traversal paths in SQL queries, though the syntax is less intuitive than Cypher.

The choice of data model and query language significantly impacts efficiency, as demonstrated by the comparison between Cypher and SQL. The Graph Query Language (GQL) ISO standard, based on Cypher, aims to improve uniformity among graph databases.

Triple Stores and SPARQL

The triple store model, similar to the property graph model, stores information as three-part statements: (subject, predicate, object). This model is valuable for building applications.

Databases with a triple-like data model, such as AWS Neptune and Datomic, often store additional metadata on each tuple, but are still referred to as triple stores.

In a graph, a triple’s subject is a vertex, and its object can be a primitive datatype value or another vertex. The predicate represents the relationship between the subject and object vertices.

Vertices in the graph are represented by _:someName, with predicates indicating edges or properties. Semicolons can be used for a more compact representation.

Triple stores, a Semantic Web technology, are used for internet-wide data exchange and as an internal data model for applications.

The RDF data model

The Turtle language encodes data in RDF, a data model for the Semantic Web, which can also be encoded in XML.

RDF uses URIs for subject, predicate, and object in triples to enable data combination without conflicts from differing interpretations of predicates.

Nonresolvable URIs, like urn:example:within, can be used as namespaces in RDF, avoiding confusion with http:// URLs.

The SPARQL query languag

SPARQL, a query language for RDF triple stores, predates Cypher and shares similar pattern matching features. Both languages offer concise query syntax for tasks like finding people who have moved from the US to Europe.

Cypher and SPARQL are equivalent in structure, using predicates for both properties and edges. SPARQL is supported by various triple stores.

Datalog: Recursive Relational Queries

Datalog, a powerful language for complex queries, is less known but used in niche databases like Datomic and CozoDB.

Datalog database facts, like rows in relational tables, are represented as statements (e.g., location(2, “United States”, “country”)).

Datalog representation of graph data from Figure 3-6, using two-column join tables for edges.

The query, now defined, is written in Datalog, a subset of Prolog.

Datalog derives new virtual tables from underlying facts, similar to SQL views, allowing querying without storing them in the database.

Example 3-12 defines three derived tables: within_recursive, migrated, and us_to_europe.

Virtual table content is defined by the rule’s right-hand side, matching rows to patterns. A rule applies when all patterns match, adding the left-hand side to the database with variable values.

Datalog queries are built rule by rule, allowing complex queries and recursive rule invocation for tasks like graph traversal.

GraphQL

GraphQL is a restrictive query language designed for OLTP queries, allowing client software to request specific JSON documents for UI rendering.

GraphQL interfaces offer flexibility for client code but require tooling for internal service integration and present challenges like authorization and performance.

GraphQL intentionally limits its language to prevent expensive queries from untrusted sources, disallowing recursive queries and arbitrary search conditions unless explicitly enabled by service owners.

GraphQL is useful for implementing group chat applications. Example 3-13 demonstrates how to query channels, messages, and sender information for a Discord or Slack-like application.

The response to a query is a JSON document mirroring the query structure, containing only requested attributes. This allows the client to request specific attributes needed for the user interface, making server-side changes unnecessary.

GraphQL prioritizes simplicity in rendering the UI over reducing response size by duplicating sender information and reply content within message objects.

GraphQL can be implemented on any database type, including relational, document, and graph databases. It allows clients to request only explicitly declared joins in the GraphQL schema.

Event Sourcing and CQRS

Complex applications may require multiple data representations optimized for different query types, rather than a single data model.

Data can be optimized for writing by choosing different representations, prioritizing writing efficiency over query performance.

Event logs are a simple and efficient way to write data, using immutable, self-contained strings with timestamps.

Conference management systems are complex, involving individual and company registrations, seat assignments, cancellations, and capacity changes.

Changes to the conference state are stored as events, triggering updates to materialized views that reflect the changes.

Event sourcing and CQRS are principles from the DDD community that involve using events as the source of truth and maintaining separate read and write representations.

User requests, called commands, are validated before execution. Valid commands become facts and are logged as events, which are then used to build materialized views.

Event sourcing data modeling recommends naming events in the past tense to accurately reflect historical facts.

Event sourcing and star schema fact tables both store past events, but differ in structure and order importance.

Event sourcing and CQRS have several advantages:

  • Events are more effective than row modifications for communicating system intent, as they clearly explain the reason for changes.
  • Materialized views in event sourcing are derived from the event log, allowing for reproducible recomputation. This enables easy bug detection and correction by deleting and recomputing views with updated code.
  • Multiple materialized views can be optimized for specific queries, stored in the same or different databases, and use any data model. Views can be denormalized for fast reads and kept in memory to avoid persistence.
  • New materialized views can be built from existing event logs, and systems can be evolved by adding new event types or properties.
  • Event sourcing allows for easy reversal of erroneous events through subsequent deletion events, unlike databases where committed transactions are often difficult to reverse.
  • Event logs provide auditability, valuable in regulated industries.
  • Event logs, with their sequential access patterns, can handle higher write throughput than databases, allowing downstream systems to catch up at their own pace.

However, event sourcing and CQRS also have downsides:

  • External information, like fluctuating exchange rates, can make event processing non-deterministic. To ensure consistent results, include the exchange rate in the event or query historical rates at the event’s timestamp.
  • Immutable events pose challenges for deleting personal data, especially when events involve multiple users. Solutions like storing data separately or using crypto-shredding complicate state recomputation.
  • Reprocessing events requires caution to avoid unwanted side effects, such as resending confirmation emails.

Event sourcing can be implemented on any database, but some systems like EventStoreDB, MartenDB, and Axon Framework are specifically designed for it. The event storage system must guarantee that materialized views process events in the same order they appear in the log.

DataFrames, Matrices, and Arrays

Data models used for transaction processing and analytics include DataFrames and multidimensional arrays, which are uncommon in OLTP systems.

DataFrame data model is supported by R, Pandas, Spark, ArcticDB, Dask, and other systems. DataFrames are used for data preparation, exploration, analysis, and visualization.

A DataFrame, similar to a table or spreadsheet, supports relational-like operators for bulk operations, including applying functions, filtering, grouping, aggregating, and merging.

DataFrames are manipulated through commands, aligning with data scientists’ workflows of incrementally refining data for analysis.

DataFrame APIs offer operations beyond relational databases and are often used to transform data into matrix or multidimensional array representations for machine learning algorithms.

A relational table of user movie ratings is transformed into a sparse matrix, with columns representing movies and rows representing users. This matrix, suitable for DataFrames and sparse arrays, can handle large datasets.

Matrices can only contain numbers, so nonnumerical data is transformed. Dates are scaled, and one-hot encoding is used for columns with a fixed set of values.

DataFrames enable data to be transformed from a relational form to a matrix representation, facilitating linear algebra operations used in many machine learning algorithms. This flexibility allows data scientists to choose the most suitable representation for data analysis or model training.

Array databases, like TileDB, are specialized for storing large multidimensional arrays, commonly used for scientific datasets and time-series data in finance.

Last updated on

On this page