Relational Model Versus Document Model

Dated May 28, 2023; last modified on Mon, 29 May 2023

A relational data model uses tables that consist of rows and columns. A row can be uniquely identified by a table + ID combination. A column entry can reference another row in another table through a shared key. One goal is to avoid duplicating data. However, to answer a real-world query, we end up paying th cost by joining results from multiple tables. That said, with proper indexing and prior research, combining results is pretty fast.

A document model is more like a one-stop shop for individual queries. Good locality for single-entity queries, but we pay the cost when fetching population-level answers. Document models got lumped in with no-schema data models, but not sure if they’re equivalent; to an approximation, is a document model one big table? Either way, no-schema data models have flexibility at the cost of no preprocessing; might lead to scaling issues.

Funny that I used the terms “table”, “row”, and “column”, which happen to be SQL-specific terms. The relational data model, from which SQL is the most popular version, organizes the data into relations, where each relation is an unordered collection of tuples. Velcro moment .

Relational Model Versus Document Model

How can a résumé be represented in a relational schema? The profile was a whole can be identified by a unique identifier, user_id. Fields like first_name and last_name appear exactly once per user, so they can be modeled as columns on the users table. However, most people have had more than one job in their career, varying numbers of periods of education, and any number of pieces of contact information. The one-to-many relationship between a user and these items can be represented in various ways:

  • Put positions, education, and contact_info in separate tables, with a foreign key reference to the users table.
  • In later versions of SQL, multi-valued data can be stored within a single row, with support for querying and indexing, e.g., as XML or JSON data-types.
  • Encode positions, education and contact_info as JSON or XML, and store it on a TEXT column in the database. However, one can’t query the DB for values inside the encoded column.

I had the following schema:

CREATE TABLE users (user_id integer PRIMARY KEY, first_name text, last_name text, summary text);
CREATE TABLE positions (position_id integer PRIMARY KEY, title text, company text);
CREATE TABLE user_positions (user_id integer REFERENCES users, position_id integer REFERENCES positions, start_date date, end date);
CREATE TABLE institutions (institution_id integer PRIMARY KEY, instution_name text, degree text);
CREATE TABLE user_institution (user_id integer REFERENCES users, institution_id REFERENCES institutions, start date, end date);
CREATE TABLE contact_info (user_id integer REFERENCES users, info text);

But it feels as if it can be split further, e.g., institutions.degree, can be extracted out further with degrees being their own entities.

Also didn’t know that later versions of SQL support queryable and indexable multi-value data in a column. Seems like SQL and NoSQL are converging?

For a data structure like a résumé, which is mostly a self-contained document, a JSON representation can be appropriate, e.g.,

{
  "user_id": 251,
  "first_name": "Bill",
  "last_name": "Gates",
  "summary": "Co-chair of the Bill & Melinda Gates... Active blogger.",
  "region_id": "us:91",
  "industry_id": 131,
  "photo_url": "/p/7/000/253/05b/308dd6e.jpg",
  "positions": [
    {"job_title": "Co-chair", "organization": "Bill & Melinda Gates Foundation"},
    {"job_title": "Co-founder, Chairman", "organization": "Microsoft"}
  ],
  "education": [
    {"school_name": "Harvard University", "start": 1973, "end": 1975},
    {"school_name": "Lakeside School, Seattle", "start": null, "end": null}
  ],
  "contact_info": {
    "blog": "http://thegatesnotes.com",
    "twitter": "http://twitter.com/BillGates"
  }
}

… with properties such as:

  • Lower impedance mismatch between the application code and the storage layer.
  • Locality: all the info in one place, and one query is sufficient.
  • The one-to-many relationships' tree structure is explicit.

The use of IDs such as us:91 and 131 over free text such as Greater Seattle Area and Philanthropy enables:

  • Consistent style and spelling across profiles.
  • Avoids ambiguity, e.g., several cities with the same name.
  • Ease of updating, e.g., city changes its name.
  • Standardized list can be localized into user’s language.
  • Better search, e.g., encoding that Geater Seattle Area is in Washington.

Anything that is meaningful to humans may need to change sometime in the future, but the ID, which has no meaning to humans, can stay the same. If information that is meaningful to humans changes, and that information is duplicated, all redundant copies need to updated (write overheads) lest the data becomes inconsistent. Removing such duplication is the key idea behind normalization in databases.

Are Document Databases Repeating History?

The network (CODASYL) model of the 1970s had it such that a record could have multiple parents. To access a record, one would need to follow an access path from the root record along chains of links (and keep track of the traversal because different paths can lead to the same record). The code for querying and updating the database was complicated and inflexible.

EGR 277 might have had something to say about relational databases winning out over network databases, especially how the resurgence of document databases partially vindicates network databases.

By contrast, the relational model lay all the data in the open. The query optimizer automatically decides which parts of the query to execute and in what order, and which indexes to use, effectively creating an access path which was abstracted away from the application developer. The application developer need not update their queries to take advantage of new indexes. Granted, query optimizers are complicated beasts with many years of R&D, all applications that use a database can benefit from its query optimizer.

#databases

#innovation

Another complicated beast that consumes a lot of R&D is a programming language’s compiler, which comes with lots of optimizations that allow application developers to write relatively simple code.

In an ideal case, the abstraction is airtight, but the abstraction may leak leading to complexity in the application code. For example, some SQL servers may be dramatically faster for WHERE a=b AND b=c AND a=c than for WHERE a=b AND b=c.

Document databases reverted back to the hierarchical model by storing one-to-many relationships within the parent record. For many-to-one and many-to-many relationships:

  • Document databases use a document reference in the document model, and resolve it at read time using follow-up queries.
  • Relational databases use a foreign key in the relational model, and resolve it at read time using a join.

#databases

Relational Versus Document Databases Today

If the data has a document-like structure, splitting it into multiple tables can lead to cumbersome schemas and unnecessarily complicated application code. That said, data has a tendency of becoming more interconnected as features are added to applications.

Some references cannot be made directly in a document model, e.g., “the second item in the list of positions for user 251,” but as long as the documents are not too deeply nested, this is not usually a problem.

If there are many-to-many relationships, the document model becomes less appealing because of the work needed to keep denormalized data consistent, and the slower emulations of joins, which could have been handled by more specialized code in a relational model. For highly interconnected data, the graph model is even better than a relational one.

#databases

While document databases are called schema-less, they are more of schema-on-read because the structure of the data does get interpreted when read. Contrast this with relational databases that have schema-on-write which is explicit and enforced by the database. The difference is poignant when changing the format of its data, e.g.,

if (user && user.name && !user.first_name) {
  // Documents written before Dec 8, 2013 don't have `first_name`.
  user.first_name = user.name.split(" ")[0];
}
-- Perform a migration that adds first_name
ALTER TABLE users ADD COLUMN first_name text;
UPDATE users SET first_name = split_part(name, ' ', 1);

#databases

… and some tradeoffs may arise. For example, MySQL’s ALTER TABLE copies the entire table, which may mean hours of downtime for very large tables. UPDATE on a large table is slow on any database, but can be mitigated by setting a default of NULL and filling it at read time, like it would in a document database. Schema-on-read is also advantageous when there are many different types of objects, or the structure of data is determined by external systems that you don’t control.

MySQL’s ALTER TABLE is still pretty slow, prompting workarounds by Percona , Soundcloud , Github , Facebook , Open Ark , etc.

If large parts of the document are needed at the same time, then the document model has a data locality advantage. However, updates to a document (especially those that change its size) may need a full rewrite of the document, with significant perf implications. Some relational databases do group related data together for locality, e.g., Google’s Spanner database.

#databases

#locality

References

  1. Designing Data-Intensive Applications: The big ideas behind reliable, scalable, and maintainable systems. Chapter 2: Data Models and Query Languages. Kleppmann, Martin. Mar 16, 2017.
  2. The Law of Leaky Abstraction. Joel Spolsky. www.joelonsoftware.com . en.wikipedia.org . Accessed May 28, 2023.
  3. sql - MySQL very slow for alter table query. stackoverflow.com . Accessed May 28, 2023.