A relational data model uses tables that consist of rows and columns. A row can be uniquely identified by a table + ID combination. A column entry can reference another row in another table through a shared key. One goal is to avoid duplicating data. However, to answer a real-world query, we end up paying th cost by joining results from multiple tables. That said, with proper indexing and prior research, combining results is pretty fast.
A document model is more like a one-stop shop for individual queries. Good locality for single-entity queries, but we pay the cost when fetching population-level answers. Document models got lumped in with no-schema data models, but not sure if they’re equivalent; to an approximation, is a document model one big table? Either way, no-schema data models have flexibility at the cost of no preprocessing; might lead to scaling issues.
Relational Model Versus Document Model
How can a résumé be represented in a relational schema? The profile was a whole
can be identified by a unique identifier, user_id
. Fields like first_name
and last_name
appear exactly once per user, so they can be modeled as columns
on the users
table. However, most people have had more than one job in their
career, varying numbers of periods of education, and any number of pieces of
contact information. The one-to-many relationship between a user and these items
can be represented in various ways:
- Put
positions
,education
, andcontact_info
in separate tables, with a foreign key reference to theusers
table. - In later versions of SQL, multi-valued data can be stored within a single row,
with support for querying and indexing, e.g., as
XML
orJSON
data-types. - Encode
positions
,education
andcontact_info
as JSON or XML, and store it on aTEXT
column in the database. However, one can’t query the DB for values inside the encoded column.
I had the following schema:
CREATE TABLE users (user_id integer PRIMARY KEY, first_name text, last_name text, summary text);
CREATE TABLE positions (position_id integer PRIMARY KEY, title text, company text);
CREATE TABLE user_positions (user_id integer REFERENCES users, position_id integer REFERENCES positions, start_date date, end date);
CREATE TABLE institutions (institution_id integer PRIMARY KEY, instution_name text, degree text);
CREATE TABLE user_institution (user_id integer REFERENCES users, institution_id REFERENCES institutions, start date, end date);
CREATE TABLE contact_info (user_id integer REFERENCES users, info text);
But it feels as if it can be split further, e.g., institutions.degree
, can be
extracted out further with degrees being their own entities.
Also didn’t know that later versions of SQL support queryable and indexable multi-value data in a column. Seems like SQL and NoSQL are converging?
For a data structure like a résumé, which is mostly a self-contained document, a JSON representation can be appropriate, e.g.,
{
"user_id": 251,
"first_name": "Bill",
"last_name": "Gates",
"summary": "Co-chair of the Bill & Melinda Gates... Active blogger.",
"region_id": "us:91",
"industry_id": 131,
"photo_url": "/p/7/000/253/05b/308dd6e.jpg",
"positions": [
{"job_title": "Co-chair", "organization": "Bill & Melinda Gates Foundation"},
{"job_title": "Co-founder, Chairman", "organization": "Microsoft"}
],
"education": [
{"school_name": "Harvard University", "start": 1973, "end": 1975},
{"school_name": "Lakeside School, Seattle", "start": null, "end": null}
],
"contact_info": {
"blog": "http://thegatesnotes.com",
"twitter": "http://twitter.com/BillGates"
}
}
… with properties such as:
- Lower impedance mismatch between the application code and the storage layer.
- Locality: all the info in one place, and one query is sufficient.
- The one-to-many relationships' tree structure is explicit.
The use of IDs such as us:91
and 131
over free text such as Greater Seattle Area
and Philanthropy
enables:
- Consistent style and spelling across profiles.
- Avoids ambiguity, e.g., several cities with the same name.
- Ease of updating, e.g., city changes its name.
- Standardized list can be localized into user’s language.
- Better search, e.g., encoding that
Geater Seattle Area
is inWashington
.
Anything that is meaningful to humans may need to change sometime in the future, but the ID, which has no meaning to humans, can stay the same. If information that is meaningful to humans changes, and that information is duplicated, all redundant copies need to updated (write overheads) lest the data becomes inconsistent. Removing such duplication is the key idea behind normalization in databases.
Are Document Databases Repeating History?
The network (CODASYL) model of the 1970s had it such that a record could have multiple parents. To access a record, one would need to follow an access path from the root record along chains of links (and keep track of the traversal because different paths can lead to the same record). The code for querying and updating the database was complicated and inflexible.
EGR 277 might have had something to say about relational databases winning out over network databases, especially how the resurgence of document databases partially vindicates network databases.
By contrast, the relational model lay all the data in the open. The query optimizer automatically decides which parts of the query to execute and in what order, and which indexes to use, effectively creating an access path which was abstracted away from the application developer. The application developer need not update their queries to take advantage of new indexes. Granted, query optimizers are complicated beasts with many years of R&D, all applications that use a database can benefit from its query optimizer.
Another complicated beast that consumes a lot of R&D is a programming language’s compiler, which comes with lots of optimizations that allow application developers to write relatively simple code.
In an ideal case, the abstraction is airtight, but the abstraction may leak
leading to complexity in the application code. For example, some SQL servers may
be dramatically faster for WHERE a=b AND b=c AND a=c
than for WHERE a=b AND b=c
.
Document databases reverted back to the hierarchical model by storing one-to-many relationships within the parent record. For many-to-one and many-to-many relationships:
- Document databases use a document reference in the document model, and resolve it at read time using follow-up queries.
- Relational databases use a foreign key in the relational model, and resolve it at read time using a join.
Relational Versus Document Databases Today
If the data has a document-like structure, splitting it into multiple tables can lead to cumbersome schemas and unnecessarily complicated application code. That said, data has a tendency of becoming more interconnected as features are added to applications.
Some references cannot be made directly in a document model, e.g., “the second
item in the list of positions for user 251
,” but as long as the documents are
not too deeply nested, this is not usually a problem.
If there are many-to-many relationships, the document model becomes less appealing because of the work needed to keep denormalized data consistent, and the slower emulations of joins, which could have been handled by more specialized code in a relational model. For highly interconnected data, the graph model is even better than a relational one.
While document databases are called schema-less, they are more of schema-on-read because the structure of the data does get interpreted when read. Contrast this with relational databases that have schema-on-write which is explicit and enforced by the database. The difference is poignant when changing the format of its data, e.g.,
if (user && user.name && !user.first_name) {
// Documents written before Dec 8, 2013 don't have `first_name`.
user.first_name = user.name.split(" ")[0];
}
-- Perform a migration that adds first_name
ALTER TABLE users ADD COLUMN first_name text;
UPDATE users SET first_name = split_part(name, ' ', 1);
… and some tradeoffs may arise. For example, MySQL’s ALTER TABLE
copies the
entire table, which may mean hours of downtime for very large tables. UPDATE
on a large table is slow on any database, but can be mitigated by setting a
default of NULL
and filling it at read time, like it would in a document
database. Schema-on-read is also advantageous when there are many different
types of objects, or the structure of data is determined by external systems
that you don’t control.
MySQL’s ALTER TABLE
is still pretty slow, prompting workarounds by
Percona
,
Soundcloud
,
Github
,
Facebook
,
Open
Ark
, etc.
If large parts of the document are needed at the same time, then the document model has a data locality advantage. However, updates to a document (especially those that change its size) may need a full rewrite of the document, with significant perf implications. Some relational databases do group related data together for locality, e.g., Google’s Spanner database.
References
- Designing Data-Intensive Applications: The big ideas behind reliable, scalable, and maintainable systems. Chapter 2: Data Models and Query Languages. Kleppmann, Martin. Mar 16, 2017.
Funny that I used the terms “table”, “row”, and “column”, which happen to be SQL-specific terms. The relational data model, from which SQL is the most popular version, organizes the data into relations, where each relation is an unordered collection of tuples. Velcro moment .