Query Languages for Data
If you have a list of animal species and you want to return only the sharks in the list, a relational algebra expression would be \( \text{sharks} = \sigma_{\text{family = “Sharks”}}(\text{animals}) \). SQL queries follow the structure of relational algebra closely:
SELECT * FROM animals WHERE family = 'Sharks';
… while an imperative query would be:
function getSharks() {
let sharks = [];
for (let i = 0; i < animals.length; i++) {
if (animals[i].family === "Sharks") {
sharks.push(animals[i]);
}
}
return sharks;
}
An imperative language tells the computer to perform certain operations in a certain order. A declarative language specifies the pattern of the data you want, but not how to achieve that goal.
Ah, so that’s what animals.filter(animal => animal.family === "Sharks")
is!
Advantages of declarative query languages:
- Typically more concise and easier to work with.
- Hides implementation details, and can therefore freely benefit from improvements in the query optimizer. An imperative query would need to be rewritten to take advantage of newer and better APIs and/or data layout.
- Lend themselves to parallelization because they do not specify an algorithm.
MapReduce Querying
MapReduce
is somewhere in between a declarative query language and a fully
imperative query API. Suppose you want to generate a report on how many sharks
sighted per month. In PostgreSQL, you’d have:
SELECT date_trunc('month', observation_timestamp) AS observation_month,
sum(num_animals) AS total_animals
FROM observations
WHERE family = 'Sharks'
GROUP BY observation_month;
… but the same can be expressed with MongoDB’s MapReduce
feature as:
db.observations.mapReduce(
// Called once for every document that matches `query`, with `this` set to the
// document object.
function map() {
let year = this.observationTimestamp.getFullYear();
let month = this.observationTimestamp.getMonth() + 1;
emit(`${year}-${month}`, this.numAnimals);
},
// The key-value pairs emitted by map are grouped by key, such that `reduce`
// is called once for every `key`, e.g., `reduce("2023-2", [23, 1, 43])`.
function reduce(key: string, values: numbers[]) {
return Array.sum(values);
},
{
// A MongoDB-specific extension to MapReduce that allows declarative filters
query: { family: "Sharks" },
// The final output is written to the collection `monthlySharkReport`.
out: "monthlySharkReport"
}
);
map
and reduce
must be pure functions (only depend on input params and have
no side effects). As such, the DB can run them anywhere in any order, and rerun
them on failure. That said, they are powerful in that they can parse strings,
call library functions, perform calculations, etc., and therefore convenient for
advanced queries.
However, one has to write two coordinated functions, and the imperativeness limits the opportunities for a query optimizer to improve the performance of a query. MongoDB 2.2 thus added a declarative query language called the aggregation pipeline:
db.observations.aggregate([
{ $match: { family: "Sharks" } },
{ $group: {
_id: {
year: { $year: "$observationTimestamp" },
month: { $month: "$observationTimestamp" }
},
totalAnimals: { $sum: "$numAnimals" }
} }
]);
References
- Designing Data-Intensive Applications: The big ideas behind reliable, scalable, and maintainable systems. Chapter 2: Data Models and Query Languages. Kleppmann, Martin. Mar 16, 2017.
To be fair though, JS’s
Array.prototype.filter
supports more succinct code: