Information Retrieval Medley

Dated Jan 3, 2021; last modified on Mon, 05 Sep 2022

Pagination with Relative Cursors

Exposing pagination through URLs like example.com/items?page=25&limit=100 leads to SQL like SELECT * from 'items' LIMIT 100 OFFSET 2400;, which inefficiently goes through 2,500 records and discards the first 2,400.

With relative cursors, we have URLs like example.com/items?limit=50&lastID=XXX, which leads to SQL like SELECT * from 'items' WHERE 'id' > XXX ORDER BY 'id' ASC LIMIT 50, and this is more efficient, assuming there are indexes for id.

The important part is including enough information in the URL to form an efficient SQL query that doesn’t skip/replay records. For example, if title is not a primary key, then a URL like example.com/items?limit=50&lastTitle=XXX will skip/replay records whose title is XXX but whose relative order is higher than 50. Combining the title with a primary key is enough to disambiguate such cases.

Substring Search in Visual Basic

In CS, substring search is a foundational problem for which there are \(O(m + n)\) algorithms. But VB’s implementation was \(O(mn)\). Suppose we have InStr("apple banana orange", "banana"):

  • Skip to the first b. Check if banana is a substring starting from that b.
  • If banana was not found, skip to the next b. Try again, and so forth.

In the worst case, we have no match, and a query that makes us do a lot of work before returning false, e.g. InStr("aaaaaaaaaaaaaaaaaaaa", "aaaab").

Why would VB go with an \(O(mn)\) algorithm?

  • The skipto function is a single x86 machine instruction.
  • Most VB devs are querying normal strings, not DNA sequences.

In practice, the brute force algorithm is mostly \(O(n + m)\). By the time preprocessing is done for asymptotic \(O(m + n)\) algorithms, the naive brute force algorithm will have given you the answer.

Commentary from HackerNews

New grads tend to have exposure to new tools. However, change should be motivated by a clear benefit, e.g. python3 has file system improvements and f-strings are more readable. While the users may not benefit, devs will.

Similar arguments: TypeScript - safer JS; Rust - speed AND memory safety.

Start with simple working implementations. Profile before you write fancier functions. But even with the working implementations, do the basics: cover DB queries with indexes, use maps/sets if frequently looking up, avoid re-renders of UI that doesn’t change, etc.

For all the talk about CS being out of touch with industry, this was pretty much hammered in COS 333.

A galactic algorithm is one that runs faster than any other algorithm for problems that are sufficiently large, but where “sufficiently large” is so big that the algorithm is never used in practice. (en.wikipedia.org)

Develop an understanding for real world data: caches, latencies, optimized instructions, etc. For small \(n\), the constants matter, e.g. how does your hash function fare against linearly searching a flat but contiguous array? What is the costly operation? Moving stuff around might be your undoing…

Know which battles to fight. 50x improvement? If your algorithm now takes 2K ns, but hitting the file system takes billions of ns, then there’s no perceivable benefit to the user. But if you’re going from 50 min to 1 min, awesome!

References

  1. Pagination with Relative Cursors. Drew Martin. Shopify. shopify.engineering . Aug 12, 2019.
  2. New grad vs senior dev. Eric Lipppert. ericlippert.com . news.ycombinator.com . Mar 27, 2020.