Replacing a 3 GB SQLite db with a 10 MB FST (finite state transducer) binary

(til.andrew-quinn.me)

47 points | by hiAndrewQuinn 2 hours ago

3 comments

Hendrikto 29 minutes ago
> I do wish to point out, of course, that the whole reason it was possible to experiment cheaply and come across this serendipity was because 9 months ago, faced with the choice to either do the bad easy thing or the good nothing, I chose to do the bad easy thing.5 The SQLite database worked! I understood how it worked, behind the scenes with its B-trees and its Full Text Search extension.
This is the most important takeaway, imo, and a very valuable technique: Start with the obvious, stupid solution that definitely works. Then do the optimized version, while making sure it matches the naive implementation. In this case, the optimized version could even be generated from the naive one.
[-]
- baublet 22 minutes ago
  Came here to add this, too. Sometimes the most valuable thing a solution can buy you is time to think of a better solution.
asibahi 5 minutes ago
This was a very interesting read. I wonder if similar techniques can apply to Turkish or Japanese dictionaries?
lscharen 48 minutes ago
I was halfway through the article and began thinking that his described data structure sounded very familiar to something I used about 20 years ago.
Sure enough, the first paragraph on the Wikipedia entry for DAFSA is:
DAFSA is the rediscovery of a data structure called Directed Acyclic Word Graph (DAWG)
[-]
- hiAndrewQuinn 3 minutes ago
  Apparently the structure itself has a bit of a history. The word 'rediscovery' tipped me off to go to Wikipedia myself and read up more about this.
  First Blumer et al., 1983 came up with a "DAWG", but reading the abstract [1] I was left a little confused as to how exactly we get from 'here is how we store all substrings of a string in O(|string|) space, with "is this a substring [yn]" recognition in O(|substring|) time' to the modern DAFSA, as cool and useful as that is. Come to think of it I bet I could use that in some LeetCode problems.
  But the structure we actually think of as a DAWG or DAFSA (or FST, I guess, thanks to this Rust crate) is in the paper "The World’s Fastest Scrabble Program". That worked but you had to construct a whole trie first, then compact it down, so build time was a memory hog. Then Dr. Daciuk of 3city sharpened the blade in 2000 by proving that this was as good as it gets in the unsorted case, but on a sorted set you could build the DAFSA incrementally, because an increasingly large part of the graph you were building was already optimized.
  And then from there BurntSushi got involved with the implementation and the rest is history.
  [1]: https://www.sciencedirect.com/science/article/pii/0304397585...
  (That's what I can glean from ~30 minutes of not particularly focused reading. Forgive me if I have made any mistakes.)