Under the hood, this model resembles LDA, but replaces its Dirichlet priors with Pitman–Yor Processes (PYPs), which better capture the power-law behavior of word distributions. It also supports arbitrary hierarchical priors, allowing metadata-aware modeling.
For example, in an earnings-transcript corpus, a typical LDA might have a flat structure:
Prior → Document
Our model instead uses a hierarchical graph:
Uniform Prior
→ Global Topics
→ Ticker
→ Quarter
→ Paragraph
This hierarchical structure, combined with the PYP statistics, consistently yields more coherent and fine-grained topic structures than standard LDA does. There’s also a “fast mode” that collapses some hierarchy levels for quicker runs; it’s a handy option if you’re curious to see the impact hierarchy has on the model results (or in a rush).
We have some more technical write-ups on the internals of the model that are not hosted publicly (we have some on-going publication efforts applying those model to scRNA sequencing). But feel free to shoot me an email (in my profile) and I'd be happy to send over some of our more technical documents.
This could become the missing piece for RAG with LLMs for company data. Every query that requires a lookup can use this model and then an agentic LLM can crawl through the hierarchy of results to extract the relevant information for the user's query. I suspect that'll work much better than the current methods of chunking and storing data with metadata like title and author in a vector database and then performing a hybrid search
That's actually an application we've had a lot of success in. This framework allows you to really easily traverse the graph at a thematic level (with sql filtering if needed), then for any high level theme, you can pull up granular excerpts. This site itself is actually just a thin wrapper over our API (https://docs.sturdystatistics.com/).
I'm an individual, experienced FAANG software engineer looking to build something in this space. Lmk if you want to chat about building something together
I love this concept! I have always believed that the old methodologies used in NLP and statistics can be better and faster than new LLM technologies like embeddings, depending on the scenario. Will the code be open-sourced someday? I'm thrilled to learn from it.
I think there is so much value and room to grow by leveraging a statistical foundation. We’re still iterating really quickly on the low level C code on a variety of applications (pharma, scRNA, text) so it might be a while before we release it standalone.
We do offer an api layer (the website is a light layer above this) over the low level statistics code focused on making it super easy to apply to language data if you are interested in playing around with it: https://docs.sturdystatistics.com
I did a google search for "camping with dogs" and it organized the results into a set of about ~30 results which span everything I'd want to know on the topic: from safety and policies to products and travel logistics.
Awesome so glad the result were helpful! What's cool is because it's built on hierarchical Bayesian sampling, it is extremely robust to any input — it just kinda works.
I see that the model has not yet finished training: I think you are referring to the "Raw Search Results Section".
Our tool works a little different than LLM style tools. We are doing a bulk search — for academic search, ~1000 papers — and then training a hierarchical Bayesian model to organize the results. Once the model trains, it provides a visual representation of the high level themes that you can then use to explore the results.
The trade off is we are willing to lower the relevance filter to enable a broad set of exploration.
The URL is unique to your search and saves it's state!
In the technical notes I sort of laid out our model graph on the document branch. We also have a topic branch that is also structured hierarchically: Uniform Prior → High Level Topic Word → Granular Topics → Document Lever Variation in Topics. We just directly visualize that hierarchical representation in the sunburst.
The low level model graph is all written in C and exports granular annotations of the model graph. We use the model output to annotate the original text data. We do some work to store these hierarchical results in a SQL queryable format in DuckDB.
What's cool about this process is it's all annotation based. You can query data at the topic level, analyze topics and sql, and at any point pull up the exact excerpts to which the high level data refers.
> Curious what you've been using it to search for?
For starters I've done some trivial things, like "emacs elisp" on HackerNews and now "git tutorial" on AcademicSearch. The later is still running and organizing results. But the results don't have relevance for "git" as it seems.
I'll do some searches in French and German later to see how it works with foreign languages (not searching on HackerNews, obviously ;-)
Quick update: I ran into a rate limit issue for one of my data sources. Apologies to anyone who has hit errors in the past 15 minutes. I think the issue should be resolved.
Under the hood, this model resembles LDA, but replaces its Dirichlet priors with Pitman–Yor Processes (PYPs), which better capture the power-law behavior of word distributions. It also supports arbitrary hierarchical priors, allowing metadata-aware modeling.
For example, in an earnings-transcript corpus, a typical LDA might have a flat structure: Prior → Document
Our model instead uses a hierarchical graph: Uniform Prior → Global Topics → Ticker → Quarter → Paragraph
This hierarchical structure, combined with the PYP statistics, consistently yields more coherent and fine-grained topic structures than standard LDA does. There’s also a “fast mode” that collapses some hierarchy levels for quicker runs; it’s a handy option if you’re curious to see the impact hierarchy has on the model results (or in a rush).
What is the go to "production" stack for something like this nowadays? Is Stan dead? Do you do HMC or approximations with e.g. Pyro?
Above C we built a python wrapper to help construct arbitrary Dirichlet and Pitman-Yor Processes graphs.
From there we have some python wrappers and store it all in a hierarchical DuckDB schema for fast query access.
The site itself is actually just a light wrapper around our API that simplifies this process.
We have some more technical write-ups on the internals of the model that are not hosted publicly (we have some on-going publication efforts applying those model to scRNA sequencing). But feel free to shoot me an email (in my profile) and I'd be happy to send over some of our more technical documents.
We do offer an api layer (the website is a light layer above this) over the low level statistics code focused on making it super easy to apply to language data if you are interested in playing around with it: https://docs.sturdystatistics.com
I did a google search for "camping with dogs" and it organized the results into a set of about ~30 results which span everything I'd want to know on the topic: from safety and policies to products and travel logistics.
Does this work on any type of data?
https://sturdystatistics.com/deepdive?fast=0&q=reinforcement...
I think only 1/10 of the articles is really on topic.
Our tool works a little different than LLM style tools. We are doing a bulk search — for academic search, ~1000 papers — and then training a hierarchical Bayesian model to organize the results. Once the model trains, it provides a visual representation of the high level themes that you can then use to explore the results.
The trade off is we are willing to lower the relevance filter to enable a broad set of exploration.
BTW:, the circular graphics of the result are really cool! How did you do this?
In the technical notes I sort of laid out our model graph on the document branch. We also have a topic branch that is also structured hierarchically: Uniform Prior → High Level Topic Word → Granular Topics → Document Lever Variation in Topics. We just directly visualize that hierarchical representation in the sunburst.
The low level model graph is all written in C and exports granular annotations of the model graph. We use the model output to annotate the original text data. We do some work to store these hierarchical results in a SQL queryable format in DuckDB.
What's cool about this process is it's all annotation based. You can query data at the topic level, analyze topics and sql, and at any point pull up the exact excerpts to which the high level data refers.
Curious what you've been using it to search for?
For starters I've done some trivial things, like "emacs elisp" on HackerNews and now "git tutorial" on AcademicSearch. The later is still running and organizing results. But the results don't have relevance for "git" as it seems.
I'll do some searches in French and German later to see how it works with foreign languages (not searching on HackerNews, obviously ;-)
The doc also explains the UX issue of a simple sunburst graph, thus using a tiered sun burst graph.