As many others pointed out, the released files are nearly nothing compared to the full dataset. Personally I've been fiddling a lot with OSINT and analytics over the publicly available Reddit data(a considerable amount of my spare time over the last year) and the one thing I can say is that LLMs are under-performing(huge understatement) - they are borderline useless compared to traditional ML techniques. But as far as LLMs go, the best performers are the open source uncensored models(the most uncensored and unhinged), while the worst performers are the proprietary and paid models, especially over the last 2-3 months: they have been nerfed into oblivion - to the extent where simple prompts like "who is eligible to vote in US presidential elections" is considered a controversial question. So in the unlikely event that the full files are released, I personally would look at the traditional NLP techniques long before investing any time into LLMs.
On the limited dataset: Completely agree - the public files are a fraction of what exists and I should have mentioned that it is not all files but all publicly available ones. But that's exactly why making even this subset searchable matters. The bar right now is people manually ctrl+F-ing through PDFs or relying on secondhand claims. This at least lets anyone verify what is public.
On LLMs vs traditional NLP: I hear you, and I've seen similar issues with LLM hallucination on structured data. That's why the architecture here is hybrid:
- Traditional exact regex/grep search for names, dates, identifiers
- Vector search for semantic queries
- LLM orchestration layer that must cite sources and can't generate answers without grounding
"can't" seems like quite a strong claim. Would you care to elaborate?
I can see how one might use a JSON schema that enforces source references in the output, but there is no technique I'm aware of to constrain a model to only come up with data based on the grounding docs, vs. making up a response based on pretrained data (or hallucinating one) and still listing the provided RAG results as attached reference.
It feels like your "can't" would be tantamount to having single-handedly solved the problem of hallucinations, which if you did, would be a billion-dollar-plus unlock for you, so I'm unsure you should show that level of certainty.
I understand uncensored in the context of LLMs, what is unhinged? Fine tuning specifically to increase likelihood of entering controversial topics without specific prompting?
I keep thinking that the lack of children’s faces in the blacked out rectangles make the files much less shocking. I wonder if AI could put back fake images to make clearer to people how sick all this is.
A lot of people are now struggling to detect which images are AI generated, and inferring reality from illusions.
To an extent, this was already the case with many other things, including stuff that was expressly labelled as fiction, but I recall an old quote, fooling all of the people some of the time and some of the people all of the time, it is now easier to fool more people all the time and to fool all people an increasing fraction of the time.
This isn't only limited to fake pics of kids, but kids are weak and struggle to defend themselves, and in this context the tools faking them seems to me likely to increase rates of harm against them.
The history of age of consent laws including Pitcairn Island, the observed results of sexualised deepfakes in classrooms by other students, and the observation that according to sexual therapists "fetishisation" is the development of a sexual response and conversion into a requirement over the course of repeated exposure rather than any innate tendency that a person is born with.
> Mr. Gates, in turn, praised Mr. Epstein’s charm and intelligence. Emailing colleagues the next day, he said: “A very attractive Swedish woman and her daughter dropped by and I ended up staying there quite late.”
What if I told you that the child sitting on Epstein's lap, the teenager he French-kissed, the girl whose skin he covered with fragments from Nabokov's Lolita, the one who had an entire corridor filled with her pictures in one of his properties, who appeared in every framed photograph on his desk and whose name is on the CD-ROMs, the only woman Epstein said he would ever marry – what if that girl is the daughter Bill Gates mentions? And that she and her mother were Epstein's main romantic interests and most percussive tools?
Please create a way to share conversations. I think that can be really relevant here
I am not a huge fan of AI but I allow this use case. This is really good in my opinion
Allowing the ability to share convo's, I hope you can also make those convo's be able to archived in web.archive.org/wayback machine
So I am thinking it instead of having some random UUID, it can have something like https://duckduckgo.com/?q=hello+test (the query parameter for hello test)
Maybe its me but archive can show all the links archived by it of a particular domain, so if many people asks queries and archives it, you almost get a database of good queries and answers. Archive features are severely underrated in many cases
Shareable conversations would definitely make the tool more useful yeah.
I really like the query parameter approach over UUIDs so it would make links human-readable
Feedback: This agent didn't really work well when I tried it with a specific non-famous, but definitely publicly known individual with known connections to Epstein. I'd rather not post a specific name here. I found more documents with keyword searches. I guess it did get me to the conclusion that there wasn't much out there, but it didn't even mention stuff that showed up in name keyword searches.
To replicate though, you might look at the list of individuals mentioned in the brief email from Epstein to Bannon a couple weeks before Esptein died containing ~30 names and phow your engine works with each one. See how a keyword search does on library of congress vs your agent.
Thanks for testing this. The Bannon email from June 30, 2019 is in there (HOUSE_OVERSIGHT_029622). Good stress test idea.
Couple things happening:
Semantic search limitation: Less-famous names don't have strong embeddings, so it defaults to general connections rather than specific mentions
Keyword search gap: You're right — raw grep can catch exact names I'm missing
I saw a similar problem. Roger Schank had some conversations with Epstein and the emails can be seen in Epsteinvisualizer.com but your site claimed there was no emails or connection. To be fair to Roger, who was an AI legend of his time and someone I knew personally before his untimely death, he really was not a pedo, and most likely never got involved with the girls, I think him and Epstein just talked about AI and education mostly.
Trump famously told New York Magazine in 2002: "I've known Jeff for 15 years. Terrific guy. He's a lot of fun to be with. It is even said that he likes beautiful women as much as I do, and many of them are on the younger side."
Trump and Epstein were social acquaintances in Palm Beach and New York circles during the 1990s-early 2000s. They socialized together at Mar-a-Lago and other venues
This is a good idea. One thing I never understand about these kinds of projects though: why are the standard questions provided to the user as prompts never cached?
"He participated regularly in paying money to force
me to ___ with him and he was present when my uncle murdered
my newborn child and disposed of the body in Lake Michigan. "
https://www.freep.com/story/news/local/michigan/2025/12/27/a.... This mentions the Trump angle. It also mentions that the report came out before the 2020 election and could be fake. I'm a little confused because the report itself says nothing about Trump so don't know where the Free press gets that and they don't tell you what the source is or I missed it.
Edit: Oh I get it. The woman's statement Donald Trump is named as one of the witnesses. She says that he watched the murder. He wasn't the uncle. He is listed as a witness to the murder. This is highly highly suspect in my opinion. Seems very sensationalistic and no reason given it as to why Trump was there. His name is just thrown in.
The allegation is quite clearly that Trump participated in [ redacted ] this pregnant 13 year-old.
> [Trump] participated regularly in paying money to force me to [ redacted ] with him
The reason he was allegedly there was probably to [ redacted ] a 13 year old... That's what convicted rapists with deep connections to child sex traffickers do...?
I would expect a large portion of the remaining records to be internal emails about memos about the process of building a case around evidence, rather than the root evidence itself.
Not that that would excuse the administration's unlawful behavior so far, or indicate the unreleased 99% can't have some big bombshells.
Ah, yes. Post is an LLM-something project: top comment is a general critique of LLMs. Waiting for this to get old. Meanwhile, at least you get points for being funny.
I think the GP was unfairly downvoted, as their comment wasn't a critique of LLMs but a comical attempt at critique of the source files themselves being redacted into uselessness.
All these attempts looks like emulation of "Pen (software) is mightier than Sword" or that only if more people believed in the cause, we would be close to resolution.
Remember folks, soft power is nothing in front of hard power.
On LLMs vs traditional NLP: I hear you, and I've seen similar issues with LLM hallucination on structured data. That's why the architecture here is hybrid:
- Traditional exact regex/grep search for names, dates, identifiers - Vector search for semantic queries - LLM orchestration layer that must cite sources and can't generate answers without grounding
"can't" seems like quite a strong claim. Would you care to elaborate?
I can see how one might use a JSON schema that enforces source references in the output, but there is no technique I'm aware of to constrain a model to only come up with data based on the grounding docs, vs. making up a response based on pretrained data (or hallucinating one) and still listing the provided RAG results as attached reference.
It feels like your "can't" would be tantamount to having single-handedly solved the problem of hallucinations, which if you did, would be a billion-dollar-plus unlock for you, so I'm unsure you should show that level of certainty.
"who is eligible to vote in US presidential elections"
“Uncensored” is simply a branding trick that a lot of seemingly intelligent people seem to fall for.
Look for anything that includes the word “woke” in any marketing /tweet material
To an extent, this was already the case with many other things, including stuff that was expressly labelled as fiction, but I recall an old quote, fooling all of the people some of the time and some of the people all of the time, it is now easier to fool more people all the time and to fool all people an increasing fraction of the time.
This isn't only limited to fake pics of kids, but kids are weak and struggle to defend themselves, and in this context the tools faking them seems to me likely to increase rates of harm against them.
Why does it seem this way to you?
> Mr. Gates, in turn, praised Mr. Epstein’s charm and intelligence. Emailing colleagues the next day, he said: “A very attractive Swedish woman and her daughter dropped by and I ended up staying there quite late.”
What if I told you that the child sitting on Epstein's lap, the teenager he French-kissed, the girl whose skin he covered with fragments from Nabokov's Lolita, the one who had an entire corridor filled with her pictures in one of his properties, who appeared in every framed photograph on his desk and whose name is on the CD-ROMs, the only woman Epstein said he would ever marry – what if that girl is the daughter Bill Gates mentions? And that she and her mother were Epstein's main romantic interests and most percussive tools?
looks like it’s getting hugged
I am not a huge fan of AI but I allow this use case. This is really good in my opinion
Allowing the ability to share convo's, I hope you can also make those convo's be able to archived in web.archive.org/wayback machine
So I am thinking it instead of having some random UUID, it can have something like https://duckduckgo.com/?q=hello+test (the query parameter for hello test)
Maybe its me but archive can show all the links archived by it of a particular domain, so if many people asks queries and archives it, you almost get a database of good queries and answers. Archive features are severely underrated in many cases
Good luck for your project!
[0] https://en.wikipedia.org/wiki/Office_of_Strategic_Services
To replicate though, you might look at the list of individuals mentioned in the brief email from Epstein to Bannon a couple weeks before Esptein died containing ~30 names and phow your engine works with each one. See how a keyword search does on library of congress vs your agent.
Couple things happening:
Semantic search limitation: Less-famous names don't have strong embeddings, so it defaults to general connections rather than specific mentions Keyword search gap: You're right — raw grep can catch exact names I'm missing
Trump and Epstein were social acquaintances in Palm Beach and New York circles during the 1990s-early 2000s. They socialized together at Mar-a-Lago and other venues
(not including the new millions upon millions of documents and photos)
https://storage.courtlistener.com/recap/gov.uscourts.nysd.47...
from a 2017 FOIA they had to provide it
https://www.bloomberg.com/news/newsletters/2025-08-08/here-s...
Might be possible for machine-learning to determine what is missing?
(which is basically 99% missing as we already know less than 1% released)
It's worth noting that only about 1% of the files have been released, according to the DOJ.
Of the released files, many have redactions.
"He participated regularly in paying money to force me to ___ with him and he was present when my uncle murdered my newborn child and disposed of the body in Lake Michigan. "
The uncle is allegedly referring to Trump
Edit: Oh I get it. The woman's statement Donald Trump is named as one of the witnesses. She says that he watched the murder. He wasn't the uncle. He is listed as a witness to the murder. This is highly highly suspect in my opinion. Seems very sensationalistic and no reason given it as to why Trump was there. His name is just thrown in.
> [Trump] participated regularly in paying money to force me to [ redacted ] with him
The reason he was allegedly there was probably to [ redacted ] a 13 year old... That's what convicted rapists with deep connections to child sex traffickers do...?
Not that that would excuse the administration's unlawful behavior so far, or indicate the unreleased 99% can't have some big bombshells.
Can't edit it anymore, but it would be "\u25A0" * n
Remember folks, soft power is nothing in front of hard power.