I would suggest changing the title to the actual title of the article: Adaptive PDFs.
Assuming the program works, the PDF will not actually look different to me than to anyone else looking at it, so there is nothing that "changes based on who is reading". It is just that text extraction, a wholly different (and much fuzzier) process than viewing the PDF, and something that the same person can do, will now return structured (Markdown) text. (One might say the PDF changes based on how you are reading it.) A great idea, IMHO.
The writer uses this expression; the other 'who' that's reading it is a machine, e.g. an LLM.
It seems the purely typographical character of pdf format might have good security uses in business etc. But it is a perpetual torture that one cannot extract some sort of structured text from e.g. scientific and academic pdfs. Markdown-level simplicity would be fine - one would concede that some things in the pdf will go beyond this.
I often want to read non-English works in my discipline, so it is natural to work with some translation mechanism, but everything is a chamber of horrors, e.g. no one can distinguish footnotes. If for example I try to produce a working translation e.g. two column html, the labor of getting the text in a position to have a mirror in a prospective english translation is half the work. This trouble is not too surprising if e.g. if the pdf is images from an 1867 book https://wertform.github.io/texts/capital.html
Having slightly different versions would certainly be a help in identifying leakers of certain kinds of documents to increase the odds of identifying leakers. That would be of interest to some kinds of organizations or departments within organizations.
Just because everything is a potential threat vector now: doesn't this also mean you could easily put AI specific malicious instructions into the PDF that the regular human would never notice?
Like the "white text between the lines that only appears when copy-pasted"-hack that some professors have been doing in their exercises to their students to include pink elephants in the output and stuff. But worse. Just thinking of a electricity bill pdf you provide as proof of address to some company that uses an LLM to extraxt that address and pre-process that doc. But instead we can command it to do something else that a regular human wouldn't even ever notice...
Yes, although that's not new. The amount of different exploits and RCE I've seen in the past decade from just "opening" an PDF is mind blowing. Not sure if it's slowed down but around 8 years ago ghostcript would patch a couple of RCE from PDF processing every few months.
In the US, publicly funded organizations are required to code their PDF with semantic structure to support machine access by screen readers and other assistive technologies [1], [2].
Given the low adherence to accessibility standards e.g. in academic publishing [3], LLM parsing needs creating a commercial incentive for comparable structured access would be marvelous.
Cool but it's relying on every extractor honoring that replacement-text property which you said yourself is hit or miss. So it's clean markdown until someone runs it through a tool that ignores it and quietly gets the messy version and has no idea that happened.
This looks really interesting. Optimizing for humans vs. agents feels like the new wave of Desktop vs. Mobile (where mobile won) - agents are going to win even faster.
Where is the repo? It's mentioned but I can't find it.
Shouldn’t it be possible since forever to put machine readable source information into PDF metadata. It’s more a problem of the tools and programs generating the PDFs.
We spend millions turning structured information into PDFs and billions to extract the same data from a printer rendering language
You can put the actual document content in an image and duplicate the textual data it contains using invisible text objects (popular for scanned books). You can specify what Unicode characters underlie the glyph used in your text objects (essentially required for copy&paste to work once the document goes beyond ASCII, or even just uses prebaked ligatures in the font). You can attach arbitrary files, which may contain the document’s plaintext source if you so choose (some do this with their LaTeX documents).
Finally, the closest to what you want is “tagged PDF”, required by some accessibility and archival profiles. As best as I understand, it essentially annotates the text content of the document with semantic markup (which is in normal viewers is invisible and completely ignored). Unfortunately, tagging is only specified in PDF ≥2.0, which ISO in its infinite wisdom decided (in spite of its promises to Adobe once upon a time) to put behind a paywall, unlike the earlier, Adobe-produced versions; and associated best-practices profiles like PDF/A and PDF/UA were born paywalled. Nowadays PDF and PDF/UA, at least, are login-walled and watermarked but gratis[1], yet tagging still seems to mostly be treated as an expensive compliance concern for those subject to such. There is in particular no decent way to make tagged PDFs from LaTeX despite ongoing work (unsurprisingly, as it would need to be an ecosystem-wide effort on the scale of tex4ht).
[1] Remember to hoard copies: e.g. quite a few public standards from the 2000s reference specifically Unicode 3.0 and not any later version, while linking to the free copy of ISO 10646-1:2000 on the ISO website. ISO has now deleted that copy because of a policy to only make the latest version freely available.
Exactly. It’s pretty insane that we have converged on storing documents as PDF. And it looks like no work is done on making PDF files machine readable.
Exactly. But we have no real coordination or uniform application in how we're creating PDFs across all these programs so we always end up with a fun mix of what will and wont be static, scalable, searchable
I always export my Typst with PDF/A. It basically guarantees maximal compatibility and none of the annoying dynamic bullshit. I wish everyone would do this, at least for documents that don't need the fancy dynamic PDF features.
>This didn't matter when humans were the only readers. But now most PDFs end up in an LLM.
but it did matter, a lot. the PDF format was originally proprietary and was designed to be proprietary and to disallow casual text extraction. I just didn't like the way you glossed over that, "it was OK that people for over 30 years were not given any way for the information they were given to be unshackled, but now it matters because our AI overlords were prefer that so we must change things!"
Assuming the program works, the PDF will not actually look different to me than to anyone else looking at it, so there is nothing that "changes based on who is reading". It is just that text extraction, a wholly different (and much fuzzier) process than viewing the PDF, and something that the same person can do, will now return structured (Markdown) text. (One might say the PDF changes based on how you are reading it.) A great idea, IMHO.
It seems the purely typographical character of pdf format might have good security uses in business etc. But it is a perpetual torture that one cannot extract some sort of structured text from e.g. scientific and academic pdfs. Markdown-level simplicity would be fine - one would concede that some things in the pdf will go beyond this.
I often want to read non-English works in my discipline, so it is natural to work with some translation mechanism, but everything is a chamber of horrors, e.g. no one can distinguish footnotes. If for example I try to produce a working translation e.g. two column html, the labor of getting the text in a position to have a mirror in a prospective english translation is half the work. This trouble is not too surprising if e.g. if the pdf is images from an 1867 book https://wertform.github.io/texts/capital.html
hn@ycombinator.com
Like the "white text between the lines that only appears when copy-pasted"-hack that some professors have been doing in their exercises to their students to include pink elephants in the output and stuff. But worse. Just thinking of a electricity bill pdf you provide as proof of address to some company that uses an LLM to extraxt that address and pre-process that doc. But instead we can command it to do something else that a regular human wouldn't even ever notice...
Just a thought
LaTeX is actually one of the best ways to create tagged PDF: https://latex3.github.io/tagging-project/tagging-status/ and https://www.overleaf.com/learn/latex/An_introduction_to_tagg...
Given the low adherence to accessibility standards e.g. in academic publishing [3], LLM parsing needs creating a commercial incentive for comparable structured access would be marvelous.
[1] https://www.section508.gov/create/pdfs/common-tags-and-usage...
[2] https://pdfa.org/resource/tagged-pdf-best-practice-guide-syn...
[3] https://arxiv.org/html/2410.03022v1
Where is the repo? It's mentioned but I can't find it.
We spend millions turning structured information into PDFs and billions to extract the same data from a printer rendering language
You can put the actual document content in an image and duplicate the textual data it contains using invisible text objects (popular for scanned books). You can specify what Unicode characters underlie the glyph used in your text objects (essentially required for copy&paste to work once the document goes beyond ASCII, or even just uses prebaked ligatures in the font). You can attach arbitrary files, which may contain the document’s plaintext source if you so choose (some do this with their LaTeX documents).
Finally, the closest to what you want is “tagged PDF”, required by some accessibility and archival profiles. As best as I understand, it essentially annotates the text content of the document with semantic markup (which is in normal viewers is invisible and completely ignored). Unfortunately, tagging is only specified in PDF ≥2.0, which ISO in its infinite wisdom decided (in spite of its promises to Adobe once upon a time) to put behind a paywall, unlike the earlier, Adobe-produced versions; and associated best-practices profiles like PDF/A and PDF/UA were born paywalled. Nowadays PDF and PDF/UA, at least, are login-walled and watermarked but gratis[1], yet tagging still seems to mostly be treated as an expensive compliance concern for those subject to such. There is in particular no decent way to make tagged PDFs from LaTeX despite ongoing work (unsurprisingly, as it would need to be an ecosystem-wide effort on the scale of tex4ht).
[1] Remember to hoard copies: e.g. quite a few public standards from the 2000s reference specifically Unicode 3.0 and not any later version, while linking to the free copy of ISO 10646-1:2000 on the ISO website. ISO has now deleted that copy because of a policy to only make the latest version freely available.
I guess the exact same technique can actually be used.
but it did matter, a lot. the PDF format was originally proprietary and was designed to be proprietary and to disallow casual text extraction. I just didn't like the way you glossed over that, "it was OK that people for over 30 years were not given any way for the information they were given to be unshackled, but now it matters because our AI overlords were prefer that so we must change things!"