> But like humans — and unlike computer programs — they do not produce the exact same results every time they are used. This is fundamental to the way that LLMs operate: based on the "weights" derived from their training data, they calculate the likelihood of possible next words to output, then randomly select one (in proportion to its likelihood).
This is emphatically not fundamental to LLMs! Yes, the next token is selected randomly; but "randomly" could mean "chosen using an RNG with a fixed seed." Indeed, many APIs used to support a "temperature" parameter that, when set to 0, would result in fully deterministic output. These parameters were slowly removed or made non-functional, though, and the reason has never been entirely clear to me. My current guess is that it is some combination of A) 99% of users don't care, B) perfect determinism would require not just a seeded RNG, but also fixing a bunch of data races that are currently benign, and C) deterministic output might be exploitable in undesirable ways, or lead to bad PR somehow.
Deterministic output is incompatible with batching, which in turn is critical to high utilization on GPUs, which in turn is necessary to keep costs low.
I don't believe it. This seems more like laziness and not caring about the problem, than something fundamental. (FWIW ChatGPT agrees)
If commercial LLM providers cared about this (and I think eventually they will, there are many use cases), we'll get seed support. It might be not completely trivial given the complexity of the stacks, but nothing compared to what they've already accomplished. It's just a compute graph, and GPUs are actually extremely INcompatible with randomness if anything.
Actually older LLMs did support seeds, but support got dropped a couple of years ago, I guess when they decided scale was more important than supporting that feature.
At what point does this just wrap all the way back around to being genetic algorithms?
I'm also reminded of the old software called Formulize, which could take in a set of arbitrary data and find a function that described it. http://nutonian.wikidot.com/
I'm finding code falls into two categories. Code that produces known results and code that produces results that are not known. For example, creating a table with a pagination component with a backend that loads the first 30 rows ordered by date descending from the database on page 1 and the second set of 30 rows on page 2. We know what the code is supposed to output, we know what the result looks like. On the other hand, there is code that does statistical analysis on the 30 rows of data. This is different because we don't know what the result is.
The known result code is easy to use an LLM with. I have a skill that will iterate with an OODA loop — observe, act, and validate. It will in the validate step take screenshots and even without telling it, it will query the database from the CLI, compare the rendered row data to the database data. It will more surprisingly make sure that all the components are responsive and render beautifully on mobile. I'm orders of magnitude past linting here which is solved with Biome.
The statistical analysis is different. The only way I can know for sure of the result is by writing the code painstakingly by hand. The LLM will always produce specious lies. It will fabricate and show me what I want to see, not the truth. This is because until it is written manually by hand, there is no ground truth. In this case, there is no code checking code.
Obviously this won't work if your tools are not deterministic, but reproducible builds is a well-trodden discipline.
This is emphatically not fundamental to LLMs! Yes, the next token is selected randomly; but "randomly" could mean "chosen using an RNG with a fixed seed." Indeed, many APIs used to support a "temperature" parameter that, when set to 0, would result in fully deterministic output. These parameters were slowly removed or made non-functional, though, and the reason has never been entirely clear to me. My current guess is that it is some combination of A) 99% of users don't care, B) perfect determinism would require not just a seeded RNG, but also fixing a bunch of data races that are currently benign, and C) deterministic output might be exploitable in undesirable ways, or lead to bad PR somehow.
If commercial LLM providers cared about this (and I think eventually they will, there are many use cases), we'll get seed support. It might be not completely trivial given the complexity of the stacks, but nothing compared to what they've already accomplished. It's just a compute graph, and GPUs are actually extremely INcompatible with randomness if anything.
Actually older LLMs did support seeds, but support got dropped a couple of years ago, I guess when they decided scale was more important than supporting that feature.
I'm also reminded of the old software called Formulize, which could take in a set of arbitrary data and find a function that described it. http://nutonian.wikidot.com/
I'm finding code falls into two categories. Code that produces known results and code that produces results that are not known. For example, creating a table with a pagination component with a backend that loads the first 30 rows ordered by date descending from the database on page 1 and the second set of 30 rows on page 2. We know what the code is supposed to output, we know what the result looks like. On the other hand, there is code that does statistical analysis on the 30 rows of data. This is different because we don't know what the result is.
The known result code is easy to use an LLM with. I have a skill that will iterate with an OODA loop — observe, act, and validate. It will in the validate step take screenshots and even without telling it, it will query the database from the CLI, compare the rendered row data to the database data. It will more surprisingly make sure that all the components are responsive and render beautifully on mobile. I'm orders of magnitude past linting here which is solved with Biome.
The statistical analysis is different. The only way I can know for sure of the result is by writing the code painstakingly by hand. The LLM will always produce specious lies. It will fabricate and show me what I want to see, not the truth. This is because until it is written manually by hand, there is no ground truth. In this case, there is no code checking code.
it goes on for ages just to reach the point of "write the tests first"
LLMs really cause diminished reasoning, or in terms that LLM people might understand: Your minds have been quantized!