SQL NULLs aren't weird, they're just based off of Kleene's TRUE-FALSE-UNKNOWN logic!
If you want you can read NULL as UNKNOWN and suddenly a whole bunch of operations involving them become a lot more intuitive:
1. TRUE OR UNKNOWN = TRUE, because you know you have at least one TRUE already.
2. TRUE AND UNKNOWN = UNKNOWN, because you don't know whether you have two TRUEs or not. It's just out there.
3. UNKNOWN XOR UNKNOWN = UNKNOWN, because it could darn near be anything: TRUE XOR TRUE, TRUE XOR FALSE, FALSE XOR FALSE, FALSE XOR TRUE... Internalizing this is where SQL's use of NULL / UNKNOWN really becomes intuitive.
4. (TRUE AND FALSE) XOR (TRUE OR UNKNOWN) = (FALSE) XOR (TRUE) per #1 = TRUE. See, it's consistent, you just need to keep in mind that if you have a lot of known UNKNOWNs they're quite parasitic and your final outcome is likely to be, itself, an UNKNOWN. Just like in real life!
That doesn't address anything in the second half of the post though, starting with this pull quote:
> The fact that NULLs are distinct for UNIQUE columns but are indistinct for SELECT DISTINCT and UNION continues to be puzzling. It seems that NULLs should be either distinct everywhere or nowhere. And the SQL standards documents suggest that NULLs should be distinct everywhere. Yet as of this writing, no SQL engine tested treats NULLs as distinct in a SELECT DISTINCT statement or in a UNION.
That's because "different" and "distinct" don't mean the same thing.
Two unknown values are assumed to be different, but they are not distinct from each other.
For example, take two boxes, in each box is a die, the value of the box is the value shown on the die inside. You don't know the value since you don't see the die, it may even change as you manipulate the box, so it is unknown, NULL in SQL. Because of that, you assume the two boxes have different values. They are, however, indistinguable, so, not distinct. All you need to know is that you hold two boxes with dices in it, which one you hold doesn't matter, and that's what "SELECT DISTINCT" tells you: that you have two boxes and it doesn't matter which is which, even though they have different values.
>That's because "different" and "distinct" don't mean the same thing.
The literal definition distinct is:
>recognizably different in nature from something else of a similar type.
If you want to get down to it nothing is "equal" or the same.
Is a temperature measurement 25C the same as another of 25C? No these measurements are an approximation of the actual values which are actually not equal to each other they are distinct they have just been lumped into the same 25C group due to the resolution of measurement yet equality works just fine on that value in sql.
I have used SQL for a long time null handling is weird and inconsistent and a waste of time. For all the language bugs due to the existence of null at least I can count on null=null and not write garbage like value=param or (param is null and value is null)
I wanted to briefly reinforce this point with the fact that SQL has multiple equality operators - there is both `=` and `IS NOT DISTINCT FROM`. The later operator will treat null values as equal in alignment with the `DISTINCT` and `DISTINCT ON` operators.
It is extremely easy using partial uniques and the proper operators to treat nulls as non-distinct values and the approach we have allows some very important functionality that aligns with standard statistical data treatment which was influential to how SQL logic was originally designed.
An unknown measurement isn't a measurement value its a statement of (lack of) knowledge about a measurement, that doesn't tell you what the measurement is. Knowledge about a measurement is as different from the measurement as the measurement itself is from the thing measured.
Whether two unknown measurements are the same is unknown.
Whether two measurements of 25C are the same is unknown, these are just values recorded in a database. 25 is a value, null is a value.
The values in the db are the same in both cases which is what I would like my db language to deal with and not make assumptions about what that value actually means.
I see no value in treating null special when in comes to equality in a sql db, in fact it is a hinderance that it does so in my experience.
The SQL null is a database-specific keyword and not something that's part of the domain of your measurements. If you want some kind of sentinel value for your measurements that doesn't have the "unknown" meaning of SQL null, then you should use your own sentinel value and not reuse null for that purpose
Sentinel values suck especially when the language already has a perfectly good one built in.
Is 0 for a temp measurement unknown sentinel or an actual measurement, how about 2,147,483,647 great probably not a measurement now its always included in greater than queries same with max negative with less than.
Null separates the value into its own distinct group and prevents it from being including in range queries due to it not being an actual numeric value while most languages still allow you to compare equality using standard operators.
Sum types would be great in sql but currently we get a union of the sql type and null, so null for sentinel values it is except for the having to using weird syntax to compare it.
NULL is a statement that a value is not available or unspecified reasons.
If you want a particular value where a query would return NULL, it's your job to replace the NULLs with the contextually-appropriate value, e.g., using COALESCE(), to provide it.
It's a convenience shortcut to allow more complicated data models to be rpersented in simpler table structures than a fully normalized NULL-free data model would require, and to provide information about missing data (which can be used with things like COALESCE, where appropriate) when a more complex data model is simplified into a resultset via a query with JOINS, etc.
> recognizably different in nature from something else of a similar type.
But anyways, the point wasn't to justify the choices of SQL but rather as a way to make intuitive sense of its logic. SQL is one of the oldest and most successful programming languages in existence, we are not going to change it, and it is not going to disappear anytime soon, so we have to go with it, like it or not. There have been some attempts at alternatives, both at changing the paradigm (NoSQL) and at cleaning up the language, which, to be fair, would be a good thing, but without much success. The relational paradigm just works, and SQL is usable enough to make the cost of switching not worth it.
Edit:
And writing things like "value=param or (param is null and value is null)" is usually the sign of a poor understanding of the NULL logic. You are using it wrong basically. Sometimes, it is the right thing to do, but if that pattern starts appearing all over the place, it is usually a result of thinking "NULL is broken, I have to use this pattern to handle NULL properly". That's cargo culting, don't fix problems you don't understand by copy-pasting code you don't understand.
Note: this is not addressed to "you" in particular, there can be good reasons, no offense intended. But I think that in general, it is a code smell.
If it is not recognizably different than it is the same in that context correct?
Two measurements of 25C are not recognizably different therefore they are equal, correct, regardless if the actual temperatures are not the same?
Two measurements of unknown are not recognizably different therefore they are equal in the context of the database.
Having null!=null has never been intuitive to me especially since every other programming language treats them equal. I am not hoping this gets changed, I know SQL is to far along for that, I can still complain about it and agree its wierd.
>And writing things like "value=param or (param is null and value is null)" is usually the sign of a poor understanding of the NULL logic.
It's needed with parametrized sql when your db doesn't support "is not distinct from" which is itself a silly way to just write '=' or '==' like a normal programming language. The distinct predict exist for this very reason to have yet another way to express equality that includes nulls: https://modern-sql.com/caniuse/T151
This is confusing when you know that NULLs are not comparable, but it makes some sense if you consider the result of distinct/union as the output of a GROUP BY. You can consider everything that's NULL to be part of the same group, all the values are unknown.
So NULLs are not comparable but they are part of the same set.
It doesn't treat them as equal, it treats them as one group. It does this because the result is more useful.
It is not the case that nulls are always the same as one another. It is also not the case that nulls are always distinct from each other. Thus, the normal rule of grouping, that same values are combined and distinct values make different groups, does not apply. Another principle is needed to determine the result of grouping with nulls.
Logic which allows for an unknown value can't be shoehorned into always giving definite true or false answers.
This is where the foundation of a relational database semantics in set theory shows through. You can’t model the behaviour with pure boolean logic.
In the SQL spec by default unique indexes consider nulls distinct because you’re adding tuple to the relation, and this is done by equality.
When doing a select distinct or group by you’re not doing a boolean grouping, you’re doing a set projection. NULLs are considered part of the set of “unknown values”, so NULLs are grouped together but they’re still not equal to each other.
The behaviour is mathematically well defined, but it doesn’t match boolean logic.
I’ve been dealing with databases for well over 20 years now and I can only think of a couple of times when this behaviour wasn’t wanted, so I think it’s the right choice.
I believe this confusion is confusing the tool with the thing being measured. For simplicity, I will use the analogy of a record (stored as a row in the database) as an observation in a scientific experiment. If the tool was able to record a value, I enter a value like 579.13. If the tool was not able to record a value, the tool will enter NULL. I make a total of one hundred observations. Of one hundred rows, some have values and some are NULL.
Are NULLs distinct values? No, they are simply a failure in measurement; it is like asking if all errors are distinct or the same. Are NULLS part of the same dataset? Yes, because they are all observations for the same scientific experiment. What does it mean when "select distinct ... " returns several rows for known/measurable values and but only one row for NULL? If this is confusing, the scientist can update the rows and substitute "UNKNOWN/ERROR" for every NULL. When you do "select distinct ...", you will get the same thing. It will return several rows for known/measurable values and but only one row for "UNKNOWN/ERROR".
Yes, ergonomics dictates some weird behavior for nulls.
Luckily, Postgres nowadays lets you declare the behavior of each null in unique constraints, like it should be. We can expect this to creep down to the other DBMSs with time.
Making nulls distinct on a "select distinct" or a "group by" would be completely useless.
If you're compromising on your high-minded and unorthodox purist concept for ergonomics, you may as well finish the job of ergonomics and just use normal nulls where X=X is true.
If you're including possibly NULL columns in a distinct or group by and you want to treat them in a particular way, use the COALESCE() or NVL() or whatever similar function to give a real value to the NULL for that purpose.
Yeah nothing wrong with “select distinct” itself if it’s used correctly for its intended reasons.
But when I see select distinct at the start of a big and complex query, I do immediately suspect that the developer might have missed some join condition down the line and “got too many rows” back from the query. And since the rows look like duplicates due to the missing join predicate, for a junior (or careless) developer, the quick “solution” is to just apply distinct in the top level query to get rid of these pesky duplicates, declare success and move on.
It's possible that this is due to the underlying implementation.
In a unique column normally you'll have an index, so NULL becomes a special value in an index, but in SELECT DISTINCT you probably won't have an index, which means a full scan is performed, then every row has to be compared with every other row.
there is a pattern starting to emerge here on hackernews of highly voted posts by people who present themselves as experts and thought leaders who shamelessly put their lack of understanding at display. it's frightening.
The idea that someone should refrain from publishing a blog post about _anything_ unless they are a certifiable expert is not reasonable. Many people (correctly) write to learn, and even if they are publishing just to "present themselves as experts", it's on the reader to determine value.
In a world where there was less false bullshit people believed smoking is fine and sugar is healthy.
Amount of false bullshit doesn’t make qualitative difference.
Only difference to make is that people should not take something as truth just because it is written in a book or in a blog post or if person has a degree or not.
Why not both? In my career, I have met countless people who are experts in programming in general, but with relatively modest skills in database systems.
Which is fine! It's really hard to be truly expert in both. There's a reason why "programmer" and "database administrator" used to be two different professions. I'd like to think that I'm better than your average developer at flogging RDBMSes, but most DBAs I've worked with can still run circles around me when it comes to information modeling and database & query optimization.
At a lot of companies, there are still full teams of people slinging t-sql or pl/SQL all day long to support their organization. Not DBAs, just developers who primarily work inside the database system their entire life.
I keep my old SQL Server Anki cards alive for just such a use case. It's been a minute since I had to jump into a 3-digit-LOC SQL script that does some arcane financial processing or what have you, but there's a nice steady niche there in case I ever want to throw my hat back into the ring.
I think that's more a pattern of your understanding growing over time.
Most technical writing is actually at the start of the technical journey, there's more people there and its more interesting to talk about strategy and ideas when its not "messy" with internal details and trade offs you make.
I encourage folks at any technical level to write, and I encourage folks to check authors work with a critical eye no matter how long the source has been in the industry, no amount of experience can protect you from being wrong.
Like on Reddit etc, which I deliberately avoid for this reason. The hiding of the vote count and the heavy moderation still help a lot that HN is still a massively better platform than any of its alternatives.
eh. This probably shouldn't have gotten so many votes, but it's a little interesting from a logic standpoint. It falls somewhere in the region of a StackOverflow question that makes some people scratch their heads and functions as nerdbait for everyone who knows the answer. These things don't rank for that long on HN, (and I agree that the self-important "expert" blog posture is silly), but I do find them to be a better daily checkin for my brain than actually going on S.O. anymore...
This is the correct way of thinking about things. Null is one of the hardest things for traditional software engineers in my experience as a guy who came up as a data admin.
That's because null in not-SQL is a rather different concept, and while it's pretty easy to understand it's absolutely is hard to actually work with. Hoare didn't call it a "billion dollar mistake" on a whim.
Because they're two orthogonal problems. It's not like you do select distinct from program variables group by scope in your frontend programming language.
SQL is not three valued. Neither is NULL. BOOLEAN is accused of being three-valued but it has two values and like all values they can be unknown. Similarly a SMALLINT has 65,536 possible values not 65,537.
Your link makes the same mistake I already addressed. It conflates nullable booleans with tri-state logic.
Null is not a value. It is the absence of a value.
> The SQL null value basically means “could be anything”.
This is wrong. Null means it could be any valid value but that value is unknown. If the datatype is DATE then the value cannot be boolean TRUE or the string ‘purple’.
If I’m comparing a value of type date to a null I still think it works as it should if value is “unknown”. What greater insight or context do we have if it’s a small-int null?
Suppose you have table with two columns to represent a classroom’s exam scores. It has the columns student_id (varchar) and score (int).
If a student is sick and has not taken the exam, yes you could enter -99 to represent they did not take the test. But if you want to find the class average, you would have to do something like this:
select average(case when score =-99 then null else score end) as class_avg from …
> What greater insight or context do we have if it’s a small-int null?
The insight is that null is not a value. It’s not a smallint or a boolean or a date. It’s the absence of a possible value. The only way to see a null as tri-state is to conflate it with a nullable boolean. This is an incorrect mental model which leads to confusion.
Javascript has both null and undefined and I'm not sure that's a good idea.
At least in SQL we only have one of them, but it can mean unknown or it can mean N/A or even false. It's like a joker, what it means depends on how you use it.
No, it's not those other things, that's just using the tool incorrectly. A NULL is definitely "we dont know", not false, not N/A, especially not any known value.
Honestly, at this point I just wish SQL servers supported proper discriminated union types and nullable columns were implemented as some kind of MaybeKnown<T> and I could just have a normal Maybe<T> with normal equality semantics if I wanted.
SQL needs to join 21st century type systems... or needs to get replaced altogether. SQL is the FORTRAN of relational programming languages, but hangs around because every time somebody tries to replace it they throw the relational-theory baby out with the bath-water.
It's unfortunate to namesquat on 'boolean' if your elements have three possible values. Just call it 'ternary logic' and let individuals decide which system to use.
Who's name squatting boolean? The bool column is exactly what it claims to be, you just have the option of introducing unknowability if you define it allow nulls.
The distinction is that not all formal logic systems are Boolean. Meaning that it is nonsensical and confusing to use "Boolean" as a generic synonym for "truth value" in the same way that it's nonsensical to use "Pantone" as a generic synonym for "color value", including when the specific kind of color value you're talking about is CMYK or HSV and definitely not Pantone.
There are two values, TRUE and FALSE. Null is not a value, it the the lack of a value.
You have a list of people and you ask if they own a car. You didn't get around to asking George, so that, somehow means he owns a car because you are using boolean logic? Or does it mean he doesn't own a car, because you are using boolean logic?
No, it means you haven't gathered this data point and don't know.
If there are exactly two possible values, TRUE and FALSE, you're working with Boolean logic.
If there are three possible values, TRUE, FALSE and NULL (unknown), then you're probably working with something like Kleene logic. You can't truly be working with Boolean logic, though, any more than you can be doing integer arithmetic when 15.37 is a member of your domain.
To put it another way, if we're talking about the mathematical definition of boolean algebra and not just some programming language's quirky implementation that happens to be called "bool", then boolean values would by definition be non-nullable. That logic that allows nulls has been implemented using the same unified set of keywords and operator names is a pragmatic decision that simplifies the language implementation and spec, not a principled one that tries to be pedantic about mathematical terminology.
> That logic that allows nulls has been implemented using the same unified set of keywords and operator names is a pragmatic decision
That's why it's name-squatting. Rather than introduce a 'kleene' datatype & operations, and let the user opt-in, they decided that in our 'bool' world, the following is not a truism:
> No, it means you haven't gathered this data point and don't know.
This is how it should be.
> Somehow means he owns a car because you are using boolean logic?
This is how it unfortunately is. There are 3 people, and there are 3 people who don't have a NULL car. Therefore George has a car.
CREATE TABLE people(name text, carId uuid);
INSERT INTO people values('Bill', '40c8a2d7-1eb9-40a9-b064-da358d6cee2b');
INSERT INTO people values('Fred', '3446364a-e4a5-400f-bb67-cbcac5dc2254');
INSERT INTO people values('George', NULL);
SELECT Count(*) FROM people WHERE name NOT IN (
SELECT name FROM people WHERE carId = NULL
);
Elsewhere people have argued that NULL propagates, so that your small unknowns infect larger queries. I could get behind that line of thinking, but the query above confidently returns 3.
But it is a boolean value, there's only two possible values TRUE and FALSE. But because it's SQL you can define any column as TYPE | NULL.
You could say that a boolean column with a NULL value is FALSE like how a lot of programming languages coerce it but if you wanted that you would just make a default of FALSE. The meaning of NULL in general being "value not specified" lends itself pretty nicely to "either true or false."
What I want is for e.g. "x OR y" where y is NULL (and/or of nullable type) to be an error rather than silently giving surprising results. Just like in a decent programming language I can't do x || y where x and y are of type boolean?, I have to explicitly handle the case where one or other of them is null (or e.g. write x!! || y!! - and that will at still error if either is null rather than silently evaluating to a funny result).
I think I mean either. So yes NULL is a distinct value from true and false so I think it's also right to say it's neither true nor false. But the value NULL does represent is roughly "could be true or false, we don't know yet."
It could mean anything or nothing depending on context, which is part of the problem. (Plenty of people think the meaning of NULL is clear and obvious and consistent, unfortunately they give different answers about what that meaning is)
It could also be neither. It's whatever you define it to be. Null could mean you don't know if it's true or if it's false, or it could mean you know it's neither true nor false.
It could be true or false, but it’s unknown. For example. a user doing a survey is yet to fill in the answer. That doesn’t mean there is no answer, it’s just unrecorded.
Even simpler, note that the first two examples are essentially just two of the most basic boolean algebra identities. Replace UNKNOWN with a variable and change the notation to the more conventional boolean algebra notation and you've got:
> And in many contexts in SQL, NULL is treated as equivalent to false, such as within a WHERE clause.
NULL is not equivalent to FALSE, it is neither FALSE nor TRUE. It has the same effect as FALSE as the final result of evaluating a WHERE clause condition only because WHERE clause conditions allow a row to be included only when they evaluate strictly to TRUE. But if NULL were equivalent to FALSE in a WHERE clause, than a WHERE clause condition which would evaluate to NULL that was instead negated would be equivalent to TRUE but instead it remains NULL which remains not TRUE.
No, because NOT NULL/UNKNOWN is still NULL/UNKNOWN.
I realized earlier I was using the term NULL, but going forwards let's use the ANSI SQL concept where the null state of a Boolean is called UNKNOWN. You'll have to forgive me for using the term NULL this far, but in my defense the concept of NULL column-values and UNKNOWN boolean expression results are pretty intertwingled.
SELECT * FROM foo WHERE NOT (1 = NULL)
resolves to
SELECT * FROM foo WHERE UNKNOWN
which is the same value that
SELECT * FROM foo WHERE 1 = NULL
resolves to.
So the WHERE clause is treating UNKNOWN/NULL as equivalent to false. The rest of the Boolean algebra is not.
SQL likes to have it both ways. Sometimes UNKNOWN/NULL is equivalent to false sometimes it's not.
It does this because UNKNOWN/NULL Booleans are incredibly inconvenient and tedious and sometimes they'd rather not think about them.
I'd like to do that as well, but this hoary old language won't give me the same permission it gives itself.
> And in many contexts in SQL, NULL is treated as equivalent to false, such as within a WHERE clause.
I don't think any databases treat `NULL` as `FALSE` in the WHERE clause. `SELECT * FROM foo WHERE bar = NULL` doesn't return rows with a NULL in the bar column. `SELECT * FROM foo WHERE bar != NULL` doesn't return rows without NULL in the bar column. `SELECT * FROM foo WHERE (bar = 'a') = NULL;` doesn't return rows where bar is not equal to `a`[1]. As far as I know every DB treats NULL as what it is, an unknown value.
It also doesn't to my mind violate the reflexive property because NULL is not equal to anything. It is a marker for an unknown value, not a value in and of itself. If you have a database of every person in a room and what color shirt they're wearing, and in your database, Alice and Bob both have NULL in their "shirt_color" column, that does not mean that Alice and Bob have the same color shirt. Nor does it mean that they don't have the same color shirt. Nor does it mean that someone with a green colored shirt has the same color shirt as Bob or Alice. It doesn't mean they don't have a shirt either. It means you don't have a record of/don't know what color their shirts are. You can't violate the reflexive property because you can't say what color shirt they have. You're not doing `X = X -> false`, you're doing `X = UNKNOWN -> UNKNOWN`
> That's the thing that's being treated as FALSE. That UNKNOWN. Not the value of NULL itself.
I think the difference between how we're looking at this is for me there is no "value" of NULL. NULL has NO value which is why you can't do `1 = NULL` or `1 <> NULL` and have to specifically use `1 IS NOT NULL` or `1 IS DISTINCT FROM NULL`
>That's not how "=" works. If you want a relationship for testing equality
>[that] handles unknown, don't call it equality.
>Basic properties of equality, from Wikipedia
>https://en.wikipedia.org/wiki/Equality_(mathematics)
>- Reflexivity: for every a, one has a = a.
>- Symmetry: for every a and b, if a = b, then b = a.
>- Transitivity: for every a, b, and c, if a = b and b = c, then a = c.
Sure, that's all well and good, but equality only can work IF you know what the values are on either side. NULL means you don't know what a value is and therefore you can't answer whether or not a = NULL because NULL isn't known yet.
Or let me put it another way, is the launch price of the iPhone 17 != $799? The answer is neither true nor false, because the launch price of the iPhone 17 is unknown. We can make reasonable guesses, but if I give you a database with entries for iPhones from the first model and placeholder rows for the 17, 18, 19 and 20 models, none of them belong in the list when someone asks "which iPhones cost $799 at launch?" But equally true that none of them belong in the list when someone asks "which iPhones did not cost $799 at launch?"
Except that NULL is not the same as UNKNOWN! NULL is a data value (like integers) that can appear in data expressions (like NULL + 1) and comparisons (like NULL = 1) whereas UNKNOWN is a truth value that can appear in boolean/logical expressions constructed from logical connectives like AND, OR, NOT.
A data expression always evaluates to a data value, and usually whenever any part of the expression is NULL, the entire expression evaluates to NULL.
A comparison evaluates to a truth value, and usually when a comparison invovles a NULL it returns UNKNOWN. This leads to weird behaviors where both `SELECT 3 WHERE NULL = NULL;` and `SELECT 3 WHERE NULL <> NULL;` returns nothing (because the query engine does not output a row if the predicate returns UNKNOWN on it).
What you listed above only comes into play for boolean/logical connectives like AND, OR, NOT, and in that case we follow 3-valued logic.
And there's more annoying corner cases when you deal with DISTINCT. The situation is so hopeless that SQLite has a whole table documenting divergent behaviors of NULL in different systems: https://www.sqlite.org/nulls.html
Indeed, they're not identical - that's why I just said "based on", and that's likely why the word UNKNOWN itself isn't used in SQL.
Nevertheless I find it a useful intuition pump. I wager that most people reading `UNKNOWN = UNKNOWN` or `UNKNOWN <> UNKNOWN` and thinking about the examples above would stop and say, "Wait, I actually don't know the value of that statement for sure either, since the LHS and the RHS could be completely different things," and would then double check what their SQL dialect would actually do in this situation.
That leads to an even more confusing point, that some systems (at least SQLite) overloads NULL to mean UNKNOWN, for example `SELECT 1 WHERE NULL = (NULL = NULL)`.
And the dangerous thing about NULLs is not when they are explicitly used in the query as a literal (as I did for brevity), but when they appear in tables. It's perfectly reasonable to assume `SELECT COUNT( * ) FROM t;` should be the same as `SELECT COUNT( * ) from t WHERE t.x = t.x OR t.x <> t.x`, but they are not the same because the latter does not return NULL rows. This has lead to real query optimizer bugs. For more examples see this paper https://dl.acm.org/doi/10.14778/3551793.3551818
This is the mental model that I use and teach: a NULL is a virus that will likely make everything it touches NULL... unless cordoned off by a COALESCE or IFNULL or something else that explicitly speaks about its NULL-sensitive behavior in its documentation.
So if you see a bug where a WHERE or FILTER clause is unexpectedly failing, and you're in a time crunch, rather than pounding your head against the wall, start sprinkling COALESCE statements around (while being thoughtful about the domain meaning of what you're doing) and it's more likely than not to solve your problem!
> select null = null; returns NULL, because each NULL is basically a placeholder representing any “unknown value”. Two unknown values are not necessarily the same value; we can’t say that they are equal, because we don’t know the value of either of them.
Agreed with all of this, it would probably have been better if they were named `unknown` instead of reusing the `null` keyword.
Note also that since Postgresql 15, you can use `NULLS NOT DISTINCT` when creating a unique index [0]. I'm less familiar with other databases.
But the "known absent" value is going to be different for different domains. For example, in EEOC databases the "known absent" value for a race would be "declined to answer". In a database of test scores, it might be "Didn't complete", but it could also be "was absent from class on exam day" so SQL can't specify what that is. On the other hand "this value is unknown" can use the same marker in all domains, and SQL chose NULL as that marker. To be completely strict about it, "have a value/don't have a value" is one piece of data if that's something you care about and "what is that value" is another one. So in an ideal system, you should have a column for "value is [present | known absent | unknown]" and a separate column for the actual value when "value is present"
Most of the time it's not that important and people can and do shortcut "null" to mean "not present" but then the issues with using null in equality statements is a result of taking the short cut, not necessarily with the logic around null.
It is encoded as such. That's why most columns are made nullable. It's crazy to say you need to use the full power of a 1:N relation with some child table when you know N cannot be greater than 1, when a nullable column already exactly encodes a 1:(0..1) relation. I'm not trying to shill for null here: one of null's great problems is exactly the fact that null can represent "unknown", "known absent", "not applicable", or even some sentinel-ish "other" escape condition, each with their own subtle differences in how they should be handled.
Null has tons of problems, of course. But it's patently absurd to claim that you "ought to be" making a different 1-column table with a unique foreign key or shared primary key for every column that might be absent, because of some vague appeal to the fact that you can write it as a 1:N relation with N<=1. You can just as easily claim that every non-nullable column is a 1:N relation where N==1 and "should be encoded as such". It is encoded as such! That's what a column is!
Best way to think of NULL is "no value." Not "unknown value," as that implies that it is a value you just don't know what it is. Not "missing" value as that even the notion of a value being "missing" tells you something. NULL is no value. It's like a black hole, it consumes anything it touches. Any expression involving a NULL term becomes NULL.
And to further apply semantics to this just to be snide, the value is only "missing" if it could possibly ever be defined for that tuple. There are cases where you expect the value to be "missing", and it not being "missing" would be considered a data integrity issue. Fun.
UNKNOWN isn’t always correct though. Let’s say your data is input by users filling out a form and some of the fields are allowed to be left blank. NULL captures both the case where the user intentionally left the field blank but also the case where they accidentally skipped that field.
So NULL can capture multiple distinct concepts: unknown values (say, as a result of a calculation), not applicable (where the data has been deliberately left out), and missing data (possibly due to instrumentation or user error).
Is it really desirable to combine "unknown" and "no value" into the same identifier? They seems like very distinct concepts. Using your form example you might have a name field that includes a spot for a suffix (John Doe III) for example that many people would leave blank because they don't have a suffix. In that case you should encode it as a NULL, but if they fail to fill in their family name then that's UNKNOWN.
If you do it this way you can avoid some confusion as to what NULL represents.
John Doe NULL IS EQUAL John Doe NULL
John Doe NULL NOT EQUAL John Doe UNKNOWN
John Doe UNKNOWN NOT EQUAL John Doe UNKNOWN
Determining if any particular input is NULL or UNKNOWN is a tricky problem, but at least this gets the programmer thinking about it up front and structuring their code to behave in the sanest possible manner.
Have you ever seen SQL? Blowing lots of bytes on a query was not a concern. They could have made a much more compact syntax with little effort, but it wouldn't have looked kind of like English the way SQL does.
You could imagine a world where instead of:
SELECT ( email, name, outstanding_balance ) FROM accounts WHERE outstanding_balance > 0 AND last_payment > 60 ORDER BY name
There were plenty of contemporary languages that looked like the latter, but few survive to this day. SQL's relative verbosity was not seen as a problem then and is definitely not one today.
Besides, if the verbosity was a problem they could easily shorten it to UNK. That would have been perfectly normal.
I don't have an issue as such, I was a fairly experienced developer first time I had to dabble with SQL, but sometimes it can still surprise.
For example I learned the hard way that the DB we use at work does not index NULL values.
And once in a while if I'm tired or stressed I might forget about UNKNOWN and thus that "Col <> 42" does not return rows where Col is NULL.
Not that better naming would prevent such surprises, but I still think the current naming is less than optimal from a pedagogical perspective. At least I see this at times when teaching our support folks SQL (many have domain background and not a technical background).
Just to clarify, I'm not advocating to introduce a new `unknown` keyword. I'm saying that the existing `null` in SQL was not named properly and that the name `unknown` would have been more fitting. SQL's `null` already has the semantics of `unknown` as explained in the part of the article that I quoted.
SQL NULL is not "exactly equivalent" to unknown. E.g. in an outer join, there's nothing unknown about the result that is missing a row from one side, yet SQL semantics is to fill it with nulls.
In practice, it behaves as "unknown" in some contexts, as "missing value" in other contexts, and sometimes it's just plain WTF like SUM() returning NULL rather than 0 if there are no rows.
The term "null" comes from C. A. R. Hoare, who has proposed in November 1965 various additions to the programming language ALGOL 60, including pointers (called references by Hoare, "pointer" comes from IBM PL/I, in July 1966), and including a special value "null" for references a.k.a. pointers, for denoting unknown or not applicable values.
C. A. R. Hoare, 1965-11:
"In order to enable references to represent partial functional relationships, i.e. ones which do not necessarily yield a value, a special reference value null is introduced.
This value fails to refer to a record, and any attempt to use it to refer to a record leads to an undefined result."
In his proposal, Hoare has followed the COBOL 60 terminology for some of the additions to ALGOL, i.e. "record" instead of the word "structure", introduced by IBM PL/I in 1964-12, and "record class" instead of "structure type", hence the "class" of SIMULA 67, from where the word "class" has spread into all OOP languages.
In Hoare's proposal, references a.k.a. pointers could point only to values belonging to user-defined types, i.e. records a.k.a. structures, not to primitive types.
I wouldn't necessarily define `null` as "unknown" -- it's just "no value" -- which is really the same thing and also somewhat equivalent to "unset". But null pointers aren't unset as pointers aren't initialized to null in C and you can explicitly set a pointer to null.
E.F. Codd added nulls to relational model in 1970 so that does pre-date C. The concept is even older than that I imagine.
In nth normal form, you can't have 'no value'. That would mean your model is wrong. In academic relational data books, null does mean "unknown". There is a value, we just don't know what it is (yet).
If there might actually not be such a value, you're supposed to change your schema to reflect that.
> There is a value, we just don't know what it is (yet).
In all my years, I've never used null for that. If I don't have a value yet then generally I'm not writing any part of the record. I only ever use nulls for the absence of a value. Creating a new table for every potential optional column is the "academic relational way" and also insane. :)
What happens if your data is produced by some automated process such as a sensor reading and occasionally the sensor fails to return a value? NULL seems exactly the appropriate value to use.
I'm still a bit confused. Suppose you have another table, call it temperatures with columns id and temperature, where every row contains only a valid temperature (no NULL records), and you have a main logging table with date and temperature_id so that you can join on temperature_id = temperatures.id. This seems to be what you mean, with a canonical measurement record table related via the temperature_id foreign key.
But then if your sensor fails to record a measurement don't you end up with NULL for that row's temperature_id?
Ah but then how do you record that the measurement actually happened but did not produce a value? I want a record of that failure.
I mean sure, you could do yet another table. But honestly that level of normalization is much more work than it's worth. Just because it's some academic definition doesn't make it right.
You'd also have to ask when NULL came into common use in C (to which I do not know the answer). AFAIK NULL was not considered to be standard until C89. As far as I'm aware, all C compilers implement it as a #define in a standard header somewhere; it's not actually part of the core language itself.
I wonder who first added that macro? Was it there from the beginning?
Simplicity. PG is often overkill for simple apps, where MySQL/Maria/et al is easier to maintain, or even SQLite for very simple apps where zero DB maintenance is preferable.
The main pain point for smaller apps is that every major Postgres version requires an explicit migration of the underlying data representation. I get why it's there, but for simpler apps I would appreciate a flag to do it transparently.
It's not client side, it's server side. The binary format between Postgres major versions is generally not compatible so you need to do a pg_dump from the old database and do a pg_restore into the new one. pg_upgrade [1] can let you bypass this by migrating the binary data directly, but still requires having both the new and old postgres versions installed. There's also things you can do with replication, but since we're talking simplicity I don't think that really applies.
Personally I think upgrades are the one thing MySQL has on Postgres at this point.
It requires manual interventions because the upgrade process is basically dump + restore. MySQL and MariaDB upgrade between major versions automatically — you simply install the next version (or change the version tag in your container definition) and restart the server.
Usually it takes almost no time, altought might be just as slow as PG when major changes to the data format are introduced. The only example I can remember is 8.0 when oracle completely rewrote the data format (making things like atomic ddl possible).
The GP is complaining about the need to run the upgrade cluster command, and reconfigure your server when you want to use a different version of postgres itself. MySql does it transparently if you just open the database in the new server.
Of course, lots of people also think it's a feature. But those aren't very vocal. Anyway, this is a main reason why people keep using old postgres versions, they have to think about upgrading, and they often don't.
You need to dump the database on the old PostgreSQL version and then load the dump on the new PostgreSQL version. Some other database servers can just use the old table data or migrate the data files on the fly transparently.
If your data's large and changing enough that you have to care about vacuuming, any reasonable database is going to require some tuning, tending and management.
I'd posit that only a tiny fraction of PostgreSQL uses have to know or care that vacuuming is a thing because the autovacuum default handle it for them.
Sure, it's never going to be plug and play, but it doesn't mean that all the issues will be equivalent. Vacuuming doesn't really have an equivalent in say, MySQL. It's something you don't have to worry about if you use the latter.
For example, HA and clustering will always be challenging to deploy/maintain, but you will still have a harder time doing that with postgres than with MySQL. Postgres also has a lot of benefits obviously, though.
Also the reverse — MySQL et al support much more complex replication topologies out of the box, including multi master for the very rare use case when you need it.
It's also much easier to tune, most database instances require setting innodb_buffer_pool_size, and that's basically it. Newer versions can even set it automatically if you're fine with consuming all memory on that machine, thus requiring no tuning at all.
SQLite has its own use cases. And MySQL was all-around better than Postgres in the past, so it still has more usage in some areas. Nowadays I'll take Postgres over MySQL by default, but it's not a huge difference.
> ... and this is even less obvious if you’re used to using ORMs.
Which is why I continue to be such an ORM skeptic. I agree that they're convenient. But I do worry that we've now got an entire generation of engineers who regularly interact with relational databases, but have largely been spared the effort of learning how they actually work.
As another commenter pointed out, if you've learned basic relational algebra then the way SQL nulls behave seems obvious and logically consistent. The logic is the same as the logic behind the comparison rules for NaN in IEEE floats. It's the behavior of C-style nulls that is, always and forever, a billion-dollar mistake.
Stop thinking of ORMs as trying to hide the details of SQL and you'll stop hating them. Instead think of them as a way to compose relational queries dynamically, with the full power of your primary language, instead of inside of database stored procedures in a language totally devoid of any support for basic software engineering best practices. They shouldn't be hiding SQL from your primary language, they should be exposing the relational model to it! SQL is not the only possible implementation of the relational model, and it's not even a particularly good one. Even SQL's founders don't think it implements EF Codd's relational model very faithfully. Good ORMs act as a domain-specific language for the relational model embedded inside the parent language.
That's an important one. It would be super nice to have a SQL dialect that works more like LINQ where you can compose your queries easily. I always hate it when I have to write SQL directly. It's super powerful but the syntax just isn't designed well. To me it feels like a throwback to the good old FORTRAN or COBOL days: you can get stuff done but modern languages are so much better.
>They shouldn't be hiding SQL from your primary language, they should be exposing the relational model to it!
But this has never been their primary purpose and it's not what they are good at. ORMs are supposed to map the relational model into an object oriented model so that you can work with objects rather than sets of tuples. And that's exactly how people use them.
ORMs incentivise people to replace simple and declarative set operations with complex procedural code operating on individual objects.
ORMs are just a terrible idea - conceptually messy, hard to debug and optimise, full of needless complexity.
Exposing the relational model to be manipulated dynamically within the parent language is exactly what LINQ in C# is. That was its primary purpose. True, LINQ is not itself an ORM -- it was built to support other ORMs like LINQ to SQL and Entity Framework, which aren't as "pure" on this subject. I don't actually like the LINQ syntax that much since it's not as extensible, but its existence is proof that the C# team did in fact intend to expose the relational model to C#.
Entity Framework did try to cater to the "SQL is scary, let me use objects" crowd, and that is the majority of how it's used, and that is a mistake in my opinion. But it is also very good at supporting relational algebra within C# and composing queries dynamically; ironically, it's best at it if you disable or avoid many of its features (dynamic subclassing, linked objects). Pass IQueryables around and compose them together and life is good. Updating needs work, but updates have always been a weakness for SQL too.
yeah I mean, isn't all of this true for ANY abstraction? Once you're off the beaten path, they're all hard to debug and optimize, they introduce extra complexity, etc. BECAUSE they are attempting to abstract away certain details
This is true for an HTTP library as much as it is an ORM.
If you can avoid learning SQL or the underlying DBMS then great, I have no problem with people ignoring things that are ignore-able. Problem is you can't. You will need to learn the DBMS, and the ORM will end up just getting in the way of what you really want to do.
I've used ORMs extensively in my career, and I've seen it trotted out as a Boogeyman here a million times. Why do I still prefer writing queries with a good ORM over awkwardly composing SQL by conditionally concatenating strings? Is buggy, error prone string concatenation with a bunch of if statements the true way? What am I missing?
My experience with ORMs is that most of the time you end up needing to write some queries by hand, in raw SQL. Usually these are the most complex, that you can't express in your ORM's DSL. My point being, I don't think using an ORM really shields you from having to learn how it works behind the scenes.
It's not even about having to write SQL by hand. In an ORM like Django that's exceedingly rare. But you still need to understand what's going on underneath. In other words, it's the most leaky abstraction there is. I think the popularity is mostly aesthetic and convenience. Most people into ORMs like Django don't really know about layered architecture and that you can keep all your SQL in one place in the data access layer. They just scatter that stuff everywhere in the codebase.
I don't know Django specifically but I'm always floored by how people talk about ORMs. They're only a leaky abstraction if you believe their point is to shield terrified junior devs of the inner workings of the scary relational database. That's an awful way to use ORMs, and the source of most of the flak they get. To be fair, some are designed that way, or at least strongly push you toward it.
Stop thinking of ORMs as trying to hide the details of SQL and you'll stop hating them. Instead think of them as a way to compose SQL dynamically, with the full power of your language. SQL is an awful language to write application logic in, because it has horrible support for abstraction, composition, encapsulation, dependency injection, etc. The ORM gives you a way to produce SQL in an environment that actually supports basic software engineering principles. Scattering ORM logic everywhere in the codebase is the point: putting all your SQL in one data access layer is like putting all your arithmetic in one calculation layer. Why would you ever do that? What's wrong with seeing a plus sign in more than one file? What's wrong with seeing language-encoded relational logic in more than one file?
I can guarantee you the popularity is not "aesthetic". And convenience is a real thing that actually does reduce costs. People complain about ORMs, but have you seen the absolute horse-shit-level code that people jam into SQL functions and procedures to do the utterly most basic things? The standard for what ends up in SQL Stored procedures is the most unmaintainable garbage in the entire software engineering ecosystem.
> I don't know Django specifically but I'm always floored by how people talk about ORMs.
Django seems to be an outlier on the side of "an ORM that's actually good". Whenever people have specific technical complaints about ORMs, it's generally not a problem or there's already a solution in Django you just have to use. It's usually only when you get to the more conceptual stuff like object-relational mismatch that such complaints tend to apply to Django.
Are you thinking of something like LINQ or SQLAlchemy Core? You do not need to use an ORM to interface with a SQL database with basic software engineering principles.
The reason for a data layer is because the underlying data representation might change. For example, you might change the schema to handle some new performance requirement. Now you have to hunt down and change everywhere you've queried that table in your whole code base. Every time you directly call SQL you are coupling your code strongly to the database schema which is ultimately an implementation detail.
> And convenience is a real thing that actually does reduce costs.
I know convenience is a real thing. I also know that it very often increases costs in the long run.
> Are you thinking of something like LINQ or SQLAlchemy Core? You do not need to use an ORM to interface with a SQL database with basic software engineering principles.
Well you certainly can't do it all in SQL, because SQL doesn't support basic software engineering principles at all. That means you're picking a parent language that does support the principles you want, and attempting to build a library that allows you to build, pass around, and compose relational queries in a way that leverages the benefits of the parent language. To do this I'd argue you need these things:
1. a structured representation of the (possibly incomplete) relational query as it's being built, e.g. SQLAlchemy's internal representation or IQueryable in C#
2. a method of building those representations in a way that conforms to the norms of the parent language and leverages its benefits as much as possible, e.g. LINQ or direct use of the various extension methods on IQueryable. It seems like you could pick either SQLAlchemy or SQLAlchemy Core depending on whether you want to think in terms of objects or queries (I'd usually pick queries), but I'm not that familiar with SQLAlchemy.
3. a method of translating the query representation into a particular SQL dialect.
I don't know what exactly your definition of "ORM" is, but I'd argue that if any of those steps involve an object-level representation of the schema of your tables (e.g. "class Employee { int ID }" type of thing) then it's an ORM. Do you need that? Well, no, probably not, but why not? C# is strongly typed, so why wouldn't I tell its type system what the schema of my table is? That's part of leveraging the benefits of the parent language.
> The reason for a data layer is because the underlying data representation might change. For example, you might change the schema to handle some new performance requirement. Now you have to hunt down and change everywhere you've queried that table in your whole code base.
This is an argument for strong typing, not for putting all your code that has the word "select" in it in the same file. And building queries with a strongly typed representation of your schema in the parent language is the point of the ORM!
Without an ORM, you still have to hunt down tons of spots where things change when you change your schema. How do you ever not have to do this?
> Every time you directly call SQL you are coupling your code strongly to the database schema which is ultimately an implementation detail.
There exists some canonical relational model of your domain. This model is not an implementation detail, it is the foundation of most of your code. I choose to keep my SQL schema as close as possible to this canonical relational model as I can. The fact that a Monitoring Instrument has an Operating Status is a fundamental fact about my model; whether that column is called "operating_status" or "OperatingStatus" or "operating_status_ID" is an implementation detail. It's just hard to agree with you that "the database schema" is merely implementation detail -- clearly parts of it are not, unless it's completely untethered from your canonical domain model (which would be a nightmare, of course). Of course your code is strongly coupled to the fundamental domain model upon which it is operating. I'd certainly agree that it'd be nice to not have to worry about implementation details like the exact naming convention of the columns -- which, of course, ORMs let you do.
>> And convenience is a real thing that actually does reduce costs.
> I know convenience is a real thing. I also know that it very often increases costs in the long run.
Code that is easier to write is generally also easier to read and maintain. That's what I mean when I say "convenience". Simplicity and convenience go together.
I was actually going to comment on how breathtakingly ignorant the blog post author was for making this statement. What on earth ORM would make a UNIQUE constraint against multiple columns without an explicit instruction, much less where some of the columns are NULLable and then leave the user entirely unaware that anything like that was happening? ORMs do nothing like this and the developer would always have built this constraint explicitly and therefore is working at the DDL layer. ORMs have nothing to do with whether or not someone knows SQL just like cars have nothing to do with whether or not someone knows how to walk.
When the null concept was introduced to me in the seventies, the only thing I could say was that it would be causing a lot of unnecessary confusion in the future.
If you have missing values in your datarecord then that datarecord belongs in an exception-queue.
And now some 45 years later people are still discussing it like we did then..
> If you have missing values in your datarecord then that datarecord belongs in an exception-queue.
This depends on the context, no? I doubt there exists someone with a contact list on their phone which has every single field for every single contact populated.
There needs to be some way to codify that a field in a datarecord is unpopulated. Using the "zero value" for the type of the field (e.g., the empty string) is reasonable, but is this necessarily better than NULL? I reckon an argument can be made that this approach is just as likely to lead to bugs.
I'm not necessarily in favor of NULL, for what it's worth, but I can't think of an adequate replacement which doesn't reduce to "NULL in sheep's clothing".
Missing values are not always an exception. There's a reason modern languages almost universally include an Option type which may contain a Null and allow you to pass it around as a first class value. Good data representations allow you to express missing values.
Oh man. I had a talk with a DBA about how oracle could not deal with an adress with no street name - literally a tiny village with 10 houses on 1 nameless town square. It was unsearchable in parts of the app because street='' was interpreted as street is null. DBA kept claiming oracle was right and the town should adapt their naming to our software.
This attitude was so prevalent at the time, I sometimes wonder if the rise of noSQL was simply people sick of dealing with Oracle DBAs
> This attitude was so prevalent at the time, I sometimes wonder if the rise of noSQL was simply people sick of dealing with Oracle DBAs
That was definitely one part; another part was sharp corners in MySQL (at least as of 20 years ago; I would be surprised if many of them haven't been rounded off in the meantime). The last part was places with no DBA with developers unaware of how to handle schema migrations.
It's weirder. If you insert an empty string into a VARCHAR field in Oracle, it returns Null back to you when you query that same field. At the very least, I'd expect a software system to behave in a deterministic way. I.e. either throw an error because you're not doing something right (whatever Oracle deems right in this case), or give you back what you gave it, especially for database software who's entire role of existence is to persist data without side-effects.
Even if we set three-value logic aside for a moment, this behavior of NULL still makes sense intuitively.
The value of NULL in a particular table cell is simply a way to indicate 'no value'. If you want the values in a column to be unique, cases where there are no values shouldn't be considered.
This plays out similarly in practice. For example, you may want to allow users to optionally reserve a username, and if they do, those usernames should be unique. It's hard to imagine a use case where by wanting a field to be both optional (nullable) and unique, you mean that the field should be optional for a single record (!) and required for all the rest. Of course, you mean that IF there is a value, THEN it should be unique.
Exactly this. SQL is based on the relational algebra and that's well-defined, NULL along with other features of SQL work in an entirely regular and predictable way. The only time it's weird is when a developer decides that it should work the way Javascript (or whatever) NULLs work because that's the last time they saw the same word used in a programming language, in which case it's the assumption that's weird.
That's not the only time it is weird. There's even a whole book by one of the pioneers of the relational DB model, Date's "Database Technology: Nulls Considered Harmful" [1], covering many of the ways it is weird.
The part that’s weird with nulls is that it’s a trinary logic stuffed into a boolean algebra. The use of x = NULL instead of x IS NULL is pretty much always a mistake.
More importantly, x = value instead of (x = value and x IS NOT NULL) is almost always a mistake, and a stupidly subtle one at that. And for this curse, we get… nothing particularly useful from these semantics.
Even null in programming languages isn't so bad if it's a distinct type. The problem with null in languages like Java is that null is part of every reference type (C's pointers are another world of broken, null being basically just another unsafe invalid address).
Most languages nowadays do get nulls right, even PHP of all things.
Ironically NULL is probably the safest pointer value in C, as any dereferences from it (and thousands of pages that follow it) are guaranteed to crash.
Well, that's a problem on many of the devices I've seen: zero is a valid memory address and dereferencing it does not cause any kind of crash. In fact some hardware requires reading or maybe even writing to that address. In an age of virtual memory there's no reason why zero should cause a crash and it wastes an entire page of memory for every application to make that happen, if it does.
I haven't had the chance to work on any MMU-less devices, but I don't quite follow your remark about wasting a page. Crashing is just the default behaviour in the absence of a mapping and consumes no resources.
The granularity of virtual address mapping is usually a page. On many systems, that's 4 kilobytes of address space. In order to trigger a fault when the address 0x0000000000000000 is dereferenced, it's necessary to map the entire address range from 0x0000000000000000 to 0x0000000000000fff to the same faulting behaviour.
Yeah but there is no "page" there. Assuming a process starts with an empty page table (meaning every single address will segfault), you don't have to do anything else to get the crashing behaviour for null.
Unless you're talking about some kind of device which supports virtual memory, but also by default sets up mappings, including at 0 which seems weird to me.
In Object Oriented Context "null" is useful to indicate that some object doesn't have value for that property.
What's interesting is, do we mean that in our data that attribute has no value? Or do we mean the real-world object represented by the data does not have that attribute?
Does null mean
a) We don't know the value of this attribute
for this object, or
b) We do know that there is no value for this
attribute in the real-world object
represented by our data.
In JavaScript because there is both null and undefined it is easy to assume that undefined means we don't know the value and null means we do know it has no value.
EXAMPLE:
The attribute 'spouse'. Some people have a spouse some don't. So what does it mean if the value of the field 'spouse' is null? That we know there is no spouse, or that we don't know who the spouse is if any.
In practical terms we can say null means "We don't know" which includes the case that there is no spouse.
> In JavaScript because there is both null and undefined it is easy to assume that undefined means we don't know the value and null means we do know it has no value.
Javascript objects have two kinds of undefined that are both represented by the same value. You have to use another method to see which it is, and I've seen "foo" in this example used for the same thing as "null" in your example:
But in a _relational_ database lack of spouse would not be modeled with a nullable column "spouse" but rather an absence of a spouse row/relation. Which is very real-world-like.
As a sort of challenge I had an idea of building an app using SQL as a pseudo ECS system where every table was either a 1 column table with only an id or a 2 column table with an id and a value.
And yet when you do a join because you need to actually use that data, the resulting table will have a column with nulls in it. Any way you squeeze it, you need a way to represent empty values for your database to be useful.
I remember from my databases course at university that NULL means that the database doesn't contain that data, and empty string means that it is known to be empty.
That's your professor's opinion, and probably one that does not come from industry experience. Look in 4 different databases and you'll see 9 different conventions. A common one is to have all strings non-null with a default value of empty string. And not all columns are strings; there is no "obviously empty" integer or boolean.
Let's also all be reminded about how Oracle DB doesn't let you insert empty strings, and instead treats them as NULLS even if you gave it an empty string initially.
That was a fun bug to find out, after having dealt with quite a few other DBs over the years. It was one of those "No, but surely" and "This can't be! This is Oracle!" moments. Found it while porting some old code that needed to store an empty string as being distinct from a NULL in that same column.
SQL nulls in some ways behave in similar to floating point nans. Of course nans are also weird in their own way, but it is a bit comforting that its not so completely singularly weird.
NaN is cool because it's almost like a type that constrains uncertainty. What do we know about this entity? not much! but it's definitely not a number. Calling it anything else would have been a mistake.
Null is more confusing because it means different things in different languages. Sometimes it's a more constrained uncertainty, eg this definitely doesn't exist. But in sql it's a less constrained uncertainty, like "undefined" in math. The value of this thing couldn't make sense in this context, but we can't make assertions about its existence.
There's another comment in here that talks about thinking of NULL as UNKNOWN, and I quite like that. It makes a lot more sense, and I think it would've been a better choice to standardize on. UNDEFINED would also be an improvement.
UNDEFINED would not be accurate. If your signup form has an optional field for a full name which I don’t fill in, I still have a name. Just because a value is not known by your database doesn’t mean it isn’t defined.
E. F. Codd thought about this issue.[0]
> Codd indicated in his 1990 book The Relational Model for Database Management, Version 2 that the single Null mandated by the SQL standard was inadequate, and should be replaced by two separate Null-type markers to indicate why data is missing. In Codd's book, these two Null-type markers are referred to as 'A-Values' and 'I-Values', representing 'Missing But Applicable' and 'Missing But Inapplicable', respectively. Codd's recommendation would have required SQL's logic system be expanded to accommodate a four-valued logic system. Because of this additional complexity, the idea of multiple Nulls with different definitions has not gained widespread acceptance in the database practitioners' domain. It remains an active field of research though, with numerous papers still being published.
> UNDEFINED would not be accurate. If your signup form has an optional field for a full name which I don’t fill in, I still have a name.
If your signup form has an optional field for middle name which I don’t fill, it can absolutely be because I don’t have a middle name. It’s undefined and known to be so.
> E. F. Codd thought about this issue.[0]
And because four value logic was left out, nulls have to fulfil multiple incompatible roles, and ends up being weird.
> If your signup form has an optional field for middle name which I don’t fill, it can absolutely be because I don’t have a middle name.
The fact that some NULLs are indeed undefined doesn’t contradict my point that it would be inaccurate to refer to all NULLs as UNDEFINED.
> It’s undefined and known to be so… And because four value logic was left out, nulls have to fulfil multiple incompatible roles, and ends up being weird.
You know it to be undefined, but if you just leave that field blank on the form, the database still can’t know whether that’s because you don’t have one or because you didn’t provide the one you have, unless there was a field on the form that allowed you to explicitly indicate that you don’t have one.
> If your signup form has an optional field for a full name which I don’t fill in, I still have a name. Just because a value is not known by your database doesn’t mean it isn’t defined.
There's the Closed World Assumption in a database.
NULLs are weird because they are basically two different types under the same name. The 3-value logic type is useful for representing "missing" foreign keys, but 2-value logic type is arguably more useful when searching/sorting/aggregating.
I think we would have been better-off by treating FKs (and maybe outer JOINs) as a special case, and using 2-value logic everywhere else.
I don't see why this is weird. Unique means no duplicates. Nulls can't be duplicates of each other because they're not equal to each other.
If you don't like null semantics, you're free to use sentinel values. You can make all the sentinel values the same, or you can make them all different. Either way, you or someone who has to use your system will be back here tomorrow complaining about how weird it is.
I actually like how NULLs behave in SQL. They mean "I don't know" In the modern programming language we all care about Null safety. But no matter how you model your data, you will always run into the situations when you don't know everything. So I believe NOT NULL is not very practical. NULLs in SQL handle these case very well - when the input is unknown your output is unknown
Agreed. If SQL didn't have NULL, we'd have other special values meaning "I don't know" or "no data" all over the place.
Too many newbies hear that NULL is bad, so they declare all columns as NOT NULL and end up inserting ad hoc values like 0, -1, '', or {} when they inevitably come across cases where they don't have data. Which is even worse than NULL.
I'm glad SQL Server finally got this, but I wish the syntax was nicer. It's a multi-word infix operator that gets tough to read. I've been using Snowflake SQL recently and I like that they just made it a function called EQUAL_NULL
This reminds me back in the day when I was writing a DSL for a project.
Since the data we were getting was sourced from an RDBMS, I wanted NULL to be a first class concept in the DSL, with similar traits.
Early on, I simply made any expression that involved a NULL result in NULL. Naively this was all well and good, but it failed spectacularly in condition statements.
Instead of A = NULL == false, I had A = NULL == NULL. And, as you can imagine, a single NULL in the expression would just pollute the entire thing, and since NULL was considered as FALSE for conditionals, any NULL in an expression made the entire thing, eventually, FALSE.
Naturally I went back and made the comparison operators always return booleans. But it was a fun little side effect at the time.
This has always made queries unpredictable in many scenarios and it should be a feature to turn nulls off entirely and swap them out with Option<T> instead.
It is not supposed that null is the bottom value in the universe of all the values that your program can recognize? Why people need to complicate it?, and yeah in that definition `null == null`, but a `null_pointer != null` because null pointer is at the bottom of all the possible pointer value, and null by itself is not a pointer. The same for (0,null), (false, null) and ("", null). null should only be equal to itself.
And lastly undefined != null, because undefined is related with structures indicating that a field was not defined when the structure was created
And SQL null shares the feature with many other languages that any type can be a NULL (although a column can be set NOT NULL). Much like Java, it is no end of grief that a type that claims to be, say, a “NUMBER” is actually “NUMBER | NULLType”.
SQL NULL is also coerced to different boolean values depending on context e.g. in a WHERE clause NULL is coerced to false, whereas in a CHECK constraint NULL is coerced to true
null != null is pretty bizarre at first, until you understand the reason the did it was to try to make sense of null-able indexed columns. Not sure why we couldnt have our cake and eat it, but instead we got IS NOT NULL is not the same as != NULL
Ah yes, someone discovering the existence of three value logic in SQL and expecting 2VL behavior. Classic. We've all been there, right?
Personally I wish more languages were like python or ruby and had chosen None or Nil over Null which alleviates the confusion a bit, as those names better indicates that it's NOT an "unknown" (1 unknown value != 1 other unknown, which intuitively makes sense.) In ruby or python it's more obvious that None and Nil are "nothing" types and therefore equivalence makes sense (nil == nil, None == None are both true)
I think (blasphemous hot take ahead) that the standards of implementation of relational models are wrong.
NULLs still have their (rare) place, but the foremost issue with query results is that they are tabular rather than hierarchical. The main culprits being (1) outer joins that represent or induce nonsensical operations and (2) lack of non-null “zero” values for types like date. Of course hierarchies can make querying more complex, but mostly in cases where the relational logic goes crazy itself and you had to go tabular anyway.
If you think of it, distinct, group by and windowing feel like workarounds in tabular mode but would be natural to hierarchies, because everything is naturally distinct and grouped-by by design and windows are basically subtables in these rows.
Bonus points you could fetch “SELECT FROM a, b_rows LEFT JOIN b AS b_rows …” in a single query without duplicating `a`s and nullifying `b`s when N <> 1. And when you aggregate through a column in `b`, there’s no headache what to do with join-produced NULLs (unless `b` columns are nullable by your design, then it’s on you). And when it all arrives to a client, it’s already well-shaped for ui, processing, etc. No more:
last_a_id = undefined
for (row of rows) {
if (row.id != last_a_id) {
…
last_a_id = row.id
}
…
}
I’m pretty sure you recognize this programming idiom immediately.
Before you criticize, I’m not talking about hierarchical/OO tables. Only about ways of getting and handling query results. You still can reshape a relation like you want. The difference is that a database engine doesn’t have to put it all onto a (N x M x …) table and instead creates sort of a subset of relations which is efficient in space and natural to walk through. It already does that when walking through indexes, selects are naturally hierarchical. All it has to do is to track relations it went through rather than just dumping rows from a set of cursors that it knows the start and end points of, but loses this knowledge by writing into a plain table.
I have deep but vague thoughts around the concept. My first intuition is that we have put too many things under NULL and None and such.
Partially, we use ... and I'll be very broad here ... "variables" as boxes we look in for answers. Answers to questions, answers to "Hey I put something in there for the time being to reference later." If I went into programming terms rather than just SQL, sometimes we get meta-answers.
You haven't made the box yet (declared the variable).
You haven't decided how the box is structured (picked a type or a length or something).
Okay, you did those but the box is virgin (nothing has been placed in the box yet).
That kind of thing. An empty set for "yes, you asked but nothing meets those criteria."
NULL is the absence of a value. If you try and treat it as a value, you're going to have a bad time. So an attempted UNIQUE(email_address, deleted_at) constraint is fundamentally flawed. If you treated NULL as a value that could be unique, you're going to break foreign keys.
But let's continue the logic of deleted_at being NULL indicating an active account, which seems to the intent here. You end up doing things like:
SELECT /* ... */
FROM accounts
WHERE email_address = '...'
AND deleted_at IS NOT NULL
Depending on your database, that may or may not index well. More problematic, you may end up with privacy leaks if someone forgets the last conditional.
If anything, you want to reverse this so someone has to go out of their way to explicitly select deleted accounts. There are multiple strategies for this eg using an active_accounts view or table.
Lastly, there are lots of potential reasons for an account to be disabled or otherwise not visible/accessible. Takedowns, court orders, site safety, hacked accounts and so on.
Overloading deleted_at to have a semantic meaning for an active account is just fundamentally bad design.
> Overloading deleted_at to have a semantic meaning for an active account is just fundamentally bad design.
Then don't do that. It's kind of a leap to say soft deletes are categorically bad because someone might confuse "deleted" with "inactive". My users table does the super-advanced thing of having both columns. The ORM also doesn't forget to add the not-null criterion. There's also zero databases in active use where it poses a problem to indexing.
Soft deletes suck in their own way, but none of the alternatives are perfect either.
You put the "is not null" on the index itself and then simply don't use it for the much rarer queries that are on deleted accounts. Or just use a view for active accounts.
Overloading timestamps to carry a boolean on null is awesome as long as you decide that's what you're doing and use one of the several standard techniques to dodge the easily avoided potential downside.
This isn't a valid security concern, more than any other incorrect sql query would be anyway. A dev can always write a bad Q, you need another way to address that it's not more likely here because of the null.
The NULLs in unique constraints quirk actually works differently in ORACLE databases, which is infuriating to say the least. Apparently this comes from some ambiguity in some sql standard, anyone know more about this?
If you want you can read NULL as UNKNOWN and suddenly a whole bunch of operations involving them become a lot more intuitive:
1. TRUE OR UNKNOWN = TRUE, because you know you have at least one TRUE already.
2. TRUE AND UNKNOWN = UNKNOWN, because you don't know whether you have two TRUEs or not. It's just out there.
3. UNKNOWN XOR UNKNOWN = UNKNOWN, because it could darn near be anything: TRUE XOR TRUE, TRUE XOR FALSE, FALSE XOR FALSE, FALSE XOR TRUE... Internalizing this is where SQL's use of NULL / UNKNOWN really becomes intuitive.
4. (TRUE AND FALSE) XOR (TRUE OR UNKNOWN) = (FALSE) XOR (TRUE) per #1 = TRUE. See, it's consistent, you just need to keep in mind that if you have a lot of known UNKNOWNs they're quite parasitic and your final outcome is likely to be, itself, an UNKNOWN. Just like in real life!
> The fact that NULLs are distinct for UNIQUE columns but are indistinct for SELECT DISTINCT and UNION continues to be puzzling. It seems that NULLs should be either distinct everywhere or nowhere. And the SQL standards documents suggest that NULLs should be distinct everywhere. Yet as of this writing, no SQL engine tested treats NULLs as distinct in a SELECT DISTINCT statement or in a UNION.
Two unknown values are assumed to be different, but they are not distinct from each other.
For example, take two boxes, in each box is a die, the value of the box is the value shown on the die inside. You don't know the value since you don't see the die, it may even change as you manipulate the box, so it is unknown, NULL in SQL. Because of that, you assume the two boxes have different values. They are, however, indistinguable, so, not distinct. All you need to know is that you hold two boxes with dices in it, which one you hold doesn't matter, and that's what "SELECT DISTINCT" tells you: that you have two boxes and it doesn't matter which is which, even though they have different values.
The literal definition distinct is:
>recognizably different in nature from something else of a similar type.
If you want to get down to it nothing is "equal" or the same.
Is a temperature measurement 25C the same as another of 25C? No these measurements are an approximation of the actual values which are actually not equal to each other they are distinct they have just been lumped into the same 25C group due to the resolution of measurement yet equality works just fine on that value in sql.
I have used SQL for a long time null handling is weird and inconsistent and a waste of time. For all the language bugs due to the existence of null at least I can count on null=null and not write garbage like value=param or (param is null and value is null)
Irrelevant. What matters is the meaning in the context of SQL.
> weird and inconsistent and a waste of time. For all the language bugs due to the existence of null
There are necessary, semantic cases that need to be dealt with. How else would you do it?
Also, it's really weird to use "bugs" to refer to well defined and well documented behavior.
It is extremely easy using partial uniques and the proper operators to treat nulls as non-distinct values and the approach we have allows some very important functionality that aligns with standard statistical data treatment which was influential to how SQL logic was originally designed.
Yes, the measurements are the same.
The actual temperatures probably are not, but measurements are not the same as the thing measured.
By the logic two unknown (null) measurements are the same regardless of the actual value which I agree with.
Whether two unknown measurements are the same is unknown.
The values in the db are the same in both cases which is what I would like my db language to deal with and not make assumptions about what that value actually means.
I see no value in treating null special when in comes to equality in a sql db, in fact it is a hinderance that it does so in my experience.
Is 0 for a temp measurement unknown sentinel or an actual measurement, how about 2,147,483,647 great probably not a measurement now its always included in greater than queries same with max negative with less than.
Null separates the value into its own distinct group and prevents it from being including in range queries due to it not being an actual numeric value while most languages still allow you to compare equality using standard operators.
Sum types would be great in sql but currently we get a union of the sql type and null, so null for sentinel values it is except for the having to using weird syntax to compare it.
A real sum type would be nice, but when you're using null then you need to accept that null was not designed with your specific use case in mind.
NULL is not a value.
NULL is a statement that a value is not available or unspecified reasons.
If you want a particular value where a query would return NULL, it's your job to replace the NULLs with the contextually-appropriate value, e.g., using COALESCE(), to provide it.
It's a convenience shortcut to allow more complicated data models to be rpersented in simpler table structures than a fully normalized NULL-free data model would require, and to provide information about missing data (which can be used with things like COALESCE, where appropriate) when a more complex data model is simplified into a resultset via a query with JOINS, etc.
> recognizably different in nature from something else of a similar type.
But anyways, the point wasn't to justify the choices of SQL but rather as a way to make intuitive sense of its logic. SQL is one of the oldest and most successful programming languages in existence, we are not going to change it, and it is not going to disappear anytime soon, so we have to go with it, like it or not. There have been some attempts at alternatives, both at changing the paradigm (NoSQL) and at cleaning up the language, which, to be fair, would be a good thing, but without much success. The relational paradigm just works, and SQL is usable enough to make the cost of switching not worth it.
Edit:
And writing things like "value=param or (param is null and value is null)" is usually the sign of a poor understanding of the NULL logic. You are using it wrong basically. Sometimes, it is the right thing to do, but if that pattern starts appearing all over the place, it is usually a result of thinking "NULL is broken, I have to use this pattern to handle NULL properly". That's cargo culting, don't fix problems you don't understand by copy-pasting code you don't understand.
Note: this is not addressed to "you" in particular, there can be good reasons, no offense intended. But I think that in general, it is a code smell.
Two measurements of 25C are not recognizably different therefore they are equal, correct, regardless if the actual temperatures are not the same?
Two measurements of unknown are not recognizably different therefore they are equal in the context of the database.
Having null!=null has never been intuitive to me especially since every other programming language treats them equal. I am not hoping this gets changed, I know SQL is to far along for that, I can still complain about it and agree its wierd.
>And writing things like "value=param or (param is null and value is null)" is usually the sign of a poor understanding of the NULL logic.
It's needed with parametrized sql when your db doesn't support "is not distinct from" which is itself a silly way to just write '=' or '==' like a normal programming language. The distinct predict exist for this very reason to have yet another way to express equality that includes nulls: https://modern-sql.com/caniuse/T151
So NULLs are not comparable but they are part of the same set.
It is not the case that nulls are always the same as one another. It is also not the case that nulls are always distinct from each other. Thus, the normal rule of grouping, that same values are combined and distinct values make different groups, does not apply. Another principle is needed to determine the result of grouping with nulls.
Logic which allows for an unknown value can't be shoehorned into always giving definite true or false answers.
In the SQL spec by default unique indexes consider nulls distinct because you’re adding tuple to the relation, and this is done by equality.
When doing a select distinct or group by you’re not doing a boolean grouping, you’re doing a set projection. NULLs are considered part of the set of “unknown values”, so NULLs are grouped together but they’re still not equal to each other.
The behaviour is mathematically well defined, but it doesn’t match boolean logic.
I’ve been dealing with databases for well over 20 years now and I can only think of a couple of times when this behaviour wasn’t wanted, so I think it’s the right choice.
I believe this confusion is confusing the tool with the thing being measured. For simplicity, I will use the analogy of a record (stored as a row in the database) as an observation in a scientific experiment. If the tool was able to record a value, I enter a value like 579.13. If the tool was not able to record a value, the tool will enter NULL. I make a total of one hundred observations. Of one hundred rows, some have values and some are NULL.
Are NULLs distinct values? No, they are simply a failure in measurement; it is like asking if all errors are distinct or the same. Are NULLS part of the same dataset? Yes, because they are all observations for the same scientific experiment. What does it mean when "select distinct ... " returns several rows for known/measurable values and but only one row for NULL? If this is confusing, the scientist can update the rows and substitute "UNKNOWN/ERROR" for every NULL. When you do "select distinct ...", you will get the same thing. It will return several rows for known/measurable values and but only one row for "UNKNOWN/ERROR".
Indeed, the sqlite page the pull quote is from says as much.
Postgres lets you control that behaviour when creating the constraint (or index)
Luckily, Postgres nowadays lets you declare the behavior of each null in unique constraints, like it should be. We can expect this to creep down to the other DBMSs with time.
Making nulls distinct on a "select distinct" or a "group by" would be completely useless.
But when I see select distinct at the start of a big and complex query, I do immediately suspect that the developer might have missed some join condition down the line and “got too many rows” back from the query. And since the rows look like duplicates due to the missing join predicate, for a junior (or careless) developer, the quick “solution” is to just apply distinct in the top level query to get rid of these pesky duplicates, declare success and move on.
In a unique column normally you'll have an index, so NULL becomes a special value in an index, but in SELECT DISTINCT you probably won't have an index, which means a full scan is performed, then every row has to be compared with every other row.
Amount of false bullshit doesn’t make qualitative difference.
Only difference to make is that people should not take something as truth just because it is written in a book or in a blog post or if person has a degree or not.
People think if someone wrote blog post with technical details and it got upvoted - somehow it has to be an expert.
i quote:
"I graduated top of my class with a BSc in Computer Science [...]. I have a strong background in software engineering and technical leadership"
Which is fine! It's really hard to be truly expert in both. There's a reason why "programmer" and "database administrator" used to be two different professions. I'd like to think that I'm better than your average developer at flogging RDBMSes, but most DBAs I've worked with can still run circles around me when it comes to information modeling and database & query optimization.
It's not getting worse, you're getting better.
HN has for a long time been where I go for whatever you call the tech equivalent of watching stoners think they're having a deep conversation.
Most technical writing is actually at the start of the technical journey, there's more people there and its more interesting to talk about strategy and ideas when its not "messy" with internal details and trade offs you make.
I encourage folks at any technical level to write, and I encourage folks to check authors work with a critical eye no matter how long the source has been in the industry, no amount of experience can protect you from being wrong.
Ergonomics matter.
Your link makes the same mistake I already addressed. It conflates nullable booleans with tri-state logic.
Null is not a value. It is the absence of a value.
> The SQL null value basically means “could be anything”.
This is wrong. Null means it could be any valid value but that value is unknown. If the datatype is DATE then the value cannot be boolean TRUE or the string ‘purple’.
If I’m comparing a value of type date to a null I still think it works as it should if value is “unknown”. What greater insight or context do we have if it’s a small-int null?
If a student is sick and has not taken the exam, yes you could enter -99 to represent they did not take the test. But if you want to find the class average, you would have to do something like this:
select average(case when score =-99 then null else score end) as class_avg from …
Or you could have entered null to begin with.
Because the possible values are known.
> What greater insight or context do we have if it’s a small-int null?
The insight is that null is not a value. It’s not a smallint or a boolean or a date. It’s the absence of a possible value. The only way to see a null as tri-state is to conflate it with a nullable boolean. This is an incorrect mental model which leads to confusion.
Making them harder is not better.
https://excelbaby.com/learn/the-difference-between-empty-nul...
(and sometimes Missing)
SQL needs to join 21st century type systems... or needs to get replaced altogether. SQL is the FORTRAN of relational programming languages, but hangs around because every time somebody tries to replace it they throw the relational-theory baby out with the bath-water.
and what is an alternative to sql ... quel?
You have a list of people and you ask if they own a car. You didn't get around to asking George, so that, somehow means he owns a car because you are using boolean logic? Or does it mean he doesn't own a car, because you are using boolean logic?
No, it means you haven't gathered this data point and don't know.
If there are three possible values, TRUE, FALSE and NULL (unknown), then you're probably working with something like Kleene logic. You can't truly be working with Boolean logic, though, any more than you can be doing integer arithmetic when 15.37 is a member of your domain.
To put it another way, if we're talking about the mathematical definition of boolean algebra and not just some programming language's quirky implementation that happens to be called "bool", then boolean values would by definition be non-nullable. That logic that allows nulls has been implemented using the same unified set of keywords and operator names is a pragmatic decision that simplifies the language implementation and spec, not a principled one that tries to be pedantic about mathematical terminology.
That's why it's name-squatting. Rather than introduce a 'kleene' datatype & operations, and let the user opt-in, they decided that in our 'bool' world, the following is not a truism:
This is how it should be.
> Somehow means he owns a car because you are using boolean logic?
This is how it unfortunately is. There are 3 people, and there are 3 people who don't have a NULL car. Therefore George has a car.
Elsewhere people have argued that NULL propagates, so that your small unknowns infect larger queries. I could get behind that line of thinking, but the query above confidently returns 3.This is not what you are asking with your query: as someone else stated, NULL is meant to be "UNKNOWN", or "it could be any valid value".
So nothing is ever equal to something that can be anything, because even another NULL (i.e. unknown) value is in general different.
So in the line
the condition will always be false. Now if instead if meant to search for the rows where carId is actually unknown you have to write And your query will return as one may expect 2.It's worse than that. It's something is neither equal nor not-equal to something else.
Whether you can implement something differently as a workaround is immaterial. It's weird, per the title.
You could say that a boolean column with a NULL value is FALSE like how a lot of programming languages coerce it but if you wanted that you would just make a default of FALSE. The meaning of NULL in general being "value not specified" lends itself pretty nicely to "either true or false."
You mean neither true or false?
We bring in the 'Trinity of thought' a priori and forgot about the advantages and costs.
You have the principal of excluded middle, principal of identity, and the principal of non-contradiction.
If your problems territory fits things are convenient.
Once you introduce known, unknown, or unknowable unknowns, the classic model falls down.
Unfortunately you have to choose what to sacrifice based on the context of the problem at hand.
This is exactly where Rice's theorm, the halting problem etc.. arise.
AC from ZF(C) and IID from statistics bring PEM, which gives or forces the classical behavior.
The 'non-trivial' properties of Rice's theorm are exactly those properties where PEM doesn't hold.
The frame problem in machine learning is also related. As is Gödels (in) completeness theories.
Sometimes you can use failure as negation, other times you can use methods like accepting that NULL is an unknown.
But you have to decide what you can sacrifice and still solve real problems.
I think thinking of a problem and it's complement is better.
E.G P is the truthy T and co-P is the truthy F. Obviously P=co-P means that the traditional method works, but jump to NP vs co-NP and ot doesn't.
NP chained ORs, co-NP is chained ANDs is another way of thinking about it.
Being based on someone's logic is not sufficient. Most weird things are based on some (weird) logic.
1. x ∧ 1 = 1 (identity law for conjunction)
2. x ∨ 0 = 1 (identity law for disjunction)
Kleene's TRUE-FALSE-UNKNOWN logic is weird.
SQL nulls effectively violate the reflexive property of equality, because X=X does not result in a value of TRUE.
And in many contexts in SQL, NULL is treated as equivalent to false, such as within a WHERE clause.
So that means that X=X is effectively FALSE in SQL*.
That is a clown language.
*(yes, I have the same opinion about ANSI NaN, but that wouldn't come up so often if Javascript didn't love NaN for some stupid reason.)
NULL is not equivalent to FALSE, it is neither FALSE nor TRUE. It has the same effect as FALSE as the final result of evaluating a WHERE clause condition only because WHERE clause conditions allow a row to be included only when they evaluate strictly to TRUE. But if NULL were equivalent to FALSE in a WHERE clause, than a WHERE clause condition which would evaluate to NULL that was instead negated would be equivalent to TRUE but instead it remains NULL which remains not TRUE.
I realized earlier I was using the term NULL, but going forwards let's use the ANSI SQL concept where the null state of a Boolean is called UNKNOWN. You'll have to forgive me for using the term NULL this far, but in my defense the concept of NULL column-values and UNKNOWN boolean expression results are pretty intertwingled.
resolves to which is the same value that resolves to.So the WHERE clause is treating UNKNOWN/NULL as equivalent to false. The rest of the Boolean algebra is not.
SQL likes to have it both ways. Sometimes UNKNOWN/NULL is equivalent to false sometimes it's not.
It does this because UNKNOWN/NULL Booleans are incredibly inconvenient and tedious and sometimes they'd rather not think about them.
I'd like to do that as well, but this hoary old language won't give me the same permission it gives itself.
I don't think any databases treat `NULL` as `FALSE` in the WHERE clause. `SELECT * FROM foo WHERE bar = NULL` doesn't return rows with a NULL in the bar column. `SELECT * FROM foo WHERE bar != NULL` doesn't return rows without NULL in the bar column. `SELECT * FROM foo WHERE (bar = 'a') = NULL;` doesn't return rows where bar is not equal to `a`[1]. As far as I know every DB treats NULL as what it is, an unknown value.
It also doesn't to my mind violate the reflexive property because NULL is not equal to anything. It is a marker for an unknown value, not a value in and of itself. If you have a database of every person in a room and what color shirt they're wearing, and in your database, Alice and Bob both have NULL in their "shirt_color" column, that does not mean that Alice and Bob have the same color shirt. Nor does it mean that they don't have the same color shirt. Nor does it mean that someone with a green colored shirt has the same color shirt as Bob or Alice. It doesn't mean they don't have a shirt either. It means you don't have a record of/don't know what color their shirts are. You can't violate the reflexive property because you can't say what color shirt they have. You're not doing `X = X -> false`, you're doing `X = UNKNOWN -> UNKNOWN`
[1]: https://www.db-fiddle.com/f/iVDDRJos1pUqxnuy1jTEEe/0
1 <> NULL => Boolean UNKNOWN,
so SELECT * FROM foo WHERE 1 <> NULL returns nothing.
1 = NULL => Boolean UNKNOWN,
so SELECT * FROM foo WHERE 1 = NULL returns nothing.
That's the thing that's being treated as FALSE. That UNKNOWN. Not the value of NULL itself.
> You're not doing `X = X -> false`, you're doing `X = UNKNOWN -> UNKNOWN`
That's not how "=" works. If you want a relationship for testing equality than handles unknown, don't call it equality.
Basic properties of equality, from Wikipedia
https://en.wikipedia.org/wiki/Equality_(mathematics)
- Reflexivity: for every a, one has a = a.
- Symmetry: for every a and b, if a = b, then b = a.
- Transitivity: for every a, b, and c, if a = b and b = c, then a = c.
edit:
We can also see the incoherence of this concept when we look at set theory.
Because UNKONWN booleans are neither true or false, if you use them in a WHERE clause you get the ugly result that the set of
is not everything.I think the difference between how we're looking at this is for me there is no "value" of NULL. NULL has NO value which is why you can't do `1 = NULL` or `1 <> NULL` and have to specifically use `1 IS NOT NULL` or `1 IS DISTINCT FROM NULL`
>That's not how "=" works. If you want a relationship for testing equality >[that] handles unknown, don't call it equality. >Basic properties of equality, from Wikipedia >https://en.wikipedia.org/wiki/Equality_(mathematics) >- Reflexivity: for every a, one has a = a. >- Symmetry: for every a and b, if a = b, then b = a. >- Transitivity: for every a, b, and c, if a = b and b = c, then a = c.
Sure, that's all well and good, but equality only can work IF you know what the values are on either side. NULL means you don't know what a value is and therefore you can't answer whether or not a = NULL because NULL isn't known yet.
Or let me put it another way, is the launch price of the iPhone 17 != $799? The answer is neither true nor false, because the launch price of the iPhone 17 is unknown. We can make reasonable guesses, but if I give you a database with entries for iPhones from the first model and placeholder rows for the 17, 18, 19 and 20 models, none of them belong in the list when someone asks "which iPhones cost $799 at launch?" But equally true that none of them belong in the list when someone asks "which iPhones did not cost $799 at launch?"
No, it's being treated as UNKNOWN, and the semantics of SELECT...WHERE only returns rows where the value of the condition is TRUE.
I think you need to look into https://en.wikipedia.org/wiki/Negation_as_failure
A data expression always evaluates to a data value, and usually whenever any part of the expression is NULL, the entire expression evaluates to NULL.
A comparison evaluates to a truth value, and usually when a comparison invovles a NULL it returns UNKNOWN. This leads to weird behaviors where both `SELECT 3 WHERE NULL = NULL;` and `SELECT 3 WHERE NULL <> NULL;` returns nothing (because the query engine does not output a row if the predicate returns UNKNOWN on it).
What you listed above only comes into play for boolean/logical connectives like AND, OR, NOT, and in that case we follow 3-valued logic.
And there's more annoying corner cases when you deal with DISTINCT. The situation is so hopeless that SQLite has a whole table documenting divergent behaviors of NULL in different systems: https://www.sqlite.org/nulls.html
Nevertheless I find it a useful intuition pump. I wager that most people reading `UNKNOWN = UNKNOWN` or `UNKNOWN <> UNKNOWN` and thinking about the examples above would stop and say, "Wait, I actually don't know the value of that statement for sure either, since the LHS and the RHS could be completely different things," and would then double check what their SQL dialect would actually do in this situation.
That leads to an even more confusing point, that some systems (at least SQLite) overloads NULL to mean UNKNOWN, for example `SELECT 1 WHERE NULL = (NULL = NULL)`.
And the dangerous thing about NULLs is not when they are explicitly used in the query as a literal (as I did for brevity), but when they appear in tables. It's perfectly reasonable to assume `SELECT COUNT( * ) FROM t;` should be the same as `SELECT COUNT( * ) from t WHERE t.x = t.x OR t.x <> t.x`, but they are not the same because the latter does not return NULL rows. This has lead to real query optimizer bugs. For more examples see this paper https://dl.acm.org/doi/10.14778/3551793.3551818
This is the mental model that I use and teach: a NULL is a virus that will likely make everything it touches NULL... unless cordoned off by a COALESCE or IFNULL or something else that explicitly speaks about its NULL-sensitive behavior in its documentation.
So if you see a bug where a WHERE or FILTER clause is unexpectedly failing, and you're in a time crunch, rather than pounding your head against the wall, start sprinkling COALESCE statements around (while being thoughtful about the domain meaning of what you're doing) and it's more likely than not to solve your problem!
Agreed with all of this, it would probably have been better if they were named `unknown` instead of reusing the `null` keyword.
Note also that since Postgresql 15, you can use `NULLS NOT DISTINCT` when creating a unique index [0]. I'm less familiar with other databases.
0: https://www.postgresql.org/docs/15/sql-createtable.html
Most of the time it's not that important and people can and do shortcut "null" to mean "not present" but then the issues with using null in equality statements is a result of taking the short cut, not necessarily with the logic around null.
(Of course, this is rather awkward in practice, and when NULL is there, it's inevitably going to be used for that instead.)
Null has tons of problems, of course. But it's patently absurd to claim that you "ought to be" making a different 1-column table with a unique foreign key or shared primary key for every column that might be absent, because of some vague appeal to the fact that you can write it as a 1:N relation with N<=1. You can just as easily claim that every non-nullable column is a 1:N relation where N==1 and "should be encoded as such". It is encoded as such! That's what a column is!
It’s not.
That said, as someone self-taught in SQL, I agree NULL was not a good choice.
Replacing NULL with UNKNOWN and the third boolean value as INDETERMINATE for example would have been better.
[1]: https://stackoverflow.com/a/79270181
[2]: https://learn.microsoft.com/en-us/sql/t-sql/language-element...
[3]: https://www.postgresql.org/docs/current/functions-comparison...
So NULL can capture multiple distinct concepts: unknown values (say, as a result of a calculation), not applicable (where the data has been deliberately left out), and missing data (possibly due to instrumentation or user error).
If you do it this way you can avoid some confusion as to what NULL represents.
John Doe NULL IS EQUAL John Doe NULL
John Doe NULL NOT EQUAL John Doe UNKNOWN
John Doe UNKNOWN NOT EQUAL John Doe UNKNOWN
Determining if any particular input is NULL or UNKNOWN is a tricky problem, but at least this gets the programmer thinking about it up front and structuring their code to behave in the sanest possible manner.
You could imagine a world where instead of:
SELECT ( email, name, outstanding_balance ) FROM accounts WHERE outstanding_balance > 0 AND last_payment > 60 ORDER BY name
the queries looked more like:
accounts: outstanding_balance > 0 & last_payment > 60 => email, ^name, outstanding_balance
There were plenty of contemporary languages that looked like the latter, but few survive to this day. SQL's relative verbosity was not seen as a problem then and is definitely not one today.
Besides, if the verbosity was a problem they could easily shorten it to UNK. That would have been perfectly normal.
I also don't use UNIQUE constraints, so maybe that has something to do with it.
For example I learned the hard way that the DB we use at work does not index NULL values.
And once in a while if I'm tired or stressed I might forget about UNKNOWN and thus that "Col <> 42" does not return rows where Col is NULL.
Not that better naming would prevent such surprises, but I still think the current naming is less than optimal from a pedagogical perspective. At least I see this at times when teaching our support folks SQL (many have domain background and not a technical background).
In practice, it behaves as "unknown" in some contexts, as "missing value" in other contexts, and sometimes it's just plain WTF like SUM() returning NULL rather than 0 if there are no rows.
Going by Wikipedia, I see that SQL is from 1974 and C from 1972. Were there earlier uses/drafts where `null` is "unknown" instead of "unset"?
C. A. R. Hoare, 1965-11:
"In order to enable references to represent partial functional relationships, i.e. ones which do not necessarily yield a value, a special reference value null is introduced. This value fails to refer to a record, and any attempt to use it to refer to a record leads to an undefined result."
In his proposal, Hoare has followed the COBOL 60 terminology for some of the additions to ALGOL, i.e. "record" instead of the word "structure", introduced by IBM PL/I in 1964-12, and "record class" instead of "structure type", hence the "class" of SIMULA 67, from where the word "class" has spread into all OOP languages.
In Hoare's proposal, references a.k.a. pointers could point only to values belonging to user-defined types, i.e. records a.k.a. structures, not to primitive types.
E.F. Codd added nulls to relational model in 1970 so that does pre-date C. The concept is even older than that I imagine.
If there might actually not be such a value, you're supposed to change your schema to reflect that.
In all my years, I've never used null for that. If I don't have a value yet then generally I'm not writing any part of the record. I only ever use nulls for the absence of a value. Creating a new table for every potential optional column is the "academic relational way" and also insane. :)
What you're describing is closer to how people do it in practice.
But then if your sensor fails to record a measurement don't you end up with NULL for that row's temperature_id?
LogEntry(LogEntryId, Date)
Temperature(TemperatureId, LogEntryId, DegreesF)
If there is no temperature measured, then you don't create a record in Temperature.
I mean sure, you could do yet another table. But honestly that level of normalization is much more work than it's worth. Just because it's some academic definition doesn't make it right.
> But honestly that level of normalization is much more work than it's worth
Yes. I question whether it's worth anything to begin with.
But there is no stored null representing 'no value'.
I wonder who first added that macro? Was it there from the beginning?
Just random thoughts...
Why would anyone want to use another database?
Maybe it is a driver specific issue? I have used Python/Java, and haven't updated any of my code or dependencies because of a major Postgre update
Personally I think upgrades are the one thing MySQL has on Postgres at this point.
[1] https://www.postgresql.org/docs/current/pgupgrade.html
Usually it takes almost no time, altought might be just as slow as PG when major changes to the data format are introduced. The only example I can remember is 8.0 when oracle completely rewrote the data format (making things like atomic ddl possible).
Of course, lots of people also think it's a feature. But those aren't very vocal. Anyway, this is a main reason why people keep using old postgres versions, they have to think about upgrading, and they often don't.
I'd posit that only a tiny fraction of PostgreSQL uses have to know or care that vacuuming is a thing because the autovacuum default handle it for them.
For example, HA and clustering will always be challenging to deploy/maintain, but you will still have a harder time doing that with postgres than with MySQL. Postgres also has a lot of benefits obviously, though.
It's also much easier to tune, most database instances require setting innodb_buffer_pool_size, and that's basically it. Newer versions can even set it automatically if you're fine with consuming all memory on that machine, thus requiring no tuning at all.
* use of a cloud provider that favours another database (like SQL server)
* some people claim mysql/maria is faster for them, simpler, or has a better support for replication
* use of sqlite for lightweight or portable apps
* domain specific databases that still use sql as their query language
* someone may want to use another database for fun or to learn something new
Which is why I continue to be such an ORM skeptic. I agree that they're convenient. But I do worry that we've now got an entire generation of engineers who regularly interact with relational databases, but have largely been spared the effort of learning how they actually work.
As another commenter pointed out, if you've learned basic relational algebra then the way SQL nulls behave seems obvious and logically consistent. The logic is the same as the logic behind the comparison rules for NaN in IEEE floats. It's the behavior of C-style nulls that is, always and forever, a billion-dollar mistake.
That's an important one. It would be super nice to have a SQL dialect that works more like LINQ where you can compose your queries easily. I always hate it when I have to write SQL directly. It's super powerful but the syntax just isn't designed well. To me it feels like a throwback to the good old FORTRAN or COBOL days: you can get stuff done but modern languages are so much better.
But this has never been their primary purpose and it's not what they are good at. ORMs are supposed to map the relational model into an object oriented model so that you can work with objects rather than sets of tuples. And that's exactly how people use them.
ORMs incentivise people to replace simple and declarative set operations with complex procedural code operating on individual objects.
ORMs are just a terrible idea - conceptually messy, hard to debug and optimise, full of needless complexity.
Entity Framework did try to cater to the "SQL is scary, let me use objects" crowd, and that is the majority of how it's used, and that is a mistake in my opinion. But it is also very good at supporting relational algebra within C# and composing queries dynamically; ironically, it's best at it if you disable or avoid many of its features (dynamic subclassing, linked objects). Pass IQueryables around and compose them together and life is good. Updating needs work, but updates have always been a weakness for SQL too.
and that's why ORMs are so unpopular and entirely absent from successful production applications for the past 30 years
This is true for an HTTP library as much as it is an ORM.
No, it is only true for bad abstractions.
Stop thinking of ORMs as trying to hide the details of SQL and you'll stop hating them. Instead think of them as a way to compose SQL dynamically, with the full power of your language. SQL is an awful language to write application logic in, because it has horrible support for abstraction, composition, encapsulation, dependency injection, etc. The ORM gives you a way to produce SQL in an environment that actually supports basic software engineering principles. Scattering ORM logic everywhere in the codebase is the point: putting all your SQL in one data access layer is like putting all your arithmetic in one calculation layer. Why would you ever do that? What's wrong with seeing a plus sign in more than one file? What's wrong with seeing language-encoded relational logic in more than one file?
I can guarantee you the popularity is not "aesthetic". And convenience is a real thing that actually does reduce costs. People complain about ORMs, but have you seen the absolute horse-shit-level code that people jam into SQL functions and procedures to do the utterly most basic things? The standard for what ends up in SQL Stored procedures is the most unmaintainable garbage in the entire software engineering ecosystem.
Django seems to be an outlier on the side of "an ORM that's actually good". Whenever people have specific technical complaints about ORMs, it's generally not a problem or there's already a solution in Django you just have to use. It's usually only when you get to the more conceptual stuff like object-relational mismatch that such complaints tend to apply to Django.
The reason for a data layer is because the underlying data representation might change. For example, you might change the schema to handle some new performance requirement. Now you have to hunt down and change everywhere you've queried that table in your whole code base. Every time you directly call SQL you are coupling your code strongly to the database schema which is ultimately an implementation detail.
> And convenience is a real thing that actually does reduce costs.
I know convenience is a real thing. I also know that it very often increases costs in the long run.
Well you certainly can't do it all in SQL, because SQL doesn't support basic software engineering principles at all. That means you're picking a parent language that does support the principles you want, and attempting to build a library that allows you to build, pass around, and compose relational queries in a way that leverages the benefits of the parent language. To do this I'd argue you need these things:
1. a structured representation of the (possibly incomplete) relational query as it's being built, e.g. SQLAlchemy's internal representation or IQueryable in C#
2. a method of building those representations in a way that conforms to the norms of the parent language and leverages its benefits as much as possible, e.g. LINQ or direct use of the various extension methods on IQueryable. It seems like you could pick either SQLAlchemy or SQLAlchemy Core depending on whether you want to think in terms of objects or queries (I'd usually pick queries), but I'm not that familiar with SQLAlchemy.
3. a method of translating the query representation into a particular SQL dialect.
I don't know what exactly your definition of "ORM" is, but I'd argue that if any of those steps involve an object-level representation of the schema of your tables (e.g. "class Employee { int ID }" type of thing) then it's an ORM. Do you need that? Well, no, probably not, but why not? C# is strongly typed, so why wouldn't I tell its type system what the schema of my table is? That's part of leveraging the benefits of the parent language.
> The reason for a data layer is because the underlying data representation might change. For example, you might change the schema to handle some new performance requirement. Now you have to hunt down and change everywhere you've queried that table in your whole code base.
This is an argument for strong typing, not for putting all your code that has the word "select" in it in the same file. And building queries with a strongly typed representation of your schema in the parent language is the point of the ORM!
Without an ORM, you still have to hunt down tons of spots where things change when you change your schema. How do you ever not have to do this?
> Every time you directly call SQL you are coupling your code strongly to the database schema which is ultimately an implementation detail.
There exists some canonical relational model of your domain. This model is not an implementation detail, it is the foundation of most of your code. I choose to keep my SQL schema as close as possible to this canonical relational model as I can. The fact that a Monitoring Instrument has an Operating Status is a fundamental fact about my model; whether that column is called "operating_status" or "OperatingStatus" or "operating_status_ID" is an implementation detail. It's just hard to agree with you that "the database schema" is merely implementation detail -- clearly parts of it are not, unless it's completely untethered from your canonical domain model (which would be a nightmare, of course). Of course your code is strongly coupled to the fundamental domain model upon which it is operating. I'd certainly agree that it'd be nice to not have to worry about implementation details like the exact naming convention of the columns -- which, of course, ORMs let you do.
>> And convenience is a real thing that actually does reduce costs.
> I know convenience is a real thing. I also know that it very often increases costs in the long run.
Code that is easier to write is generally also easier to read and maintain. That's what I mean when I say "convenience". Simplicity and convenience go together.
This depends on the context, no? I doubt there exists someone with a contact list on their phone which has every single field for every single contact populated.
There needs to be some way to codify that a field in a datarecord is unpopulated. Using the "zero value" for the type of the field (e.g., the empty string) is reasonable, but is this necessarily better than NULL? I reckon an argument can be made that this approach is just as likely to lead to bugs.
I'm not necessarily in favor of NULL, for what it's worth, but I can't think of an adequate replacement which doesn't reduce to "NULL in sheep's clothing".
This attitude was so prevalent at the time, I sometimes wonder if the rise of noSQL was simply people sick of dealing with Oracle DBAs
That was definitely one part; another part was sharp corners in MySQL (at least as of 20 years ago; I would be surprised if many of them haven't been rounded off in the meantime). The last part was places with no DBA with developers unaware of how to handle schema migrations.
The value of NULL in a particular table cell is simply a way to indicate 'no value'. If you want the values in a column to be unique, cases where there are no values shouldn't be considered.
This plays out similarly in practice. For example, you may want to allow users to optionally reserve a username, and if they do, those usernames should be unique. It's hard to imagine a use case where by wanting a field to be both optional (nullable) and unique, you mean that the field should be optional for a single record (!) and required for all the rest. Of course, you mean that IF there is a value, THEN it should be unique.
The princess eyes are as blue as the stone of Galveston
Have you seen the princess eyes?
No!
Have you seen the blue stone of Galveston?
No!
So you're comparing something you've never seen with something else you've never seen!
That's NULL comparison
[1] https://www.amazon.com/Database-Technology-Nulls-Considered-...
More importantly, x = value instead of (x = value and x IS NOT NULL) is almost always a mistake, and a stupidly subtle one at that. And for this curse, we get… nothing particularly useful from these semantics.
Also the x != NULL case is completely cursed
I will die on the hill that regular C-like nulls are the actual thing that's weird.
The real billion dollar mistake [1] was the damage it made on the minds of developers.
[1] https://en.wikipedia.org/wiki/Tony_Hoare
Most languages nowadays do get nulls right, even PHP of all things.
Ironically NULL is probably the safest pointer value in C, as any dereferences from it (and thousands of pages that follow it) are guaranteed to crash.
That's a waste of a page.
Unless you're talking about some kind of device which supports virtual memory, but also by default sets up mappings, including at 0 which seems weird to me.
What's interesting is, do we mean that in our data that attribute has no value? Or do we mean the real-world object represented by the data does not have that attribute?
Does null mean
a) We don't know the value of this attribute for this object, or
b) We do know that there is no value for this attribute in the real-world object represented by our data.
In JavaScript because there is both null and undefined it is easy to assume that undefined means we don't know the value and null means we do know it has no value.
EXAMPLE: The attribute 'spouse'. Some people have a spouse some don't. So what does it mean if the value of the field 'spouse' is null? That we know there is no spouse, or that we don't know who the spouse is if any.
In practical terms we can say null means "We don't know" which includes the case that there is no spouse.
Javascript objects have two kinds of undefined that are both represented by the same value. You have to use another method to see which it is, and I've seen "foo" in this example used for the same thing as "null" in your example:
This is something you have to account for because the key is still there if you try to remove a value by just setting it to undefined: This is the right way to remove the key:age: null? married: null?
https://stackoverflow.com/questions/203493/why-does-oracle-9...
That was a fun bug to find out, after having dealt with quite a few other DBs over the years. It was one of those "No, but surely" and "This can't be! This is Oracle!" moments. Found it while porting some old code that needed to store an empty string as being distinct from a NULL in that same column.
Null is more confusing because it means different things in different languages. Sometimes it's a more constrained uncertainty, eg this definitely doesn't exist. But in sql it's a less constrained uncertainty, like "undefined" in math. The value of this thing couldn't make sense in this context, but we can't make assertions about its existence.
E. F. Codd thought about this issue.[0]
> Codd indicated in his 1990 book The Relational Model for Database Management, Version 2 that the single Null mandated by the SQL standard was inadequate, and should be replaced by two separate Null-type markers to indicate why data is missing. In Codd's book, these two Null-type markers are referred to as 'A-Values' and 'I-Values', representing 'Missing But Applicable' and 'Missing But Inapplicable', respectively. Codd's recommendation would have required SQL's logic system be expanded to accommodate a four-valued logic system. Because of this additional complexity, the idea of multiple Nulls with different definitions has not gained widespread acceptance in the database practitioners' domain. It remains an active field of research though, with numerous papers still being published.
[0] https://en.wikipedia.org/wiki/Null_(SQL)
If your signup form has an optional field for middle name which I don’t fill, it can absolutely be because I don’t have a middle name. It’s undefined and known to be so.
> E. F. Codd thought about this issue.[0]
And because four value logic was left out, nulls have to fulfil multiple incompatible roles, and ends up being weird.
The fact that some NULLs are indeed undefined doesn’t contradict my point that it would be inaccurate to refer to all NULLs as UNDEFINED.
> It’s undefined and known to be so… And because four value logic was left out, nulls have to fulfil multiple incompatible roles, and ends up being weird.
You know it to be undefined, but if you just leave that field blank on the form, the database still can’t know whether that’s because you don’t have one or because you didn’t provide the one you have, unless there was a field on the form that allowed you to explicitly indicate that you don’t have one.
There's the Closed World Assumption in a database.
I think we would have been better-off by treating FKs (and maybe outer JOINs) as a special case, and using 2-value logic everywhere else.
If you don't like null semantics, you're free to use sentinel values. You can make all the sentinel values the same, or you can make them all different. Either way, you or someone who has to use your system will be back here tomorrow complaining about how weird it is.
- col1 = 1 should not return NULLS
- !(col1 = 1) should return NULLS
- col1 <> 1 should not return NULLS
Too many newbies hear that NULL is bad, so they declare all columns as NOT NULL and end up inserting ad hoc values like 0, -1, '', or {} when they inevitably come across cases where they don't have data. Which is even worse than NULL.
(Although in rare cases that is even weirder: https://stackoverflow.com/a/58998043 )
`1 is not distinct from NULL` => false
`NULL is not distinct from NULL` => true
`0 is not distinct from 1` => false
Since the data we were getting was sourced from an RDBMS, I wanted NULL to be a first class concept in the DSL, with similar traits.
Early on, I simply made any expression that involved a NULL result in NULL. Naively this was all well and good, but it failed spectacularly in condition statements.
Instead of A = NULL == false, I had A = NULL == NULL. And, as you can imagine, a single NULL in the expression would just pollute the entire thing, and since NULL was considered as FALSE for conditionals, any NULL in an expression made the entire thing, eventually, FALSE.
Naturally I went back and made the comparison operators always return booleans. But it was a fun little side effect at the time.
Or, the user must define a default value in the query itself.
Yes, tedious; but, precise and forces the programmer to really prepare for the "unknown" scenario.
And lastly undefined != null, because undefined is related with structures indicating that a field was not defined when the structure was created
https://blog.rustprooflabs.com/2022/07/postgres-15-unique-im...
Practically speaking, I go with not null, and always set default value.
Imo, SQL should add "NOTHING", add "UNKNOWN" as a synonym for "NULL", and deprecate "NULL".
https://dbfiddle.uk/C5JqMP8O
edit: an empty string, false, 0 are all values.
Personally I wish more languages were like python or ruby and had chosen None or Nil over Null which alleviates the confusion a bit, as those names better indicates that it's NOT an "unknown" (1 unknown value != 1 other unknown, which intuitively makes sense.) In ruby or python it's more obvious that None and Nil are "nothing" types and therefore equivalence makes sense (nil == nil, None == None are both true)
NULLs still have their (rare) place, but the foremost issue with query results is that they are tabular rather than hierarchical. The main culprits being (1) outer joins that represent or induce nonsensical operations and (2) lack of non-null “zero” values for types like date. Of course hierarchies can make querying more complex, but mostly in cases where the relational logic goes crazy itself and you had to go tabular anyway.
If you think of it, distinct, group by and windowing feel like workarounds in tabular mode but would be natural to hierarchies, because everything is naturally distinct and grouped-by by design and windows are basically subtables in these rows.
Bonus points you could fetch “SELECT FROM a, b_rows LEFT JOIN b AS b_rows …” in a single query without duplicating `a`s and nullifying `b`s when N <> 1. And when you aggregate through a column in `b`, there’s no headache what to do with join-produced NULLs (unless `b` columns are nullable by your design, then it’s on you). And when it all arrives to a client, it’s already well-shaped for ui, processing, etc. No more:
I’m pretty sure you recognize this programming idiom immediately.Before you criticize, I’m not talking about hierarchical/OO tables. Only about ways of getting and handling query results. You still can reshape a relation like you want. The difference is that a database engine doesn’t have to put it all onto a (N x M x …) table and instead creates sort of a subset of relations which is efficient in space and natural to walk through. It already does that when walking through indexes, selects are naturally hierarchical. All it has to do is to track relations it went through rather than just dumping rows from a set of cursors that it knows the start and end points of, but loses this knowledge by writing into a plain table.
https://modern-sql.com/concept/null
(Note: I am not affiliated with that bloh/website in any way, shape, or form.)
Partially, we use ... and I'll be very broad here ... "variables" as boxes we look in for answers. Answers to questions, answers to "Hey I put something in there for the time being to reference later." If I went into programming terms rather than just SQL, sometimes we get meta-answers.
You haven't made the box yet (declared the variable).
You haven't decided how the box is structured (picked a type or a length or something).
Okay, you did those but the box is virgin (nothing has been placed in the box yet).
That kind of thing. An empty set for "yes, you asked but nothing meets those criteria."
But let's continue the logic of deleted_at being NULL indicating an active account, which seems to the intent here. You end up doing things like:
Depending on your database, that may or may not index well. More problematic, you may end up with privacy leaks if someone forgets the last conditional.If anything, you want to reverse this so someone has to go out of their way to explicitly select deleted accounts. There are multiple strategies for this eg using an active_accounts view or table.
Lastly, there are lots of potential reasons for an account to be disabled or otherwise not visible/accessible. Takedowns, court orders, site safety, hacked accounts and so on.
Overloading deleted_at to have a semantic meaning for an active account is just fundamentally bad design.
Then don't do that. It's kind of a leap to say soft deletes are categorically bad because someone might confuse "deleted" with "inactive". My users table does the super-advanced thing of having both columns. The ORM also doesn't forget to add the not-null criterion. There's also zero databases in active use where it poses a problem to indexing.
Soft deletes suck in their own way, but none of the alternatives are perfect either.
Overloading timestamps to carry a boolean on null is awesome as long as you decide that's what you're doing and use one of the several standard techniques to dodge the easily avoided potential downside.
This isn't a valid security concern, more than any other incorrect sql query would be anyway. A dev can always write a bad Q, you need another way to address that it's not more likely here because of the null.