Log level 'error' should mean that something needs to be fixed

(utcc.utoronto.ca)

482 points | by todsacerdoti 52 days ago

51 comments

layer8 49 days ago
> When implementing logging, it's important to distinguish between an error from the perspective of an individual operation and an error from the perspective of the overall program or system. Individual operations may well experience errors that are not error level log events for the overall program. You could say that an operation error is anything that prevents an operation from completing successfully, while a program level error is something that prevents the program as a whole from working right.
This is a nontrivial problem when using properly modularized code and libraries that perform logging. They can’t tell whether their operational error is also a program-level error, which can depend on usage context, but they still want to log the operational error themselves, in order to provide the details that aren’t accessible to higher-level code. This lower-level logging has to choose some status.
Should only “top-level” code ever log an error? That can make it difficult to identify the low-level root causes of a top-level failure. It also can hamper modularization, because it means you can’t repackage one program’s high-level code as a library for use by other programs, without somehow factoring out the logging code again.
[-]
- Too 49 days ago
  This is why it’s almost always wrong for library functions to log anything, even on ”errors”. Pass the status up through return values or exceptions. As a library author you have no clue as how an application might use it. Multi threading, retry loops and expected failures will turn what’s a significant event in one context into what’s not even worthy of a debug log in another. No rule without exceptions of course, one valid case could be for example truly slow operations where progress reports are expected. Modern tracing telemetry with sampling can be another solution for the paranoid.
  [-]
  - cogman10 49 days ago
    Depending on the language and logging framework, debug/trace logging can be acceptable in a library. But you have to be extra careful to make sure that it's ultimately a no-op.
    A common problem in Java is someone will drop a log that looks something like this `log.trace("Doing " + foo + " to " + bar);`
    The problem is, especially in a hot loop, that throw away string concatenation can ultimately be a performance problem. Especially if `foo` or `bar` have particularly expensive `toString` functions.
    The proper way to do something like this in java is either
```
    log.trace("Doing $1 to $2", foo, bar);
```
    or
```
    if (log.traceEnabled()) {
      log.trace("Doing " + foo + " to " + bar);
    }
```
    [-]
    - usefulcat 48 days ago
      Ideally a logging library should at least not make it easy to make that kind of mistake.
      [-]
      - Lvl999Noob 48 days ago
        This isn't really something the logging library can do. If the language provides a string interpolation mechanism then that mechanism is what the programmers will reach for first. And the library cannot know that interpolation happened because the language creates the final string before passing it in.
        If you want the builtin interpolation to become a noop in the face runtime log disabling then the logging library has to be a builtin too.
        [-]
        demurgos 48 days ago
        I feel like there's a parallel with SQL where you want to discourage manual interpolation. Taking inspiration from it may help: you may not fully solve it but there are some API ideas and patterns.
        A logging framework may have the equivalent of prepared statements. You may also nudge usage where the raw string API is `log.traceRaw(String rawMessage)` while the parametrized one has the nicer naming `log.trace(Template t, param1, param2)`.
        [-]
        NewJazz 48 days ago
        You can have 0 parameters and the template is a string...
        [-]
        demurgos 48 days ago
        The point of my message is that you should avoid the `log(string)` signature. Even if it's appealing, it's an easy perf trap.
        There are many ideas if you look at SQL libs. In my example I used a different type but there other solutions. Be creative.
        logger.log(new Template("foo"))` logger.log("foo", []) logger.prepare("foo").log()
        [-]
        tharkun__ 48 days ago
        And none of those solve the issue.
        You pass "foo" to Template. The Template will be instantiated before log ever sees it. You conveniently left out where the Foo string is computed from something that actually need computation.
        Like both:
        new Template("doing X to " + thingBeingOperatedOn) new Template("doing " + expensiveDebugThing(thingBeingOperatedOn))
        You just complicated everything to get the same class of error.
        Heck even the existing good way of doing it, which is less complicated than your way, still isn't safe from it.
        logger("doing {}", expensiveDebugThing(thingBeingOperatedOn))
        All your examples have the same issue, both with just string concatenation and more expensive calls. You can only get around an unknowing or lazy programmer if the compiler can be smart enough to entirely skip these (JIT or not - a JIT would need to see that these calls never amount to anything and decide to skip them after a while. Not deterministically useful of course).
        [-]
        demurgos 48 days ago
        Yeah, it's hard to prevent a sufficiently motivated dev from shooting itself in the foot; but these still help.
        > You conveniently left out where the Foo string is computed from something that actually need computation.
        I left it out because the comment I was replying to was pointing that some logs don't have params.
        For the approach using a `Template` class, the expectation would be that the doc would call out why this class exists in the first place as to enable lazy computation. Doing string concatenation inside a template constructor should raise a few eyebrows when writing or reviewing code.
        I wrote `logger.log(new Template("foo"))` in my previous comment for brevity as it's merely an internet comment and not a real framework. In real code I would not even use stringy logs but structured data attached to a unique code. But since this thread discusses performance of stringy logs, I would expect log templates to be defined as statics/constants that don't contain any runtime value. You could also integrate them with metadata such as log levels, schemas, translations, codes, etc.
        Regarding args themselves, you're right that they can also be expensive to compute in the first place. You may then design the args to be passed by a callback which would allow to defer the param computation.
        A possible example would be:
        const OPERATION_TIMEOUT = new Template("the operation $operationId timed-out after $duration seconds", {level: "error", code: "E_TIMEOUT"}); // ... function handler(...) { // .. logger.emit(OPERATION_TIMEOUT, () => ({operationId: "foo", duration: someExpensiveOperationToRetrieveTheDuration()})) }
        This is still not perfect as you may need to compute some data before the log "just in case" you need it for the log. For example you may want to record the current time, do the operation. If the operation times out, you use the time recorded before the op to compute for how long it ran. If you did not time out and don't log, then getting the current system time is "wasted".
        All I'm saying is that `logger.log(str)` is not the only possible API; and that splitting the definition of the log from the actual "emit" is a good pattern.
        bluGill 48 days ago
        Unless log() is a macro of some sort that expands to if(logEnabled){internalLog(string)} - which a good optimizer will see through and not expand the string when logging is disabled.
      - lock1 48 days ago
        Ideally, but realistically, I have never heard of any major programming language that allows you to express "this function only accepts static constant string literal".
        [-]
        purkka 48 days ago
        Python has LiteralString for this exact purpose. It's only on the type checker level, but type checking should be part of most modern Python workflows anyway. I've seen DB libraries use this a lot for SQL parameters.
        https://typing.python.org/en/latest/spec/literal.html#litera...
        [-]
        Too 48 days ago
        Beyond LiteralString there is now also t-strings, introduced in Python 3.14, that eases how one writes templated strings without loosing out on security. Java has something similar with Template class in Java 21 as preview.
        8n4vidtmkvmk 48 days ago
        We have this in c++ at Google. It's like securitytypes::StringLiteral. I don't know how it works under the hood, but it indeed only allows string literals.
        stefanfisk 48 days ago
        Even PHP has that these days via static analysis https://phpstan.org/writing-php-code/phpdoc-types#other-adva...
        MereInterest 48 days ago
        In Rust, this can almost be expressed as `arg: &'static str` to accept a reference to a string whose lifetime never ends. I say “almost” because this allows both string literals and references to static (but dynamically generated) string.
        For Rust’s macros, a literal can be expressed as `$arg:lit`. This does allow other literals as well, such as int or float literals, but typically the generated code would only work for a string literal.
        pxx 48 days ago
        c++20 offers `consteval` to make this clear, but you can do some simple macro wizardry in c++11 to do this:
        #define foo(x) ( \ (void)std::integral_constant<char, (x)[0]>::value, \ foo_impl(x) \ )
        (the re-evaluation of x doesn't matter if it compiles). You can also use a user-defined literal which has a different ergonomic problem.
        jval43 48 days ago
        Not the language, but the linter can do it. IntelliJ inspections warn you if you do it: https://www.jetbrains.com/help/inspectopedia/StringConcatena...
        zem 48 days ago
        it does seem like something a good static analysis tool should be able to catch though
    - ignoramous 48 days ago
      > The problem is, especially in a hot loop ... The proper way to do something like this in java is either log.trace(..., ...) or if (log.traceEnabled()) log.trace(...)
      The former still creates strings, for the garbage collector to mop up even when log.traceEnabled() is false, no?
      Also, even if the former or latter is implemented as:
      fn trace(log, str, args...) { if (!log.tracing) return; // ... }
      Most optimising JIT compilers will code hoist the if-condition when log.tracing is false, anyway.
    - prithvip 48 days ago
      This is not true. Any modern Java compiler will generate identical bytecode for both. Try it yourself and see! As a programmer you do not need to worry about such details, this is what the compiler is for. Choose whatever style feels best for you.
      [-]
      - Hackbraten 48 days ago
        > Any modern Java compiler will generate identical bytecode for both. Try it yourself and see!
        You may be misunderstanding something here.
        If you follow the varargs-style recommendation, then concatenation occurs in the log class.
        If you follow the guard-style recommendation, then the interpolated expressions will not be evaluated unless the log level matches.
        In the naive approach, concatenation always occurs and all expressions which are part of the interpolation will be evaluated no matter the log level.
        Could it be that you were thinking about StringBuffer vs. concatenation, an entirely unrelated problem?
    - rr808 48 days ago
      Still quite like the windows log approach which (if logged) stores the template as just the id, with the values, saving lots of storage as well eg 123, foo, bar. You can concatenate in the reader.
      [-]
      - buggjenrmf 48 days ago
        So, it costs perf every time it’s read, instead of when it’s written (once). And of course has a lot of overhead to store metadata. Bad design. As usual.
        [-]
        kubelsmieci 48 days ago
        Most logs are probably never read, but nevertheless should be written (fast) for unexpected situations when you will later need them. And logging have to be fast, and have minimal performance overhead.
        rr808 43 days ago
        No, the size is a fraction of a text file, much faster to write and read. The only difference is you can't grep like text.
        just6979 47 days ago
        Except it's always written, but almost never read. Something that is fast/non-resource-intensive to write is definitionally a better design for logging.
        What metadata? The raw template? That's data in this case, data for the later rendering of logs. Yes, the template plus the params is going to be slightly bigger than a rendered string, but that's the speed/size tradeoff inherent almost everywhere. It may even keep seperate things like the subsystem, event type, log level, etc; which trades off size (again) for speed/ease of filtering. It's all trade-offs, and to blanket declare one method (the Windows method in this case) as just bad design is only displaying your own ignorance, or bias.
    - TZubiri 48 days ago
      How about wrapping the log.trace param in a lambda and monkeypatching log.trace to take a function that returns a string, and of course pushing the conditional to the monkeypatched func.
      [-]
      - 01HNNWZ0MV43FF 48 days ago
        That is why the popular `tracing` crate in Rust uses macros for logging instead of functions. If the log level is too low, it doesn't evaluate the body of the macro
        [-]
        tsimionescu 48 days ago
        Does that mean the log level is a compilation parameter? Ideally, log levels shouldn't even be startup parameters, they should be changeable on the fly, at least for any server side code. Having to restart if bad enough, having to recompile to get debug logs would be an extraordinary nightmare (not only do you need to get your customers to reproduce the issue with debug logs, you actually have to ship them new binaries, which likely implies export controls and security validations etc).
        [-]
        bluGill 48 days ago
        I don't know how rust does it, but my internal C++ framework has a global static array so that we can lookup the current log level quickly, and change it at runtime as needed. It is very valuable to turn on specific debug logs at times, when someone has a problem and we want to know what some code is doing
        TZubiri 48 days ago
        I know this is standard practice, but I personally think it's more professional to attach a gdb like debugger to a process instead of depending on coded log statements.
        [-]
        tsimionescu 48 days ago
        A very common thing that will happen in professional environments is that you ship software to your customers, and they will occasionally complain that in certain situations (often ones they don't fully understand) the software misbehaves. You can't attach a debugger to your customer's setup that had a problem over the weekend and got restarted: the only solution to debug such issues is to have had programmed logs set up ahead of time.
        ekidd 48 days ago
        In my professional life, somewhere over 99% of time, the code suffering the error has either been:
        1. Production code running somewhere on a cluster.
        2. Released code running somewhere on a end-user's machine.
        3. Released production code running somewhere on an end-user's cluster.
        And errors happen at weird times, like 3am on a Sunday morning on someone else's cluster. So I'd just as soon not have to wake up, figuring out all the paperwork to get access to some other company's cluster, and then figure out how to attach a debugger. Especially when the error is some non-reproducible corner case in a distributed algorithm that happens once every few months, and the failing process is long gone. Just no.
        It is so much easier to ask the user to turn up logging and send me the logs. Nine times out of ten, this will fix the problems. The tenth time, I add more logs and ask the user to keep an eye open.
        [-]
        TZubiri 48 days ago
        I think I get the idea, gdb is too powerful. For contexts where operator is distinct from manufacturer, the debug/logging tool needs to be weaker and not ad-hoc so it can be audited and to avoid exfiltrating user data.
        [-]
        tsimionescu 47 days ago
        It's not so much about power, but about the ad-hoc nature of attaching a debugger. If you're not there to catch and treat the error as it happens, a debugger is not useful in the slightest: by the time you can attach it, the error, or the context where it happened, are long gone. Not to mention, even if you can attach a debugger, it's most often not acceptable to pause the execution of the entire process for you to debug the error.
        Especially since a lot of the time an exception being raised is not the actual bug: the bug happened many functions before. By logging key aspects of the state of the program, even in non-error cases, when an error happens, you have a much better chance of piecing together how you got to the error state in the first place.
        jeeeb 48 days ago
        The idea in Java is to let the JIT optimise away the logging code.
        This is more flexible as it still allows runtime configuration of the logging level.
        The OP is simply pointing that some programmers are incompetent and call the trace function incorrectly.
      - cluckindan 48 days ago
        Then you still have the overhead of the log.trace function call and the lambda construction (which is not cheap because it has closure over the params being logged and is passed as a param to a function call, so probably gets allocated on the heap)
        [-]
        TZubiri 48 days ago
        >Then you still have the overhead of the log.trace function call
        That's not an overhead at all. Even if it were it's not compareable to string concatenation.
        Regarding overhead of lambda and copying params. Depends on the language, but usually strings are pass by ref and pass by values are just 1 word long, so we are talking one cycle per variable and 8 bytes of memory. Which were already paid anyways.
        That said, logging functions that just take a list of vars are even better, like python's print()
        > printtrace("var x and y",x,y)
        > def printtrace(*kwargs):
        >> print(kwargs) if trace else None
        Python gets a lot of slack for being a slow language, but you get so much expressiveness that you can invest in optimization after paying a flat cycle cost.
        [-]
        jeeeb 48 days ago
        That’s what most languages, including Java do.
        The problem the OP is pointing out is that some programmers are incompetent and do string concatenation anyway. A mistake which if anything is even easier in Python thanks to string interpolation.
  - MobiusHorizons 49 days ago
    What you are proposing sounds like a nightmare to debug. The high level perspective of the operation is of course valuable for determining if an investigation is necessary, but the low level perspective in the library code is almost always where the relevant details are hiding. Not logging these details means you are in the dark about anything your abstractions are hiding from higher level code (which is usually a lot)
    [-]
    - cwillu 48 days ago
      Those details don't belong in the error log level, that's what info or trace is for.
      [-]
      - dpark 48 days ago
        They were replying to a person who says “it’s almost always wrong for library functions to log anything”. Not just errors.
        [-]
        Retric 48 days ago
        If it’s not your code how is a log useful vs returning an error?
        Even relatively complex operations like say convert this document into a PDF etc basically only has two useful states either it worked or something specific failed at which point just tell me that thing.
        Now independent software like web servers or database can have useful logs because they have completely independent interfaces with the outside world. But I call libraries they don’t call me.
        [-]
        lazyasciiart 48 days ago
        That’s a very simple operation. Try “take these 100 user generated pdfs and translate all of them”. Oh, “cannot parse unexpected character 0x001?” Cool beans, I wish I knew more.
        [-]
        Retric 48 days ago
        That’s ok, I’ll just check the log. 50MB of ‘This is my happy place.’ followed by a one liner “cannot to parse unexpected character 0x001?’
        Any library can do a bad job here, that doesn’t come down to logging vs error messages.
        [-]
        dpark 48 days ago
        The unspoken assumption you are making is that anyone who would disagree with your philosophy on this is incompetent.
        [-]
        Retric 47 days ago
        Being incorrect doesn’t imply general incompetence.
        [-]
        dpark 47 days ago
        Your statement that logging would contain zero useful information indicates an assumption of incompetence.
        [-]
        Retric 47 days ago
        No, I’m only saying a useless error code and a useless log are both possible. Either could be useful or they could both be useless because the creator was actively malicious etc. Thus, the possibility of a useless error code doesn’t inherently mean a log would improve things.
        Really the only thing we can defiantly say is when both approaches are executed well it’s harder to use log entries in your code. If something returns an error that’s tied to a specific call to a specific bit of code, where a log entry could in theory be from anything etc.
      - heisenbit 48 days ago
        Trace can become so voluminous that it is switched on only on a need basis which can be too late for rare events. Also trace level as more a need to use debug tool tends to be less scrutinized for exposing sensitive data making it unsuitable for continuous operation or use in live production.
    - Too 48 days ago
      Simple: include those relevant details in the exceptions instead of hiding them.
      [-]
      - awesome_dude 48 days ago
        At the extreme end: If my Javascript frontend is being told about a database configuration error happening in the backend when a call with specific parameters is made - that is a SERIOUS security problem.
        Errors are massaged for the reader - a database access library will know that a DNS error occurred and that is (the first step for debugging) why it cannot connect to the specified datastore. The service layer caller does not need to know that there is a DNS error, it just needs to know that the specified datastore is uncontactable (and then it can move on to the approriate resilience strategy, retry that same datastore, fallback to a different datastore, or tell the API that it cannot complete the call at all).
        The caller can then decide what to do (typically say "Well, I tried, but nothing's happening, have yourself a merry 500)
        It makes no sense for the Service level to know the details of why the database access layer could not connect, no more than it makes any sense for the database access layer to know why there is a DNS configuration error - the database access just needs to log the reasons (for humans to investigate), and tell the caller (the service layer) that it could not do the task it was asked to do.
        If the service layer is told that the database access layer encountered a DNS problem, what is it going to do?
        Nothing, the best it can do is log (tell the humans monitoring it) that a DB access call (to a specific DB service layer) failed, and try something else, which is a generic strategy, one that applies to a host of errors that the database call could return.
        [-]
        kgklxksnrb 48 days ago
        That’s how we get errors like ”file not found”, without a file name. A pain for mankind.
        lelanthran 47 days ago
        > At the extreme end: If my Javascript frontend is being told about a database configuration error happening in the backend when a call with specific parameters is made - that is a SERIOUS security problem.
        I'll accept that it is a security problem; why would it be a serious security problem? Any error that the client knows about the configuration is unlikely to be one that is exploitable anyway, and if it is (for example, the client gets told "could not connect to 192.168.1.139:5432"), then you have bigger problems than sending error messages to clients.
        What sort of example did you have in mind that makes this a serious security problem?
        [-]
        awesome_dude 46 days ago
        2. Verbose Error Messages: When Your Application Talks Too Much Verbose error messages represent another common misconfiguration that gifts critical information to attackers. When applications encounter errors, they often generate detailed messages intended for developers. In production environments, these messages can reveal:
        Technical infrastructure details: Database types, versions, server configurations File paths and directory structures: Enabling directory traversal attacks Programming logic: Including code snippets that expose application behavior Sensitive credentials: Database connection strings, usernames, passwords Software versions: Allowing attackers to identify known vulnerabilities The impact of this vulnerability is significant. Error messages can expose not just that a system runs PHP, but that it runs a specific, unsupported version — providing attackers with a clear exploitation path.
        Security researchers have documented numerous instances where verbose error messages enabled breaches:
        Dating App Vulnerability (2016): Tinder’s login system displayed error messages indicating whether specific email addresses were registered, enabling brute-force attacks to identify valid accounts. Password Manager Leak (2019): A popular password manager’s login form disclosed through error messages whether email addresses were registered with the service, facilitating targeted attacks. Government Agency Breach (2020): A major US government agency’s website displayed error messages revealing whether specific usernames existed in the system, enabling attackers to enumerate valid accounts.
        [1] https://medium.com/@instatunnel/security-misconfiguration-th...
        [-]
        lelanthran 46 days ago
        First, I disagree that "user emails can be brute-forced" is a serious security issue.
        I mean, sure, it's a security issue, but on a scale of 1-10, with 1 being "security issue, we'll fix in next point release" and 10 being "All-hands until this emergency patch goes out, and we keep the system offline while fixing it", this is definitely a 1.
        Secondly, this barely counts as a security issue; some systems I worked on recently required error messages to tell the user how to fix the error they got. You don't simply say (for example) "attachment not found", you say "Field $FIELD is empty. This is a mandatory field" or similar.
        There are still plenty of secure systems out there that will direct the user to create an account if an unregistered user attempts to log in.
        It's a trade-off in usability: some places go the "Authentication failed (but we won't tell you why)" route, and others go the "Click here to sign up" route.
        [-]
        awesome_dude 45 days ago
        > First, I disagree that "user emails can be brute-forced" is a serious security issue. > I mean, sure, it's a security issue, but on a scale of 1-10, with 1 being "security issue, we'll fix in next point release" and 10 being "All-hands until this emergency patch goes out, and we keep the system offline while fixing it", this is definitely a 1.
        Jesus no.
        Aside from this now being an argument on semantics, someone enumerating every customer/user account you have is serious.
        It opens the door for privacy leaks, targeted attacks (like password attempts, phishing, or account lockouts)
        If you don't want to take that seriously, thank you for your honesty, I will ensure that I never have an account on any service you work on.
        [-]
        lelanthran 45 days ago
        > If you don't want to take that seriously, thank you for your honesty, I will ensure that I never have an account on any service you work on.
        That's fine; you already have multiple accounts on various providers that can be trivially massaged by a client into providing proof of life of an email address.
        Microsoft, OpenAI, Anthropic, Oracle, Amazon; I tried them all now, and they let you enumerate emails trivially by clicking "signup" and then informing you if you choose an email that is already registered.
        > Jesus no.
        You haven't really has thought this through as thoroughly as you think you have - email enumeration is still, at the tail end of 2025, possible across all major sites, providers, etc.
      - layer8 48 days ago
        It’s not that simple. First, this results in exception messages that are a concatenation of multiple levels of error escalation. These become difficult to read and have to be broken up again in reverse order.
        Second, it can lose information about at what exact time and in what exact order things happened. For example, cleanup operations during stack unwinding can also produce log messages, and then it’s not clear anymore that the original error happened before those.
        Even when you include a timestamp at each level, that’s often not sufficient to establish a unique ordering, unless you add some sort of unique counter.
        It gets even more complicated when exceptions are escalated across thread boundaries.
        [-]
        ninkendo 48 days ago
        > First, this results in exception messages that are a concatenation of multiple levels of error escalation. These become difficult to read and have to be broken up again in reverse order
        Personally I don't mind it... the whole "$outer: $inner" convention naturally lends to messages that still parse in my brain and actually include the details in a pretty natural way. Something like:
        "Error starting up: Could not connect to database: Could not read database configuration: Could not open config file: Permission denied"
        Tells me the config file for the database has broken permissions. Because the permission denied error caused a failure opening the config file, which caused a failure reading the database configure, which caused a failure connecting to the database, which caused an error starting up. It's deterministic in that for "$outer: $inner", $inner always caused $outer.
        Maybe it's just experience though, in a sense that it takes a lot of time and familiarity for someone to actually prefer the above. Non-technical people probably hate such messages and I don't necessarily blame them.
      - MobiusHorizons 48 days ago
        Sometimes you don’t have all the relevant details in scope at the point of error. For instance some recoverable thing might have happened first which exercises a backup path with slightly different data. This is not exception worthy and execution continues. Then maybe some piece of data in this backup path interacts poorly with some other backend causing an error. The exception won’t tell you how you got there, only where you got stuck. Logging can tell you the steps that led up to that, which is useful. Of course you need a way to deal with verbose logs effectively, but such systems aren’t exactly rare these days.
        [-]
        Hackbraten 48 days ago
        > Then maybe some piece of data in this backup path interacts poorly with some other backend causing an error. The exception won’t tell you how you got there, only where you got stuck.
        Then catch the exception on the backup path and wrap it in a custom exception that conveys to the handler the fact that you were on the backup path. Then throw the new exception.
      - Kwpolska 48 days ago
        Not all problems cause exceptions.
        [-]
        energy123 48 days ago
        That's a matter of good taste, but there's nothing preventing you from throwing exceptions on every issue and requiring consumers to handle them
        [-]
        makeitdouble 48 days ago
        Imagine you have a caching library that handles DB fallback. A cache that should be there but goes missing is arguably an issue.
        Should if throw an exception for that to let you know, or should it gracefully fallback so your service stays alive ? The middle ground is leaving a log and chugging along, your proposition throws that out of the window.
    - TZubiri 48 days ago
      You can log your IO and as long as your functions are idempotent that should be enough info to replicate.
      [-]
      - dpark 48 days ago
        Assuming everything is idempotent is a tall order.
        There are a lot of libraries that haven non-idempotent actions. There are a lot of inputs that can be problematic to log, too.
        [-]
        TZubiri 48 days ago
        Say like opening a file?
        I guess in those cases standard practice is for lib to return a detailed error yeah.
        As far as traces, trying to solve issues that depend on external systems is indeed a tall order for your code. Isn't it beyond the scope of the thing being programmed.
        [-]
        sigseg1v 48 days ago
        From my experience working on B2B applications, I am happy that everything is generally spammed to the logs because there would simply be no other reasonable way to diagnose many problems.
        It is very, very common that the code that you have written isn't even the code that executes. It gets modified by enterprise anti virus or "endpoint security". All too often do I see "File.Open" calls return true that the caller has access, but actually what's happened is AV has intercepted the call, blocked it improperly, and returns 0 bytes file that exists (even though there is actually a larger file there) instead of saying the file cannot open.
        I will never, in a million years, be granted access to attach a debugger to such a client computer. In fact, they will not even initially disclose that they are using anti virus. They will just say the machine is set up per company policy and that your software doesn't work, fix it. The assumption is always that your software is to blame and they give you nearly nothing, except for the logs.
        The only way I ever get this solved in a reasonable amount of time is by looking at verbose logs, determining that the scenario they have described is impossible, explaining which series of log messages is not able to occur, yet occurred on their system, and ask them to investigate further. Usually this ends up being closed with a resolution like "Checked SuperProtectPro360 logs and found it was writing infernal error logs at the same time as using the software. Adjusted the monitoring settings and problem is now resolved."
        dpark 48 days ago
        I don’t really understand what you mean about opening files. Is this just an example of an idempotent action or is there some specific significance here?
        Either way logging the input (file name) is notably not sufficient for debugging if the file can change between invocations. The action can be idempotent and still be affected by other changes in the system.
        > trying to solve issues that depend on external systems is indeed a tall order for your code. Isn't it beyond the scope of the thing being programmed.
        If my program is broken I need it fixed regardless of why it’s broken. The specific example here of a file changing is likely to manifest as flakiness that’s impossible to diagnose without detailed logs from within the library.
        [-]
        TZubiri 48 days ago
        I was just trying to think of an example of a non idempotent function. As in it depends on an external IO device.
        I will say that error handling and logging in general is one of my weakpoints, but I made a comment about my approach so far being dbg/pdb based, attaching a debugger and creating breakpoints and prints ad-hoc rather than writing them in code. I'm sure there's reasons why it isn't used as much and logging in code is so much more common, but I have faith that it's a path worth specializing in.
        Back to the file reading example, for a non-idempotent function. Considering we are using an encapsulating approach we have to split ourselves into 3 roles. We can be the IO library writer, we can be the calling code writer, and we can be an admin responsible for the whole product. I think a common trap engineers fall for is trying to keep all of the "global" context (or as much as they can handle) at all times.
        In this case of course we wouldn't be writing the non-idempotent library, so of course that's not a hat we wear, do not quite care about the innards of the function and its state, rather we have a well defined set of errors that are part of the interface of the function (EINVAL, EACCES, EEXIST).
        In this sense we respect the encapsulation boundaries and are provided the information necessary by the library. If we ever need to dive into the actual library code, first the encapsulation is broken and we are dealing with a leaky abstraction, second we just dive into the library code, (or the filesystem admin logs themselves).
        It's not precisely the type of responsibility that can be handled at design time and in code anyways, when we code we are wearing the calling-module programmer hat. We cannot think of everything that the sysadmin might need at the time of experiencing an error, we have to think that they will be sufficiently armed with enough tools to gather the information necessary with other tools. And thank god for that! checking /proc/fs and looking at crash dumps, and attaching processes with dbg will yield far better info than relying on whatever print statements you somehow added to your program.
        Anyways at least that's my take on the specific example of glibc-like implementations of POSIX file operations like open(). I'm sure the implications may change for other non-idempotent functions, but at some point, talking about specifics is a bit more productive than talking in the abstract.
        [-]
        dpark 48 days ago
        The issue with relying on gdb is that you generally cannot do this in production. You can’t practically attach a debugger to a production instance of a service for both performance and privacy reasons, and the same generally applies to desktop and mobile applications being run by your customers. Gdb is mostly for local debugging and the truth is that “printf debugging” is how it often works for production. (Plus exception traces, crash dumps, etc. But there is a lot of debugging based on logging.) Interactive debugging is so much more efficient for local development but capable preexisting logging is so much more efficient for debugging production issues.
        I generally agree that I would not expect a core library to do a bunch of logging, at least not onto your application logs. This stuff generally is very stable with a clean interface and well defined error reporting.
        But there’s a whole world of libraries that are not as clean, not as stable, and not as well defined. Most libraries in my experience are nowhere near as clean as standard IO libraries. They often do very complex stuff to simplify for the calling application and have weakly defined error behavior. The more complexity a library contains, the more it likely has this issue. Arguably that is leaky abstraction but it’s also the reality of a lot of software and I’m not even sure that’s a bad thing. A good library that leaks in unexpected conditions might be just fine for many real world purposes.
        [-]
        TZubiri 47 days ago
        It's coming together more clearly now.
        I guess my experience is more from the role of a startup or even in-house software. So we both develop and operate the software. But in scenarios where you ship the software and it's operated by someone else, it makes sense to have more auditable and restricted logging instead of all-too-powerful ad-hoc debugging.
  - jeroenhd 48 days ago
    I very much appreciate libraries that provide optional logging. Tracing error causes in network protocol calls can be pretty near impossible without throwing a library/package/crate/whatever into TRACE mode.
    Of course they shouldn't just be dumping text to stdout/stderr, but as long as the library logging is optional (or only logs when the library has reached some kind of unrecoverable state with instructions to file a bug report), logging is often the right call.
    It's easier to have logs and turn them off at compile time/runtime than to not have logs and need them once deployed.
  - esrauch 49 days ago
    I think an example where libraries could sensibly log error is if you have a condition which is recoverable but may cause a significant slowdown, including a potential DoS issue, and the application owner can remediate.
    You don't want to throw because destroying someone's production isn't worth it. You don't want to silent continue in that state because realistically there's no way for application owner to understand what is happening and why.
    [-]
    - TZubiri 48 days ago
      We call those warnings, and it's very common to downgrade errors to warnings by wrapping an exception and printing the trace as you would an exception.
      [-]
      - kgklxksnrb 48 days ago
        Logging warnings are cowardly, you just push the decision to the log consumer to decide if the error should be acted on.
        Warnings are just errors that no one wants to deal with.
        [-]
        bluGill 48 days ago
        Warnings are for where you expect someplace else to know/log if it really is an error but it might also be normal. You might log why a file io operation failed: if the caller recovers somehow it isn't an errer, but if they can't they log an error and when investigating the warning gives the detail you need to figure it out.
        [-]
        kgklxksnrb 48 days ago
        Who proactively investigates warnings?
        [-]
        bluGill 47 days ago
        statistacs are someimes run and the most common investigated (normally shut up the noise)
        mostly though when you are on a known problem warnings should be a useful filter to find where in the logs the problem might have started, then you use that timestamp to find info logs in the same area
      - makeitdouble 48 days ago
        Warning logs are usually polluted with stuff nobody wants to fix but try to wash their hands off with a log. Like deprecated calls or error logs that got demoted because it didn't matter in practice.
        Anything that has a measurable impact on production should be logged above that, except if your system ignores log levels in the first place, but that's another can of worms.
    - ivan_gammel 48 days ago
      In such scenarios it makes sense to give clients an opportunity to react on such conditions programmatically, so just logging is wrong choice and if there’s a call back to client, client can decide whether to log it and how.
      [-]
      - esrauch 48 days ago
        It's a nice idea but I've literally never seen it done, so I would be interested if you have examples of major libraries that do this. Abstractly it doesn't really seem to work to me in place of simple logs.
        One test case here is that your library has existed for a decade and was fast, but Java removed a method that let you make it fast, but you can still run slow without that API. Java the runtime has a flag that the end use can enable to turn it back on a for a stop gap. How do you expect this to work in your model, you expect to have an onUnnecessarilySlow() callback already set up that all of your users have hooked up which is never invoked for a decade, and then once it actually happens you start calling it and expect it to do something at all sane in those systems?
        Second example is all of the scenarios where you're some transitively used library for many users, it makes and callback strategy immediately not work if the person who needs to know about the situation and could take action is the application owner rather than the people writing library code which called you. It would require every library to offer these same callbacks and transitively propagate things, which would only work if it was just such a firm idiomatic pattern in some language ecosystem and I don't believe that it is in any language ecosystem.
        [-]
        ivan_gammel 48 days ago
        > library has existed for a decade
        >but Java removed a method that let you make it fast, but you can still run slow without that API
        I’d like to see an example of that, because this is extremely hypothetical scenario. I don’t think any library is so advanced to anticipate such scenarios and write something to log. And of course Java specifically has longer cycle of deprecation and removal. :)
        As for your second example, let’s say library A is smart and can detect certain issues. Library B depending on it is at higher abstraction level, so it has enough business context to react on them. I don’t think it’s necessary to propagate the problem and leak implementation details in this scenario.
        [-]
        esrauch 48 days ago
        Protobuf is the example I had in mind. It uses sun.misc.Unsafe which is being removed in upcoming Java releases, but it has a slow fallback path. It logs a warning when it runs if it can tell it's only using the fallback path but the fast path is still available if the application owner set a flag to turn it back on if they want to:
        https://github.com/protocolbuffers/protobuf/issues/20760
        Java Protobuf also logs a warning now if you can tell you are using gencode old enough that it's covered by a DoS CVE. They actually did a release that broke compatability of the CVE covered gencode but restored it and print a warning in a newer release.
        [-]
        vips7L 48 days ago
        What’s stopping you from using the replacements provided in VarHandle and MemorySegment? Just wanting to support the 10 year old JDK 8?
        [-]
        esrauch 48 days ago
        There's a lot here, to be honest these things always come back to investment cost and ROI compared to everything else that could be worked on.
        Java 8 is still really popular, probably the most popular single version. It's not just servers in context, but also Android where Java 8 is the highest safe target, it's not clear what decade we'll be in when VarHandle would be safe to use there at all.
        VarHandle was Java 9 but MemorySegment was Java 17. And the rest of FFM is only in 25 which is fully bleeding edge.
        Protobuf may realistically try to move off of sun.misc.unsafe without the performance regressions in a way that is without adopting MemorySegment to avoid the versioning problem, but it takes significant and careful engineering time.
        That said it's always possible to have waterfall of preferred implementations based on what's supported, it's just always an implementation/verification costs.
      - dpark 48 days ago
        I’ve written code that followed this model, but it almost always just maps to logging anyway, and the rest of the time it’s narrow options presented in the callback. e.g. Retry vs wait vs abort.
        It’s very rarely realistic that a client would code up meaningful paths for every possible failure mode in a library. These callbacks are usually reserved for expected conditions.
        [-]
        ivan_gammel 48 days ago
        > almost always just maps to logging anyway
        Yes, that’s the point. You log it until you encounter it for the first time, then you know more and can do something meaningful. E.g. let’s say you build an API client and library offers callback for HTTP 429. You don’t expect it to happen, so just log the errors in a generic handler in client code, but then after some business logic change you hit 429 for the first time. If library offers you control over what is going to happen next, you may decide how exactly you will retry and what happens to your state in between the attempts. If library just logs and starts retry cycle, you may get a performance hit that will be harder to fix.
        [-]
        dpark 48 days ago
        Defining a callback for every situation where a library might encounter an unexpected condition and pointing them all at the logs seems like a massive waste of time.
        I would much prefer a library have sane defaults, reasonable logging, and a way for me to plug in callbacks where needed. Writing On429 and a hundred other functions that just point to Logger.Log is not a good use of time.
        [-]
        ivan_gammel 48 days ago
        This sub-thread in my understanding is about a special case (a non-error mode that client may want to avoid, in which case explicit callback makes sense), not about all possible unexpected errors. I’m not suggesting hooks as the best approach. And of course “on429” is the last thing I would think about when designing this. There are better ways.
        [-]
        dpark 48 days ago
        If the statement is just that sometimes it’s appropriate to have callbacks, absolutely. A library that only logs in places where it really needs a callback is poorly designed.
        I still don’t want to have to provide a 429 callback just to log, though. The library should log by default if the callback isn’t registered.
        [-]
        ivan_gammel 48 days ago
        It doesn’t have to provide a specific callback. This can be a starting point (Java):
        var client = aClient() .onError((request,response) -> { LOG.debug(…); return FAIL; }).build();
        And eventually you do this:
        return switch(response.code()) { case 429 -> RETRY; default -> FAIL; }
        Or something more interesting, e.g. with more details of retry strategy.
  - Etherlord87 49 days ago
    This seems like such an obvious answer to the problem, your program isn't truly modularized if logging is global. If an error is unexpected it should bubble all the way up, but if it's expected and dealt with, the error message should be suppressed or its type changed to a warning.
    [-]
    - dpark 48 days ago
      I’ve worked on systems with “modularized” logging. It’s never been pleasant because investigations involve stitching together a bunch of different log sources to understand erase actually happened. A global log dump with attribution (module/component/file/line) is far easier to work with.
  - echelon 49 days ago
    You need a tuple: (context, level)
    The application owner should be able to adjust the contexts up or down. This is the point of ownership and where responsibility over which logs matter is handled.
    A library author might have ideas and provide useful suggestions, but it's ultimately the application owner who decides. Some libraries have huge blast radius and their `error` might be your `error` too. In other contexts, it could just be a warning. Library authors should make a reasonable guess about who their customer is and try to provide semantic, granular, and controllable failure behavior.
    As an example, Rust's logging ecosystem provides nice facilities for fine-grained tamping down of errors by crate (library) or module name. Other languages and logging libraries let you do this as well.
    That capability just isn't adopted everywhere.
    [-]
    - Izkata 48 days ago
      Python's built-in logging is the same if used correctly, where the library gets a logger based on its module name (this part isn't enforced) and the application can add a handler to that logger to route the logs differently if needed.
  - cyphar 48 days ago
    On paper, USDT probes are the best way for libraries (and binaries) to provide information for debugging because they can be used programmatically and have no performance overhead until they are measured but unfortunately they are not widely used.
  - renewiltord 48 days ago
    Conflicting goals for the predominant libraries is what causes this. Log4J2 has a rewrite appender that solves the problem. But if you want zero-copy etc I don’t think there’s such a solution.
  - paulddraper 48 days ago
    It may be unwise to log errors at low layers but logging informational and debug messages are useful (at least, when the caller enables them).
  - pca006132 48 days ago
    Wonder if someone used effect handlers for error logging. Sounds like a natural and modular way of handling this problem.
- ivan_gammel 49 days ago
  Libraries should not log on levels above DEBUG, period. If there’s something worthy for reporting on higher levels, pass this information to client code, either as an event, or as an exception or error code.
  [-]
  - layer8 49 days ago
    From a code modularization point of view, there shouldn’t really be much of a difference between programs and libraries. A program is just a library with a different calling convention. I like to structure programs such that their actual functionality could be reused as a library in another program.
    This is difficult to reconcile with libraries only logging on a debug level.
    [-]
    - schrodinger 49 days ago
      I see your point, but disagree on a practical level. Libraries are being used while you’re in “developer” mode, while programs are used in “user” mode (trying awkwardly to differentiate between _being_ a developer and currently developing code around that library.
      Usually a program is being used by the user to accomplish something, and if logging is meaningful than either in a cli context or a server context. In both cases, errors are more often being seen by people/users than by code. Therefore printing them to logs make sense.
      While a lib is being used by a program. So it has a better way to communicate problems with the caller (and exceptions, error values, choose the poison of your language). But I almost never want a library to start logging shit because it’s almost guaranteed to not follow the same conventions as I do in my program elsewhere. Return me the error and let me handle.
      It’s analogous to how Go has an implicit rule of that a library should never let a panic occur outside the library. Internally, fine. But at the package boundary, you should catch panics and return them as an error. You don’t know if the caller wants the app to die because it an error in your lib!
    - ivan_gammel 49 days ago
      The main difference is that library is not aware of the context of the execution of the code, so cannot decide, whether the problem is expected, recoverable or severe.
      [-]
      - dpark 48 days ago
        And the program doesn’t know if the user is expecting failure, either. The library case is not actually much different.
        It’s very reasonable that a logging framework should allow higher levels to adjust how logging at lower levels is recorded. But saying that libraries should only log debug is not. It’s very legitimate for a library to log “this looks like a problem to me”.
      - layer8 49 days ago
        The same is true for programs that are being invoked. The program only knows relative to its own purpose, and the same is again true for libraries. I don’t see the difference, other than, as already mentioned, the mechanism of program vs. library invocation.
        Consider a Smalltalk-like system, or something like TCL, that doesn’t distinguish between programs and libraries regarding invocation mechanism. How would you handle logging in that case?
        [-]
        msteffen 48 days ago
        Okay, but…most programs are written in Python or Rust or something, where invoking library functions is a lot safer, more ergonomic, more performant, and more common than spawning a subprocess and executing a program in it. Like you can’t really ignore the human expectations and conventions that are brought to bear when your code is run (the accommodation of which is arguably most of the purpose of programming languages).
        When you publish a library, people are going to use it more liberally and in a wider range of contexts (which are therefore harder to predict, including whether a given violation requires human intervention)
        ivan_gammel 48 days ago
        The purpose of a program and of a library is different and intent of the authors of the code is usually clear enough to make the distinction in context. Small composable programs aren’t interesting case here, they shouldn’t be verbose anyway even to justify multiple logging levels (it’s probably just set to on/off using a command line argument).
        hrimfaxi 49 days ago
        The mechanism of invocation is important. Most programs allow you to set the logging verbosity at invocation. Libraries may provide an interface to do so but their entry points tend to be more numerous.
  - lanstin 48 days ago
    I have a logging level I call "log lots" where it will log the first time with probability 1, but as it hits more often the same line, it will log with lower and lower probability bottoming out around 1/20000 times. Sort of a "log with probability proportional to the unlikiness of the event". So if I get e.g. sporadic failures to some back end, I will see them all, but if it goes down hard I will see it is still down but also be able to read other log msgs.
  - 1718627440 49 days ago
    Why? Whats wrong with logging it and passing the log object to the caller? The caller can still modify the log entry however it pleases?
    [-]
    - ivan_gammel 49 days ago
      Practicality. It is excessive for client code to calibrate library logging level. It’s ok to do it in logging configuration, but having an entry for every library there is also excessive. It is reasonable to expect that dev/staging may have base level at DEBUG and production will have base level at INFO, so that a library following the convention will not require extra effort to prevent log spam in production. Yes, we have entire logging industry around aggregation of terabytes of logs, with associated costs, but do you really need that? In other words, are we developers too lazy to adapt the sane logging policy, which actually requires minimum effort, and will just burn the company money for nothing?
      [-]
      - dolmen 48 days ago
        TLDR: I agree.
        A library might also be used in multiple place, maybe deeply in a dependency stack, so the execution context (high level stack) matters more than which library got a failure.
        So handling failures should stay in the hands of the developer calling the library and this should be a major constraint for API design.
  - kelnos 49 days ago
    Eh, as with anything there are always exceptions. I generally agree with WARN and ERROR, though I can imagine a few situations where it might be appropriate for a library to log at those levels. Especially for a warning, like a library might emit "WARN Foo not available; falling back to Bar" on initialization, or something like that. And I think a library is fine logging at INFO (and DEBUG) as much as it wants.
    Ultimately, though, it's important to be using a featureful logging framework (all the better if there's a "standard" one for your language or framework), so the end user can enable/disable different levels for different modules (including for your library).
    [-]
    - ivan_gammel 49 days ago
      WARN Foo not available; falling back to Bar
      In server contexts this is usually unnecessary noise, if it’s not change detection. Of course, good logging framework will help you to mute irrelevant messages, but as I said in another comment here, it’s a matter of practicality. Library shouldn’t create extra effort for its users to fine-tune logging output, so it must use reasonable defaults.
- HendrikHensen 48 days ago
  > Should only “top-level” code ever log an error? That can make it difficult to identify the low-level root causes of a top-level failure.
  Some languages (e.g. Java) include a stack trace when reporting an error, which is extremely useful when logging the error. It shows at exactly which point in the code the error was generated, and what the full call stack was to get there.
  It's a real shame that "modern" languages or "low level" languages (e.g. Go, Rust) don't include this out of the box, it makes troubleshooting errors in production much more difficult, for exactly the reason you mention.
  [-]
  - StellarScience 48 days ago
    C++ with Boost has let you grab a stacktrace anywhere in the application for years. But in April 2024 Boost 1.85 added a big new feature: stacktrace from arbitrary exception ( https://www.boost.org/releases/1.85.0/ ), which shows the call stack at the time of the throw. We added it to our codebase, and suddenly errors where exceptions were thrown became orders of magnitude easier to debug.
    C++23 added std::tracktrace, but until it includes stacktrace from arbitrary exception, we're sticking with Boost.
  - layer8 48 days ago
    The point in the code is not the same information as knowing the time, or knowing the order with respect to operations performed during stack unwinding. Stacktraces are very useful, but they don’t replace lower-level logging.
  - dolmen 48 days ago
    The idiomatic practice in Go for libraries is to wrap returned errors and errors can be unwrapped with stdlib tooling. This is more useful to handle errors at runtime than digging into a stack trace.
- hinkley 49 days ago
  Log4j has the ability to filter log levels by subject matter for twenty years. In Java you end up having to use that a lot for this reason.
  [-]
  - PartiallyTyped 49 days ago
    Logging in rust also does that, you can set logging levels for individual modules deep within your dependency tree.
  - TZubiri 48 days ago
    Oh that library that gives you a write() wrapper in exchange for RCE vulns
    [-]
    - ivan_gammel 48 days ago
      Log4j is basically a design pattern. If you don’t like the library, Slf4j/logback are based on the same principles.
- 0x696C6961 49 days ago
  Libraries should not log, instead they should allow registering hooks which get called with errors and debug info.
  [-]
  - kelnos 49 days ago
    I think this is useful for libraries in a language like C, where there is no standardized logging framework, so there's no way for the application to control what the library logs. But in a language (Java, Rust, etc.) where there are standard, widely-used logging frameworks that give people fine-grained control over what gets logged, libraries should just use those frameworks.
    (Even in C, though... errors should be surfaced as return values from functions causing the error, not just logged somewhere. Debug info, sure, have a registerable callback for that.)
    [-]
    - dolmen 48 days ago
      Log4J style logging is effectively a hook system. But it is too easy to badly use it with too high level and delegate level fixing to the end user.
  - ivan_gammel 49 days ago
    They can log if platform permits, i.e. when you can set TRACE and DEBUG to no-op, but of course it should be done reasonably. Having hooks is often an overkill compared to this.
  - esrauch 49 days ago
    It doesn't seem to work this way in practice, not least because most libraries will be transitive deps of the application owner.
    I think creating the hooks is very close to just not doing anything here, if no one is going to use the hooks anyway then you might as well not have them.
  - Blackthorn 48 days ago
    Libraries should log in a way that is convenient to the developer rather than a way that is ideologically consistent. Oftentimes, that means logging as we know it.
- peacebeard 48 days ago
  I've been thinking about this all day. I think the best approach is probably twofold:
  1) Thrown errors should track the original error to retain its context. In JavaScript errors have a `cause` option which is perfect for this. You can use the `cause` to hold a deep stack trace even if the error has been handled and wrapped in a different error type that may have a different semantics in the application.
  2) For logging that does not stop program execution, I think this is a great case for dependency injection. If a library allows its consumer to provide a logger, the application has complete control over how and when the library logs, and can even change it at runtime. If you have a disagreement with a library, for example it logs errors that you want to treat as warnings, your injected logger can handle that.
eterm 49 days ago
How I'd personally like to treat them:
```
  - Critical / Fatal:  Unrecoverable without human intervention, someone needs to get out of bed, now.
  - Error : Recoverable without human intervention, but not without data / state loss. Must be fixed asap. An assumption didn't hold.
  - Warning: Recoverable without intervention. Must have an issue created and prioritised. ( If business as usual, this could be downgrading to INFO. )
```
The main difference therefore between error and warning is, "We didn't think this could happen" vs "We thought this might happen".
So for example, a failure to parse JSON might be an error if you're responsible for generating that serialisation, but might be a warning if you're not.
[-]
- arwhatever 49 days ago
  I like to think of “warning” as something to alert on statistically, e.g. incorrect password attempt rate jumps from 0.4% of login attempts to 99%.
  [-]
  - mzi 48 days ago
    This sounds more like metrics than a log statement.
    For me logs should complement metrics, and can in many instances be replaced by tracing if the spans are annotated sufficiently. But making metrics out of logs is both costly and a bit brittle.
  - lanstin 48 days ago
    This point is important - the value of a log is inextricably tied to its unlikelihood. Which depends on so many things in the context.
    [-]
    - bluGill 48 days ago
      The value of all logs is tied only to if there is a problem will it help you find and debug it. If you never do statistics that password log is useless. If you never encounter a problem where the log helps debug it was useless.
      God doesn't tell you the future so good luck figuring out which logs you really need.
- masswerk 49 days ago
  Also, warnings for ambiguous results.
  For example, when a process implies a conversion according to the contract/convention, but we know that this conversion may be not the expected result and the input may be based on semantic misconceptions. E.g., assemblers and contextually truncated values for operands: while there's no issue with the grammar or syntax or intrinsic semantics, a higher level misconception may be involved (e.g., regarding address modes), resulting in a correct but still non-functional output. So, "In this individual case, there may be or may be not an issue. Please, check. (Not resolvable on our end.)"
  (Disclaimer: I know that this is a very much classic computing and that this is now mostly moved to the global TOS, but still, it's the classic example for a warning.)
- RaftPeople 49 days ago
  > The main difference therefore between error and warning is, "We didn't think this could happen" vs "We thought this might happen".
  What about conditions like "we absolutely knew this would happen regularly, but it's something that prevents the completion of the entire process which is absolutely critical to the organization"
  The notion of an "error" is very context dependent. We usually use it to mean "can not proceed with action that is required for the successful completion of this task"
  [-]
  - wizzwizz4 49 days ago
    Those conditions would be "Critical", no? The error vs warning distinction doesn't apply.
    [-]
    - fhcuvyxu 48 days ago
      No, many applications need to be fault tolerant.
      Crashing your web stack because one route hit an error is a dumb idea.
      And no, calling it a warning is also dumb idea. It is an error.
      This article is a navel gazing expedition.
      They're kind of right but you can turn any warning into an error and vice versa depending on business needs that outweigh the technical categorisation.
      [-]
      - wizzwizz4 48 days ago
        A log entry marked "CRITICAL" does not imply crashing the web stack.
        [-]
        fhcuvyxu 47 days ago
        Right. Was thinking of fatal.
- mewpmewp2 49 days ago
  What if you are integrated to a third party app and it gives you 5xx once? What do you log it as, and let's say after a retry it is fine.
  [-]
  - kiicia 49 days ago
    As always „it depends”
    - info - when this was expected and system/process is prepared for that (like automatic retry, fallback to local copy, offline mode, event driven with persistent queue etc) - warning - when system/process was able to continue but in degraded manner, maybe leaving decision to retry to user or other part of system, or maybe just relying on someone checking logs for unexpected events, this of course depends if that external system is required for some action or in some way optional - error - when system/process is not able to continue and particular action has been stopped immediately, this includes situation where retry mechanism is not implemented for step required for completion of particular action - fatal - you need to restart something, either manually or by external watchdog, you don’t expect this kind of logs for simple 5xx
  - bqmjjx0kac 49 days ago
    I would log a warning when an attempt fails, and an error when the final attempt fails.
    [-]
    - mewpmewp2 49 days ago
      You are not the OP, but I think I was trying to point out this example case in relation to their descriptions of Error/Warnings.
      This scenario may or may not yield in data/state loss, it may also be something that you, yourself can't immediately fix. And if it's temporary, what is the point of creating an issue and prioritizing.
      I guess my point is that to any such categorization of errors or warnings there are way too many counter examples to be able to describe them like that.
      So I'd usually think that Errors are something that I would heuristically want to quickly react to and investigate (e.g. being paged, while Warnings are something I would periodically check in (e.g. weekly).
      [-]
      - wredcoll 49 days ago
        Like so many things in this industry the point is establishing a shared meaning for all the humans involved, regardless of how uninvolved people think.
        That being said, I find tying the level to expected action a more useful way to classify them.
        [-]
        mewpmewp2 49 days ago
        But what I also see frequently is people trying to do the impossible and idealistic things because they read somewhere that something should mean X, when things are never so clearly cut, so either it is not such a simplistic issue and should be understood as not such a simple issue, or there might be a better more practical definition for it. We should first start from what are we using Logs for. Are we using those for debugging, or so we get alerted or both?
        If for debugging, the levels seem relevant in the sense of how quickly we are able to use that information to understand what is going wrong. Out of potential sea of logs we want to see first what were the most likely culprits for something causing something to go wrong. So the higher the log level, the higher likelihood of this event causing something to go wrong.
        If for alerting, they should reflect on how bad is this particular thing happening for the business and would help us to set a threshold for when we page or have to react to something.
  - marcosdumay 49 days ago
    Well, the GPs criteria are quite good. But what you should actually do depends on a lot more things than the ones you wrote in your comment. It could be so irrelevant to only deserve a trace log, or so important to get a warning.
    Also, you should have event logs you can look to make administrative decisions. That information surely fits into those, you will want to know about it when deciding to switch to another provider or renegotiate something.
  - cpburns2009 49 days ago
    It really depends on the third party service.
    For service A, a 500 error may be common and you just need to try again, and a descriptive 400 error indicates the original request was actually handled. In these cases I'd log as a warning.
    For service B, a 500 error may indicate the whole API is down, in which case I'd log a warning and not try any more requests for 5 minutes.
    For service C, a 500 error may be an anomaly and treat it as hard error and log as error.
    [-]
    - srdjanr 48 days ago
      What's the difference between B and C? API being down seems like an anomaly.
      Also, you can't know how frequently you'll get 500s at the time you're doing integration, so you'll have to go back after some time to revisit log severities. Which doesn't sound optimal.
      [-]
      - IgorPartola 48 days ago
        Exactly. What’s worse is that if you have something like a web service that calls an external API, when that API goes down your log is going to be littered with errors and possibly even tracebacks which is just noise. If you set up a simple “email me on error” kind of service you will get as many emails as there were user requests.
        In theory some sort of internal API status tracking thing would be better that has some heuristic of is the API up or down and the error rate. It should warn you when the API is down and when it comes back up. Logging could still show an error or a warning for each request but you don’t need to get an email about each one.
      - cpburns2009 47 days ago
        I forgot to mention that for service B, the API being down is a common, daily occurrence and does not last long. The behavior of services A-C is from my real world experience.
        I do mean revisiting the log seventies as the behavior of the API becomes known. You start off treating every error as a hard error. As you learn the behavior of the API over time, you adjust the logging and error handling accordingly.
  - eterm 49 days ago
    This might be controversial, but I'd say if it's fine after a retry, then it doesn't need a warning.
    Because what I'd want to know is how often does it fail, which is a metric not a log.
    So expose <third party api failure rate> as a metric not a log.
    If feeding logs into datadog or similar is the only way you're collecting metrics, then you aren't treating your observablity with the respect it deserves. Put in real counters so you're not just reacting to what catches your eye in the logs.
    If the third party being down has a knock-on effect to your own system functionality / uptime, then it needs to be a warning or error, but you should also put in the backlog a ticket to de-couple your uptime from that third-party, be it retries, queues, or other mitigations ( alternate providers? ).
    By implementing a retry you planned for that third party to be down, so it's just business as usual if it suceeds on retry.
    [-]
    - mewpmewp2 49 days ago
      > If the third party being down has a knock-on effect to your own system functionality / uptime, then it needs to be a warning or error, but you should also put in the backlog a ticket to de-couple your uptime from that third-party, be it retries, queues, or other mitigations ( alternate providers? ).
      How do you define uptime? What if e.g. it's a social login / data linking and that provider is down? You could have multiple logins and your own e-mail and password, but you still might lose users because the provider is down. How do you log that? Or do you only put it as a metric?
      You can't always easily replace providers.
      [-]
      - ivan_gammel 49 days ago
        You may log that or count failures in some metric, but the correct answer is to have a health check on third party service and an alert when that service is down. Logs may help to understand the nature of the incident, but they are not the channel through which you are informed about such problems.
        The different issue is when third party broke the contract, so suddenly you get a lot of 4xx or 5xx responses, likely unrecoverable. Then you get ERROR level messages in the log (because it’s unexpected problem) and an alert when there’s a spike.
    - hk__2 49 days ago
      > This might be controversial, but I'd say if it's fine after a retry, then it doesn't need a warning. > > Because what I'd want to know is how often does it fail, which is a metric not a log.
      It’s not controversial; you just want something different. I want the opposite: I want to know why/how it fails; counting how often it does is secondary. I want a log that says "I sent this payload to this API and I got this error in return", so that later I can debug if my payload was problematic, and/or show it to the third party if they need it.
    - hamandcheese 49 days ago
      My main gripe with metrics is that they are not easily discoverable like logs are. Even if you capture a list of all the metrics emitted from an application, they often have zero context and so the semantics are a bit hard to decipher.
- p2detar 48 days ago
  Yea but instead of log Critical/Fatal and go on, I would just panic() the program. To the other definitions I agree - everything else is recoverable, because the program still runs.
  Warning to me is an error that has very little business logic side effects/impact as opposed to an Error, but still requires attention.
  [-]
  - IgorPartola 48 days ago
    I write a lot of backend web code that often talks to external services. So for example the user wants to add a shipping address to their profile but the address verification API responds with a 500. That is an expected error: sometimes it can happen. I want to log it but I do not want a trace back or anything like that.
    On the other hand it could be that the API had changed slightly. Say they for some reason decided to rename the input parameter postcode to postal_code and I didn’t change my code to fix this. This is 100% a programming error that would be classified as critical but I would not want to panic() the entire server process over it. I just want an alert that hey there is a programming error, go fix it.
    But what could also happen is that when I try to construct a request for the external API and the OS is out of memory. Then I want to just crash the process and rely on automatic process restarts to bring it back up. BTW logging an error after malloc() returns NULL needs to be done carefully since you cannot allocate more memory for things like a new log string.
- sysguest 49 days ago
  hmm maybe we need extra representation?
  eg: 2.0 for "trace" / 1.0 for "debug" / 0.0 for "info" / -1.0 for "warn" / -2.0 for "error that can be handled"
  [-]
  - wredcoll 49 days ago
    I said this elsewhere, but the point here is what the humans involved are supposed to do with this info. Do I literally get out of bed on an error log or do I grep for them once or twice a month?
    [-]
    - ivan_gammel 49 days ago
      You should never get out of bed on an error in the log. Logs are for retrospective analysis, health checks and metrics are for situational awareness, alerts are for waking people up.
mfuzzey 49 days ago
I think it's difficult to say without knowing how the system is deployed and administered. "If a SMTP mailer trying to send email to somewhere logs 'cannot contact port 25 on <remote host>', that is not an error in the local system"
Maybe or maybe not. If the connection problem is really due to the remote host then that's not the problem of the sender. But maybe the local network interface is down, maybe there's a local firewall rule blocking it,...
If you know the deployment scenario then you can make reasonable decisions on logging levels but quite often code is generic and can be deployed in multiple configurations so that's hard to do
[-]
- colechristensen 49 days ago
  How about this:
  - An error is an event that someone should act on. Not necessarily you. But if it's not an event that ever needs the attention of a person then the severity is less than an error.
  Examples: Invalid credentials. HTTP 404 - Not Found, HTTP 403 Forbidden, (all of the HTTP 400s, by definition)
  It's not my problem as a site owner if one of my users entered the wrong URL or typed their password wrong, but it's somebody's problem.
  A warning is something that A) a person would likely want to know and B) wouldn't necessarily need to act on
  INFO is for something a person would likely want to know and unlikely needs action
  DEBUG is for something likely to be helpful
  TRACE is for just about anything that happens
  EMERG/CRIT are for significant errors of immediate impact
  PANIC the sky is falling, I hope you have good running shoes
  [-]
  - DanHulton 49 days ago
    If you're logging and reporting on ERRORs for 400s, then your error triage log is going to be full of things like a user entering a password with insufficient complexity or trying to sign up with an email address that already exists in your system.
    Some of these things can be ameliorated with well-behaved UI code, but a lot cannot, and if your primary product is the API, then you're just going to have scads of ERRORs to triage where there's literally nothing you can do.
    I'd argue that anything that starts with a 4 is an INFO, and if you really wanted to be through, you could set up an alert on the frequency of these errors to help you identify if there's a broad problem.
    [-]
    - colechristensen 49 days ago
      You have HTTP logs tracked, you don't need to report them twice, once in the HTTP log and once on the backend. You're just effectively raising the error to the HTTP server and its logs are where the errors live. You don't alert on single HTTP 4xx errors because nobody does, you only raise on anomalous numbers of HTTP 4xx errors. You do alert on HTTP 5xx errors because as "Internal" http errors those are on you always.
      In other words, of course you don't alert on errors which are likely somebody else's problem. You put them in the log stream where that makes sense and can be treated accordingly.
    - lanstin 48 days ago
      The frequency is important and so is the answer to "could we have done something different ourselves to make the request work". For example in credit card processing, if the remote network declines, then at first it seems like not your problem. But then it turns out for many BINs there are multiple choices for processing and you could add dynamic routing when one back end starts declining more than normal. Not a 5xx and not a fault in your process, but a chance to make your customer experience better.
  - adrianmonk 49 days ago
    > An error is an event that someone should act on. Not necessarily you.
    Personally, I'd further qualify that. It should be logged as an error if the person who reads the logs would be responsible for fixing it.
    Suppose you run a photo gallery web site. If a user uploads a corrupt JPEG, and the server detects that it's corrupt and rejects it, then someone needs to do something, but from the point of view of the person who runs the web site, the web site behaved correctly. It can't control whether people's JPEGs are corrupt. So this shouldn't be categorized as an error in the server logs.
    But if you let users upload a batch of JPEG files (say a ZIP file full of them), you might produce a log file for the user to view. And in that log file, it's appropriate to categorize it as an error.
    [-]
    - colechristensen 48 days ago
      That's the difference between an HTTP 4xx and 5xx
      4xx is for client side errors, 5xx is for server side errors.
      For your situation you'd respond with an HTTP 400 "Bad Request" and not an HTTP 500 "Internal Server Error" because the problem was with the request not with the server.
    - Arrowmaster 48 days ago
      Counter argument. How do you know the user uploaded a corrupted image and it didn't get corrupted by your internet connection, server hardware, or a bug in your software stack?
      You cannot accurately assign responsibility until you understand the problem.
      [-]
      - jeremyjh 48 days ago
        This is just trolling. The JPEG is corrupt if the library that reads it says it is corrupt. You log it as a warning. If you upgrade the library or change your upstream reverse proxy, and starting getting 1000x the number of warnings, you can still recognize that and take action without personally inspecting each failed upload to be sure you haven't yet stumbled on the one edge case where the JPEG library is out of spec.
- greatgib 49 days ago
  The point is that if your program itself take note of the error from the library it is ok. You, as the program owner, can decide what to do with it (error log or not).
  But if you are the SMTP library and that you unilaterally log that as an error. That is an issue.
  [-]
  - dminuoso 49 days ago
    This would require a complete new ecosystem and likely new language where any degradation of code flow becomes communicatable in a standardized and fully documented fashion.
    The closest we have is something like Java with exceptions in type signatures, but we would have to ban any kind of exception capture except from final programs, and promote basically any logger call int an exception that you could remotely suppress.
    We could philosophize about a world with compilers made out of unobtanium - but in this reality a library author cannot know what conditions are fixable or necessitate a fix or not. And structured logging lacks has way too many deficiencies to make it work from that angle.
  - zamadatix 49 days ago
    The counterpoint made above is while what you describe is indeed the way the author likes to see it that doesn't explain why "an error is something which failed that the program was unable to fix automatically" is supposed to be any less valid a way to see it. I.e. should error be defined as "the program was unable to complete the task you told it to do" or only "things which could have worked but you need to explicitly change something locally".
    I don't even know how to say whether these definitions are right or wrong, it's just whatever you feel like it should be. The important thing is what your program logs should be documented somewhere, the next most important thing is that your log levels are self consistent and follow some sort of logic, and that I would have done it exactly the same is not really important.
    At the end of the day, this is just bikeshedding about how to collapse ultra specific alerting levels into a few generic ones. E.g. RFC 5424 defines 8 separate log levels for syslog and, while that's not a ceiling by any means, it's easy to see how there's already not really going to be a universally agreed way to collapse even just these down to 4 categories.
    [-]
    - hinkley 49 days ago
      Any robust system isn’t going to rely on reading logs to figure out what to do about undelivered email anyway. If you’re doing logistics the failure to send an order confirmation needs to show up in your data model in some manner. Managing your application or business by logs is amateur hour.
      There’s a whole industry of “we’ll manage them for you” which is just enabling dysfunction.
- solatic 48 days ago
  > But maybe the local network interface is down, maybe there's a local firewall rule blocking it,...
  That's exactly why you log it as a warning. People get warned all the time about the dangers of smoking. It's important that people be warned about smoking; these warnings save lives. People should pay attention to warnings, which let them know about worrisome concerns that should be heeded. But guess what? Everyone has a story about someone who smoked until they were 90 and died in a car accident. It is not an error that somebody is smoking. Other systems will make their own bloody decisions and firewalling you off might be one of them. That is normal.
  What do you think a warning means?
jayofdoom 49 days ago
In OpenStack, we explicitly document what our log levels mean; I think this is valuable from both an Operator and Developer perspective. If you're a new developer, without a sense of what log levels are for, it's very prescriptive and helpful. For an operator, it sets expectations.
https://docs.openstack.org/oslo.log/latest/user/guidelines.h...
FWIW, "ERROR: An error has occurred and an administrator should research the event." (vs WARNING: Indicates that there might be a systemic issue; potential predictive failure notice.)
[-]
- quectophoton 49 days ago
  Thank you, this (and jillesvangurp's comment) sounds way more reasonable than the article's suggestion.
  If I have a daily cron job that is copying files to a remote location (e.g. backups), and the _operation_ fails because for some reason the destination is not writable.
  Your suggestion would get me _both_ alerts, as I want; the article's suggestion would not alert me about the operation failing because, after all, it's not something happening in the local system, the local program is well configured, and it's "working as expected" because it doesn't need neither code nor configuration fixing.
  [-]
  - __turbobrew__ 49 days ago
    Agreed, I don’t get the OPs delineation between local and non-local error sources. If your code has a job to do it doesn’t matter if the error was local or non-local, the operator needs to know that the code is not doing its job. In the case of something like you cannot backup files to a remote you can try to contact the humans who own the remote or come up with an alternative backup mechanism.
rwmj 49 days ago
And the second rule is make all your error messages actionable. By that I mean it should tell me what action to take to fix the error (even if that action means hard work, tell me what I have to do).
[-]
- chongli 49 days ago
  Suppose I'm writing an http server and the error is caused by a flaky power supply causing the disk to lose power when the server attempts to read a file that's been requested. How is the http server supposed to diagnose this or any other hardware fault? Furthermore, why should it even be the http server's responsibility to know about hardware issues at all?
  [-]
  - uniq7 49 days ago
    The error doesn't need to be extremely specific or point to the actual root cause.
    In your example,
    - "Error while serving file" would be a bad error message,
    - "Failed to read file 'foo/bar.html'" would be acceptable, and
    - "Failed to read file 'foo/bar.html' due to EIO: Underlying device error (disk failure, I/O bus error). Please check the disk integrity." would be perfect (assuming the http server has access to the underlying error produced by the read operation).
  - Copenjin 48 days ago
    Some of these replies make me wonder if you have ever written any code at all, nonsensical example.
- andoando 49 days ago
  Error: Possible race condition, rewrite codebase
  [-]
  - morkalork 49 days ago
    I have written out-of-band sanity checks that have caught race conditions, the recommendation is more like "<Thing> that should be locked, isn't. Check what was merged and deployed in the last 24h, someone ducked it up"
- 1123581321 49 days ago
  Can you please explain this? That sounds like identifying bugs but not fixing them but I realize you don’t mean that. One hopes the context information in the error will make it actionable when it occurs, never completely successfully, of course.
  [-]
  - rwmj 49 days ago
    Here's an example of a bug that I filed about non-actionable error messages: https://github.com/karmab/kcli/issues/456
    The first error message was "No usable public key found, which is required for the deployment" which doesn't tell me what I have to do to correct the problem. Nothing about even where it's looking for keys, what is supposed to create the key or how I am supposed to create the key.
    There are other examples and discussion of what they should say in the link.
    Edit: Here's another one that I filed: https://github.com/containers/podman/issues/20775
    [-]
    - 1123581321 49 days ago
      That makes sense and good examples; thanks.
      At work, I can think of cases where we error when data mismatches between two systems. It’s almost always the fault of system B but we present the mismatch error neutrally. Experienced developers just know to fix B but we shouldn’t rely on that.
  - Copenjin 48 days ago
    You can hope that the person reading the context will always able to understand it like you would have. Bad assumption in my experience.
    [-]
    - 1123581321 48 days ago
      Quite true. It can be a bad assumption when I’m the one trying to understand it weeks later. :)
- pixl97 49 days ago
  So what error do you put if the server is over 500 miles away?
  https://web.mit.edu/jemorris/humor/500-miles
  Or you can't connect because of a path MTU error.
  Or because the TTL is set to low?
  Your software at the server level has no idea what's going wrong at the network level, all you can send is some kind of network problem message.
- lanstin 48 days ago
  Also put the fucking data in the message that led to the decision to emit the logs. I can't remember how many times I have had a three part test trigger a log "blah: called with illegal parameters, shouldn't happen" and the illegal parameters were not logged.
- throw3e98 49 days ago
  Maybe that makes sense for a single-machine application where you also control the hardware. But for a networked/distributed system, or software that runs on the user's hardware, the action might involve a decision tree, and a log line is a poor way to convey that. We use instrumentation, alerting and runbooks for that instead, with the runbooks linking into a hyperlinked set of articles.
  My 3D printer will try to walk you through basic fixes with pictures on the device's LCD panel, but for some errors it will display a QR code to their wiki which goes into a technical troubleshooting guide with complex instructions and tutorial videos.
- magicalhippo 49 days ago
  This can be difficult or just not possible.
  What is possible is to include as much information about what the system was trying to do. If there's an file IO error, include the the full path name. Saying "file not found" without saying which file was not found infuriates me like few other things.
  If some required configuration option is not defined, include the name of the configuration option and from where it tried to find said configuration (config files, environment, registry etc). And include the detailed error message from the underlying system if any.
  Regular users won't have a clue how to deal with most errors anyway, but by including details at least someone with some system knowledge has a chance of figuring out how to fix or work around the issue.
- Copenjin 48 days ago
  Exactly. Some applications keep running way after you have long gone. If there is useful information to provide give it.
- hyperadvanced 49 days ago
  This is just plain wrong, I vehemently disagree. What happens if a payment fails on my API, and today that means I need to go through a 20-step process with this pay provider, my database, etc. to correct that. But what’s worse is if this error happens 11,000 times and I run a script to do my 20 step process 11,000 times, but it turns out the error was raised in error. Additionally, because the error was so explicit about how to fix it, I didn’t talk to anyone. And of course, the suggested fix was out of date because docs lag vs. production software. Now I have 11,000 pissed off customers because I was trying to be helpful.
alex-moon 48 days ago
"If a SMTP mailer trying to send email to somewhere logs 'cannot contact port 25 on <remote host>', that is not an error in the local system and should not be logged at level 'error'."
But it is still an error condition, i.e. something does need to be fixed - either something about the connection string (i.e. in the local system) is wrong, or something in the other system or somewhere between the two is wrong (i.e. and therefore needs to be fixed). Either way, developers on this end (I mean someone reading the logs - true that it might not be the developers of the SMTP mailer) need to get involved, even if it is just to reach out to the third party and ask them to fix it on their end.
A condition that fundamentally prevents a piece of software from working not being considered an error is mad to me.
[-]
- jeremyjh 48 days ago
  There is no "connection string" in mail software that defines the remote host. The other party's MX records do that. If you are sending mail to thousands of remote hosts and one is unreachable, that is NOT a problem a mail administrator is going to be researching or trying to fix because they cannot, and it is not their problem. Either the email address is wrong, the remote host is down, or its DNS is misconfigured. This happens constantly all day long everywhere. The errors are reported to the sender of the email, which is the person who has the problem to solve.
  [-]
  - alex-moon 48 days ago
    OK yeah I think I see what you're saying, if the SMTP mailer is a hosted service and we're talking about the logs for the service itself then failed connections are not an error - this I agree with. I also wouldn't be logging anything transactional at all in this case - the transactional logs are for the user, they are functionality of the service itself in that case, and those logs should absolutely log a failure to connect as an error.
    [-]
    - jeremyjh 48 days ago
      It doesn't matter if it is a hosted service or if its just your local mail transfer agent, every "SMTP mailer" works the same way. There are lots of ways to send email that don't involve a locally administered SMTP mailer (such as an API which indeed has a connection string to a hosted service) but none would be described with that term.
- jeroenhd 48 days ago
  Exactly this, a remote error may still be your problem. If your SMTP mailer is failing to send out messages on behalf of your customer because their partners' email servers cannot be reached, your customer is still going to ask you why the documents never arrived.
  Plus, a remote server not being reachable doesn't say anything about where the problem lies. Did you mess up a routing table? Did your internet connection get severed? Did you firewall off an important external server? Did you end up on a blacklist of some kind?
  These types of messages are important error messages for plenty of people. Just because your particular use case doesn't care about the potential causes behind the error doesn't mean nobody does.
bytefish 48 days ago
Making software is 20% actual development and 80% is maintenance. Your code and your libraries need to be easy to debug, and this means logs, logs, logs, logs and logs. The more the better. It makes your life easy in the long run.
So the library you are using fires too many debug messages? You know, that you can always turn it off by ignoring specific sources, like ignoring namespaces? So what exactly do you lose? Right. Almost nothing.
As for my code and libraries I always tend to do both, log the error and then throw an exception. So I am on the safe side both ways. If the consumer doesn’t log the exception, then at least my code does it. And I give them the chance to do logging their way and ignore mine. I am doing a best-guess for you… thinking to myself, what’s an error when I’d use the library myself.
You don’t trust me? Log it the way you need to log it, my exception is going to transport all relevant data to you.
This has saved me so many times, when getting bug reports by developers and customers alike.
There are duplicate error logs? Simply turn my logging off and use your own. Problem solved.
If it is a program level error, maybe a warning and returning the error is the correct way to do. Maybe it’s not? It depends on the context.
And this basically is the answer to any software design question: It depends.
[-]
- szundi 48 days ago
  [dead]
teo_zero 49 days ago
This doesn't resonate with my experience. I place the line between a warning and an error whether the operation can or can't be completed.
A connection timed out, retrying in 30 secs? That's a warning. Gave up connecting after 5 failed attempts? Now that's an error.
I don't care so much if the origin of the error is within the program, or the system, or the network. If I can't get what I'm asking for, it can't be a mere warning.
AndroTux 49 days ago
“cannot contact port 25 on <remote host>” may very well be a configuration error. How should the program know?
[-]
- notatoad 49 days ago
  >How should the program know?
  if we're talking about logs from our own applications that we have written, the program should know because we can write it in a way that it knows.
  user-defined config should be verified before it is used. make a ping to port 25 to see if it works before you start using that config for actual operation. if it fails the verification step, that's not an error that needs to be logged.
  [-]
  - tcpkump 49 days ago
    What about when the mail server endpoint has changed, and for whatever reason, this configuration wasn’t updated? This is a common scenario when dealing with legacy infrastructure in my experience.
    [-]
    - notatoad 49 days ago
      the whole point of the essay here is that you should make a distinction between errors that you care about and plan to fix, and errors that you don't care about and don't intend to do anything about. and if you don't intend to do anything about it, it shouldn't be logged as error.
      i'm following the author's example that an SMTP connection error is something you want to investigate and fix. if you have a different system with different assumptions where your response to a mailserver being unreachable is to ignore it, obviously that example doesn't apply for you. i'm not saying, and i don't think the author is saying that SMTP errors should always or never be logged as errors.
      when the mailserver endpoint has changed, you should do the thing that makes sense in the context of your application. if it's not something that the person responsible for reviewing the logs needs to know about, don't log it. if it is, then log it.
  - 1718627440 49 days ago
    So when the random error on a remote party happens at one time your system ignores it, bu when it happens at another time, it prevents the server from booting? That's a very brittle system.
    [-]
    - notatoad 49 days ago
      log level error prevents your server from booting? i'm pretty sure that's not how logging works.
- HankB99 49 days ago
  Would it make sense to consider anything that prevents a process from completing it's intended function an error? It seems like this message would fall into that category and, as you pointed out, could result from a local fault as well.
- kijin 49 days ago
  SMTP clients are designed to try again with exponential backoff. If the final attempt fails and your email gets bounced, now that's an error. Until then, it's just a delay, business as usual.
  [-]
shadowgovt 49 days ago
This is the standard I use as well. In general, my rule of thumb is that if something is logging error, it would have been perfectly reasonable for the program to respond by crashing, and the only reason it didn't is that it's executing in some kind of larger context that wants to stay up in the event of the failure of an individual component (like one handler suffering a query that hangs it and having to be terminated by its monitoring program in a program with multiple threads serving web requests). In contrast, something like an ill-formed web query from an untrusted source isn't even an error because you can't force untrusted sources to send you correctly formed input.
Warning, in contrast, is what I use for a condition that the developer predicted and handled but probably indicates the larger context is bad, like "this query arrived from a trusted source but had a configuration so invalid we had to drop it on the floor, or we assumed a default that allowed us to resolve the query but that was a massive assumption and you really should change the source data to be explicit." Warning is also where I put things like "a trusted source is calling a deprecated API, and the deprecation notification has been up long enough that they really should know better by now."
Where all of this matters is process. Errors trigger pages. Warnings get bundled up into a daily report that on-call is responsible for following up on, sometimes by filing tickets to correct trusted sources and sometimes by reaching out to owners of trusted sources and saying "Hey, let's synchronize on your team's plan to stop using that API we declared is going away 9 months ago."
[-]
- nlawalker 49 days ago
  It seems that the easier rule of thumb, then, is that "application logic should never log an error on its own behalf unless it terminates immediately after", and that error-level log entries should only ever be generated from a higher-level context by something else that's monitoring for problems that the application code itself didn't anticipate.
- raldi 49 days ago
  Right. If staging or the canary is logging errors, you block/abort the deploy. If it’s logging warnings, that’s normal.
  [-]
  - lanstin 48 days ago
    Unless it is logging more warnings because your new code is failing somehow; maybe it stopped parsing the reply correctly from a "is this request rate limited" service so it is only returning 429 to callers never accepting work.
Xss3 48 days ago
Some programs are error resistant and need an additional level: Fatal.
A warning can be ignored safely. Warnings may be 'debugging enabled, results cannot be certified' or something similar.
An error should not be ignored, an operation is failing, data loss may be occurring, etc.
Some users may be okay with that data loss or failing operation. Maybe it isnt important to them. If the program continues and does not error in the parts that matter to the user, then they can ignore it, but it is still objectively an error occurring.
A fatal message cannot be ignored, the system has crashed. Its the last thing you see before shutdown is attempted.
[-]
jillesvangurp 49 days ago
Errors mean I get alerted. Zero tolerance on that from my side.
yoan9224 48 days ago
I've found the most practical rule is: "Would I want to be paged for this at 2 AM?"
If yes: ERROR If I want to check it tomorrow: WARNING If it's useful for debugging: INFO Everything else: DEBUG
The problem with the article's approach is that libraries don't have enough context. A timeout calling an external API might be totally fine if you're retrying, but it's an ERROR if you've exhausted retries and failed the user's request.
We solve this by having libraries emit structured events with severity hints, then the application layer decides the final log level based on business impact. A 500 from a recommendation service? Warning. A 500 from the payment processor? Error.
hedayet 49 days ago
I agree with the principle: log level error should mean someone needs to fix something.
This post frames the problem almost entirely from a sysadmin-as-log-consumer perspective, and concludes that a correctly functioning system shouldn’t emit error logs at all. That only holds if sysadmins are the only "someone" who can act.
In practice, if there is a human who needs to take action - whether that’s a developer fixing a bug, an infra issue, or coordinating with an external dependency - then it’s an error. The solution isn’t to downgrade severity, but to route and notify the right owner.
Severity should encode actionability, not just system correctness.
aqme28 49 days ago
I agree with this take in a steady state, but the process of building software is just that-- it's a process.
So it's natural for error messages to be expected, as you progressively add and then clear up edge cases.
[-]
- raldi 49 days ago
  Exactly: When you're building software, it has lots of defects (and, thus, error logging). When it's mature, it should have few defects, and thus few error logs, and each one that remains is a bug that should be fixed.
  [-]
  - plorkyeran 49 days ago
    I don't understand why you seem to think you're disagreeing with the article? If you're producing a lot of error logs because you have bugs that you need to fix then you aren't violating the rule that an error log should mean that something needs to be fixed.
    [-]
    - raldi 48 days ago
      I couldn’t agree more with the article. What made you think I disagreed?
georgefrowny 49 days ago
Easy to say, but there's "yes we know this is wrong but this will have to do for now" and "we don't expect to see this in real life unless something has gone sideways".
[-]
- oofbey 49 days ago
  At scale the rare events start to happen reliably. Hardware failures almost certainly cause ERROR conditions. Network glitches.
  Our production system pages oncall for any errors. At night it will only wake somebody up for a whole bunch of errors. This discipline forces us to take a look at every ERROR and decide if it is spurious and out of our control or something we can deal with. At some point our production system will reach a scale where there are errors logged constantly and this strategy Durant make sense any more. But for now it helps keep our system clean.
  [-]
  - georgefrowny 48 days ago
    I think if someone is going be gotten out of bed that would be a critical rather then error. Generally I'd say in a large "live" system, errors end up raising Jira tickets, criticals end up ringing phones.
    [-]
    - oofbey 47 days ago
      Most systems I’ve worked with can go completely offline without ever logging a critical error. Some coding errors or misconfiguration or failure in a critical system - enough to log an error - and nobody can get any useful work done. I’ve never seen sobering that cash convert those into critical errors. I’m used to critical errors being rare - certain failures of a server to start. Or infra problems.
jedberg 48 days ago
I feel like it's more nuanced than OP writes. Presumably every log line comes from something like a try/catch. An edge case was identified, and the code did something differently.
Did it do what it was supposed to do, but in a different way or defer for retrying later? Then WARN.
Did it fail to do what it needed to do? ERROR
Did it do what it needed to do in the normal way because it was totally recoverable? INFO
Did data get destroyed in the process? FATAL
It should be about what the result was, not who will fix it or how. Because that might change over time.
[-]
- Joker_vD 48 days ago
  > Did it do what it was supposed to do, but in a different way or defer for retrying later? Then WARN.
  > Did it fail to do what it needed to do? ERROR
  > Did it do what it needed to do in the normal way because it was totally recoverable? INFO
  We have a web-facing system (it uses a custom request-response protocol on top of Websocket... it's an old system) that users are routinely trying to, ahem, automate even though it's technically against ToS but hey, as long as we don't catch them? Anyway, it's quite often to see user connections that send malformed commands and then get disconnected after we send them a critical_error/protocol_error message — we do have quite extensive validation logic for user commands.
  So, how should such errors be logged in your opinion? I know that we originally logged them as errors but very quickly changed to warnings, and precisely for the reasons outlined in TFA: if some kewl haxxor can't figure out how to quote strings in JSON, it's not really something we can't fix. We probably should keep the records, just to know that "oh, some script kiddie was trying to hack us during that time period" but nothing more than that; it definitely doesn't warrant the "hey, there are too many errors in sfo2 location, please take a look" summons at 3:00 AM from the ops team.
  [-]
  - jedberg 47 days ago
    It sounds like it did exactly what it was supposed to do -- reject the bad input. Looks like an INFO to me.
umpalumpaaa 48 days ago
What I like about objective-c’s error handling approach is that a method that can fail is able to tell if a caller considers error handling or not. If the passed *error is NULL you know that that is no way for a caller to properly handle the error. My implementations usually have this logic:
if error == NULL and operationFailed then log error Otherwise Let client side do the error handling (in terms of logging)
aunty_helen 48 days ago
Good logging is critical and actually having the logs turned on in production. No point writing logs if you silence them.
My company now has a log aggregator that scans the logs for errors, when it finds one, creates a Trello card, uses opus to fix the issue and then propose a PR against the card. These then get reviewed, finished if tweaks are necessary and merged if appropriate.
t43562 48 days ago
Errors can be recovered automatically sometimes but at the level at which you log them you don't know if that's going to happen. I therefore think this suggestion is not easy to follow.
Even if your libraries use nothing but exceptions or return codes you still end up with levels. You still end up with logs that have information in them that gets ignored when it shouldn't be because there's so much noise that people get tired of all the "cries of wolf."
Occasionally one is at a high enough level to know for sure that something needs fixing and for this I use "CRITICAL" which is my code for "absolutely sure that you can't ignore this."
IMO it's about time AI was looking at the logs to find out if there was something we really need to be alerted to.
Waterluvian 48 days ago
I think this is one of those discussions where there's no one right answer (though there's many wrong answers). All you have to do is pick a reasonable definition, write it down, socialize it, and be consistent when using it.
I think discussions that argue over a specific approach are a form of playing checkers.
HarHarVeryFunny 49 days ago
I agree with the sentiment, although not sure if "error" is the right category/verbiage for actionable logs.
In an ideal world things like logs and alarms (alerting product support staff) should certainly cleanly separate things that are just informative, useful for the developer, and things that require some human intervention.
If you don't do this then it's like "the boy that cried wolf", and people will learn to ignore errors and alarms since you've trained them to understand that usually no action is needed. It's also useful to be able to grep though log files and distinguish failures of different categories, not just grep for specific failures.
raldi 49 days ago
Yes. Examples of non-defects that should not be in the ERROR loglevel:
* Database timeout (the database is owned by a separate oncall rotation that has alerts when this happens)
* ISE in downstream service (return HTTP 5xx and increment a metric but don’t emit an error log)
* Network error
* Downstream service overloaded
* Invalid request
Basically, when you make a request to another service and get back a status code, your handler should look like:
```
    logfunc = logger.error if 400 <= status <= 499 and status != 429 else logger.warning
```
(Unless you have an SLO with the service about how often you’re allowed to hit it and they only send 429 when you’re over, which is how it’s supposed to work but sadly rare.)
[-]
- Hizonner 49 days ago
  > Database timeout (the database is owned by a separate oncall rotation that has alerts when this happens)
  So people writing software are supposed to guess how your organization assigns responsibilities internally? And you're sure that the database timeout always happens because there's something wrong with the database, and never because something is wrong on your end?
  [-]
  - raldi 49 days ago
    No; I’m not understanding your point about guessing. Could you restate?
    As for queries that time out, that should definitely be a metric, but not pollute the error loglevel, especially if it’s something that happens at some noisy rate all the time.
    [-]
    - electroly 49 days ago
      I think OP is making two separate but related points, a general point and a specific point. Both involve guessing something that the error-handling code, on the spot, might not know.
      1. When I personally see database timeouts at work it's rarely the database's fault, 99 times out of 100 it's the caller's fault for their crappy query; they should have looked at the query plan before deploying it. How is the error-handling code supposed to know? I log timeouts (that still fail after retry) as errors so someone looks at it and we get a stack trace leading me to the bad query. The database itself tracks timeout metrics but the log is much more immediately useful: it takes me straight to the scene of the crime. I think this is OP's primary point: in some cases, investigation is required to determine whether it's your service's fault or not, and the error-handling code doesn't have the information to know that.
      2. As with exceptions vs. return values in code, the low-level code often doesn't know how the higher-level caller will classify a particular error. A low-level error may or may not be a high-level error; the low-level code can't know that, but the low-level code is the one doing the logging. The low-level logging might even be a third party library. This is particularly tricky when code reuse enters the picture: the same error might be "page the on-call immediately" level for one consumer, but "ignore, this is expected" for another consumer.
      I think the more general point (that you should avoid logging errors for things that aren't your service's fault) stands. It's just tricky in some cases.
      [-]
      - lanstin 48 days ago
        Also everywhere I have worked there are transient network glitches from time to time. Timeout can often be caused by these.
    - makeitdouble 49 days ago
      > the database is owned by a separate oncall rotation
      Not OP, but this part hits the same for me.
      In the case your client app is killing the DB through too many calls (e.g. your cache is not working) you should be able to detect it and react, without waiting for the DB team to come to you after they investigated the whole thing.
      But you can't know in advance if the DB connection errors are your fault or not, so logging it to cover the worse case scenario (you're the cause) is sensible.
      [-]
      - raldi 49 days ago
        I agree that you should detect this, just through a metric rather than putting DB timeouts in the ERROR loglevel.
        [-]
        makeitdouble 48 days ago
        But what's the base of your metric ?
        I feel you're thinking about system wide downtime with everything timing out consistently, which would be detected by the generic database server vitals and basic logs.
        But what if the timeouts are sparse and only 10 or 20% more than usual from the DB POV, but it affects half of your registration services' requests ? You need it logged application side so the aggregation layer has any chance of catching it.
        On writing to ERROR or not, the hresholds should be whatever your dev and oncall teams decides. Nobody outside of them will care, I feel it's like arguing which drawer the socks should go.
        I was in an org where any single error below CRITICAL was ignored by the oncall team , and everything below that only triggered alerts on aggregation or special conditions. Pragmatically, we ended up slicing it as ERROR=goes to the APM, anything below=no aggregation, just available when a human wants to look at it for whatever reason. I'd expect most orgs to come with that kind of split, where the levels are hooked to processes, and not some base meaning.
    - Hizonner 49 days ago
      > No; I’m not understanding your point about guessing. Could you restate?
      In the general case, the person writing the software has no way of knowing that "the database is owned by a separate oncall rotation". That's about your organization chart.
      Admittedly, they'd be justified in assuming that somebody is paying attention to the database. On the other hand, they really can't be sure that the database is going to report anything useful to anybody at all, or whether it's going to report the salient details. The database may not even know that the request was ever made. Maybe the requests are timing out because they never got there. And definitely maybe the requests are timing out because you're sending too many of them.
      I mean, no, it doesn't make sense to log a million identical messages, but that's rate limiting. It's still an error if you can't access your database, and for all you know it's an error that your admin will have to fix.
      As for metrics, I tend to see those as downstream of logs. You compute the metric by counting the log messages. And a metric can't say "this particular query failed". The ideal "database timeout" message should give the exact operation that timed out.
- zbentley 49 days ago
  I wish I lived in a world where that worked. Instead, I live in a world where most downstream service issues (including database failures, network routing misconfigurations, giant cloud provider downtime, and more ordinary internal service downtime) are observed in the error logs of consuming services long before they’re detected by the owners of the downstream service … if they ever are.
  My rough guess is that 75% of incidents on internal services were only reported by service consumers (humans posting in channels) across everywhere I’ve worked. Of the remaining 25% that were detected by monitoring, the vast majority were detected long after consumers started seeing errors.
  All the RCAs and “add more monitoring” sprints in the world can’t add accountability equivalent to “customers start calling you/having tantrums on Twitter within 30sec of a GSO”, in other words.
  The corollary is “internal databases/backend services can be more technically important to the proper functioning of your business, but frontends/edge APIs/consumers of those backend services are more observably important by other people. As a result, edge services’ users often provide more valuable telemetry than backend monitoring.”
  [-]
  - raldi 49 days ago
    But everything you’re describing can be done with metrics and alerts; there’s no need to spam the ERROR loglevel.
    [-]
    - zbentley 49 days ago
      My point is that just because those problems can be solved with better telemetry doesn’t mean that is actually done in practice. Most organizations do are much more aware of/sensitive to failures upstream/at the edge than they are in backend services. Once you account for alert fatigue, crappy accountability distribution, and organizational pressures, even the places that do this well often backslide over time.
      In brief: drivers don’t obey the speed limit and backend service operators don’t prioritize monitoring. Both groups are supposed to do those things, but they don’t and we should assume they won’t change. As a result, it’s a good idea to wear seatbelts and treat downstream failures as urgent errors in the logs of consuming services.
- jonathrg 49 days ago
  4xx is for invalid requests. You wouldn't log a 404 as an error
  [-]
  - raldi 49 days ago
    I’m talking about codes you receive from services you call out to.
    [-]
    - mewpmewp2 49 days ago
      What if user sends some sort of auth token or other type of data that you yourself can't validate and third party gives you 4xx for it? You won't know ahead of time whether that token or data is valid, only after making a request to the third party.
    - jonathrg 49 days ago
      Oh that makes sense.
      [-]
      - raldi 49 days ago
        There are still some special cases, because 404 is used for both “There’s no endpoint with that name” and “There’s no record with the ID you tried to look up.”
        [-]
makeitdouble 49 days ago
> This assumes an error/warning/info/debug set of logging levels instead of something more fine grained, but that's how many things are these days.
Does it ?
Don't most stacks have an additional level of triaging logs to detect anomalies etc ? It can be your New relic/DataDog/Sentry or a self made filtering system, but nowadays I'd assume the base log levels are only a rough estimate of whether an single event has any chance of being problematic.
I'd bet the author also has strong opinions about http error codes, and while I empathize, those ships have long sailed.
jmull 48 days ago
I encourage people to think a few moments about what to log and at what level.
You’re kind of telling a story to future potential trouble-shooters.
When you don’t think about it at all (it doesn’t take much), you tend to log too much and too little and at the wrong level.
But this article isn’t right either. Lower-level components typically don’t have the context to know whether a particular fault requires action or not. And since systems are complex, with many levels of abstractions and boxes things live in, actually not much is in a position to know this, even to a standard of “probably”.
twosdai 48 days ago
I understand where the author is coming from but frankly I think this design pattern is not really correct.
Obviously this depends on teams, application context and code bases. But "knowing if action needs to be taken" can't be boiled into a simple log level for most cases.
There is a reason most alerting software like pagerduty is just a trigger interface and the logic for what constitutes the "error" is typically some data level query in something like datadog, sumologic, elastic search, or graphana, that either looks for specific string messages, error types, or a collection of metric conditions.
Cool if you want to consider that any error level log needs to be an actionable error but what quickly happens is that some error cases are auto retry able due to infrastructure conditions that the application has completely no knowledge of. And to run some sort of infrastructure query at error write time in code, eg
1. Error is thrown 2. Prior to logging guess/determine if the case can be retired through a few http calls. 3. Log either a warning or an error
Seems to be a complete waste when we could just write some sort of query in our log/metrics management platform of choice which takes into account the infrastructure conditions for us.
mschuster91 49 days ago
> If error level messages are not such a sign, I can assure you that most system administrators will soon come to ignore all messages from your program rather than try to sort out the mess, and any actual errors will be lost in the noise and never be noticed in advance of actual problems becoming obvious.
Bold of you to assume that there are system administrators. All too often these days it's "devops" aka some devs you taught how to write k8s yamls.
alexwasserman 49 days ago
I have been particularly irritated in the past where people use a lower log level and include the higher log level string in the message, especially where it's then parsed, filtered, and alerted on my monitoring.
eg. log level WARN, message "This error is...", but it then trips an error in monitoring and pages out.
Probably breaching multiple rules here around not parsing logs like that, etc. But it's cropped up so many times I get quite annoyed by it.
[-]
- dragonwriter 49 days ago
  > I have been particularly irritated in the past where people use a lower log level and include the higher log level string in the message, especially where it's then parsed, filtered, and alerted on my monitoring.
  If your parsing, filtering, and monitoring setup parses strings that happen to correspond to log level names in positions other than that of log levels as having the semantics of log levels, then that's a parsing/filtering error, not a logging error.
- jonathrg 49 days ago
  Stuff like that is a good argument for using structured logging, but even if you are just parsing text logs, surely you can make the parser be a bit more specific when retrieving the log level.
rsanek 48 days ago
If something needs to be fixed, why is it just a log? How is someone supposed to even notice a random error log? At the places that I've worked, trying to make alerting be triggered on only logs was always quite brittle, it's just not best practice. Throw an exception / exit the program if it's something that actually needs fixing!
[-]
- Copenjin 48 days ago
  > If something needs to be fixed, why is it just a log?
  What he meant is that is an unexpected condition, that should have never happened, but that did, so it needs to be fixed.
  > How is someone supposed to even notice a random error log?
  Logs should be monitored.
  > At the places that I've worked, trying to make alerting be triggered on only logs was always quite brittle, it's just not best practice.
  Because the logs sucked. It not common practice, it should be best practice.
  > Throw an exception / exit the program if it's something that actually needs fixing!
  I understand the sentiment, but some programs cannot/should not exit. Or you have an error in a subsystem that should not bring down everything.
  I completely agree with the approach of the author, but also understand that good logging discipline is rare. I worked in many places where logs sucked, they just dumped stuff, and had to restructure them.
  [-]
  - lanstin 48 days ago
    While it is fun to have your code run for 500 days without restart, it is a bad architecture. You should be able to move load around from host to host or network to network without losing any work. This involves graceful draining and then shutting down the old.
    For impossible errors exiting and sending the dev team as much info as possible (thread dump, memory dump, etc) is helpful.
    In my experience logs are good for finding out what is wrong once you know something is wrong. Also if the server is written to have enough but not too much logging you can read them over and get a feel for normal operation.
peanut-walrus 49 days ago
Disagree. If you have an error that NEEDS fixing, your program should exit. Error level logs for operation level errors are fine.
knallfrosch 48 days ago
> If a SMTP mailer trying to send email to somewhere logs 'cannot contact port 25 on <remote host>', that is not an error in the local system and should not be logged at level 'error'.
A mail program not being to checks notes send emails sounds like an error to me. (Unless you implement retries.)
[-]
Insanity 48 days ago
Coincidentally was reviewing code yesterday that had a confusing/contradictory statement..
```
  error_msg = "xyz went wrong"
  log.warn(error_msg)
```
My comment on the CR was about this being an inherent contradiction and incredibly confusing to know if it's actually an error or a warning..
bandrami 48 days ago
It's like how an alert system that sends more than ~8 alerts a day effectively sends zero alerts.
Glyptodon 48 days ago
I agree errors should be errors. Many things that are logged for other reasons should use a different label.
That said, the thing I've cone find being useful as a subcategory of error are errors due to data problems vs errors due to other issues.
theli0nheart 49 days ago
I agree with this.
Not everything that a library considers an error is an application error. If you log an error, something is absolutely wrong and requires attention. If you consider such a log as "possibly wrong", it should be a warning instead.
dpc_01234 48 days ago
Error log level should be renamed. It's just a terrible name that confuses usage.
[-]
- fogzen 48 days ago
  Yeah, even alert/warn/info would be an improvement.
  I hate the concept of “errors” in general. They’re an excuse to avoid responsibility, and ship broken software with known undefined behavior.
  The very notion of an error basically means “there was behavior I chose to not handle and do anything about but which I knew would happen” which is essentially just negligence.
dnautics 49 days ago
let's say you a bunch of database timeouts in a row. this might mean that nothing needs to be fixed. But also, the "thing that needs to be fixed" might be "the ethernet cable fell out the back of your server".
How do you know?
[-]
- raldi 49 days ago
  You have an alert on what users actually care about, like the overall success rate. When it goes off, you check the WARNING log and metric dashboard and see that requests are timing out.
  [-]
  - ImPostingOnHN 49 days ago
    That is a lagging indicator. By the time you're alerted, you've already failed by letting users experience an issue.
    [-]
    - danaris 49 days ago
      Well, yes. If the cable falls out of the server (or there's a power outage, or a major DDoS attack, or whatever), your users are going to experience that before you are aware of it. Especially if it's in the middle of the night and you don't have an active night shift.
      Expecting arbitrary services to be able to deal with absolutely any kind of failure in such a way that users never notice is deeply unrealistic.
      [-]
      - lanstin 48 days ago
        It continues to become more realistic with the passing of time.
    - raldi 49 days ago
      What alternative would you propose? Page the oncall whenever there's a single query timeout?
      [-]
      - dnautics 49 days ago
        the alternative i propose is have deep understanding of your system before popping off with dumb one size fits all rules that don't make sense.
vpribish 49 days ago
I just started playing in the Erlang ecosystem and they have EIGHT levels of logging messages. it seems crazily over-specific, but they are the champions of robust systems.
I could live with 4
Error - alert me now.
Warning - examine these later,
Info - important context for investigations.
Debug - usually off in prod.
[-]
- emmelaich 48 days ago
  I need Notice (between Info and Warning), for important events such as start and shutdown, and successfully connecting to the database, and ready to start serving. These otherwise would be in Info; and enabling Info level produces a torrent of uninteresting muck.
- regularfry 49 days ago
  The eight levels in Erlang are inherited from syslog, rather than something specific to Erlang itself.
- groundzeros2015 49 days ago
  The first one should be crashing.
winningChild 49 days ago
I have a collection of cameras, I can take a picture or photos with those cameras. Some of the lenses may not work properly with the lighting. That doesn’t mean the object being photographed is faulty.
tgv 49 days ago
I log authorization errors as errors. Are they errors? It depends on how you read the logs. Perhaps you want to distinguish between internal, external and non-attributable errors for easier grepping.
Too 49 days ago
Agree with the post. The job of blackbox is to turn probes into metrics. If a probe fails, that should just become a probe_success=0 metric. Blackbox did its job and should not log an error.
Kinrany 49 days ago
Why are logs usually assumed to be for human consumption only? It seems weird to me that log storage usually exists outside of the system and isn't a general purpose message bus.
BiraIgnacio 49 days ago
It means something is wrong, yes. Now, if it's worth fixing (granted, most of the time it would), that's another story.
leni536 49 days ago
I make error logs fail happy path functional/integration tests for the backend applications I'm currently writing.
plandis 49 days ago
I agree. Error or higher should result in an alarm and indicates that some corrective action needs to be taken.
mkoubaa 49 days ago
To me it's always a neat trick when you're not allowed to use print() in production code
mycall 48 days ago
Severity is the value and you set thresholds based on context of the error type.
29athrowaway 49 days ago
Input errors do not need fixing, so no.
[-]
- lanstin 48 days ago
  If they cause your customers to ditch your product but calling them and saying "your calls are all getting 4xx because you are not putting the state code into the call parameters" would keep them as customers, then you would be wise to make that communication.
  [-]
  - dolmen 48 days ago
    But first ensure that the input error is properly reported to the client in the response body (ideally in a structured way), so the client could have figured out by himself.
    If a fix is needed on your side for this matter, having a conversation with a customer might be useful before breaking more stuff. ("We have no state code in EU. Why is that mandatory?").
    [-]
    - lanstin 45 days ago
      If you are trying to sell a product, it is sometimes useful to solve people problems for them, rather than counting on them to figure them out on their own.
azov 49 days ago
If my system doesn’t work - I want to be alerted. If notification was supposed to be sent but wasn’t - it’s an error regardless of whether it wasn’t sent because of a bug in my code or external service being down. It may be a warning if I’m still retrying, but if I gave up - it’s an error.
“External service down, not my problem, nothing I can do” is hardly ever the case - e.g. you may need to switch to a backup provider, initiate a support call, or at least try to figure out why it’s down and for how long.
blkflcn3 49 days ago
> What an error log level should mean (a system administrator's view)
That says it all:
- Backseat driving
- Not a developer by trade