The art of logging: the whys, whens, hows, and whats to log.

Mainly because good logging sometimes means fun weekends and long afterwork hours 😉

Jul 08, 2024

tree logs — Photo by Pawel Czerwinski on Unsplash

“Error: max retries exceeded,” chimed the ever-so-helpful slack alert.

I opened our logs visualisation tool and filtered for the offending service as usual. Nothing came up.

The error was coming from an upstream system that was calling a peer service to evaludate data. Based on context, I had a hunch where else to search. Again, nothing came up.

No logs, nada.

And this was a production system.

On a Friday at 5 PM.

Effective logging is one of those things which no one really focuses on but makes a world of difference when it comes to high performance teams.

High performance teams to me means teams which score high on the DORA metrics and the metric which good logging seriously bumps is the Mean Time to Recovery (or MTTR).

Read on to discover the secrets of logging which I’ve slowly learnt along the way over many “your weekends might be gone” moments.

Why logging matters

The end goal of good logging is to reduce potential system downtime. In professional terms: reducing the MTTR. In personal terms: having undisturbed weekends and afterwork hours.

Within properly designed systems, it is rare that many people will have access to the deployed code in production. Engineers will instead typically have access to a metrics and logs visualisation tool which should serve in times of need to tell a “good enough” story that allows them to know where the root cause of an issue might be.

Ineffective logging results in bad stories where the characters are unknown, no one knows what they were remotely up to, and someone lies dead on the floor. Well, I mean if you like mystery novels…

Anyway.

Effective logging results in a full story being told when an error is alerted for. And this enables engineers to scope into what went wrong based on the trail of logs.

Logging as a practice is in the domain of observability. It is as it sounds, and it’s concern is when to log something, what to log, and how to log it.

The art of logging

TL;DR, this is one senior-ish engineer’s POV of how to have better logging:

Use lots of metadata
Format your logs for ingestion
Timestamp your logs in UTC
Set your log levels correctly
1. Using numbers for log levels?
2. Use warning for business errors
3. Use debug like your life depends on it
Writing the log content
1. Log the start and end of all transactions
2. Bubble up all error messages
3. Standardise & simplify the semantics in use
4. Do not log sensitive data

Use lots of metadata

The first thing to know about logging is that your logs will be filtered. No one ever reads logs end-to-end except in the case of a security or data breach.

Given the intended user experience, logs are most useful when metadata about it is present. This comes in many forms, some common metadata include:

Hostname - Which machine did a log come from?
Process ID (PID) - Which process did a log come from?
Log level - What level of severity should this log be considered at?
Timestamp - When did this happen?
Container ID - If applicable, which container did it come from?

More advanced/enterprisey runtimes/frameworks might even include details like name of the function or line of code where the log entry was made.

It’s recommended to err on the side of caution when it comes to details. Include as much as you reasonably can for the resources you have been assigned.

Format your logs for ingestion

One common problem with how most engineers log things is using stdout directly. While this works effectively for local development, a plaintext log entry is almost as good as useless in the grander scheme of things

Difficult as it may be, logs must follow an organisation-wide format as far as possible. A common format to use in production systems is JSON. This enables a single log entry to look like:

{
  "log": "attention attention, error while adding x and y!",
  "hostname": "ec2-1.2.3.4-etcetc",
  "level": "warn",
  "timestamp": "2006-01-02T15:04:05Z",
  "pid": 42069
}

In most popular logging libraries, there’s a check for whether the process is running with a TTY (meaning it’s a local run) or without (meaning it’s run as a remote deployment). With a TTY, the logger will send unformatted output to stdout, without a TTY, the logger will automatically format logs into JSON which is machine-parseable.

TL;DR: Use a popular library and don’t invent your own.

As an addenum, in medium-to-large organisations, you’ll very likely have differences in field names given the technology sprawl across teams. What’s important is to align on specific field names instead of assigning a hard format for everyone to follow (hint: it will never be, give up, there are better mountains to die on).

Timestamp your logs in UTC

One decision that could seriously mess with your logs is customising your logger to use a local time.

It should seem obvious enough, but always log in UTC and let the logs visualisation tool adjust the timestamp to your timezone.

Most - if not all - loggers will use UTC by default. Avoid trying to customise your logger even if your engineer spidey senses try to tell you an advanced configuration is usually better. It can be, but only when you’re fully aware of what you’re achieving in the grander scheme of things.

Set your log levels correctly

Majority of logger libraries will have a configuration known as the log level. Log levels are useful because they allow details to be abstracted away.

Given a story book and an in-story event to search for, you’d likely start with the chapter overviews to see which chapter an event could most likely come from before delving into each chapter. Debugging code is similar and is a process of refining your scope until you get to the root of the problem.

And that is what log levels enables us to do.

We want to know the error (the story event) which lets us see what happened before it. So we go to the chapter overviews. These can be likened to generic logs that indicate the flow of code. And then finally we get to the details by further scoping in on the details to find out what exactly happened.

To achieve the above, we can use the 5 generic categories of log levels that come with most logger libraries. These 5 categories are:

Trace - Use this level for extremely inconsequential things. For example reaching a branch in the code that might be problem-prone.
Debug - Use this for logging details like the contents of variables. Think of how a traditional debugger works - it lets you view the content of each variable with every step you move along the code flow.
Information - Use this for announcing an activity/function has started/concluded.
Warning - Use this for logic that errors out due to bad input.
Error - Use this to log events which unexpectedly disrupts the code flow or could cause the process’s state to be unknown.

Using numbers for log levels?

Some logger libraries define the log level in numbers.

For such libraries, define constant numbers which can be translated into the above categories. Something like 10 meaning trace, 30 meaning information and 50 meaning error (with the others in between) works. Why intervals of 10? This allows you to have things like a 35 in future if the need ever arises.

Use warning for business errors

One mistake that’s commonly made is using an error level of logs where a warning level should’ve been used.

Logs at the error level should be about events which impact the process’s integrity. This means:

Events resulting in the process being in an undefined state, which means…
Events which cause the process to unexpectedly terminate, potentially in-between transactions.
Events which might result in data loss.

Business logic errors should be at the warning level since they are generally accounted for and are a result of a caught exception, a handled error, or a failed validation check; all of which do not risk a system getting into a state where unexpected things might have happened.

Use debug like your life depends on it

Because in some cases your weekend might depend on it.

Not to sound like an elderly boomer, but back in the day when programmers used software called debuggers, debugging was a process of setting breakpoints in the code, running the code through the debugger, and noting the values of variables of interest along the way.

This enabled a level of state observation which made it possible to know when a variable was modified in an unexpected way.

In cloud-based deployments, it’s impossible to do that.

In it’s place though, we can use the debug level of logs to achieve the same effect by preemptively logging the value of variables that undergo many state transitions.

Typically, these would be variables which are processed by a high number of branch conditions (if-else/switch statements), or variables which are eventually used in business-critical transaction points.

Writing the log content

Beyond the metadata which serve as good chapter guides when likened to a storybook, the content of the logs matter as well for us to understand the story better.

Exactly what to log is usually contextual to most teams and is something that should be left to the discretion of the implementing engineer. In general though, the contents of the logs should:

Be sufficient to know the flow of code through branch conditions
Not be so verbose that there are consecutive lines of logs which are inconsequential (can be skipped).

Some general tips that work for most situations follow.

Log the start and end of all transactions

When writing code, it’s common and good practice to implement some form of encapsulation to keep code readable at the higher levels of implementation.

This results in multiple layers of code where a top-level API may call a controller component, which calls a shared library component, which uses the standard library.

For example, given a (overengineered for the sake of examples) piece of code which calls a URL and gets the response body:

urlToCall = "https://api.domain.tld/api/v1/something"

log.Info("getting data from ${urlToCall}...")
data = utils.Get(urlToCall)
log.Info("completed data retrieval from ${urlToCall}")

We might have at the controller component level:

func Get(url, optionalConfigurations...) {
  // ... process configuration ...

  log.Info("starting http GET on ${url}...")
  result = HttpCall(METHOD_GET, url, configuration)
  log.Info("completed http GET on ${url}")

  // ... process and return data ...
}

And finally at the shared library component level which ensures all calls have a User Agent header with the organisation’s name:

func HttpCall(method, url, configuration) {
  configuration.UserAgent = "organisation/HttpCall 1.0"
  if !allowlistCheck(url) {
    throw "url is not allowlisted"
  }

  log.Info("calling ${url} via http GET...")
  standardLibrary.CallHttp(method, url, configuration)
  log.Info("completed call to ${url} via http GET")
}

The full logs will look like:

getting data from ${urlToCall}...
starting http GET on ${url}...
calling ${url} via http GET...
completed call to ${url} via http GET
completed http GET on ${url}
completed data retrieval from ${urlToCall}

When the logs stop at any point, it is immediately obvious where the error could have triggered from. This is admittedly an extremely simple example, but following this line of logic as best as you can will help to create decent “chapter overviews” for your logs.

Bubble up all error messages

Some legacy enterprise loggers support this out-of-the-box, but in more modern loggers that prioritise simplicity and speed of development, the passing of context from a low level function to a high level function is absent.

It’s usually complex to create a shared logger component which enables contextual information to be bubbled up, but the next best thing you can do is to append error messages from lower level code to error messages at higher level code.

This could look like

func A() {
  log.Info("processing data..."
  try {
    B()
  } catch (error) {
    log.Warn("failed call to B: ${error}")
    return ... // depends on context
  }
  log.Info("completed processing of data")
  return ... // depends on context
}

func B(input) {
  try {
    augmentedInput = processInputB(input)
    C(augmentedInput)
  } catch (error) {
    throw "failed call to C: ${error}")
  }
  return ... // depends on context
}

func C(finalInput) {
  var validatedInput
  try {
    validatedInput = validateInput(finalInput)
  } catch (error) {
    throw "failed to process input: ${error}")
  }
  process(validatedInput)
  return ... // depends on context
}

Which results in logs that look like:

processing data...
failed call to B: failed call to C: failed to process input: ${error}

This allows for errors from multiple levels of code to be immediately visible, improving how an error presents itself to an engineer who is debugging an error using the logs.

Standardise & simplify the semantics in use

Once a service has reached a certain maturity, it’s usually useful to have a set of common language which the team that owns the service is aware of. This makes the story more accessible to a diverse set of people which working teams generally are.

Semantics don’t make the story, but it allows the story to be more easily read especially when the language needs to be understood by non-native speakers of a language.

It might be easy for you as a native english speaker to mentally process “starting”, “commencing”, “initialising”, “beginning” et cetera but the same might not be so for a non-native english speaker.

For example:

Trailing ellipses (...) could be consistently used to indicate something will be processing after that log entry
Code issues could always be begin with and be phrased as “failed to ...”
Beginning of transactions could always use “starting ...“
Ending of transactions could always use “completed ...”
Debug logs for state obvservation could use “value of var[varName]: ...”

Do not log sensitive data

When’s the last time you read a story book where the street address of a character was fully revealed?

Jokes aside, logs serve to tell a story, they don’t serve as documentation for an issue.

In the same way that wildly popular stories have lorebooks or encylopedia-like publications, issues and bugs have incident reports. Leave the details for incident reports and focus on the story to be told which describes events and flow, not exact details.

While it’s ideal to be able to log any state-related information, avoid logging sensitive data like Personally Identifiable Information (PII) and secrets. This is because logs occasionally need to be shared with team members for faster incident recovery. They might not have the security clearance to view sensitive data/secrets. In some cases, logs might to be shared with 3rd party service providers, and in some regulatory contexts, your logs could end up with government officials.

To reduce the likelihood of sensitive data being in the logs, it’s best to never log whole objects when doing debug logging.

Value masking is another option which some log collators provide. Although these are largely based on regular expression, these generally do a good-enough job of preventing sensitive data from being sent into log visualisation tools. Contact your friendly App Infrastructure team for more information :D

Cheers

And this brings us to the end of this piece. It was intended to be short but doesn’t seem so after I’ve completed the first draft of it.

In anyways, I hope this might be a useful reference to someone out there. Subscribe for more of similar pieces + insights into my life.

joseph.yaml's journal