Logging #9: Observability via Logging

Problem

Application monitoring has grown more complex in the days of SaaS and microservices.

Solution

Observability, which for me means the ability to analyze the system usually based on its metrics, traces and logs.

Example

From the 3 standard tools used to achieve an observable system, logging is the most powerful one. It is possible to have an observable system only using properly structured logs. Let me present a naive example of code that produces metrics and traces via logging using logrusutil (full code).

func JobHandler(w http.ResponseWriter, r *http.Request) {
	ctx := r.Context()
	storeJob(ctx)
	logctx.From(ctx).Info("job done")
	io.WriteString(w, "OK")
}

func storeJob(ctx context.Context) {
	ctx, closeSpan := NewSpan(ctx, "storeJob")
	defer closeSpan()
	time.Sleep(time.Second * time.Duration(rand.Int63n(5)))
	logctx.From(ctx).Info("job stored")
}

func NewSpan(ctx context.Context, name string) (context.Context, func()) {
	log := logctx.From(ctx).WithField("spanID", rand.Int())
	log.WithField("spanName", name).Info("span started")
	t1 := time.Now()
	
	// setting contextual log entry and function to close the span
	return logctx.New(ctx, log), func() {
		log.WithField("duration", time.Since(t1)).Info("span finished")
	}
}

func logMiddleware(next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		log := logctx.Default.WithField("reqID", rand.Int())
		log.Info("request started")
		t1 := time.Now()

		// setting contextual log entry for the handler
		ctx := logctx.New(r.Context(), log)
		next.ServeHTTP(w, r.WithContext(ctx))

		log.WithField("duration", time.Since(t1)).Info("request finished")
	})
}

// somewhere: logMiddleware(http.HandlerFunc(JobHandler))

Log output:

time="2009-11-10T23:00:00Z" level=info msg="request started" reqID=134020434
time="2009-11-10T23:00:00Z" level=info msg="span started" reqID=134020434 spanID=1721070109 spanName=storeJob
time="2009-11-10T23:00:00Z" level=info msg="request started" reqID=1597969999
time="2009-11-10T23:00:00Z" level=info msg="span started" reqID=1597969999 spanID=1237770961 spanName=storeJob
time="2009-11-10T23:00:00Z" level=info msg="job stored" reqID=1597969999 spanID=1237770961
time="2009-11-10T23:00:00Z" level=info msg="span finished" duration=0s reqID=1597969999 spanID=1237770961
time="2009-11-10T23:00:00Z" level=info msg="job done" reqID=1597969999
time="2009-11-10T23:00:00Z" level=info msg="request finished" duration=0s reqID=1597969999
time="2009-11-10T23:00:01Z" level=info msg="job stored" reqID=134020434 spanID=1721070109
time="2009-11-10T23:00:01Z" level=info msg="span finished" duration=1s reqID=134020434 spanID=1721070109
time="2009-11-10T23:00:01Z" level=info msg="job done" reqID=134020434
time="2009-11-10T23:00:01Z" level=info msg="request finished" duration=1s reqID=134020434

As you can see we can analyze (even as a human) a lot of stuff from these logs such us:

  • Find the longest request (metric)
  • Filter out logs associated with given request using its id (trace)
  • Analyze how much of the time was consumed the persistence (span in the trace)

Advantages

  1. Usage of only logging (without dedicated tracing and metrics libraries) reduces the number of software’s dependencies.
  2. Fast and easy to adopt early.

Disadvantages

  1. Logging is the most expensive instrumentation technique in terms of analysis and storage. Setting up efficient monitoring based on logs is a lot a harder.
  2. There are dedicated libraries to handle metrics and traces such us OpenTelemetry. Using logging instead can end up in using custom instrumentation API, especially for tracing.

Summary

Logging can be used to achieve kind of “poor man’s” observability with low initial cost. It feels to be a very Cloud Native way to use logging to stdout and use other software to decompose logs to metrics and traces. However, currently I do not think it would be maintainable, because this solution would be rather non-standard. Moreover, I think that decomposing logs into metrics and traces may be too slow for big scale SaaS software. Right now it becomes more popular to use fast “tail” log management system like Loki together with other OOS instrumentation tools like Grafana, Prometheus, Jaeger or PaperTrail together with SaaS instrumentation tools like AppOptics.

Logging series

Leave a comment