Friday, January 19, 2024

Traces in Tempo vs logs in Loki

In my last post I mentioned how to use the http-log plugin in Kong to provide logs to Loki. Also how we're gonna use OpenTelemetry to provide traces to Tempo.

The OpenTelemetry plugin requires a change in Kong config, enabling tracing by setting tracing_instrumentations to all and restart the plane.

In the configuration of the plugin we had to set the plugin config setting queue.max_batch_size from default 1 to 1000, to avoid full queue errors.

Without repeating my last post the http log provides valuable information like received time in milliseconds, source ip, incoming endpoint, method and http headers, authenticated id, Kong service and route invoked, upstream ip and port and http status code.

The traces provide similar information, same starttime in milliseconds, source ip, incoming endpoint and method, Kong route invoked, upstream name, ip and port and http status code.

In Grafana we can explore both logs from Loki and traces from Tempo, but we want to take advantage of the built-in Observability, which is now rebranded to Applications. Initially this looks promising, we have metrics generated from traces and see quickly the duration and duration distribution of all requests.

Traces: both in Explore (Tempo) and Application Kong we see all traces, each trace contains the set of spans. No further configuration needed, we have in Kong the sampling rate configured to 1, which is 100%, so far we see no reason to lower this.

Logs: in Explore (Loki) we see all logs, not in Application Kong. As Application Kong Log query is defaulted to {exporter="OTLP", job="${serviceName}"} we have to change our log stream from Kong towards Loki, new custom_fields_by_lua is Streams with value

local cjson = require "cjson" local ts=string.format('%18.0f', os.time()*1000000000) local log_payload = kong.log.serialize() local json_payload = cjson.encode(log_payload) local service = log_payload['service'] local t = { {stream = {exporter='OTLP', job='kong', service=service['name']}, values={{ts, json_payload}}}} return t

After this change all Kong http logs appear in Application Kong, of course we have to update our dashboards from kong_http_log="log-payload" to job="kong".

Now the correlation between traces and logs, we learned that this doesn't work out-of-the-box with Kong version 3.4, we need to upgrade to 3.5 in order to have the field trace_id in the logs.

As a workaround we can use the timestamp up to milliseconds, this value is identical for the log and the trace for each request.

For example I've exported a trace (5.0 kB, length 5102) containing 9 spans, the parent and 8 children from kong.router till kong.header_filter.plugin.opentelemetry, see below screenshot:

Surely this is just for fun, we see that durations are in up to a hundredth of microseconds, e.g. the key-auth plugin Duration: 71.94μs Start Time:658.25μs (11:43:50.364)

In the span we find "startTimeUnixNano": 1705661030364658200, "endTimeUnixNano": 1705661030364730000

Now when I take duration I come to 71.8 microseconds, googling both values with minus in between returns 71936, Grafana comes to 71.94μs

All nano timestamps in the exported trace end with '00', exact to 100 nanoseconds, which is 0.1 microseconds.

Clever that Google and Grafana can get more precise, but yeah, this is already about a tenth of a thousandth of a thousandth of a second...

Taking the milliseconds (1705661030364) the correlated log can be found easily, saving this json to file it's 3.3 kB (length 3390), size is around 70% of the size of the trace. These numbers are interesting because the average ingestion rates of these logs and traces are other way around:

1 log is 2/3 the size of the trace of the same request, while the average logs ingestion rate is more than 3 times the average traces ingestion rate, 14.5 GiB log versus 4.50 GiB traces. This seems like a mystery, which I leave unsolved for now.

As mentioned this exercise is more fun than practical, Grafana can provide insights on Kong latencies, number of errors, alerts and so on, but detailed information on sub-components is overkill. As soon as we have our landscape OpenTelemetry enabled, especially our upstream MicroServices, only then I expect to gain useful insights and nice service maps. Till that time I enjoy playing with dashboards on the http logs in Loki 🤣


No comments:

Post a Comment