Brand Integration: Monitoring

Showing posts with label Monitoring. Show all posts

Friday, January 19, 2024

Traces in Tempo vs logs in Loki

In my last post I mentioned how to use the http-log plugin in Kong to provide logs to Loki. Also how we're gonna use OpenTelemetry to provide traces to Tempo.

The OpenTelemetry plugin requires a change in Kong config, enabling tracing by setting tracing_instrumentations to all and restart the plane.

In the configuration of the plugin we had to set the plugin config setting queue.max_batch_size from default 1 to 1000, to avoid full queue errors.

Without repeating my last post the http log provides valuable information like received time in milliseconds, source ip, incoming endpoint, method and http headers, authenticated id, Kong service and route invoked, upstream ip and port and http status code.

The traces provide similar information, same starttime in milliseconds, source ip, incoming endpoint and method, Kong route invoked, upstream name, ip and port and http status code.

In Grafana we can explore both logs from Loki and traces from Tempo, but we want to take advantage of the built-in Observability, which is now rebranded to Applications. Initially this looks promising, we have metrics generated from traces and see quickly the duration and duration distribution of all requests.

Traces: both in Explore (Tempo) and Application Kong we see all traces, each trace contains the set of spans. No further configuration needed, we have in Kong the sampling rate configured to 1, which is 100%, so far we see no reason to lower this.

Logs: in Explore (Loki) we see all logs, not in Application Kong. As Application Kong Log query is defaulted to {exporter="OTLP", job="${serviceName}"} we have to change our log stream from Kong towards Loki, new custom_fields_by_lua is Streams with value

local cjson = require "cjson" local ts=string.format('%18.0f', os.time()*1000000000) local log_payload = kong.log.serialize() local json_payload = cjson.encode(log_payload) local service = log_payload['service'] local t = { {stream = {exporter='OTLP', job='kong', service=service['name']}, values={{ts, json_payload}}}} return t

After this change all Kong http logs appear in Application Kong, of course we have to update our dashboards from kong_http_log="log-payload" to job="kong".

Now the correlation between traces and logs, we learned that this doesn't work out-of-the-box with Kong version 3.4, we need to upgrade to 3.5 in order to have the field trace_id in the logs.

As a workaround we can use the timestamp up to milliseconds, this value is identical for the log and the trace for each request.

For example I've exported a trace (5.0 kB, length 5102) containing 9 spans, the parent and 8 children from kong.router till kong.header_filter.plugin.opentelemetry, see below screenshot:

Surely this is just for fun, we see that durations are in up to a hundredth of microseconds, e.g. the key-auth plugin Duration: 71.94μs Start Time:658.25μs (11:43:50.364)

In the span we find "startTimeUnixNano": 1705661030364658200, "endTimeUnixNano": 1705661030364730000

Now when I take duration I come to 71.8 microseconds, googling both values with minus in between returns 71936, Grafana comes to 71.94μs

All nano timestamps in the exported trace end with '00', exact to 100 nanoseconds, which is 0.1 microseconds.

Clever that Google and Grafana can get more precise, but yeah, this is already about a tenth of a thousandth of a thousandth of a second...

Taking the milliseconds (1705661030364) the correlated log can be found easily, saving this json to file it's 3.3 kB (length 3390), size is around 70% of the size of the trace. These numbers are interesting because the average ingestion rates of these logs and traces are other way around:

1 log is 2/3 the size of the trace of the same request, while the average logs ingestion rate is more than 3 times the average traces ingestion rate, 14.5 GiB log versus 4.50 GiB traces. This seems like a mystery, which I leave unsolved for now.

As mentioned this exercise is more fun than practical, Grafana can provide insights on Kong latencies, number of errors, alerts and so on, but detailed information on sub-components is overkill. As soon as we have our landscape OpenTelemetry enabled, especially our upstream MicroServices, only then I expect to gain useful insights and nice service maps. Till that time I enjoy playing with dashboards on the http logs in Loki 🤣

Monday, December 18, 2023

Monitoring Kong with Grafana

After being quiet for a a decade I love to start sharing some experiences again!

My new posts will be focused on Kong API Gateway Enterprise Edition. As we started years ago with Oracle API Gateway we switched two years ago and migrated all our API's to Kong 2.3.

This year I've had many interesting journeys with Kong, I successfully migrated the datastore from Cassandra to PostgreSQL, upgraded Kong to 3.4, performed some POC's with Grafana and Dynatrace and updated and improved our CI/CD pipelines.

This post is about monitoring, how to monitor Kong API Gateway with Grafana Cloud, Grafana is the dashboard to visualize all metrics (Mimir), logs (Loki) and traces (Tempo).

Metrics:

Prometheus: Kong offers the Prometheus plugin which exposes metrics on the Kong /metrics endpoint, to be scraped by an agent, like the Grafana agent.

Advantages: out-of-the-box, Kong official dashboard available in Grafana.
Disadvantages: agent needed, Prometheus plugin may create high cardinality of metrics and may cause performance issues.
Links:
https://docs.konghq.com/hub/kong-inc/prometheus/
https://docs.konghq.com/gateway/latest/production/monitoring/prometheus/
https://grafana.com/grafana/dashboards/7424-kong-official/

Statsd: Statsd with Kong 2.8 didn't work smooth, required a lot of field mappings, in 2.8 there existed both a statsd and statsd advanced plugin, statsd should work better from 3.x onwards, see https://konghq.com/blog/engineering/how-to-use-prometheus-to-monitor-kong-gateway

Logs:

Http-log: Kong offers with the http-log the possibility to send the log of each request to Grafana.

Advantages: all request meta-information is available in Grafana, from latencies to upstream IP. Minimal performance impact as the HTTP Log plugin uses internal queues to decouple the production of log entries from their transmission to the upstream log server.
Disadvantages: No out-of-the-box dashboards available
Links:
https://docs.konghq.com/hub/kong-inc/http-log/

Note that this plugin works nicer with a custom field by lua added: Streams with the following value:

local cjson = require "cjson" local ts=string.format('%18.0f', os.time()*1000000000) local log_payload = kong.log.serialize() local json_payload = cjson.encode(log_payload) local service = log_payload['service'] local t = { {stream = {kong_http_log='log-payload', service=service['name']}, values={{ts, json_payload}}}} return t

When using cjson the following should be added to kong.conf: untrusted_lua_sandbox_requires=cjson

Now we can easier explore the logs in Grafana, see

all requests as they come in, simply select log-payload: {kong_http_log="log-payload"} |= ``
parsed as json: {kong_http_log="log-payload"} | json
filter on service, as Kong sent the service in the custom field by lua Streams: {service="ASW_Europe_Standards_API"} | json

Some example dashboards:

Workspace request per minute: sum by(workspace_name) (count_over_time({kong_http_log="log-payload"} | json [1m]))
Response status per minute: sum by(response_status) (count_over_time({kong_http_log="log-payload"} | json [1m]))
Service per minute: sum by(service) (count_over_time({kong_http_log="log-payload"} | json [1m]))
Service status per minute: sum by(service, response_status) (count_over_time({kong_http_log="log-payload"} | json [1m]))

File logs: the Grafana agent can monitor the access and error logs, especially the access logs provide useful information about total latency and upstream path, but the whole set of information is less than what the http-log provides. Also the format of the loglines need to be defined in order to get parsed...

Traces:

OpenTelemetry: or olly (o-11-y) is a framework to send spans and traces from Kong to Grafana, or any other OTLP enabled application.

Advantages: a lot of information available, won't add much if the other systems (client/upstream) don't use OpenTelemetry
Disadvantages: not fully ready yet, request to Grafana support was needed in order to get metrics from traces. Also Kong mentions high tracing instrumentation sampling rate has impact of the instrumentation on Kong Gateway’s proxy performance in production.
Links:
https://docs.konghq.com/hub/kong-inc/opentelemetry/
https://grafana.com/docs/grafana-cloud/send-data/otlp/send-data-otlp/

Currently I'm exploring this feature to see what kind of extra insights this gives us when e.g. our upstream microservices also enable OpenTelemetry.

Tuesday, November 6, 2012

FOTY0001 and logfiles roulation

A certain person named, or used the alias of, Vivek wrote a few years ago some interesting articles on his blog [ref: OracleD] about Oracle SOA Suite 10g. He experienced many flaws in Oracle OC4J Application Server and recommended the OTN discussion: Oracle BPEL + Oc4j + Jdeveloper = brain damage. Clearly he was not so happy with the features in the 10g version of Oracle SOA Suite...

He referred also to the so-called FOTY0001 errors, these often occur during XRef calls and XSL Transformations. More information about this FOTY0001 can be found in the OPMN log files. For example, a typical error is the following:

subLanguageExecutionFault - XPathExecutionError

  XPath expression failed to execute.

  Error while processing xpath expression, the expression is "ora:processXSLT('myTransformation.xsl',bpws:getVariableData('myInputMessage'))", the reason is FOTY0001: type error.

  Please verify the xpath query.

This error occurs e.g., when in JDeveloper the transform activity is opened and closed immediately. There was no time for the messagePart to be loaded, so it will be missing in the code leading to this FOTY0001 error on runtime. The proper syntax in the code is ora:processXSLT('myTransformation.xsl',bpws:getVariableData('myInputMessage','myMessagePart'))
To prevent this either click cancel in JDeveloper or wait for the message parts to load completely.

To view the FOTY0001 errors in detail the obvious way is to view them using Enterprise Manager. But an easier way is to view them directly by opening the logfiles on filesystem. Depending on the logging level the server logfiles can quickly become pretty huge. In the opmn configuration the roulation can be configured as follows, open $SOA_HOME/opmn/conf/opmn.xml and make the following changes:


<ias-component id="soa_group" status="enabled">

  <process-type id="oc4j_soa" module-id="OC4J" status="enabled">

    <module-data>

      <category id="start-parameters">

        <data id="java-options"

          value="-server

Add here the following parameters:
-Dstdstream.filesize=10 - File size in MegaBytes
-Dstdstream.filenumber=50 - Number of files
To separate the output and error messages add the following data element with oc4j-options inside the same category:
<data id="oc4j-options" value="-out $ORACLE_HOME/opmn/logs/oc4j_soa.out -err $ORACLE_HOME/opmn/logs/oc4j_soa.err"/>
To view the FOTY0001 error details simply open (in a good text-editor) the *.err file containing the timestamp of the error.

Thursday, May 12, 2011

BPELConsole session timeout

Want to disable the continuous and annoying relogins into the BPELConsole?
The solution is simple, just change the session timeout so you'll not logout automatically after some time anymore:

Login to the Enterprise Manager (em), open the oc4j container oc4j_soa and click on application orabpel, then click on module 'console' and go to tab Administration. Next click on the configuration properties - Go to Task to View/edit configuration properties for this web module. Here you can set the value of the Session Timeout (seconds): a value of -1 will disable the timeout, the default is 1200 (20 min.), 12 hours will be 43200.

As usual: the change will become effective after a restart.

This setting can be changed for all sites (em, esb, AIA), just find the right module by clicking the oc4j container oc4j_soa and click on the tab Administration. Next click on the J2EE Websites - Go to Task to Manage the J2EE websites in this OC4J instance. Click on the default-web-site and you see the list of all Web Modules, one per Application and their web-address (root-context). For the root-context BPELConsole you see application orabpel and module console, you can change the timeout for every application (em - ascontrol - ascontrol, esb - esb-dt - esbconsole, AIA - AIAApplication - AIAUserInterface) the same way as described above.

Don't forget to restart, once more:

Login to the Enterprise Manager (em).
Click the oc4j container [oc4j_soa].
Click on the tab Administration.
Click on the J2EE Websites - Go to Task to Manage the J2EE websites in this OC4J instance.
Click on the default-web-site.
You see the list of all Web Modules, click on th web module of the application you want to change the timeout of. Example click on the web module [console] of Application orabpel.
Click on the tab Administration.
Click on the configuration properties - Go to Task to View/edit configuration properties for this web module.
Set a new value of the Session Timeout (seconds): a value of -1 will disable the timeout, the default is 1200 (20 min.), 12 hours will be 43200.
Restart.

Wednesday, May 11, 2011

BPEL instance title in BPELConsole

For monitoring purposes it's a good practice to set the BPEL Instance title with an unique and descriptive meaning. For example: for order and invoice messages you might want to see the orderNumber in the instance title, like OrderNumber - [orderNumber].

To achieve this use the following code in the BPEL process:
<variable name="Title" type="xsd:string"/>
and

<assign name="GetTitle">

 <copy>

  <from expression="concat('OrderNumber - ',bpws:getVariableData('InputMessage','InputPart','/msg:rootElement/msg:orderNumber'))"/>

  <to variable="Title"/>

 </copy>

</assign>

<bpelx:exec name="SetTitle" language="java" version="1.5">

 <![CDATA[setTitle((String)getVariableData("Title"));]]>

</bpelx:exec>

Next it might be useful to search for all instances related to this ordernumber, for this a small fix is needed to automatically search with wildcards in the BPELConsole Instances:
Modify the following file: $SOA_HOME/j2ee/oc4j_soa/applications/orabpel/console/ngInstanceList.jsp
Replace

// construct where condition step 4: instanceTitle

//

String instanceTitleQ = request.getParameter( "instanceTitle" );

if ( instanceTitleQ != null && instanceTitleQ.length( ) != 0 )

{

  buf.setLength( 0 );

  tmpWhere.setClause( buf.append( n++ > 0 ? " AND " : "" )

    .append( SQLDefs.AL_ci_title )

    .append( " = ? " )

    .toString() );

  tmpWhere.setString( 1, instanceTitleQ );

  where.append( tmpWhere );

}

with

// construct where condition step 4: instanceTitle

//

String instanceTitleQ = request.getParameter( "instanceTitle" );

if ( instanceTitleQ != null && instanceTitleQ.length( ) != 0 )

{

  buf.setLength( 0 );

  tmpWhere.setClause( buf.append( n++ > 0 ? " AND " : "" )

    .append( SQLDefs.AL_ci_title )

//Start search with wildcards

//    .append( " = ? " )

//    .toString() );

//  tmpWhere.setString( 1, instanceTitleQ );

    .append( " LIKE ? " )

    .toString() );

  String wildcard = "%";

  instanceTitleQ = wildcard.concat(instanceTitleQ.concat(wildcard));

  tmpWhere.setString( 1, instanceTitleQ );

//End search with wildcards

  where.append( tmpWhere );

}

Now you can search by the ordernumber in the 'title'-field and the BPELConsole returns all instances with this number somewhere in the instance title.

Tuesday, May 10, 2011

EBMTracking stamp both in BPEL as ESB

According to AIA [ref: Oracle® Application Integration Architecture - Foundation Pack 2.5: Integration Developer's Guide Release 2.5 Part No. E16465-01 December 2009] the messages get Tracking Information added within both the BPEL and ESB components. Below you find some updated code to get this EBMTracking segment populated (the Sample EBM here is of type SyncObjectEBM):

The transformation within the BPEL Component RequesterABCSImpl (xsl) which creates the EBM:

<xsl:variable name="ServiceName" select="'{http://xmlns.oracle.com/ABCSImpl/[SenderApplication]/Core/[ABCSName]/V1}[ABCSName]'"/>



<corecom:EBMTracking>

 <corecom:SequenceNumber>

  <xsl:value-of select="position()"/>

 </corecom:SequenceNumber>

 <corecom:ExecutionUnitID/>

 <corecom:ExecutionUnitName>

  <xsl:value-of select="$ServiceName"/>

 </corecom:ExecutionUnitName>

 <corecom:ImplementationCode>

  <xsl:text disable-output-escaping="no">BPEL</xsl:text>

 </corecom:ImplementationCode>

 <corecom:ActivityDateTime>

  <xsl:value-of select="xp20:current-dateTime()"/>

 </corecom:ActivityDateTime>

</corecom:EBMTracking>

After the transformation the ExecutionUnitID gets populated in the BPEL code with the following assign:

<assign name="Assign_InstanceID">

 <copy>

  <from expression="ora:getInstanceId()"/>

  <to variable="SyncObjectEBSReqMsg" part="SyncObjectEBM" query="/objectebs:SyncObjectEBM/corecom:EBMHeader/corecom:EBMTracking/corecom:ExecutionUnitID"/>

 </copy>

</assign>

The following transformation has to be created inside the ESB Component, the population of the ExecutionUnitID is described in the Best Practices Guide [ref: Oracle® SOA Suite Best Practices Guide 10g Release 3 (10.1.3.3.0) E10971-01 December 2007].
Tip: first let JDeveloper create the transformation and map one field to get all the namespaces populated, afterwards copy/paste the following code:

<xsl:template match="/">

 <ebo:SyncObjectListEBM>

  <corecom:EBMHeader>

   <xsl:copy-of select="ebo:SyncObjectListEBM/corecom:EBMHeader/corecom:EBMID"/>

   <xsl:copy-of select="ebo:SyncObjectListEBM/corecom:EBMHeader/corecom:EBMName"/>

   <xsl:copy-of select="ebo:SyncObjectListEBM/corecom:EBMHeader/corecom:EBOName"/>

   <xsl:copy-of select="ebo:SyncObjectListEBM/corecom:EBMHeader/corecom:CreationDateTime"/>

   <xsl:copy-of select="ebo:SyncObjectListEBM/corecom:EBMHeader/corecom:VerbCode"/>

   <xsl:copy-of select="ebo:SyncObjectListEBM/corecom:EBMHeader/corecom:MessageProcessingInstruction"/>

   <xsl:copy-of select="ebo:SyncObjectListEBM/corecom:EBMHeader/corecom:Sender"/>

   <xsl:copy-of select="ebo:SyncObjectListEBM/corecom:EBMHeader/corecom:Target"/>

   <xsl:copy-of select="ebo:SyncObjectListEBM/corecom:EBMHeader/corecom:BusinessScope"/>

   <xsl:for-each select="ebo:SyncObjectListEBM/corecom:EBMHeader/corecom:EBMTracking">

    <xsl:copy-of select="."/>

   </xsl:for-each>



<corecom:EBMTracking>

 <corecom:SequenceNumber>

  <xsl:value-of select="position() + 1"/>

 </corecom:SequenceNumber>

 <corecom:ExecutionUnitID>

  <xsl:value-of select="ehdr:getInstanceID()"/>

 </corecom:ExecutionUnitID>

 <corecom:ExecutionUnitName>

  <xsl:text disable-output-escaping="no">{http://xmlns.oracle.com/EnterpriseServices/Core/Object/V1}ObjectEBS</xsl:text>

 </corecom:ExecutionUnitName>

 <corecom:ImplementationCode>

  <xsl:text disable-output-escaping="no">ESB</xsl:text> 

 </corecom:ImplementationCode>

 <corecom:ActivityDateTime>

  <xsl:value-of select="xp20:current-dateTime()"/>

 </corecom:ActivityDateTime>

</corecom:EBMTracking>



  </corecom:EBMHeader>

  <xsl:for-each select="ebo:SyncObjectListEBM/ebo:DataArea">

   <xsl:copy-of select="."/>

  </xsl:for-each>

 </ebo:SyncObjectListEBM>

</xsl:template>

Last thing: add the following property to your esb component:

name:    enableAccessBusinessEvent

value:   true

Monday, May 9, 2011

Routing rules

Basically routing rules are build upon the following three parts: sender, receiver and messagetype.

With AIA comes a library of EBOs (Enterprise Business Objects - canonical model), EBMs (Enterprise Business Messages - EBO + envelope per action/verb) and EBSs (Enterprise Business Service - routing service per EBO, one operation for each EBM).
As the EBS contains already one operation per EBM for an EBO we can see AIA takes already care of the messagetype. The routing rule in the EBS consists of filter, mapping and target. This leaves us with no place for the sender as part of the routing rule as defined by AIA.

Therefor we want to include the sender as follows, two scenarios are described here.

First scenario: direct integration from ApplicationX to ApplicationY, apply the filter on sender and empty target to exclude calls from the EBF (the only component allowed to stamp the target).

{/ebo:SyncObjectListEBM/corecom:EBMHeader/corecom:Sender/corecom:ID = 'ApplicationX_01' and not(/ebo:SyncObjectListEBM/corecom:EBMHeader/corecom:Target/corecom:ID/text())};

Second scenario: direct integration with an EBF (Enterprise Business Flow) involved for data enrichment in ApplicationZ, apply the filter on sender and the target, the target is stamped by the EBF (the only component allowed to stamp the target).
Data enrichment - send to ApplicationZ_01:

{/ebo:SyncObjectListEBM/corecom:EBMHeader/corecom:Sender/corecom:ID = 'ApplicationX_01' and /ebo:SyncObjectListEBM/corecom:EBMHeader/corecom:Target/corecom:ID = 'ApplicationZ_01'};

Receiver - send to ApplicationY_01:

{/ebo:SyncObjectListEBM/corecom:EBMHeader/corecom:Sender/corecom:ID = 'ApplicationX_01' and /ebo:SyncObjectListEBM/corecom:EBMHeader/corecom:Target/corecom:ID = 'ApplicationY_01'};

The advantage is a clear overview and easier to extend/maintain the rules, the advantage increases over time.