The Three Pillars of Observability: A MuleSoft Architect's Guide

Our Mule Runtime is running. Our APIs are deployed. Our flows are live. But do we know what is happening inside them right now?

Observability answers that question. It gives us the ability to understand our system from the outside — by examining what our system produces. Without it, we operate blind. We wait for users to report problems. We scramble to find root causes after damage is done. We can do better.

In this post, we build an observability strategy for our Mule deployments using three pillars: logs, metrics, and traces. Each pillar answers a different question. Together, they give us full visibility into every integration we run.


What Is Observability?

Observability is not a tool. It is a property of our system. A system is observable when we can determine its internal state by examining its external outputs. The more our system tells us about itself, the more observable it is.
The three pillars are the outputs we collect:

PillarQuestion It Answers
LogsWhat happened?
MetricsHow is the system performing over time?
TracesWhere did a request go and how long did each step take?

We need all three. A single pillar gives us a partial picture. Together, they give us the full story.


Pillar 1: Logs

What Logs Are

A log is a timestamped record of a discrete event. When our Mule flow starts, logs fire. When a connector throws an error, logs fire. When a message transforms, we can make logs fire.

Logs tell us what happened and when it happened.


What We Log in MuleSoft

We log at meaningful points in our flows. Here is a practical checklist:

  • Flow entry — record the correlation ID, HTTP method, and path
  • Connector calls — record the target system, operation, and outcome
  • Error handlers — record the error type, message, and full stack trace
  • Flow exit — record the status code and total elapsed time

We use the Logger component in Anypoint Studio. We set the log level appropriately:

LevelWhen We Use It
ERRORUnhandled failures. Pages the on-call team.
WARNUnexpected but recoverable conditions.
INFONormal flow milestones. Startup, shutdown, key transitions.
DEBUGDetailed data for troubleshooting. Never on in production.


Structured Logs Win

We do not write plain text logs. We write structured logs in JSON.

Plain text:

2024-11-01 12:00:00 INFO Order received for customer 12345

Structured JSON:

{
"timestamp": "2024-11-01T12:00:00Z",
"level": "INFO",
"correlationId": "abc-123",
"flow": "order-processing-flow",
"event": "order_received",
"customerId": "12345",
"orderId": "ORD-9876"
}

The plain text version is hard to search. The JSON version is machine-readable. We can filter, aggregate, and alert on any field.
We configure structured logging in our log4j2.xml file. We route all log output to stdout. Our infrastructure captures stdout and ships it to our log platform.

Correlation IDs Are Non-Negotiable

Every request that enters our Mule deployment must carry a correlation ID. We propagate this ID through every flow, every connector call, and every outbound HTTP request.

The correlation ID links every log entry for a single transaction. Without it, we cannot reconstruct what happened during a complex integration.

In MuleSoft, we use the built-in correlationId attribute on the Mule event. We log it on every entry. We pass it downstream in HTTP headers.


Pillar 2: Metrics

What Metrics Are

A metric is a numeric measurement recorded at a point in time. We collect metrics continuously. We plot them over time.

Metrics tell us how our system is performing and whether performance is trending in the right direction.


Metrics That Matter for Mule Deployments

We track metrics at three levels.


JVM Metrics — the health of our Java runtime:

MetricWhy It Matters
Heap used / heap maxMemory pressure. High heap triggers GC storms.
GC pause durationLong pauses stall our flows.
Thread countThread pool exhaustion blocks message processing.
CPU utilizationSustained high CPU signals a processing bottleneck.


Mule Runtime Metrics — the health of our runtime:

MetricWhy It Matters
Active flowsConfirms our applications deployed correctly.
Worker thread pool utilizationShows capacity headroom.
Message processing rateOur baseline throughput.


API and Flow Metrics — the health of our integrations:

MetricWhy It Matters
Request rateVolume of traffic hitting each API.
Error ratePercentage of requests ending in failure.
Response time (p50, p95, p99)Latency that our consumers experience.
Timeout rateConnections that exceeded our time limit.


The Golden Signals

Google's Site Reliability Engineering discipline defines four golden signals. We apply them to every Mule API we deploy:

  1. Latency — How long does it take to handle a request?
  2. Traffic — How many requests per second are we handling?
  3. Errors — What fraction of requests are failing?
  4. Saturation — How full is our system? (CPU, memory, threads)

If we instrument nothing else, we instrument these four signals.


Alerting on Metrics

Metrics without alerts are just history. We set alert thresholds on every golden signal. We alert on trends, not just thresholds. A p99 latency that doubles over two hours is a warning sign even if it has not crossed our threshold yet.


Pillar 3: Traces

What Traces Are

A trace records the full path of a single request through our entire system. A trace is made up of spans. Each span represents one unit of work — one flow execution, one HTTP call, one database query.

Traces tell us where a request went and how long each step took.


Why Traces Are Critical for MuleSoft

Our MuleSoft integrations rarely live in isolation. A single API request might:

  1. Hit our Experience API
  2. Call our Process API
  3. Fan out to two System APIs
  4. Call a third-party REST service
  5. Write to a database

Logs tell us each of those steps happened. Metrics tell us the p99 latency. But only a trace shows us the exact path, in order, with timing for every hop.
When our SLA is breached, we use the trace to identify which hop caused the delay. Without traces, we guess.


Distributed Tracing in Practice

Distributed tracing requires every service in the chain to participate. We propagate trace context through HTTP headers using the W3C Trace Context standard. The two headers are:

  • traceparent — carries the trace ID and span ID
  • tracestate — carries vendor-specific metadata

In MuleSoft, we read these headers on inbound requests. We pass them on every outbound HTTP call. We generate a new span for each connector call and flow execution.
Traces are sent to the Anypoint control plane and made available within Anypoint Monitoring. But they can also be exported using Telemetry Exporters in Anypoint to our preferred tracing backend (Splunk, ELK, Datadog...);

Reading a Trace

A trace waterfall looks like this:

[Experience API - /orders POST]           0ms → 210ms
├─ [Process API - order-process-flow] 5ms → 195ms
│ ├─ [System API - inventory check] 6ms → 45ms
│ ├─ [System API - payment charge] 50ms → 185ms ← bottleneck
│ └─ [DB write - orders table] 186ms → 193ms

The payment charge span took 135ms. It is our bottleneck. Without the trace, we would have seen a slow p99 in our metrics and spent time searching logs. The trace gave us the answer in seconds.


Putting the Three Pillars Together

Each pillar has a job. They work as a team. The workflow looks like this:

  1. Our alert fires. Error rate on /orders spiked. (metric)
  2. We open our log platform. We filter by flow and time window. We see NullPointerException errors in order-process-flow. (log)
  3. We grab the correlation ID from a failing log entry. We paste it into our tracing tool. We see the full request path. The payment System API returned a malformed response. (trace)
  4. We fix the System API. The error rate drops.

This is the observability loop. Metrics alert us. Logs tell us what went wrong. Traces tell us where it went wrong.


A Practical Observability Stack for MuleSoft

Here is a reference stack we can adopt. These are all open standards and common tools.

ConcernToolNotes
Log shippingFluent BitLightweight. Runs as a sidecar. Reads stdout.
Log storageOpenSearch / ElasticsearchFull-text search across all log fields.
Metrics collectionPrometheusScrapes JVM and custom metrics via /metricsendpoint.
Metrics visualizationGrafanaDashboards for all golden signals.
Tracing backendGrafana TempoStores and queries distributed traces.
Tracing instrumentationOpenTelemetryVendor-neutral. Instruments our Mule flows.

Observability Is a Design Decision. We do not add observability after deployment. We design it in from the start. Every API we build answers these questions before it goes live:

  • What are the correlation IDs, and how do we propagate them?
  • What log events do we emit, and at what level?
  • What metrics do we expose, and what are the alert thresholds?
  • How do we inject and forward trace context?

When we ask these questions up front, we deploy with confidence. We know that when something goes wrong — and something always goes wrong — we have the data to find it fast.


Summary

We covered the three pillars of observability and how they apply to our Mule deployments:

  • Logs record discrete events. We write structured JSON logs. We always propagate correlation IDs.
  • Metrics measure performance over time. We track the four golden signals for every API we deploy.
  • Traces map the full path of a request. We use distributed tracing to find bottlenecks across our API layers.

The three pillars work together. Metrics trigger the investigation. Logs tell us what broke. Traces show us where it broke.
In the next post in this series, we will wire these three pillars into a real Mule Runtime deployment — with a working OpenTelemetry configuration and Grafana dashboards for the golden signals.

Previous Post Next Post