The Three Pillars of Observability: A MuleSoft Architect's Guide

Our Mule Runtime is running. Our APIs are deployed. Our flows are live. But do we know what is happening inside them right now?

Observability answers that question. It gives us the ability to understand our system from the outside — by examining what our system produces. Without it, we operate blind. We wait for users to report problems. We scramble to find root causes after damage is done. We can do better.

In this post, we build an observability strategy for our Mule deployments using three pillars: logs, metrics, and traces. Each pillar answers a different question. Together, they give us full visibility into every integration we run.

What Is Observability?

Observability is not a tool. It is a property of our system. A system is observable when we can determine its internal state by examining its external outputs. The more our system tells us about itself, the more observable it is.
The three pillars are the outputs we collect:

Pillar	Question It Answers
Logs	What happened?
Metrics	How is the system performing over time?
Traces	Where did a request go and how long did each step take?

We need all three. A single pillar gives us a partial picture. Together, they give us the full story.

Pillar 1: Logs

What Logs Are

A log is a timestamped record of a discrete event. When our Mule flow starts, logs fire. When a connector throws an error, logs fire. When a message transforms, we can make logs fire.

Logs tell us what happened and when it happened.

What We Log in MuleSoft

We log at meaningful points in our flows. Here is a practical checklist:

Flow entry — record the correlation ID, HTTP method, and path
Connector calls — record the target system, operation, and outcome
Error handlers — record the error type, message, and full stack trace
Flow exit — record the status code and total elapsed time

We use the Logger component in Anypoint Studio. We set the log level appropriately:

Level	When We Use It
`ERROR`	Unhandled failures. Pages the on-call team.
`WARN`	Unexpected but recoverable conditions.
`INFO`	Normal flow milestones. Startup, shutdown, key transitions.
`DEBUG`	Detailed data for troubleshooting. Never on in production.

Structured Logs Win

We do not write plain text logs. We write structured logs in JSON.

Plain text:

2024-11-01 12:00:00 INFO Order received for customer 12345

Structured JSON:

{
  "timestamp": "2024-11-01T12:00:00Z",
  "level": "INFO",
  "correlationId": "abc-123",
  "flow": "order-processing-flow",
  "event": "order_received",
  "customerId": "12345",
  "orderId": "ORD-9876"
}

The plain text version is hard to search. The JSON version is machine-readable. We can filter, aggregate, and alert on any field.
We configure structured logging in our log4j2.xml file. We route all log output to stdout. Our infrastructure captures stdout and ships it to our log platform.

Correlation IDs Are Non-Negotiable

Every request that enters our Mule deployment must carry a correlation ID. We propagate this ID through every flow, every connector call, and every outbound HTTP request.

The correlation ID links every log entry for a single transaction. Without it, we cannot reconstruct what happened during a complex integration.

In MuleSoft, we use the built-in correlationId attribute on the Mule event. We log it on every entry. We pass it downstream in HTTP headers.

Pillar 2: Metrics

What Metrics Are

A metric is a numeric measurement recorded at a point in time. We collect metrics continuously. We plot them over time.

Metrics tell us how our system is performing and whether performance is trending in the right direction.

Metrics That Matter for Mule Deployments

We track metrics at three levels.

JVM Metrics — the health of our Java runtime:

Metric	Why It Matters
Heap used / heap max	Memory pressure. High heap triggers GC storms.
GC pause duration	Long pauses stall our flows.
Thread count	Thread pool exhaustion blocks message processing.
CPU utilization	Sustained high CPU signals a processing bottleneck.

Mule Runtime Metrics — the health of our runtime:

Metric	Why It Matters
Active flows	Confirms our applications deployed correctly.
Worker thread pool utilization	Shows capacity headroom.
Message processing rate	Our baseline throughput.

API and Flow Metrics — the health of our integrations:

Metric	Why It Matters
Request rate	Volume of traffic hitting each API.
Error rate	Percentage of requests ending in failure.
Response time (p50, p95, p99)	Latency that our consumers experience.
Timeout rate	Connections that exceeded our time limit.

The Golden Signals

Google's Site Reliability Engineering discipline defines four golden signals. We apply them to every Mule API we deploy:

Latency — How long does it take to handle a request?
Traffic — How many requests per second are we handling?
Errors — What fraction of requests are failing?
Saturation — How full is our system? (CPU, memory, threads)

If we instrument nothing else, we instrument these four signals.

Alerting on Metrics

Metrics without alerts are just history. We set alert thresholds on every golden signal. We alert on trends, not just thresholds. A p99 latency that doubles over two hours is a warning sign even if it has not crossed our threshold yet.

Pillar 3: Traces

What Traces Are

A trace records the full path of a single request through our entire system. A trace is made up of spans. Each span represents one unit of work — one flow execution, one HTTP call, one database query.

Traces tell us where a request went and how long each step took.

Why Traces Are Critical for MuleSoft

Our MuleSoft integrations rarely live in isolation. A single API request might:

Hit our Experience API
Call our Process API
Fan out to two System APIs
Call a third-party REST service
Write to a database

Logs tell us each of those steps happened. Metrics tell us the p99 latency. But only a trace shows us the exact path, in order, with timing for every hop.
When our SLA is breached, we use the trace to identify which hop caused the delay. Without traces, we guess.

Distributed Tracing in Practice

Distributed tracing requires every service in the chain to participate. We propagate trace context through HTTP headers using the W3C Trace Context standard. The two headers are:

traceparent — carries the trace ID and span ID
tracestate — carries vendor-specific metadata

In MuleSoft, we read these headers on inbound requests. We pass them on every outbound HTTP call. We generate a new span for each connector call and flow execution.
Traces are sent to the Anypoint control plane and made available within Anypoint Monitoring. But they can also be exported using Telemetry Exporters in Anypoint to our preferred tracing backend (Splunk, ELK, Datadog...);

Reading a Trace

A trace waterfall looks like this:

[Experience API - /orders POST]           0ms → 210ms
  ├─ [Process API - order-process-flow]   5ms → 195ms
  │    ├─ [System API - inventory check]  6ms → 45ms
  │    ├─ [System API - payment charge]  50ms → 185ms  ← bottleneck
  │    └─ [DB write - orders table]     186ms → 193ms

The payment charge span took 135ms. It is our bottleneck. Without the trace, we would have seen a slow p99 in our metrics and spent time searching logs. The trace gave us the answer in seconds.

Putting the Three Pillars Together

Each pillar has a job. They work as a team. The workflow looks like this:

Our alert fires. Error rate on /orders spiked. (metric)
We open our log platform. We filter by flow and time window. We see NullPointerException errors in order-process-flow. (log)
We grab the correlation ID from a failing log entry. We paste it into our tracing tool. We see the full request path. The payment System API returned a malformed response. (trace)
We fix the System API. The error rate drops.

This is the observability loop. Metrics alert us. Logs tell us what went wrong. Traces tell us where it went wrong.

A Practical Observability Stack for MuleSoft

Here is a reference stack we can adopt. These are all open standards and common tools.

Concern	Tool	Notes
Log shipping	Fluent Bit	Lightweight. Runs as a sidecar. Reads stdout.
Log storage	OpenSearch / Elasticsearch	Full-text search across all log fields.
Metrics collection	Prometheus	Scrapes JVM and custom metrics via `/metrics`endpoint.
Metrics visualization	Grafana	Dashboards for all golden signals.
Tracing backend	Grafana Tempo	Stores and queries distributed traces.
Tracing instrumentation	OpenTelemetry	Vendor-neutral. Instruments our Mule flows.

Observability Is a Design Decision. We do not add observability after deployment. We design it in from the start. Every API we build answers these questions before it goes live:

What are the correlation IDs, and how do we propagate them?
What log events do we emit, and at what level?
What metrics do we expose, and what are the alert thresholds?
How do we inject and forward trace context?

When we ask these questions up front, we deploy with confidence. We know that when something goes wrong — and something always goes wrong — we have the data to find it fast.

Summary

We covered the three pillars of observability and how they apply to our Mule deployments:

Logs record discrete events. We write structured JSON logs. We always propagate correlation IDs.
Metrics measure performance over time. We track the four golden signals for every API we deploy.
Traces map the full path of a request. We use distributed tracing to find bottlenecks across our API layers.

The three pillars work together. Metrics trigger the investigation. Logs tell us what broke. Traces show us where it broke.
In the next post in this series, we will wire these three pillars into a real Mule Runtime deployment — with a working OpenTelemetry configuration and Grafana dashboards for the golden signals.