Our Mule Runtime is running. Our APIs are deployed. Our flows are live. But do we know what is happening inside them right now?
Observability answers that question. It gives us the ability to understand our system from the outside — by examining what our system produces. Without it, we operate blind. We wait for users to report problems. We scramble to find root causes after damage is done. We can do better.
In this post, we build an observability strategy for our Mule deployments using three pillars: logs, metrics, and traces. Each pillar answers a different question. Together, they give us full visibility into every integration we run.
What Is Observability?
Observability is not a tool. It is a property of our system. A system is observable when we can determine its internal state by examining its external outputs. The more our system tells us about itself, the more observable it is.
The three pillars are the outputs we collect:
| Pillar | Question It Answers |
|---|---|
| Logs | What happened? |
| Metrics | How is the system performing over time? |
| Traces | Where did a request go and how long did each step take? |
We need all three. A single pillar gives us a partial picture. Together, they give us the full story.
Pillar 1: Logs
What Logs Are
A log is a timestamped record of a discrete event. When our Mule flow starts, logs fire. When a connector throws an error, logs fire. When a message transforms, we can make logs fire.
Logs tell us what happened and when it happened.
What We Log in MuleSoft
We log at meaningful points in our flows. Here is a practical checklist:
- Flow entry — record the correlation ID, HTTP method, and path
- Connector calls — record the target system, operation, and outcome
- Error handlers — record the error type, message, and full stack trace
- Flow exit — record the status code and total elapsed time
We use the Logger component in Anypoint Studio. We set the log level appropriately:
| Level | When We Use It |
|---|---|
ERROR | Unhandled failures. Pages the on-call team. |
WARN | Unexpected but recoverable conditions. |
INFO | Normal flow milestones. Startup, shutdown, key transitions. |
DEBUG | Detailed data for troubleshooting. Never on in production. |
Structured Logs Win
We do not write plain text logs. We write structured logs in JSON.
Plain text:
2024-11-01 12:00:00 INFO Order received for customer 12345Structured JSON:
{
"timestamp": "2024-11-01T12:00:00Z",
"level": "INFO",
"correlationId": "abc-123",
"flow": "order-processing-flow",
"event": "order_received",
"customerId": "12345",
"orderId": "ORD-9876"
}The plain text version is hard to search. The JSON version is machine-readable. We can filter, aggregate, and alert on any field.
We configure structured logging in our log4j2.xml file. We route all log output to stdout. Our infrastructure captures stdout and ships it to our log platform.
Correlation IDs Are Non-Negotiable
Every request that enters our Mule deployment must carry a correlation ID. We propagate this ID through every flow, every connector call, and every outbound HTTP request.
The correlation ID links every log entry for a single transaction. Without it, we cannot reconstruct what happened during a complex integration.
In MuleSoft, we use the built-in correlationId attribute on the Mule event. We log it on every entry. We pass it downstream in HTTP headers.
Pillar 2: Metrics
What Metrics Are
A metric is a numeric measurement recorded at a point in time. We collect metrics continuously. We plot them over time.
Metrics tell us how our system is performing and whether performance is trending in the right direction.
Metrics That Matter for Mule Deployments
We track metrics at three levels.
JVM Metrics — the health of our Java runtime:
| Metric | Why It Matters |
|---|---|
| Heap used / heap max | Memory pressure. High heap triggers GC storms. |
| GC pause duration | Long pauses stall our flows. |
| Thread count | Thread pool exhaustion blocks message processing. |
| CPU utilization | Sustained high CPU signals a processing bottleneck. |
Mule Runtime Metrics — the health of our runtime:
| Metric | Why It Matters |
|---|---|
| Active flows | Confirms our applications deployed correctly. |
| Worker thread pool utilization | Shows capacity headroom. |
| Message processing rate | Our baseline throughput. |
API and Flow Metrics — the health of our integrations:
| Metric | Why It Matters |
|---|---|
| Request rate | Volume of traffic hitting each API. |
| Error rate | Percentage of requests ending in failure. |
| Response time (p50, p95, p99) | Latency that our consumers experience. |
| Timeout rate | Connections that exceeded our time limit. |
The Golden Signals
Google's Site Reliability Engineering discipline defines four golden signals. We apply them to every Mule API we deploy:
- Latency — How long does it take to handle a request?
- Traffic — How many requests per second are we handling?
- Errors — What fraction of requests are failing?
- Saturation — How full is our system? (CPU, memory, threads)
If we instrument nothing else, we instrument these four signals.
Alerting on Metrics
Metrics without alerts are just history. We set alert thresholds on every golden signal. We alert on trends, not just thresholds. A p99 latency that doubles over two hours is a warning sign even if it has not crossed our threshold yet.
Pillar 3: Traces
What Traces Are
A trace records the full path of a single request through our entire system. A trace is made up of spans. Each span represents one unit of work — one flow execution, one HTTP call, one database query.
Traces tell us where a request went and how long each step took.
Why Traces Are Critical for MuleSoft
Our MuleSoft integrations rarely live in isolation. A single API request might:
- Hit our Experience API
- Call our Process API
- Fan out to two System APIs
- Call a third-party REST service
- Write to a database
Logs tell us each of those steps happened. Metrics tell us the p99 latency. But only a trace shows us the exact path, in order, with timing for every hop.
When our SLA is breached, we use the trace to identify which hop caused the delay. Without traces, we guess.
Distributed Tracing in Practice
Distributed tracing requires every service in the chain to participate. We propagate trace context through HTTP headers using the W3C Trace Context standard. The two headers are:
traceparent— carries the trace ID and span IDtracestate— carries vendor-specific metadata
In MuleSoft, we read these headers on inbound requests. We pass them on every outbound HTTP call. We generate a new span for each connector call and flow execution.
Traces are sent to the Anypoint control plane and made available within Anypoint Monitoring. But they can also be exported using Telemetry Exporters in Anypoint to our preferred tracing backend (Splunk, ELK, Datadog...);
Reading a Trace
A trace waterfall looks like this:
[Experience API - /orders POST] 0ms → 210ms
├─ [Process API - order-process-flow] 5ms → 195ms
│ ├─ [System API - inventory check] 6ms → 45ms
│ ├─ [System API - payment charge] 50ms → 185ms ← bottleneck
│ └─ [DB write - orders table] 186ms → 193msThe payment charge span took 135ms. It is our bottleneck. Without the trace, we would have seen a slow p99 in our metrics and spent time searching logs. The trace gave us the answer in seconds.
Putting the Three Pillars Together
Each pillar has a job. They work as a team. The workflow looks like this:
- Our alert fires. Error rate on
/ordersspiked. (metric) - We open our log platform. We filter by flow and time window. We see
NullPointerExceptionerrors inorder-process-flow. (log) - We grab the correlation ID from a failing log entry. We paste it into our tracing tool. We see the full request path. The payment System API returned a malformed response. (trace)
- We fix the System API. The error rate drops.
This is the observability loop. Metrics alert us. Logs tell us what went wrong. Traces tell us where it went wrong.
A Practical Observability Stack for MuleSoft
Here is a reference stack we can adopt. These are all open standards and common tools.
| Concern | Tool | Notes |
|---|---|---|
| Log shipping | Fluent Bit | Lightweight. Runs as a sidecar. Reads stdout. |
| Log storage | OpenSearch / Elasticsearch | Full-text search across all log fields. |
| Metrics collection | Prometheus | Scrapes JVM and custom metrics via /metricsendpoint. |
| Metrics visualization | Grafana | Dashboards for all golden signals. |
| Tracing backend | Grafana Tempo | Stores and queries distributed traces. |
| Tracing instrumentation | OpenTelemetry | Vendor-neutral. Instruments our Mule flows. |
Observability Is a Design Decision. We do not add observability after deployment. We design it in from the start. Every API we build answers these questions before it goes live:
- What are the correlation IDs, and how do we propagate them?
- What log events do we emit, and at what level?
- What metrics do we expose, and what are the alert thresholds?
- How do we inject and forward trace context?
When we ask these questions up front, we deploy with confidence. We know that when something goes wrong — and something always goes wrong — we have the data to find it fast.
Summary
We covered the three pillars of observability and how they apply to our Mule deployments:
- Logs record discrete events. We write structured JSON logs. We always propagate correlation IDs.
- Metrics measure performance over time. We track the four golden signals for every API we deploy.
- Traces map the full path of a request. We use distributed tracing to find bottlenecks across our API layers.
The three pillars work together. Metrics trigger the investigation. Logs tell us what broke. Traces show us where it broke.
In the next post in this series, we will wire these three pillars into a real Mule Runtime deployment — with a working OpenTelemetry configuration and Grafana dashboards for the golden signals.