Connection & Channel Exhaustion: RabbitMQ Incident Guide

1. Symptom

A CloudWatch alarm fires on FileDescriptorsUsed approaching the ulimit, or ConnectionCount shows continuous, unbounded growth instead of a stable plateau. Around the same time, you may see reports like:

“We’re getting Connection refused / Channel closed errors trying to publish to RabbitMQ”, from a team whose app has nothing to do with the original offender.

That last part is the key signature of this playbook: connection/channel exhaustion is a broker-wide resource limit problem. Once the broker’s OS file descriptor limit (or a configured connection_max/channel_max) is hit, every app trying to open a new connection or channel starts failing, not just the one causing the leak. This makes it feel like “the broker is down” even though the cluster itself is healthy.

Recall from Core Concepts: a Connection is a TCP connection to the broker, and a Channel is a lightweight multiplexed virtual connection inside one Connection (like an HTTP/2 stream). Both consume broker-side file descriptors and memory. Spring’s CachingConnectionFactory is designed to keep a small, stable number of each, reused across your whole app. This playbook is about what happens when something bypasses that and creates new ones constantly instead.

2. Likely Causes

Broker-side

Cause	How it manifests
OS file descriptor `ulimit` set too low for the node’s actual connection load	Node hits the ceiling well before any single app looks obviously abusive: `FileDescriptorsUsed` tracks `ulimit` closely across all apps combined
No `channel_max` configured on the broker	A misbehaving client can open unlimited channels on a single connection with nothing to stop it; the broker has no backstop until the OS fd limit is hit
No `connection_max` configured on the broker	Same idea at the connection level: nothing rejects excessive connection counts from one source before the whole node runs out of headroom

App-side (Spring Boot): the common real-world case

Cause	How it manifests
Manually calling `new ConnectionFactory().newConnection()` per request instead of injecting the shared Spring-managed `ConnectionFactory`/`RabbitTemplate` bean	Every HTTP request or message handled creates a brand-new TCP connection to the broker that’s never reused and often never closed
Creating a raw `Channel` via `connection.createChannel()` per message without closing/returning it	Channel count climbs steadily even if the connection count stays flat: each channel still holds broker-side resources
Connection/channel leak from an exception path that skips `channel.close()`	Looks fine under happy-path testing; only leaks under real traffic when the failure path actually gets hit repeatedly
A batch job or scheduled task opening a fresh connection per item in a loop	Short, sharp spikes in connection count correlated with the batch job’s schedule (e.g., every night at 2am)
`spring.rabbitmq.cache.channel.size` undersized for actual concurrency	Channels churn (created and closed rapidly) rather than being reused from the cache: shows up as elevated rate of channel creation/closure in the Management UI, not necessarily a runaway total count
Channel checkout timeouts (`spring.rabbitmq.cache.channel.checkout-timeout`) too aggressive under load	Threads fail to obtain a cached channel in time and the app compensates by creating extra ones, or throws and retries, compounding the churn

The broker-side causes are almost always about headroom (a limit set too low for legitimate load). The app-side causes are almost always about code creating far more connections/channels than are legitimately needed. In practice, you’ll spend most of your time on the app side, misusing Connections/Channels is a very common bug for developers new to messaging, because the pattern that works for a REST client (open a connection, do one thing, close it) is exactly wrong here.

3. Diagnostic Steps

Work top to bottom, cheapest, fastest checks first.

Check the Management UI → Connections and Channels tabs. Note the total count on each, and reload after 10-15 seconds. A stable/flat count under normal traffic is healthy. A count that visibly climbs between reloads is churn or a leak in progress.
Confirm churn with the CLI, not just a snapshot. A single list_connections call only tells you the count right now. Run it twice a few seconds apart (or watch it with watch) to see the trend:
```
watch -n 2 'rabbitmqctl list_connections name peer_host state | wc -l'
```
A steadily increasing number confirms active churn rather than a one-time spike that’s already leveled off.
Check CloudWatch (or OS-level) file descriptor usage on the broker nodes (FileDescriptorsUsed vs. the node’s ulimit). If this is climbing in step with the connection/channel count, you’re heading toward a broker-wide outage, not just a nuisance for one app.
Identify the offending app using peer_host (source IP) and name (which includes the connecting host/port) from list_connections. If Spring’s connection naming is configured, the connection name itself may include the app/instance identifier, compare against your service inventory to find the owning team.
Check that app’s code and config for the anti-pattern:
- Grep for manual client construction as a red flag:
```
grep -rn "new ConnectionFactory()" src/
grep -rn "createChannel()" src/
```
  Any hit outside of Spring’s own internals is suspect, legitimate Spring AMQP usage almost never calls these directly.
- Check application.yml/application.properties for spring.rabbitmq.cache.channel.size and spring.rabbitmq.cache.connection.mode. A cache size left at a low default under high concurrency, or connection.mode: CONNECTION (a new connection per operation instead of CHANNEL, the default) can itself cause elevated churn even without a code bug.
Cross-check list_channels for the offending connection to see if the problem is connection-level, channel-level, or both:
```
rabbitmqctl list_channels connection_details consumer_count
```
Many channels with consumer_count = 0 hanging off a small number of connections often points to publisher-side channel-per-message code rather than a consumer misconfiguration.

Step	Question it answers	Typical time cost
1. Management UI Connections/Channels	Is there a problem at all, and roughly how big?	seconds
2. `list_connections` watched over time	Is this active churn or a settled spike?	30 sec - 1 min
3. CloudWatch / OS fd usage	How close are we to a broker-wide failure?	1-2 min
4. `peer_host` / connection name	Who owns this?	1-2 min
5. Code/config grep	Is this a code bug or a cache-sizing issue?	3-5 min (needs repo access)
6. `list_channels`	Connection-level or channel-level leak?	1 min

4. Safe Remediations

Situation	Safe action
Offending app identified, clear code-level anti-pattern (per-message connection/channel creation, missed `close()`)	This is a code fix, not something support can patch live. Open a ticket/page to the owning app team with the specific evidence (connection name, growth rate, grep hits if you have repo access) so they can fix it and redeploy.
Leak is actively growing and threatens broker-wide impact before a code fix can land	As a stopgap, restart the offending app instance(s) to release its leaked connections/channels immediately.
Cache sizing issue only (`spring.rabbitmq.cache.channel.size` undersized, no actual leak)	Recommend the owning team raise `spring.rabbitmq.cache.channel.size` to match their real concurrency needs: a config change they own, not a broker-side fix.
Broker’s file descriptor `ulimit` is confirmed too low for legitimate, otherwise-healthy load	Raising the ulimit is an infra change: only do this with escalation approval, not as a routine fix (see Section 5).

⚠️ Caution: restarting the offending app is only temporary relief.** It frees the leaked connections/channels immediately, which can be the right call if a broker-wide outage is imminent, but the leak will recur at the same rate as soon as the app resumes normal traffic, because the underlying code is unchanged. Always pair a restart with a tracked follow-up to the owning team, never treat the restart itself as “resolved.”

⚠️ Caution: do not raise broker-side ulimit, connection_max, or channel_max unilaterally.** These are cluster-wide infra settings requiring a config change and often a node restart to take effect. Treat this the same as any other broker topology change: escalation-approved only, never a routine response to an alert.

5. Escalation Trigger

Stop and page on-call engineering (per Escalation and Communication) if any of these are true:

New connection attempts are failing broker-wide, affecting apps that have nothing to do with the original offender, this means the file descriptor (or configured connection/channel) limit has already been hit.
The leak’s rate of growth is fast enough that it will hit the limit before the owning app team can realistically ship and deploy a fix (e.g., growing hundreds of connections per minute).
The fix requires a broker-side infra change (raising ulimit, setting connection_max/channel_max) rather than an app-side code/config fix.
You cannot identify the offending app from peer_host/connection name within a few minutes and the growth is ongoing, escalate rather than let it run while you keep digging.

6. Relevant Commands/Queries

# Connection count and identity: run repeatedly to detect churn, not just a snapshot
rabbitmqctl list_connections name peer_host state

# Healthy example: small, stable count, one entry per app instance
name                          peer_host        state
10.0.1.23:54021 -> 10.0.2.10  10.0.1.23        running
10.0.1.24:54022 -> 10.0.2.10  10.0.1.24        running

# Alerting example: rapidly growing count, many short-lived connections from one source
name                          peer_host        state
10.0.1.55:61010 -> 10.0.2.10  10.0.1.55        running
10.0.1.55:61011 -> 10.0.2.10  10.0.1.55        running
10.0.1.55:61012 -> 10.0.2.10  10.0.1.55        running
... (hundreds more from the same peer_host, count climbing on every re-run)

# Channel-level detail per connection
rabbitmqctl list_channels connection_details consumer_count

# Healthy example: few channels, matching expected consumer/publisher pool size
connection_details                          consumer_count
<[email protected]>                       2

# Alerting example: many channels, most with 0 consumers (publisher churn)
connection_details                          consumer_count
<[email protected]>                       0
<[email protected]>                       0
<[email protected]>                       0
... (rapidly growing)

# Watch connection count trend live over a short interval
watch -n 2 'rabbitmqctl list_connections name peer_host state | wc -l'

# OS-level file descriptor usage on a broker node (via SSM Session Manager)
cat /proc/$(pgrep beam.smp)/limits | grep "open files"
ls /proc/$(pgrep beam.smp)/fd | wc -l

# Grep an app's codebase for the anti-pattern
grep -rn "new ConnectionFactory()" src/
grep -rn "createChannel()" src/

# Spring Boot cache settings worth checking (application.yml)
spring:
  rabbitmq:
    cache:
      channel:
        size: 25              # too low under high concurrency -> excess churn
        checkout-timeout: 0   # aggressive timeouts can cause compensating churn
      connection:
        mode: CHANNEL          # should be CHANNEL (default), not CONNECTION

7. Mini Practical

Reproduce the anti-pattern locally, watch it exhaust connections, then fix it.

Step 1: Start from the RabbitMQ container from Environment Setup (still running on localhost:5672), and have the Management UI open at localhost:15672 → Connections tab.

Step 2: Write the deliberately-bad publisher. This opens a brand-new Connection (and channel) on every single publish call and never closes it, the exact anti-pattern from Section 2:

public class BadPublisher {

    public static void main(String[] args) throws Exception {
        for (int i = 0; i < 200; i++) {
            ConnectionFactory factory = new ConnectionFactory();
            factory.setHost("localhost");
            factory.setUsername("guest");
            factory.setPassword("guest");

            Connection connection = factory.newConnection();  // new TCP connection every iteration
            Channel channel = connection.createChannel();     // new channel every iteration

            channel.basicPublish("", "orders.created.queue", null,
                    ("bad message " + i).getBytes());

            // No channel.close(), no connection.close(), leaked on purpose
            System.out.println("Published " + i);
            Thread.sleep(100);
        }
    }
}

Step 3: Run it and watch the Management UI Connections tab (or run the CLI watch command) while it executes:

watch -n 1 'docker exec rabbitmq-crashcourse rabbitmqctl list_connections name peer_host state | wc -l'

You should see the connection count climb steadily, roughly one new connection every 100ms, and never come back down, reproducing exactly the “rapidly growing count” alerting pattern from Section 6. Leave it running long enough and you’ll see the same shape of growth that eventually exhausts a broker’s file descriptor limit at real production volume.

Step 4: Stop the bad publisher, then confirm the leaked connections don’t clean themselves up quickly, they’ll linger until the JVM process exits or the broker eventually times them out, unlike a properly closed connection.

Step 5: Fix it with a shared, Spring-managed bean. Replace the manual client code with an injected RabbitTemplate (backed by CachingConnectionFactory, which Spring Boot auto-configures for you):

@Component
@RequiredArgsConstructor
public class GoodPublisher {

    private final RabbitTemplate rabbitTemplate;

    public void publishBatch() {
        for (int i = 0; i < 200; i++) {
            rabbitTemplate.convertAndSend("", "orders.created.queue", "good message " + i);
        }
    }
}

Step 6: Re-run and re-watch. Trigger publishBatch() (e.g., from a throwaway @RestController endpoint or a CommandLineRunner) and watch the same list_connections count. This time it should stay flat at one connection (plus whatever channels Spring’s cache pool needs, also stable) for the entire 200-message run, the same workload, but reusing the pooled Connection/Channel instead of creating new ones per message.

✅ Checkpoint

You should now be able to:

Explain why creating a new Connection or Channel per message is a broker-wide risk, not just a problem for the app doing it.
Use list_connections/list_channels (watched over time, not as a single snapshot) plus peer_host to identify a churning/leaking app.
Reproduce a connection leak locally, confirm it in the Management UI, and fix it by switching to an injected RabbitTemplate/CachingConnectionFactory.