Connection Reliability Improvements #675

fielding · 2025-03-13T18:49:45Z

Overview

This PR focuses on improving connection reliability in nats.py, addressing persistent stability issues observed in long-running applications or scenarios with intermittent network disruptions like the one we've been experiencing from one of our compute providers(lol). I was able to mitigate some of it, but not all of it with additional logic in our codebase that would help nats to reconnect in these instances and rebind the subscriptions, but honestly that was an ugly bandaid and it didn't work 100% of the time. So, after repeated encounters with elusive connectivity bugs and noting similar experiences among others in the community (#598), I decided to take a stab at helping improve the issue at the core.

Here's what this PR includes:

1. Improved Ping Loop Error Handling

Enhanced error handling within the ping loop to prevent silent failures.
Properly catches and handles asyncio.InvalidStateError.
Adds a catch-all exception handler to ensure the ping loop never silently stalls.
Forces a proper disconnect with ErrStaleConnection when ping anomalies occur.

2. Enhanced Read Loop Stability

Implements timeout detection for read operations, introducing a consecutive timeout counter to identify potentially stalled connections.
Adds a configurable client option (max_read_timeouts, defaults to 3) to fine-tune sensitivity.
Explicit handling for ConnectionResetError and asyncio.InvalidStateError to improve resilience and provide clearer debug information.

3. Reliable Request/Response Handling

Adds pre-flight connection health checks before issuing requests.
Improves internal cleanup for request/response calls to prevent subtle resource leaks.
Strengthens timeout and cancellation logic to guard against orphaned or stale futures.

4. Proactive Connection Health Checks

Introduces _check_connection_health(), a method designed to proactively test and re-establish the connection if necessary.
Utilized in critical paths like request handling to ensure robustness under varying network conditions.

Linked Issues

This PR addresses stability concerns raised in:

Client got network error after a while #598

Recommended Testing Scenarios

These improvements should noticeably enhance stability, particularly in environments with:

Long-running applications (24+ hours uptime)
Frequent or intermittent connectivity disruptions
Intensive request-response workloads
Heavy usage of JetStream or Key-Value operations

Impact and Compatibility

Backward Compatibility: Fully backward compatible. Existing interfaces remain unchanged.
Configuration: New options (max_read_timeouts) default to safe values and can be tuned as needed without affecting existing usage.
Robustness: Designed to gracefully handle and recover from various edge cases previously causing silent connection failures.

Contributing Statement

This contribution is my original work.
I license the work to the NATS project under the project's Apache 2.0 license.

As always, feedback is welcome, and I'm happy to iterate as needed!

Cheers,
Fielding

fielding · 2025-03-13T18:52:02Z

I'm going to leave this as a draft until I can get some concrete data from my own testing in the wild or from others who might be having similar issues and can help to test.

If you want to install and test this commit, fielding@4a20463, you can simply do:

pip install git+https://github.com/fielding/nats.py.git@4a20463b521962c83ec58e0cc1a4d1f72fd98440

or there is a way to install a specific PR by #, but I can't remember exactly what it is -.-

fielding · 2025-03-13T20:09:52Z

My own concerns with this:

Performance Overhead: Timeouts and health checks add slight overhead. For high-throughput applications, this could be noticeable. Consider profiling under load. This is my main concern.
Testing Gaps: The bug may/may not be specific to specific scenarios that I mentioned in the description, thus it becomes difficult to test and assess easily.

fielding · 2025-03-18T18:07:28Z

So far testing shows a positive change in behavior over the last 5 days. Normally it would fail to reconnect somewhere around the 36-48hr mark prior to these changes

MattiasAng · 2025-05-22T10:44:47Z

Just some feedback on this.

We've ran with the changes committed in this pull request for a month now as we were experiencing the same kind of reliability issues where connection was not reestablished randomly without any notice of it happening.

Since we introduced these changes we've not had any reliability issues and that is with a NATS upgrade along with multiple restarts of our NATS servers.

fielding · 2025-05-24T14:39:06Z

@MattiasAng Thanks so much for the feedback. This is great news! To be completely honest, we've been using these changes in the wild as well and it has fully cleared up any issues we were having. To the point that I kind of forgot about this PR lol.

Your feedback + my experience with these changes gives me the reassurance I need to get this out of draft mode and in front of the team. Thank you

Toruitas · 2025-06-25T14:34:16Z

Would be great to have these, as I'm also facing some connection reliability issues.

swelborn · 2025-09-30T14:56:45Z

This would be good to be upstreamed. Facing a lot of these issues! @fielding have you rebased this in a bit?

pisymbol · 2025-09-30T20:46:09Z

2. Enhanced Read Loop Stability

Implements timeout detection for read operations, introducing a consecutive timeout counter to identify potentially stalled connections.

Same. We sometimes see the client go into an infinite loop around uvloop on connection timeouts.

superlevure · 2025-10-10T12:17:45Z

We're facing these issues as well in production, this PR would be very much welcomed here too! Did you get any traction from the maintainers?

cc @wallyqs @caspervonb

nats/src/nats/aio/client.py

fielding · 2025-10-11T15:28:54Z

Hey guys, sorry just now catching up on this. I'm no longer working on the project that I originally was using this on, but as far as I know they continued to use this branch in production for awhile. I see there is def. some interest in getting this to a point where we can try and get it merged in, so I will help as much as possible. It looks like @superlevure is already on it some =)

it looks like a rebase is in order if nothing else =)

Copilot

Pull Request Overview

This PR improves connection reliability in the NATS Python client by enhancing error handling and implementing proactive connection health checks to address stability issues in long-running applications with intermittent network disruptions.

Enhanced ping loop and read loop error handling with proper exception catching and forced disconnections
Added connection health checks before critical operations like requests
Improved request/response handling with better timeout management and resource cleanup

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

nats/src/nats/aio/client.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix: `is_reconnecting` case

fielding · 2025-10-11T16:49:51Z

This now includes the fix for the race condition that @superlevure caught, thanks!

caspervonb · 2025-10-14T16:06:29Z

Looks promising, will do some testing locally with some chaos sprinkled in.

caspervonb

PTAL @wallyqs

caspervonb · 2025-10-17T06:22:48Z

nats/src/nats/aio/client.py


-        # Publish the request
-        await self.publish(subject, payload, reply=inbox.decode(), headers=headers)
+        def cleanup_resp_map(f):


Doesn't matter, but just makes the diff larger. Why change from the lambda to a named local fn?

err, I meant to change it back... I had added some debugging statements at some point and it was easier to comment them out when they were on distinct lines... come to think of it that might not be the only place

caspervonb · 2025-10-17T06:24:25Z

nats/src/nats/aio/client.py

+                if self.is_closed:
+                    raise errors.ConnectionClosedError
+                elif self.is_reconnecting:
+                    raise errors.ConnectionReconnectingError


So, this is now raising a bunch of errors, needs to be documented. Also, this bit feels breaking, PTAL @wallyqs

fielding mentioned this pull request Mar 13, 2025

NATS PING/PONG does not reconnects correctly when client can't talk with the server #671

Open

zoni mentioned this pull request Sep 26, 2025

Bug: NATS unstable connections ag2ai/faststream#1581

Open

superlevure reviewed Oct 11, 2025

View reviewed changes

nats/src/nats/aio/client.py Outdated Show resolved Hide resolved

improve long running connection stability

632f8c4

caspervonb requested a review from Copilot October 11, 2025 15:54

Copilot AI reviewed Oct 11, 2025

View reviewed changes

nats/src/nats/aio/client.py Outdated Show resolved Hide resolved

nats/src/nats/aio/client.py Outdated Show resolved Hide resolved

fielding force-pushed the fix/connection-stability branch from 4a20463 to 632f8c4 Compare October 11, 2025 15:59

fielding marked this pull request as ready for review October 11, 2025 16:00

fielding and others added 5 commits October 11, 2025 11:01

commit copilot nit

58802bc

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

approve copilot nit

7c4fa60

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix: is_reconnecting case

5f61ea6

fix: revert style

fcff317

Merge pull request #2 from superlevure/fix/connection-stability

f8c1335

fix: `is_reconnecting` case

fielding changed the title ~~[DRAFT]: Connection Reliability Improvements~~ Connection Reliability Improvements Oct 11, 2025

Fix formatting

4d05525

caspervonb reviewed Oct 17, 2025

View reviewed changes

Connection Reliability Improvements #675

Are you sure you want to change the base?

Connection Reliability Improvements #675

Uh oh!

Conversation

fielding commented Mar 13, 2025

Overview

1. Improved Ping Loop Error Handling

2. Enhanced Read Loop Stability

3. Reliable Request/Response Handling

4. Proactive Connection Health Checks

Linked Issues

Recommended Testing Scenarios

Impact and Compatibility

Contributing Statement

Uh oh!

fielding commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fielding commented Mar 13, 2025

Uh oh!

fielding commented Mar 18, 2025

Uh oh!

MattiasAng commented May 22, 2025

Uh oh!

fielding commented May 24, 2025

Uh oh!

Toruitas commented Jun 25, 2025

Uh oh!

swelborn commented Sep 30, 2025

Uh oh!

pisymbol commented Sep 30, 2025

2. Enhanced Read Loop Stability

Uh oh!

superlevure commented Oct 10, 2025

Uh oh!

Uh oh!

fielding commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

fielding commented Oct 11, 2025

Uh oh!

caspervonb commented Oct 14, 2025

Uh oh!

caspervonb left a comment

Choose a reason for hiding this comment

Uh oh!

caspervonb Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

fielding Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

caspervonb Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

fielding commented Mar 13, 2025 •

edited

Loading

fielding commented Oct 11, 2025 •

edited

Loading