-
Notifications
You must be signed in to change notification settings - Fork 231
Connection Reliability Improvements #675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
I'm going to leave this as a draft until I can get some concrete data from my own testing in the wild or from others who might be having similar issues and can help to test. If you want to install and test this commit, fielding@4a20463, you can simply do: pip install git+https://github.com/fielding/nats.py.git@4a20463b521962c83ec58e0cc1a4d1f72fd98440 or there is a way to install a specific PR by #, but I can't remember exactly what it is -.- |
|
My own concerns with this:
|
|
So far testing shows a positive change in behavior over the last 5 days. Normally it would fail to reconnect somewhere around the 36-48hr mark prior to these changes |
|
Just some feedback on this. We've ran with the changes committed in this pull request for a month now as we were experiencing the same kind of reliability issues where connection was not reestablished randomly without any notice of it happening. Since we introduced these changes we've not had any reliability issues and that is with a NATS upgrade along with multiple restarts of our NATS servers. |
|
@MattiasAng Thanks so much for the feedback. This is great news! To be completely honest, we've been using these changes in the wild as well and it has fully cleared up any issues we were having. To the point that I kind of forgot about this PR lol. Your feedback + my experience with these changes gives me the reassurance I need to get this out of draft mode and in front of the team. Thank you |
|
Would be great to have these, as I'm also facing some connection reliability issues. |
|
This would be good to be upstreamed. Facing a lot of these issues! @fielding have you rebased this in a bit? |
Same. We sometimes see the client go into an infinite loop around |
|
We're facing these issues as well in production, this PR would be very much welcomed here too! Did you get any traction from the maintainers? |
|
Hey guys, sorry just now catching up on this. I'm no longer working on the project that I originally was using this on, but as far as I know they continued to use this branch in production for awhile. I see there is def. some interest in getting this to a point where we can try and get it merged in, so I will help as much as possible. It looks like @superlevure is already on it some =) it looks like a rebase is in order if nothing else =) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves connection reliability in the NATS Python client by enhancing error handling and implementing proactive connection health checks to address stability issues in long-running applications with intermittent network disruptions.
- Enhanced ping loop and read loop error handling with proper exception catching and forced disconnections
- Added connection health checks before critical operations like requests
- Improved request/response handling with better timeout management and resource cleanup
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
4a20463 to
632f8c4
Compare
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
fix: `is_reconnecting` case
|
This now includes the fix for the race condition that @superlevure caught, thanks! |
|
Looks promising, will do some testing locally with some chaos sprinkled in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL @wallyqs
|
|
||
| # Publish the request | ||
| await self.publish(subject, payload, reply=inbox.decode(), headers=headers) | ||
| def cleanup_resp_map(f): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't matter, but just makes the diff larger. Why change from the lambda to a named local fn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err, I meant to change it back... I had added some debugging statements at some point and it was easier to comment them out when they were on distinct lines... come to think of it that might not be the only place
| if self.is_closed: | ||
| raise errors.ConnectionClosedError | ||
| elif self.is_reconnecting: | ||
| raise errors.ConnectionReconnectingError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this is now raising a bunch of errors, needs to be documented. Also, this bit feels breaking, PTAL @wallyqs
Overview
This PR focuses on improving connection reliability in
nats.py, addressing persistent stability issues observed in long-running applications or scenarios with intermittent network disruptions like the one we've been experiencing from one of our compute providers(lol). I was able to mitigate some of it, but not all of it with additional logic in our codebase that would help nats to reconnect in these instances and rebind the subscriptions, but honestly that was an ugly bandaid and it didn't work 100% of the time. So, after repeated encounters with elusive connectivity bugs and noting similar experiences among others in the community (#598), I decided to take a stab at helping improve the issue at the core.Here's what this PR includes:
1. Improved Ping Loop Error Handling
asyncio.InvalidStateError.ErrStaleConnectionwhen ping anomalies occur.2. Enhanced Read Loop Stability
max_read_timeouts, defaults to 3) to fine-tune sensitivity.ConnectionResetErrorandasyncio.InvalidStateErrorto improve resilience and provide clearer debug information.3. Reliable Request/Response Handling
4. Proactive Connection Health Checks
_check_connection_health(), a method designed to proactively test and re-establish the connection if necessary.Linked Issues
This PR addresses stability concerns raised in:
Recommended Testing Scenarios
These improvements should noticeably enhance stability, particularly in environments with:
Impact and Compatibility
max_read_timeouts) default to safe values and can be tuned as needed without affecting existing usage.Contributing Statement
As always, feedback is welcome, and I'm happy to iterate as needed!
Cheers,
Fielding