-
Notifications
You must be signed in to change notification settings - Fork 1.5k
fix #5152: expanding the error detection #5153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
47111a2
to
77f1319
Compare
@manusa @scholzj this refines the changes in #5047 to address #5047 (comment) which was observed in #5152. We'll specifically look for a protocolexception as the only case to assume we should proactively terminate - with the expectation that the httpclient will have normalized to that. For everything else there will be some logging based upon the inferred severity and number of times we've seen that termination since we've made progress with the watch. The calls to scheduleReconnect have been consolidated to ensure consistent handling of exceptions - the http watch exceptions have not been fully normalized, but at worst this will result in additional info logs or looping on unresolvable protocol errors. |
Thanks @shawkins |
…d logging also obeying the Status retryAfterSeconds if provided
@manusa at least for the jdk client this causes pretty bad behavior with watches / informers, so another 6.6 may be in order. For the other clients it shouldn't be as impactful. |
@manusa @scholzj After updating to master this was updated to harden the websocket implementation contracts - returning a boolean for send and sendClose is a little problematic. A non-terminal error writing would mean that any additional writes would be invalid. To ensure the expected behavior and let us know if something is wrong, callbacks will log and even terminate the websocket if possible. Similar logic / logging was added / made more robust for sendClose - with a delayed termination if a remote close has not been received. There is currently no method exposed for the vertx websocket to force a termination, so if anyone hits one of the error logs, we'll need to have something added or that specific error fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the effort you put into this @shawkins!
also adding logging and refining termination
SonarCloud Quality Gate failed. |
Description
As seen on #5152 some jdk client implementations do call onError with connection related errors (my local one would call onClose instead). The proposed change is to further differentiate what are protocol errors, so that only select exceptions are treated as terminal. Note that logic like this would be applicable in the http scenario as well, but no changes were made there.
Another possible issue I noticed in doing these changes is the behavior in WatchConnectionManager.start - if a websocket has already been established, then a subsequent connection attempt can short-circuit on failure before scheduling another attempt if a 200 or 503 occurs. That short-circuit should only be for the first attempt, where we fall-back to an http watch instead. It seems like there is probably some improvement to be had with the http case - also #4624 was never addressed.
However comparing to the go client, it will only retry a watch when the connection is refused or if there have been too many requests: https://sourcegraph.com/github.com/kubernetes/client-go@2a5f18df73b70cb85c26a3785b06162f3d513cf5/-/blob/tools/cache/reflector.go?L418 - so it seems like they would have a similar issue if 200 or 503 where returned.
Type of change
test, version modification, documentation, etc.)
Checklist