Skip to content

Conversation

shawkins
Copy link
Contributor

Description

As seen on #5152 some jdk client implementations do call onError with connection related errors (my local one would call onClose instead). The proposed change is to further differentiate what are protocol errors, so that only select exceptions are treated as terminal. Note that logic like this would be applicable in the http scenario as well, but no changes were made there.

Another possible issue I noticed in doing these changes is the behavior in WatchConnectionManager.start - if a websocket has already been established, then a subsequent connection attempt can short-circuit on failure before scheduling another attempt if a 200 or 503 occurs. That short-circuit should only be for the first attempt, where we fall-back to an http watch instead. It seems like there is probably some improvement to be had with the http case - also #4624 was never addressed.

However comparing to the go client, it will only retry a watch when the connection is refused or if there have been too many requests: https://sourcegraph.com/github.com/kubernetes/client-go@2a5f18df73b70cb85c26a3785b06162f3d513cf5/-/blob/tools/cache/reflector.go?L418 - so it seems like they would have a similar issue if 200 or 503 where returned.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • Feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change
  • Chore (non-breaking change which doesn't affect codebase;
    test, version modification, documentation, etc.)

Checklist

  • Code contributed by me aligns with current project license: Apache 2.0
  • I Added CHANGELOG entry regarding this change
  • I have implemented unit tests to cover my changes
  • I have added/updated the javadocs and other documentation accordingly
  • No new bugs, code smells, etc. in SonarCloud report
  • I tested my code in Kubernetes
  • I tested my code in OpenShift

@shawkins shawkins marked this pull request as ready for review May 19, 2023 13:44
@shawkins shawkins force-pushed the iss5152 branch 4 times, most recently from 47111a2 to 77f1319 Compare May 19, 2023 15:48
@shawkins
Copy link
Contributor Author

@manusa @scholzj this refines the changes in #5047 to address #5047 (comment) which was observed in #5152. We'll specifically look for a protocolexception as the only case to assume we should proactively terminate - with the expectation that the httpclient will have normalized to that. For everything else there will be some logging based upon the inferred severity and number of times we've seen that termination since we've made progress with the watch.

The calls to scheduleReconnect have been consolidated to ensure consistent handling of exceptions - the http watch exceptions have not been fully normalized, but at worst this will result in additional info logs or looping on unresolvable protocol errors.

@scholzj
Copy link
Contributor

scholzj commented May 19, 2023

Thanks @shawkins

…d logging

also obeying the Status retryAfterSeconds if provided
@shawkins
Copy link
Contributor Author

@manusa at least for the jdk client this causes pretty bad behavior with watches / informers, so another 6.6 may be in order. For the other clients it shouldn't be as impactful.

@shawkins
Copy link
Contributor Author

@manusa @scholzj After updating to master this was updated to harden the websocket implementation contracts - returning a boolean for send and sendClose is a little problematic. A non-terminal error writing would mean that any additional writes would be invalid. To ensure the expected behavior and let us know if something is wrong, callbacks will log and even terminate the websocket if possible. Similar logic / logging was added / made more robust for sendClose - with a delayed termination if a remote close has not been received.

There is currently no method exposed for the vertx websocket to force a termination, so if anyone hits one of the error logs, we'll need to have something added or that specific error fixed.

Copy link
Contributor

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the effort you put into this @shawkins!

also adding logging and refining termination
@shawkins shawkins added this to the 6.7.0 milestone May 29, 2023
@sonarqubecloud
Copy link

SonarCloud Quality Gate failed.    Quality Gate failed

Bug B 2 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

59.0% 59.0% Coverage
0.0% 0.0% Duplication

@manusa manusa merged commit 1c3baf9 into fabric8io:master May 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants