Skip to content

Conversation

0xmichalis
Copy link
Contributor

This PR adds failure inspection in the deploymentConfig controller.
Every time a deploymentConfig is reconciled (every two minutes), we try
to find any details about possible failures and report them to the user
via the status.details.message field in the deploymentConfig. Users can
use oc describe dc/config and see if there is any warning in their
config. oc status should follow up these changes and report likewise.

@ironcladlou @deads2k @smarterclayton PTAL

Fixes #3526
Fixes #3311

@0xmichalis 0xmichalis changed the title Report transient deployment trigger erros via API field Report transient deployment trigger errors via API field Oct 5, 2015
@deads2k
Copy link
Contributor

deads2k commented Oct 5, 2015

Update oc status to know this too.

@ironcladlou
Copy link
Contributor

Discussed on IRC, but to recap: the change needs to originate in the deployment config generator, and the caller (e.g. the configChangeController) needs to be able to interpret that error and generate a more meaningful message. The current approach won't handle reporting the actual causes for a non-existent initial deployment, which is the primary use case.

@ironcladlou
Copy link
Contributor

Talked more on IRC, and I'm now thinking it might make more sense to have the deploymentController do the state inspection you've got here every time it reconciles a DC. This would ensure the message is set upon initial DC creation (which is the primary use case), and also ensure the message is cleared/updated as the triggers mutate the config.

So basically, when handling a DC, check to see "does a deployment for this DC.LatestVersion exist?" and if not, figure out some possible reasons why and update the message. If there is a deployment for the current version, clear the message.

@0xmichalis
Copy link
Contributor Author

@ironcladlou makes sense, will update accordingly

@0xmichalis
Copy link
Contributor Author

So basically, when handling a DC, check to see "does a deployment for this DC.LatestVersion exist?" and if not, figure out some possible reasons why and update the message. If there is a deployment for the current version, clear the message.

Thinking about this more, I believe we should not do that kind of inspection only when there is no respective deployment for the current dc version but also when the latest deployment is failed. WDYT?

ps. in your first paragraph, you probably meant the deploymentConfig controller.

@ironcladlou
Copy link
Contributor

Thinking about this more, I believe we should not do that kind of inspection only when there is no respective deployment for the current dc version but also when the latest deployment is failed. WDYT?

Makes sense to me... if the deployment is visible and we see it failed, I don't see why not try and add more info to the config status. That might even be a good time to go look up the deployer pod and if it still exists, get some info from that as well to add to the message. The more clues we can collect and provide to the user, the better IMO.

ps. in your first paragraph, you probably meant the deploymentConfig controller.

You're right, I meant to say this work should happen in the deploymentConfig controller.

@0xmichalis
Copy link
Contributor Author

@ironcladlou is this closer to what you were thinking? What other failures can we find out based on a dc? Inspect the template maybe? I could also see an unshedulable node as a possible failure but I am not sure about having a node client in here. Thoughts?


// failureInspection inspects the given deployment configuration for any failure causes
// and updates the status of the configuration accordingly
func (c *DeploymentConfigController) failureInspection(config *deployapi.DeploymentConfig, outErr error) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making this function just return the message and do the update from the calling side? Makes it easier to test and less side-effecty. I don't like doing the update in multiple places within the function, via defer, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the code is cleaner now that the inspection happens once at start. Updated

@0xmichalis
Copy link
Contributor Author

Need to add tests

}

// Handle processes config and creates a new deployment if necessary.
func (c *DeploymentConfigController) Handle(config *deployapi.DeploymentConfig) error {
// Inspect a deployment configuration every time the controller reconciles it
if details := c.failureInspection(config); len(details) > 0 {
c.detailsUpdate(config, details)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closer... there are a few mods to fit the general reconcilation model of other controllers. Something like:

details := c.findDetails(config, client)
c.updateDetails(config, details)

The findDetails function could immediately check to see if there's a deployment for config, and if so, return an empty string (to cause details to be cleared). It could then do all its other checks. Then findDetails needs to only be called once, from here (the one on L94 would not be necessary).

The updateDetails function would update config if details != config.details.Message. That handles setting new messages and clearing outdated ones, and handles no-op'ing if there's no difference.

You should also ditch all those Namespacers and just use a single client.Interface. The test support works just as well with those but with a lot less code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then findDetails needs to only be called once, from here (the one on L94 would not be necessary).

I was aware of it but wanted to keep population and cleaning in separate places. I guess though it doesn't really matter. Updating...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ironcladlou I am not sure what to do with errors returned from the clients when querying for resources. Pass those as the details into the DC? Let the controller handle them? Or ignore them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ironcladlou comments addressed, unit tests for findDetails added, PTAL

@0xmichalis
Copy link
Contributor Author

@ironcladlou updated the messages and tests (please use table tests with test names:) ).
Still have to update oc status but I am more inclined to do it in a follow-up (I would prefer to setup markers for dcs and we currently implement none - @deads2k WDYT?).

@deads2k
Copy link
Contributor

deads2k commented Oct 8, 2015

@ironcladlou updated the messages and tests (please use table tests with test names:) ).
Still have to update oc status but I am more inclined to do it in a follow-up (I would prefer to setup markers for dcs and we currently implement none - @deads2k WDYT?).

I don't see a reason to block this pull on oc status, but I would say that bad oc status for every deployment (what we have today) is a p1 for the release.

}, nil
},
},
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nesting here renders like a flock of geese on my screen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol

I will add mock funcs to reduce it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol.

I was just kidding, don't sweat it

@0xmichalis
Copy link
Contributor Author

@ironcladlou @deads2k I think this is ready

The oc status fix will be my next PR

@ironcladlou
Copy link
Contributor

This LGTM, thanks a lot for working on it.

@0xmichalis
Copy link
Contributor Author

Added the warning to oc status too. We could definitely traverse the graph and find out the cause sooner than the dc controller but then we would be incosistent. I think the right way is for status to follow what's reported by the controller.
@deads2k PTAL in the last commit

[vagrant@openshiftdev sample-app]$ oc status
In project test on server https://localhost:8443

service/database - 172.30.32.247:5434 -> 3306
  dc/database deploys docker.io/openshift/mysql-55-centos7:latest 
    #1 deployed 2 minutes ago - 1 pod

service/frontend - 172.30.252.126:5432 -> 8080
  dc/frontend deploys istag/origin-ruby-sample:latest <-
    builds https://github.com/openshift/ruby-hello-world.git with test/ruby-20-centos7:latest through bc/ruby-sample-build 
      #1 build failed 24 seconds ago (can't push to image)
    #1 deployment waiting on image or update

Warnings:
  bc/ruby-sample-build is pushing to imagestreamtag/origin-ruby-sample:latest that is using is/origin-ruby-sample, but the administrator has not configured the integrated Docker registry.  (oadm registry)
  The image trigger for "origin-ruby-sample:latest" will have no effect because image stream "origin-ruby-sample" does not exist!

To see more, use 'oc describe <resource>/<name>'.
You can use 'oc get all' to see a list of other objects.

@0xmichalis
Copy link
Contributor Author

This means that we have to report multiple bad triggers. Separate line for
each? The message you are asking for feels like a separate line too

On Fri, Oct 9, 2015 at 2:49 PM, David Eads [email protected] wrote:

Isn't this what the missing istag is about?

A deploymentconfig can have multiple triggers. It could have one bad
trigger and one good one.


Reply to this email directly or view it on GitHub
#4941 (comment).

@ironcladlou
Copy link
Contributor

I do agree that this work really belongs in the graph code- I brought this up before we started and @smarterclayton and I decided to do ASAP inside the controller to get started. Please correct me if I misunderstood the preferred initial direction.

My impression is since we're building a freeform message, it shouldn't be a big deal to swap in a different rendering from the graph library later.

@smarterclayton
Copy link
Contributor

Controller doesn't have to use the graph library - especially if the controller is checking conditions directly as part of its core logic.

@deads2k
Copy link
Contributor

deads2k commented Oct 9, 2015

This means that we have to report multiple bad triggers. Separate line for
each?

You want to do status as a followup? I do think we should handle multiple bad triggers as markers for warnings and the separately mark a deployment that can't succeed on the same line. Something like " #1 deployment waiting on image or update (image stream doesn't exist)"

@0xmichalis
Copy link
Contributor Author

@ironcladlou what's your opinion on reporting multiple bad triggers in the details field or the can't succeed message due to no image? I have updated findDetails likewise.

@ironcladlou
Copy link
Contributor

I think adding some sort of line per trigger would be just fine... I see no harm in making the message longer/better formatted for that.

@0xmichalis
Copy link
Contributor Author

Removed status changes and squashed down

glog.V(2).Infof("Error while trying to get image stream %q: %v", name, err)
if errors.IsNotFound(err) {
details = fmt.Sprintf("The image trigger for istag/%s will have no effect because is/%s does not exist.", istag, name)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use "istag/" syntax in messages from a controller. Say "image stream tag foo:bar"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@0xmichalis
Copy link
Contributor Author

[test]

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/5761/)

@deads2k
Copy link
Contributor

deads2k commented Oct 12, 2015

Don't use "istag/" syntax in messages from a controller. Say "image stream tag foo:bar"

Not trying to re-open this issue for this pull, but why not keep the TYPE/NAME syntax consistent throughout all messages? It's really convenient to be able to get an error message from anywhere and copy/paste a resource reference into an oc describe, oc get -o yaml, a label filter, or any other command with zero friction. Given the usage of that format throughout every command in arguments and output, I think its reasonable to expect people to be able to read it.

@smarterclayton
Copy link
Contributor

Because istag/ is a specific construct of one client (the CLI) and has no
relevance for the UI or an Eclipse client (or a log). "istag" is
insufficient info for a new user to determine the referenced type. No
message in our core API should assume a particular client.

On Mon, Oct 12, 2015 at 8:43 AM, David Eads [email protected]
wrote:

Don't use "istag/" syntax in messages from a controller. Say "image stream
tag foo:bar"

Not trying to re-open this issue for this pull, but why not keep the
TYPE/NAME syntax consistent throughout all messages? It's really
convenient to be able to get an error message from anywhere and copy/paste
a resource reference into an oc describe, oc get -o yaml, a label filter,
or any other command with zero friction. Given the usage of that format
throughout every command in arguments and output, I think its reasonable to
expect people to be able to read it.


Reply to this email directly or view it on GitHub
#4941 (comment).

@0xmichalis
Copy link
Contributor Author

Squashed down to one commit and updated the commit message

This commit adds failure inspection in the deploymentConfig controller.
Every time a deploymentConfig is reconciled (every two minutes), we try
to find any details about possible failures and report them to the user
via the status.details.message field in the deploymentConfig. Users can
use `oc describe dc/config` and see if there is any warning in their
config. `oc status` should follow up these changes and report likewise.
@openshift-bot
Copy link
Contributor

Evaluated for origin test up to 0b38d81

@ironcladlou
Copy link
Contributor

This still LGTM but I think @deads2k and @smarterclayton will need to take another look to see if they're okay with the latest verbiage.

@deads2k
Copy link
Contributor

deads2k commented Oct 14, 2015

I'm fine with the messages.

@deads2k
Copy link
Contributor

deads2k commented Oct 14, 2015

[merge]

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/3621/) (Image: devenv-fedora_2465)

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to 0b38d81

openshift-bot pushed a commit that referenced this pull request Oct 14, 2015
@openshift-bot openshift-bot merged commit 1bc373b into openshift:master Oct 14, 2015
@0xmichalis 0xmichalis deleted the report-trigger-errors-via-api-field branch October 15, 2015 07:57
@ironcladlou ironcladlou mentioned this pull request Oct 16, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants