Skip to content

openshift-sdn failed health check with "link not found" on GCP, failed test #18317

@smarterclayton

Description

@smarterclayton

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/18229/test_pull_request_origin_extended_conformance_gce/15151/#conformanceareanetworkingfeaturerouter-openshift-routers-the-haproxy-router-should-serve-the-correct-routes-when-scoped-to-a-single-namespace-and-label-set-suiteopenshiftconformanceparallel

exec pod was trying to curl the router (which was up) but wasn't able to create a connection.

/tmp/openshift/build-rpms/rpm/BUILD/origin-3.9.0/_output/local/go/src/github.com/openshift/origin/test/extended/router/scoped.go:42
Expected error:
    <*errors.errorString | 0xc420bf1170>: {
        s: "last response from server was not 200:\n",
    }
    last response from server was not 200:
    
not to have occurred
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.9.0/_output/local/go/src/github.com/openshift/origin/test/extended/router/scoped.go:79

Router was up a few seconds after the exec pod was created, but it was never able to connect to the destination pod:

Jan 27 23:50:48.628: INFO: Running '/data/src/github.com/openshift/origin/_output/local/bin/linux/amd64/kubectl --server=https://internal-api.prtest-5a37c28-15151.origin-ci-int-gce.dev.rhcloud.com:8443 --kubeconfig=/tmp/cluster-admin.kubeconfig exec --namespace=extended-test-scoped-router-s9pmd-2sj74 execpod -- /bin/sh -c 
		set -e
		for i in $(seq 1 180); do
			code=$( curl -k -s -o /dev/null -w '%{http_code}\n' --header 'Host: 172.16.2.45' "http://172.16.2.45:1936/healthz" ) || rc=$?
			if [[ "${rc:-0}" -eq 0 ]]; then
				echo $code
				if [[ $code -eq 200 ]]; then
					exit 0
				fi
				if [[ $code -ne 503 ]]; then
					exit 1
				fi
			else
				echo "error ${rc}" 1>&2
			fi
			sleep 1
		done
		'
Jan 27 23:53:53.739: INFO: stderr: "error 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\n"

Logs from router

Jan 27 23:53:53.828: INFO: Scoped Router test should serve the correct routes when scoped to a single namespace and label set [Suite:openshift/conformance/parallel] - waiting for the healthz endpoint to respond logs:
 I0127 23:51:07.732698       1 template.go:260] Starting template router (v3.9.0-alpha.3+5ec2b2f-218)
I0127 23:51:07.733031       1 merged_client_builder.go:123] Using in-cluster configuration
I0127 23:51:07.737366       1 reflector.go:202] Starting reflector *core.Service (10m0s) from github.com/openshift/origin/pkg/router/template/service_lookup.go:32
I0127 23:51:07.737420       1 reflector.go:240] Listing and watching *core.Service from github.com/openshift/origin/pkg/router/template/service_lookup.go:32
I0127 23:51:07.739146       1 router.go:154] Creating a new template router, writing to /var/lib/haproxy/router
I0127 23:51:07.739275       1 router.go:228] Template router will coalesce reloads within 5s of each other
I0127 23:51:07.739318       1 router.go:278] Router default cert from router container
I0127 23:51:07.739327       1 router.go:215] Reading persisted state
I0127 23:51:07.739355       1 router.go:219] Committing state
I0127 23:51:07.739363       1 router.go:333] Writing the router state
I0127 23:51:07.739855       1 router.go:340] Writing the router config
I0127 23:51:07.741933       1 router.go:354] Reloading the router
E0127 23:51:07.760463       1 reflector.go:205] github.com/openshift/origin/pkg/router/template/service_lookup.go:32: Failed to list *core.Service: services is forbidden: User "system:serviceaccount:extended-test-scoped-router-s9pmd-2sj74:default" cannot list services at the cluster scope: User "system:serviceaccount:extended-test-scoped-router-s9pmd-2sj74:default" cannot list all services in the cluster
I0127 23:51:07.816597       1 router.go:441] Router reloaded:
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).

The s4sd node ovs health check failed immediately after the router started:

Jan 27 23:50:31 ci-prtest-5a37c28-15151-ig-n-s4sd origin-node[2074]: F0127 23:50:31.053104    2074 healthcheck.go:96] SDN healthcheck detected unhealthy OVS server, restarting: Link not found
Jan 27 23:50:31 ci-prtest-5a37c28-15151-ig-n-s4sd systemd[1]: origin-node.service: main process exited, code=exited, status=255/n/a
Jan 27 23:50:31 ci-prtest-5a37c28-15151-ig-n-s4sd systemd[1]: Unit origin-node.service entered failed state.
Jan 27 23:50:31 ci-prtest-5a37c28-15151-ig-n-s4sd systemd[1]: origin-node.service failed.
Jan 27 23:50:36 ci-prtest-5a37c28-15151-ig-n-s4sd systemd[1]: origin-node.service holdoff time over, scheduling restart.
Jan 27 23:50:36 ci-prtest-5a37c28-15151-ig-n-s4sd systemd[1]: Starting OpenShift Node...

@openshift/sig-networking @openshift/networking

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions