OCPBUGS-67161: Replace HTTP backend liveness check with admin socket check by alebedev87 · Pull Request #737 · openshift/router

alebedev87 · 2026-02-24T17:51:37Z

Summary

Replace the HTTP-based HAProxy liveness check with an admin socket show info command
The HTTP liveness check counts against HAProxy's maxconn limit. When maxconn is reached due to client traffic, the probe HTTP request gets queued or rejected, causing probe failures and unnecessary container restarts even though HAProxy is still running
The admin socket is not subject to maxconn, so the liveness probe remains reliable under high connection load
The readiness probe continues to use the HTTP backend check

Manual test

Using https://github.com/mparram/test-backend.

Standard router image:

# 1 pod sends ~500 req/s
$ oc -n test-client get pods
NAME                           READY   STATUS    RESTARTS   AGE
test-client-58c4687f55-5skdb   1/1     Running   0          9m13s
test-client-58c4687f55-bqffc   1/1     Running   0          9m13s
test-client-58c4687f55-gvsl7   1/1     Running   0          9m13s
test-client-58c4687f55-kpvm9   1/1     Running   0          9m13s
test-client-58c4687f55-wkxpp   1/1     Running   0          9m13s

$ oc -n openshift-ingress get pods router-default-d5db46b5d-p6k9w -o yaml | grep image:
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4bdfb4ce97c4391f07c4183b771dab332f311f2d707adb03281b43fd4dc7e196
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4bdfb4ce97c4391f07c4183b771dab332f311f2d707adb03281b43fd4dc7e196

$ oc -n openshift-ingress get pods router-default-d5db46b5d-p6k9w -o yaml | grep -A1 MAX_CONN
    - name: ROUTER_MAX_CONNECTIONS
      value: "2000"

# Router pods are restarting
$ oc -n openshift-ingress  get pods
NAME                             READY   STATUS    RESTARTS        AGE
router-default-d5db46b5d-2q7mh   1/1     Running   2 (8m11s ago)   11m
router-default-d5db46b5d-p6k9w   1/1     Running   4 (7m31s ago)   11m

$ oc -n openshift-ingress logs router-default-d5db46b5d-p6k9w -p
. . .
I0224 17:45:24.118761       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:27.608009       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:34.118481       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:37.607969       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:43.161206       1 template.go:844] "msg"="Shutdown requested, waiting 45s for new connections to cease" "logger"="router"
I0224 17:45:44.119374       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: read tcp 127.0.0.1:49768->127.0.0.1:80: i/o timeout
I0224 17:45:47.607867       1 healthz.go:311] backend-proxy-http,process-running check failed: healthz
[-]backend-proxy-http failed: read tcp 127.0.0.1:49782->127.0.0.1:80: i/o timeout
[-]process-running failed: process is terminating

Router image with the fix:

$ oc -n openshift-ingress get pods router-default-7fc6b96c5b-thbsm -o yaml | grep -A1 MAX_CONN
    - name: ROUTER_MAX_CONNECTIONS
      value: "2000"

$ oc -n openshift-ingress get pods router-default-7fc6b96c5b-thbsm -o yaml | grep image:
    image: quay.io/alebedev/router:2.24.173
    image: quay.io/alebedev/router:2.24.173

# readiness probe goes on/off
$ oc -n openshift-ingress get pods
NAME                              READY   STATUS    RESTARTS   AGE
router-default-7fc6b96c5b-qbnzp   0/1     Running   0          4m9s
router-default-7fc6b96c5b-thbsm   1/1     Running   0          4m10s

$ oc -n openshift-ingress get pods
NAME                              READY   STATUS    RESTARTS   AGE
router-default-7fc6b96c5b-qbnzp   0/1     Running   0          4m41s
router-default-7fc6b96c5b-thbsm   0/1     Running   0          4m42s

# while liveness probe keeps running ok (no admin-socket healthz failures)
$ oc -n openshift-ingress logs router-default-7fc6b96c5b-qbnzp | grep -c admin-socket
0

$ oc -n openshift-ingress logs router-default-7fc6b96c5b-thbsm | grep CurrConns
CurrConns: 0
CurrConns: 1
CurrConns: 1
CurrConns: 2
CurrConns: 3
CurrConns: 4
CurrConns: 5
CurrConns: 0
CurrConns: 1
CurrConns: 2
CurrConns: 3
CurrConns: 4
CurrConns: 5
CurrConns: 6
CurrConns: 7
CurrConns: 8
CurrConns: 0
CurrConns: 1
CurrConns: 2
CurrConns: 931
CurrConns: 2000
CurrConns: 2000
CurrConns: 1998
CurrConns: 1994
CurrConns: 2000
CurrConns: 2000
CurrConns: 2000
CurrConns: 824
CurrConns: 359
CurrConns: 2000
CurrConns: 2000
CurrConns: 2000
CurrConns: 1810
CurrConns: 9
CurrConns: 10
CurrConns: 11
CurrConns: 12
CurrConns: 13

openshift-ci-robot · 2026-02-24T17:51:44Z

@alebedev87: This pull request references Jira Issue OCPBUGS-67161, which is invalid:

expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

Replace the HTTP-based HAProxy liveness check with an admin socket show info command

The HTTP liveness check counts against HAProxy's maxconn limit. When maxconn is reached due to client traffic, the probe HTTP request gets queued or rejected, causing probe failures and unnecessary container restarts even though HAProxy is still running

The admin socket is not subject to maxconn, so the liveness probe remains reliable under high connection load

The readiness probe continues to use the HTTP backend check

Test plan

go build ./... passes

go test ./pkg/router/metrics/... ./pkg/cmd/infra/router/... passes

Deploy to a cluster and verify curl localhost:1936/healthz returns 200 when HAProxy is running

Verify the liveness probe restarts the container if HAProxy becomes unresponsive

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-02-24T17:52:10Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign rfredette for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…check Use HAProxy admin socket "show info" command for the liveness probe instead of sending an HTTP request to the backend. This directly tests whether the HAProxy process is alive and responsive, rather than testing through the data plane. The HTTP-based liveness check counts against HAProxy's maxconn limit. When maxconn is reached due to client traffic, the liveness probe HTTP request gets queued or rejected, causing probe failures and unnecessary container restarts even though HAProxy is still running. The admin socket is not subject to maxconn, so the liveness probe remains reliable under high connection load. The readiness probe continues to use the HTTP backend check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

alebedev87 · 2026-02-24T21:02:21Z

/jira refresh

openshift-ci-robot · 2026-02-24T21:02:30Z

@alebedev87: This pull request references Jira Issue OCPBUGS-67161, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @ShudiLi

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

alebedev87 · 2026-02-25T10:13:44Z

/retest

openshift-ci-robot · 2026-02-25T14:33:09Z

@alebedev87: This pull request references Jira Issue OCPBUGS-67161, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @ShudiLi

Details

In response to this:

Summary

Replace the HTTP-based HAProxy liveness check with an admin socket show info command
The HTTP liveness check counts against HAProxy's maxconn limit. When maxconn is reached due to client traffic, the probe HTTP request gets queued or rejected, causing probe failures and unnecessary container restarts even though HAProxy is still running
The admin socket is not subject to maxconn, so the liveness probe remains reliable under high connection load
The readiness probe continues to use the HTTP backend check

Manual test

Using https://github.com/mparram/test-backend.

Standard router image:

# 1 pod sends ~500 req/s
$ oc -n test-client get pods
NAME                           READY   STATUS    RESTARTS   AGE
test-client-58c4687f55-5skdb   1/1     Running   0          9m13s
test-client-58c4687f55-bqffc   1/1     Running   0          9m13s
test-client-58c4687f55-gvsl7   1/1     Running   0          9m13s
test-client-58c4687f55-kpvm9   1/1     Running   0          9m13s
test-client-58c4687f55-wkxpp   1/1     Running   0          9m13s

$ oc -n openshift-ingress get pods router-default-d5db46b5d-p6k9w -o yaml | grep image:
   image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4bdfb4ce97c4391f07c4183b771dab332f311f2d707adb03281b43fd4dc7e196
   image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4bdfb4ce97c4391f07c4183b771dab332f311f2d707adb03281b43fd4dc7e196

$ oc -n openshift-ingress get pods router-default-d5db46b5d-p6k9w -o yaml | grep -A1 MAX_CONN
   - name: ROUTER_MAX_CONNECTIONS
     value: "2000"

# Router pods are restarting
$ oc -n openshift-ingress  get pods
NAME                             READY   STATUS    RESTARTS        AGE
router-default-d5db46b5d-2q7mh   1/1     Running   2 (8m11s ago)   11m
router-default-d5db46b5d-p6k9w   1/1     Running   4 (7m31s ago)   11m

$ oc -n openshift-ingress logs router-default-d5db46b5d-p6k9w -p
. . .
I0224 17:45:24.118761       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:27.608009       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:34.118481       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:37.607969       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:43.161206       1 template.go:844] "msg"="Shutdown requested, waiting 45s for new connections to cease" "logger"="router"
I0224 17:45:44.119374       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: read tcp 127.0.0.1:49768->127.0.0.1:80: i/o timeout
I0224 17:45:47.607867       1 healthz.go:311] backend-proxy-http,process-running check failed: healthz
[-]backend-proxy-http failed: read tcp 127.0.0.1:49782->127.0.0.1:80: i/o timeout
[-]process-running failed: process is terminating

Router image with the fix:

$ oc -n openshift-ingress get pods router-default-7fc6b96c5b-thbsm -o yaml | grep -A1 MAX_CONN
   - name: ROUTER_MAX_CONNECTIONS
     value: "2000"

$ oc -n openshift-ingress get pods router-default-7fc6b96c5b-thbsm -o yaml | grep image:
   image: quay.io/alebedev/router:2.24.173
   image: quay.io/alebedev/router:2.24.173

# readiness probe goes on/off
$ oc -n openshift-ingress get pods
NAME                              READY   STATUS    RESTARTS   AGE
router-default-7fc6b96c5b-qbnzp   0/1     Running   0          4m9s
router-default-7fc6b96c5b-thbsm   1/1     Running   0          4m10s

$ oc -n openshift-ingress get pods
NAME                              READY   STATUS    RESTARTS   AGE
router-default-7fc6b96c5b-qbnzp   0/1     Running   0          4m41s
router-default-7fc6b96c5b-thbsm   0/1     Running   0          4m42s

# while liveness probe keeps running ok (no admin-socket healthz failures)
$ oc -n openshift-ingress logs router-default-7fc6b96c5b-qbnzp | grep -c admin-socket
0

$ oc -n openshift-ingress logs router-default-7fc6b96c5b-thbsm | grep CurrConns
CurrConns: 0
CurrConns: 1
CurrConns: 1
CurrConns: 2
CurrConns: 3
CurrConns: 4
CurrConns: 5
CurrConns: 0
CurrConns: 1
CurrConns: 2
CurrConns: 3
CurrConns: 4
CurrConns: 5
CurrConns: 6
CurrConns: 7
CurrConns: 8
CurrConns: 0
CurrConns: 1
CurrConns: 2
CurrConns: 931
CurrConns: 2000
CurrConns: 2000
CurrConns: 1998
CurrConns: 1994
CurrConns: 2000
CurrConns: 2000
CurrConns: 2000
CurrConns: 824
CurrConns: 359
CurrConns: 2000
CurrConns: 2000
CurrConns: 2000
CurrConns: 1810
CurrConns: 9
CurrConns: 10
CurrConns: 11
CurrConns: 12
CurrConns: 13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

alebedev87 · 2026-02-25T14:52:30Z

/retest

openshift-ci · 2026-02-25T14:53:27Z