Skip to content

OCPBUGS-67161: Replace HTTP backend liveness check with admin socket check#737

Open
alebedev87 wants to merge 1 commit intoopenshift:masterfrom
alebedev87:OCPBUGS-67161-liveness-probe-show-info
Open

OCPBUGS-67161: Replace HTTP backend liveness check with admin socket check#737
alebedev87 wants to merge 1 commit intoopenshift:masterfrom
alebedev87:OCPBUGS-67161-liveness-probe-show-info

Conversation

@alebedev87
Copy link
Contributor

@alebedev87 alebedev87 commented Feb 24, 2026

Summary

  • Replace the HTTP-based HAProxy liveness check with an admin socket show info command
  • The HTTP liveness check counts against HAProxy's maxconn limit. When maxconn is reached due to client traffic, the probe HTTP request gets queued or rejected, causing probe failures and unnecessary container restarts even though HAProxy is still running
  • The admin socket is not subject to maxconn, so the liveness probe remains reliable under high connection load
  • The readiness probe continues to use the HTTP backend check

Manual test

Using https://github.com/mparram/test-backend.

Standard router image:

# 1 pod sends ~500 req/s
$ oc -n test-client get pods
NAME                           READY   STATUS    RESTARTS   AGE
test-client-58c4687f55-5skdb   1/1     Running   0          9m13s
test-client-58c4687f55-bqffc   1/1     Running   0          9m13s
test-client-58c4687f55-gvsl7   1/1     Running   0          9m13s
test-client-58c4687f55-kpvm9   1/1     Running   0          9m13s
test-client-58c4687f55-wkxpp   1/1     Running   0          9m13s

$ oc -n openshift-ingress get pods router-default-d5db46b5d-p6k9w -o yaml | grep image:
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4bdfb4ce97c4391f07c4183b771dab332f311f2d707adb03281b43fd4dc7e196
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4bdfb4ce97c4391f07c4183b771dab332f311f2d707adb03281b43fd4dc7e196

$ oc -n openshift-ingress get pods router-default-d5db46b5d-p6k9w -o yaml | grep -A1 MAX_CONN
    - name: ROUTER_MAX_CONNECTIONS
      value: "2000"

# Router pods are restarting
$ oc -n openshift-ingress  get pods
NAME                             READY   STATUS    RESTARTS        AGE
router-default-d5db46b5d-2q7mh   1/1     Running   2 (8m11s ago)   11m
router-default-d5db46b5d-p6k9w   1/1     Running   4 (7m31s ago)   11m

$ oc -n openshift-ingress logs router-default-d5db46b5d-p6k9w -p
. . .
I0224 17:45:24.118761       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:27.608009       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:34.118481       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:37.607969       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:43.161206       1 template.go:844] "msg"="Shutdown requested, waiting 45s for new connections to cease" "logger"="router"
I0224 17:45:44.119374       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: read tcp 127.0.0.1:49768->127.0.0.1:80: i/o timeout
I0224 17:45:47.607867       1 healthz.go:311] backend-proxy-http,process-running check failed: healthz
[-]backend-proxy-http failed: read tcp 127.0.0.1:49782->127.0.0.1:80: i/o timeout
[-]process-running failed: process is terminating

Router image with the fix:

$ oc -n openshift-ingress get pods router-default-7fc6b96c5b-thbsm -o yaml | grep -A1 MAX_CONN
    - name: ROUTER_MAX_CONNECTIONS
      value: "2000"

$ oc -n openshift-ingress get pods router-default-7fc6b96c5b-thbsm -o yaml | grep image:
    image: quay.io/alebedev/router:2.24.173
    image: quay.io/alebedev/router:2.24.173

# readiness probe goes on/off
$ oc -n openshift-ingress get pods
NAME                              READY   STATUS    RESTARTS   AGE
router-default-7fc6b96c5b-qbnzp   0/1     Running   0          4m9s
router-default-7fc6b96c5b-thbsm   1/1     Running   0          4m10s

$ oc -n openshift-ingress get pods
NAME                              READY   STATUS    RESTARTS   AGE
router-default-7fc6b96c5b-qbnzp   0/1     Running   0          4m41s
router-default-7fc6b96c5b-thbsm   0/1     Running   0          4m42s

# while liveness probe keeps running ok (no admin-socket healthz failures)
$ oc -n openshift-ingress logs router-default-7fc6b96c5b-qbnzp | grep -c admin-socket
0

$ oc -n openshift-ingress logs router-default-7fc6b96c5b-thbsm | grep CurrConns
CurrConns: 0
CurrConns: 1
CurrConns: 1
CurrConns: 2
CurrConns: 3
CurrConns: 4
CurrConns: 5
CurrConns: 0
CurrConns: 1
CurrConns: 2
CurrConns: 3
CurrConns: 4
CurrConns: 5
CurrConns: 6
CurrConns: 7
CurrConns: 8
CurrConns: 0
CurrConns: 1
CurrConns: 2
CurrConns: 931
CurrConns: 2000
CurrConns: 2000
CurrConns: 1998
CurrConns: 1994
CurrConns: 2000
CurrConns: 2000
CurrConns: 2000
CurrConns: 824
CurrConns: 359
CurrConns: 2000
CurrConns: 2000
CurrConns: 2000
CurrConns: 1810
CurrConns: 9
CurrConns: 10
CurrConns: 11
CurrConns: 12
CurrConns: 13

@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Feb 24, 2026
@openshift-ci-robot
Copy link
Contributor

@alebedev87: This pull request references Jira Issue OCPBUGS-67161, which is invalid:

  • expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

  • Replace the HTTP-based HAProxy liveness check with an admin socket show info command
  • The HTTP liveness check counts against HAProxy's maxconn limit. When maxconn is reached due to client traffic, the probe HTTP request gets queued or rejected, causing probe failures and unnecessary container restarts even though HAProxy is still running
  • The admin socket is not subject to maxconn, so the liveness probe remains reliable under high connection load
  • The readiness probe continues to use the HTTP backend check

Test plan

  • go build ./... passes
  • go test ./pkg/router/metrics/... ./pkg/cmd/infra/router/... passes
  • Deploy to a cluster and verify curl localhost:1936/healthz returns 200 when HAProxy is running
  • Verify the liveness probe restarts the container if HAProxy becomes unresponsive

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 24, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign rfredette for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…check

Use HAProxy admin socket "show info" command for the liveness probe
instead of sending an HTTP request to the backend. This directly tests
whether the HAProxy process is alive and responsive, rather than
testing through the data plane.

The HTTP-based liveness check counts against HAProxy's maxconn limit.
When maxconn is reached due to client traffic, the liveness probe HTTP
request gets queued or rejected, causing probe failures and unnecessary
container restarts even though HAProxy is still running. The admin
socket is not subject to maxconn, so the liveness probe remains
reliable under high connection load.

The readiness probe continues to use the HTTP backend check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@alebedev87 alebedev87 force-pushed the OCPBUGS-67161-liveness-probe-show-info branch from 9b63bde to 6329b86 Compare February 24, 2026 20:58
@alebedev87
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Feb 24, 2026
@openshift-ci-robot
Copy link
Contributor

@alebedev87: This pull request references Jira Issue OCPBUGS-67161, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @ShudiLi

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from ShudiLi February 24, 2026 21:02
@alebedev87
Copy link
Contributor Author

/retest

@openshift-ci-robot
Copy link
Contributor

@alebedev87: This pull request references Jira Issue OCPBUGS-67161, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @ShudiLi

Details

In response to this:

Summary

  • Replace the HTTP-based HAProxy liveness check with an admin socket show info command
  • The HTTP liveness check counts against HAProxy's maxconn limit. When maxconn is reached due to client traffic, the probe HTTP request gets queued or rejected, causing probe failures and unnecessary container restarts even though HAProxy is still running
  • The admin socket is not subject to maxconn, so the liveness probe remains reliable under high connection load
  • The readiness probe continues to use the HTTP backend check

Manual test

Using https://github.com/mparram/test-backend.

Standard router image:

# 1 pod sends ~500 req/s
$ oc -n test-client get pods
NAME                           READY   STATUS    RESTARTS   AGE
test-client-58c4687f55-5skdb   1/1     Running   0          9m13s
test-client-58c4687f55-bqffc   1/1     Running   0          9m13s
test-client-58c4687f55-gvsl7   1/1     Running   0          9m13s
test-client-58c4687f55-kpvm9   1/1     Running   0          9m13s
test-client-58c4687f55-wkxpp   1/1     Running   0          9m13s

$ oc -n openshift-ingress get pods router-default-d5db46b5d-p6k9w -o yaml | grep image:
   image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4bdfb4ce97c4391f07c4183b771dab332f311f2d707adb03281b43fd4dc7e196
   image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4bdfb4ce97c4391f07c4183b771dab332f311f2d707adb03281b43fd4dc7e196

$ oc -n openshift-ingress get pods router-default-d5db46b5d-p6k9w -o yaml | grep -A1 MAX_CONN
   - name: ROUTER_MAX_CONNECTIONS
     value: "2000"

# Router pods are restarting
$ oc -n openshift-ingress  get pods
NAME                             READY   STATUS    RESTARTS        AGE
router-default-d5db46b5d-2q7mh   1/1     Running   2 (8m11s ago)   11m
router-default-d5db46b5d-p6k9w   1/1     Running   4 (7m31s ago)   11m

$ oc -n openshift-ingress logs router-default-d5db46b5d-p6k9w -p
. . .
I0224 17:45:24.118761       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:27.608009       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:34.118481       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:37.607969       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:43.161206       1 template.go:844] "msg"="Shutdown requested, waiting 45s for new connections to cease" "logger"="router"
I0224 17:45:44.119374       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: read tcp 127.0.0.1:49768->127.0.0.1:80: i/o timeout
I0224 17:45:47.607867       1 healthz.go:311] backend-proxy-http,process-running check failed: healthz
[-]backend-proxy-http failed: read tcp 127.0.0.1:49782->127.0.0.1:80: i/o timeout
[-]process-running failed: process is terminating

Router image with the fix:

$ oc -n openshift-ingress get pods router-default-7fc6b96c5b-thbsm -o yaml | grep -A1 MAX_CONN
   - name: ROUTER_MAX_CONNECTIONS
     value: "2000"

$ oc -n openshift-ingress get pods router-default-7fc6b96c5b-thbsm -o yaml | grep image:
   image: quay.io/alebedev/router:2.24.173
   image: quay.io/alebedev/router:2.24.173

# readiness probe goes on/off
$ oc -n openshift-ingress get pods
NAME                              READY   STATUS    RESTARTS   AGE
router-default-7fc6b96c5b-qbnzp   0/1     Running   0          4m9s
router-default-7fc6b96c5b-thbsm   1/1     Running   0          4m10s

$ oc -n openshift-ingress get pods
NAME                              READY   STATUS    RESTARTS   AGE
router-default-7fc6b96c5b-qbnzp   0/1     Running   0          4m41s
router-default-7fc6b96c5b-thbsm   0/1     Running   0          4m42s

# while liveness probe keeps running ok (no admin-socket healthz failures)
$ oc -n openshift-ingress logs router-default-7fc6b96c5b-qbnzp | grep -c admin-socket
0

$ oc -n openshift-ingress logs router-default-7fc6b96c5b-thbsm | grep CurrConns
CurrConns: 0
CurrConns: 1
CurrConns: 1
CurrConns: 2
CurrConns: 3
CurrConns: 4
CurrConns: 5
CurrConns: 0
CurrConns: 1
CurrConns: 2
CurrConns: 3
CurrConns: 4
CurrConns: 5
CurrConns: 6
CurrConns: 7
CurrConns: 8
CurrConns: 0
CurrConns: 1
CurrConns: 2
CurrConns: 931
CurrConns: 2000
CurrConns: 2000
CurrConns: 1998
CurrConns: 1994
CurrConns: 2000
CurrConns: 2000
CurrConns: 2000
CurrConns: 824
CurrConns: 359
CurrConns: 2000
CurrConns: 2000
CurrConns: 2000
CurrConns: 1810
CurrConns: 9
CurrConns: 10
CurrConns: 11
CurrConns: 12
CurrConns: 13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@alebedev87
Copy link
Contributor Author

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

12 similar comments
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@jcmoraisjr
Copy link
Member

/assign

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ShudiLi
Copy link
Member

ShudiLi commented Feb 26, 2026

tested it with 4.22.0-0-2026-02-26-031346-test-ci-ln-jir9lwt-latest

1.
% oc get clusterversion
NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.22.0-0-2026-02-26-031346-test-ci-ln-jir9lwt-latest   True        False         124m    Cluster version is 4.22.0-0-2026-02-26-031346-test-ci-ln-jir9lwt-latest

2. create pods, services and the service
% oc get pods,svc,route  
NAME                                 READY   STATUS    RESTARTS   AGE
pod/appach-server-66b4878747-29ntd   1/1     Running   0          48m
pod/appach-server-66b4878747-6fjxr   1/1     Running   0          48m
pod/appach-server-66b4878747-8tqrv   1/1     Running   0          48m
pod/appach-server-66b4878747-dvjd4   1/1     Running   0          48m
pod/appach-server-66b4878747-h8ftw   1/1     Running   0          48m
pod/appach-server-66b4878747-pvtql   1/1     Running   0          48m
pod/appach-server-66b4878747-rb9x2   1/1     Running   0          50m
pod/appach-server-66b4878747-v4wtf   1/1     Running   0          50m
pod/appach-server-66b4878747-z6dqm   1/1     Running   0          48m
pod/appach-server-66b4878747-zxcg8   1/1     Running   0          48m
pod/perf-tool-6c6f847b8-258tq        1/1     Running   0          46m
pod/perf-tool-6c6f847b8-4lkf2        1/1     Running   0          46m
pod/perf-tool-6c6f847b8-58fm6        1/1     Running   0          46m
pod/perf-tool-6c6f847b8-6hj9f        1/1     Running   0          47m
pod/perf-tool-6c6f847b8-762nb        1/1     Running   0          46m
pod/perf-tool-6c6f847b8-79kcd        1/1     Running   0          47m
pod/perf-tool-6c6f847b8-bb8gl        1/1     Running   0          46m
pod/perf-tool-6c6f847b8-jrlvv        1/1     Running   0          46m
pod/perf-tool-6c6f847b8-n2hpt        1/1     Running   0          46m
pod/perf-tool-6c6f847b8-x8dqs        1/1     Running   0          47m

NAME                  TYPE           CLUSTER-IP      EXTERNAL-IP                            PORT(S)     AGE
service/kubernetes    ClusterIP      172.30.0.1      <none>                                 443/TCP     162m
service/openshift     ExternalName   <none>          kubernetes.default.svc.cluster.local   <none>      152m
service/unsec-apach   ClusterIP      172.30.52.119   <none>                                 28080/TCP   22m

NAME                                   HOST/PORT                                                             PATH   SERVICES      PORT          TERMINATION   WILDCARD
route.route.openshift.io/unsec-apach   unsec-apach-default.apps.ci-ln-jir9lwt-76ef8.aws-4.ci.openshift.org          unsec-apach   unsec-apach                 None

3. let the 10 clients to send traffic, each with 50k http requests
4.  Let one client to send 10k http requests
sh-4.4# hey -n 10000 -c 10000 http://unsec-apach-default.apps.ci-ln-jir9lwt-76ef8.aws-4.ci.openshift.org

Summary:
  Total:	0.6044 secs
  Slowest:	0.5908 secs
  Fastest:	0.2375 secs
  Average:	0.4415 secs
  Requests/sec:	16544.2112
  
  Total data:	59024 bytes
  Size/request:	14 bytes

Response time histogram:
  0.237 [1]	|
  0.273 [0]	|
  0.308 [63]	|■■■
  0.343 [217]	|■■■■■■■■■■
  0.379 [463]	|■■■■■■■■■■■■■■■■■■■■■
  0.414 [822]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.449 [781]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.485 [740]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.520 [210]	|■■■■■■■■■
  0.555 [890]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.591 [29]	|■


Latency distribution:
  10% in 0.3536 secs
  25% in 0.3934 secs
  50% in 0.4351 secs
  75% in 0.5182 secs
  90% in 0.5309 secs
  95% in 0.5344 secs
  99% in 0.5394 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0441 secs, 0.2375 secs, 0.5908 secs
  DNS-lookup:	0.0788 secs, 0.0000 secs, 0.3231 secs
  req write:	0.0034 secs, 0.0000 secs, 0.1169 secs
  resp wait:	0.0814 secs, 0.0086 secs, 0.2291 secs
  resp read:	0.0029 secs, 0.0000 secs, 0.0846 secs

Status code distribution:
  [200]	4216 responses

Error distribution:
  [2974]	Get "http://unsec-apach-default.apps.ci-ln-jir9lwt-76ef8.aws-4.ci.openshift.org": dial tcp 52.54.60.220:80: socket: too many open files
  [2810]	Get "http://unsec-apach-default.apps.ci-ln-jir9lwt-76ef8.aws-4.ci.openshift.org": dial tcp: lookup unsec-apach-default.apps.ci-ln-jir9lwt-76ef8.aws-4.ci.openshift.org on 172.30.0.10:53: no such host

sh-4.4#

5. checked the logs of the router pod, couldn't see the reloaded or restarted log
% oc -n openshift-ingress logs router-default-777cb868c5-vhdzj --tail=10                      
I0226 03:56:38.827512       1 router.go:665] "msg"="router reloaded" "logger"="template" "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0226 03:56:43.791524       1 router.go:665] "msg"="router reloaded" "logger"="template" "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0226 03:57:11.470493       1 router.go:665] "msg"="router reloaded" "logger"="template" "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0226 03:57:19.234594       1 router.go:665] "msg"="router reloaded" "logger"="template" "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0226 03:57:24.211068       1 router.go:665] "msg"="router reloaded" "logger"="template" "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0226 03:57:33.155680       1 router.go:665] "msg"="router reloaded" "logger"="template" "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0226 03:57:38.153022       1 router.go:665] "msg"="router reloaded" "logger"="template" "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0226 03:57:46.442838       1 router.go:665] "msg"="router reloaded" "logger"="template" "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0226 03:58:22.013179       1 router.go:665] "msg"="router reloaded" "logger"="template" "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0226 05:54:23.334872       1 router.go:665] "msg"="router reloaded" "logger"="template" "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"

@ShudiLi
Copy link
Member

ShudiLi commented Feb 26, 2026

/label qe-approved
/verified by @ShudiLi

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Feb 26, 2026
@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Feb 26, 2026
@openshift-ci-robot
Copy link
Contributor

@ShudiLi: This PR has been marked as verified by @ShudiLi.

Details

In response to this:

/label qe-approved
/verified by @ShudiLi

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@alebedev87
Copy link
Contributor Author

@ShudiLi: In your test, did you see failed readiness probes (due to maxconn reached)?

Name: o.RouterName,
},
LiveChecks: liveChecks,
ReadyChecks: []healthz.HealthChecker{checkBackend, checkSync, metrics.ProcessRunning(stopCh)},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any benefit or caveat on changing here as well?

// expecting a non-empty response.
func AdminSocketAvailable(u *url.URL) healthz.HealthChecker {
return healthz.NamedCheck("admin-socket", func(_ *http.Request) error {
conn, err := net.DialTimeout("unix", u.Path, 2*time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raw socket works nice, but router already imports a haproxy client that should make the code smaller and easier to understand, how does it sound?

		client := &haproxy.HAProxyClient{ // github.com/bcicen/go-haproxy
			Addr:    "unix:///var/lib/haproxy/run/haproxy.sock",
			Timeout: 2,
		}
		out, err := client.RunCommand("show info")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the third place configuring the admin socket, maybe there's a place to configure it just once and everyone else can reuse instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. qe-approved Signifies that QE has signed off on this PR verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants