OCPBUGS-67161: Replace HTTP backend liveness check with admin socket check#737
OCPBUGS-67161: Replace HTTP backend liveness check with admin socket check#737alebedev87 wants to merge 1 commit intoopenshift:masterfrom
Conversation
|
@alebedev87: This pull request references Jira Issue OCPBUGS-67161, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
…check Use HAProxy admin socket "show info" command for the liveness probe instead of sending an HTTP request to the backend. This directly tests whether the HAProxy process is alive and responsive, rather than testing through the data plane. The HTTP-based liveness check counts against HAProxy's maxconn limit. When maxconn is reached due to client traffic, the liveness probe HTTP request gets queued or rejected, causing probe failures and unnecessary container restarts even though HAProxy is still running. The admin socket is not subject to maxconn, so the liveness probe remains reliable under high connection load. The readiness probe continues to use the HTTP backend check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9b63bde to
6329b86
Compare
|
/jira refresh |
|
@alebedev87: This pull request references Jira Issue OCPBUGS-67161, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
|
@alebedev87: This pull request references Jira Issue OCPBUGS-67161, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
12 similar comments
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
/assign |
|
@alebedev87: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
tested it with 4.22.0-0-2026-02-26-031346-test-ci-ln-jir9lwt-latest |
|
/label qe-approved |
|
@ShudiLi: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| Name: o.RouterName, | ||
| }, | ||
| LiveChecks: liveChecks, | ||
| ReadyChecks: []healthz.HealthChecker{checkBackend, checkSync, metrics.ProcessRunning(stopCh)}, |
There was a problem hiding this comment.
Any benefit or caveat on changing here as well?
| // expecting a non-empty response. | ||
| func AdminSocketAvailable(u *url.URL) healthz.HealthChecker { | ||
| return healthz.NamedCheck("admin-socket", func(_ *http.Request) error { | ||
| conn, err := net.DialTimeout("unix", u.Path, 2*time.Second) |
There was a problem hiding this comment.
Raw socket works nice, but router already imports a haproxy client that should make the code smaller and easier to understand, how does it sound?
client := &haproxy.HAProxyClient{ // github.com/bcicen/go-haproxy
Addr: "unix:///var/lib/haproxy/run/haproxy.sock",
Timeout: 2,
}
out, err := client.RunCommand("show info")There was a problem hiding this comment.
This seems to be the third place configuring the admin socket, maybe there's a place to configure it just once and everyone else can reuse instead?
Summary
show infocommandmaxconnlimit. Whenmaxconnis reached due to client traffic, the probe HTTP request gets queued or rejected, causing probe failures and unnecessary container restarts even though HAProxy is still runningmaxconn, so the liveness probe remains reliable under high connection loadManual test
Using https://github.com/mparram/test-backend.
Standard router image:
Router image with the fix: