Better shutdown. #867

ZanCorDX · 2026-01-22T20:59:44Z

📝 Summary

Shutdown is now performed in 2 steps:
1- Cancellation: We propagate a cancel signal so we stop building, data stops flowing and channels close.
We expect all critical threads to finish.
2- Abort: If some critical thread is still alive we trigger a second signal so critical tasks stop waiting for data (assuming the source is hanged) and perform any necessary clean up.
Sleeps were replaced by JoinHandles.

✅ I have completed the following steps:

Run make lint
Run make test
Added tests (if applicable)

claude · 2026-01-22T21:00:05Z

Claude finished @ZanCorDX's task in 3m 28s —— View job

🔍 PR Review: Better Shutdown

This PR improves the shutdown mechanism by replacing sleep-based waiting with JoinHandle-based task tracking. This is a significant improvement for graceful shutdown handling.

Summary

The PR introduces a two-phase shutdown mechanism:

Graceful phase: Wait for all tasks (inner jobs + critical tasks) with a timeout (GRACEFUL_SHUTDOWN_TIMEOUT = 12s)
Critical phase: If timeout reached, signal abort and wait only for critical tasks with a second timeout (CRITICAL_SHUTDOWN_TIMEOUT = 10s)

This replaces the previous approach that used arbitrary sleeps to wait for tasks to complete.

✅ Strengths

Proper task tracking: Using JoinHandles instead of sleeps ensures tasks actually complete rather than just hoping they finish within a sleep duration
Two-phase shutdown: The separation between graceful and critical shutdown phases is a well-designed pattern for handling different priorities during shutdown
Abort token: The new global_abort token allows critical tasks (like clickhouse) to know when upstream is hung and perform cleanup immediately
FuturesUnordered: Efficient concurrent waiting on multiple handles using biased select for deterministic behavior
Clear timeout constants: Well-documented constants in process_killer.rs explaining each timeout purpose
Clean JoinHandle wrapping: The pattern in clickhouse/mod.rs of wrapping both inserter and backup handles in a single task is elegant

⚠️ Minor Issues Identified (see inline comments)

Unused constant: RUN_SUBMIT_TO_RELAYS_JOB_CANCEL_TIME_SECONDS is now unused after removing RUN_SUBMIT_TO_RELAYS_JOB_CANCEL_TIME
Log message clarity: The log at line 605 says "No critical tasks remaining" but logs non_critical_task_remaining, which is slightly confusing
Timeout logging: Consider logging the count of remaining tasks when graceful timeout is reached to aid debugging

📋 Checklist Verification

Item	Status
Code compiles	✅ Appears correct
No security issues	✅ No concerns
Error handling	✅ Properly handles JoinHandle errors
Documentation	✅ Good doc comments on new structs/functions
Tests	⚠️ PR notes "Added tests (if applicable)" is unchecked - consider adding tests for the shutdown logic

🎯 Overall Assessment

Approve with minor suggestions. This is a well-designed improvement to the shutdown mechanism. The two-phase approach with proper task tracking is much more robust than the previous sleep-based approach. The inline comments contain minor suggestions that can be addressed in a follow-up if desired.

claude · 2026-01-22T21:02:51Z

crates/rbuilder/src/live_builder/process_killer.rs

-    Duration::from_secs(RUN_SUBMIT_TO_RELAYS_JOB_CANCEL_TIME_SECONDS);
+/// v3 might need to keep the server running for a while until the slot ends.
+pub const OPTIMISTIC_V3_CLOSE_TIME_SECONDS: u64 = SLOT_DURATION_SECS;
+


Good addition of the OPTIMISTIC_V3_CLOSE_TIME_SECONDS constant. However, it appears that RUN_SUBMIT_TO_RELAYS_JOB_CANCEL_TIME_SECONDS (defined at line 14) is now unused since RUN_SUBMIT_TO_RELAYS_JOB_CANCEL_TIME was removed. Consider cleaning up that unused constant to avoid dead code.

crates/rbuilder/src/live_builder/mod.rs

claude · 2026-01-22T21:03:08Z

crates/rbuilder-operator/src/clickhouse.rs

-            tokio::time::sleep(RUN_SUBMIT_TO_RELAYS_JOB_CANCEL_TIME).await;
-            task_manager.graceful_shutdown_with_timeout(Duration::from_secs(5));
+            abort_token.cancelled().await;
+            task_manager.graceful_shutdown();


Good design pattern: This task forwards the abort signal to the task_manager, which is cleaner than the previous approach that used a hardcoded sleep. However, ensure that task_manager.graceful_shutdown() properly signals all child tasks to complete their cleanup work before the clickhouse_shutdown_handle completes.

crates/rbuilder/src/live_builder/mod.rs

claude · 2026-01-22T21:03:22Z

crates/rbuilder-utils/src/clickhouse/mod.rs

+    let backup_handle = backup.spawn(task_executor, backup_table_name, tracing_target);
+
+    // Spawn a wrapper task that waits for both to complete
+    tokio::spawn(async move {


Good pattern: Wrapping both handles in a single task that waits for both to complete is a clean solution. This ensures the returned JoinHandle only completes when both the inserter and backup tasks have finished their cleanup.

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

claude · 2026-01-22T21:32:43Z

Claude finished @ZanCorDX's task in 1m 41s —— View job

🔍 PR Review: Better Shutdown (Re-review after sync)

Summary

Reviewed the latest sync which added logging of remaining tasks when graceful shutdown timeout is reached. Found one issue that should be addressed.

⚠️ Issue Found: Duplicate Warning Message

The latest commit introduced a duplicate warning message at crates/rbuilder/src/live_builder/mod.rs:577-582. The first warn! at line 577 is now redundant with the more informative version below it. See inline comment for suggested fix.

⚠️ Minor Issue from Previous Review

Log message at line 610: Says "No critical tasks remaining" but logs non_critical_task_remaining = inner_jobs.len(). The field name is slightly confusing since inner_jobs are the non-critical tasks that may still be running.

✅ Overall Assessment

The PR is in good shape. The two-phase shutdown mechanism with proper task tracking is a significant improvement over the previous sleep-based approach. Just needs the duplicate warning removed.

claude · 2026-01-22T21:34:26Z

crates/rbuilder/src/live_builder/mod.rs

+                }
+            }
+            _ = tokio::time::sleep_until(graceful_deadline) => {
+                warn!("Graceful shutdown timeout reached, signaling abort");


This line appears to be redundant now that the more informative warning with task counts was added below. Consider removing it:

Suggested change

warn!("Graceful shutdown timeout reached, signaling abort");

claude · 2026-01-23T13:34:59Z

Claude finished @ZanCorDX's task in 1m 19s —— View job

🔍 PR Review: Better Shutdown (Re-review after sync)

Summary

Reviewed the latest sync (commits 4211920 - lint, 845b32b - V3 wait handle). The PR is in good shape overall with the two-phase shutdown mechanism being a significant improvement over sleep-based waiting.

Previous Review Status

Issue	Status
Unused `RUN_SUBMIT_TO_RELAYS_JOB_CANCEL_TIME_SECONDS` constant	✅ Fixed (removed from `process_killer.rs`)
Duplicate warning message at lines 577-578	⚠️ Still present (see inline comment)
Log message clarity at line 610	⚠️ Minor issue remains (see inline comment)
Logging remaining tasks on timeout	✅ Added in commit `38bcc3e`

⚠️ Minor Issues (see inline comments)

Duplicate warning at mod.rs:577-578: The first warn! is now redundant since the second one includes task counts. Simply remove line 577.
Field naming at mod.rs:610: The field non_critical_task_remaining could be renamed to remaining_inner_jobs for consistency with the warning message above.

✅ Overall Assessment

Approve with minor nits. The core shutdown mechanism is well-designed:

✅ Proper task tracking with JoinHandles
✅ Two-phase shutdown (graceful → critical)
✅ FuturesUnordered for efficient concurrent waiting
✅ Clean timeout constants in process_killer.rs
✅ Good pattern for combining handles in clickhouse/mod.rs:105-107

The inline comments are minor suggestions that can be addressed in a follow-up if desired.

claude · 2026-01-23T13:36:15Z

crates/rbuilder/src/live_builder/mod.rs

+                warn!("Graceful shutdown timeout reached, signaling abort");
+                warn!(


This line is now redundant with the more informative warning that includes task counts on the next line. Consider removing the duplicate:

Suggested change

warn!("Graceful shutdown timeout reached, signaling abort");

warn!(

warn!(

crates/rbuilder/src/live_builder/mod.rs

claude · 2026-01-23T15:23:15Z

Claude finished @ZanCorDX's task in 1m 23s —— View job

🔍 PR Review: Better Shutdown (Re-review after latest sync)

Summary

Reviewed the latest state after commit 92fb077 (removed warning). The PR is in excellent shape with the two-phase shutdown mechanism being a significant improvement over the previous sleep-based approach.

✅ Previous Issues Status

Issue	Status
Unused `RUN_SUBMIT_TO_RELAYS_JOB_CANCEL_TIME_SECONDS` constant	✅ Fixed
Duplicate warning message	✅ Fixed in commit `92fb077`
Logging remaining tasks on timeout	✅ Added

⚠️ Minor Suggestion (Optional)

Field naming at mod.rs:609: The field non_critical_task_remaining logs inner_jobs.len(). Consider renaming to remaining_inner_jobs for consistency with the warning message at line 578-580, though this is a very minor nit.

✅ Architecture Review

The two-phase shutdown mechanism is well-designed:

Phase 1 (Graceful): Wait for all tasks (inner jobs + critical tasks) with GRACEFUL_SHUTDOWN_TIMEOUT (12s)
Phase 2 (Critical): If timeout reached, signal global_abort and wait only for critical tasks with CRITICAL_SHUTDOWN_TIMEOUT (10s)

Strengths:

✅ Proper task tracking with JoinHandles instead of sleeps
✅ FuturesUnordered with biased select for efficient concurrent waiting
✅ Clean separation of critical vs non-critical tasks
✅ Clear timeout constants in process_killer.rs with good documentation
✅ Good pattern in clickhouse/mod.rs:105-107 wrapping both inserter and backup handles
✅ Optimistic V3 server properly waits OPTIMISTIC_V3_CLOSE_TIME_SECONDS before shutdown to avoid relay demotion

📋 Final Assessment

Approve ✅

The PR is ready for merge. All previously identified issues have been addressed. The shutdown mechanism is well-designed and properly handles graceful degradation when tasks don't complete in time.

claude · 2026-01-23T15:53:58Z

Claude finished @ZanCorDX's task in 1m 15s —— View job

🔍 PR Review: Better Shutdown (Final Re-review)

Summary

Reviewed the latest state after commit ffed962 (comment, renames). The PR is in excellent shape. All previously identified issues have been addressed, and the latest commit improves code clarity with better variable naming.

✅ Previous Issues Status

Issue	Status
Unused `RUN_SUBMIT_TO_RELAYS_JOB_CANCEL_TIME_SECONDS` constant	✅ Fixed
Duplicate warning message	✅ Fixed in commit `92fb077`
Logging remaining tasks on timeout	✅ Added
Variable naming (`inner_jobs` → `non_critical_tasks`)	✅ Fixed in commit `ffed962`

📝 Latest Commit Analysis (`ffed962`)

The latest commit improves code clarity:

Renamed inner_jobs_handles → non_critical_tasks_join_handles
Renamed inner_jobs → non_critical_tasks
Updated documentation comment to reference "non-critical and critical tasks"

This makes the two-phase shutdown logic much clearer to understand.

⚠️ Minor Remaining Issue

Log field name at mod.rs:609: The field non_critical_task_remaining is slightly awkward given the variable was renamed to non_critical_tasks. Consider either:

Renaming to remaining_non_critical_tasks for consistency, OR
This is minor enough to leave as-is

// Current:
non_critical_task_remaining = non_critical_tasks.len(),

// Suggested (optional):
remaining_non_critical_tasks = non_critical_tasks.len(),

✅ Architecture Review

The two-phase shutdown mechanism is well-designed:

Phase 1 (Graceful): Wait for all tasks (non-critical + critical) with GRACEFUL_SHUTDOWN_TIMEOUT (12s)
Phase 2 (Critical): If timeout reached, signal global_abort and wait only for critical tasks with CRITICAL_SHUTDOWN_TIMEOUT (10s)

Strengths:

✅ Proper task tracking with JoinHandles instead of sleeps
✅ FuturesUnordered with biased select for efficient concurrent waiting
✅ Clean separation of critical vs non-critical tasks
✅ Clear timeout constants in process_killer.rs with good documentation
✅ Good pattern in clickhouse/mod.rs:105-107 wrapping both inserter and backup handles
✅ Optimistic V3 server properly waits OPTIMISTIC_V3_CLOSE_TIME_SECONDS before shutdown to avoid relay demotion
✅ Abort token forwarding in clickhouse.rs:249-253 cleanly handles upstream hang scenarios

📋 Final Assessment

Approve ✅

The PR is ready for merge. This is a significant improvement to the shutdown mechanism:

Replaces arbitrary sleeps with proper task tracking - Tasks are now awaited directly via JoinHandles rather than hoping they complete within a sleep duration
Two-phase shutdown ensures data integrity - Critical tasks (like clickhouse backup) get additional time to complete even if non-critical tasks hang
Clean code organization - Timeout constants are centralized in process_killer.rs with clear documentation of each timeout's purpose
Proper abort signaling - The global_abort token allows tasks to know when upstream is hung and perform immediate cleanup

The one minor nit about field naming is truly optional and doesn't affect functionality.

claude · 2026-01-23T15:59:23Z

Claude finished @ZanCorDX's task in 1m 23s —— View job

🔍 PR Review: Better Shutdown (Final Review)

Summary

Reviewed the latest state after commit 72f789b (rename). The PR is in excellent shape. All previously identified issues have been addressed, and the latest commit fixes the field naming inconsistency at line 609.

✅ All Previous Issues Resolved

Issue	Status
Unused `RUN_SUBMIT_TO_RELAYS_JOB_CANCEL_TIME_SECONDS` constant	✅ Fixed
Duplicate warning message	✅ Fixed in commit `92fb077`
Logging remaining tasks on timeout	✅ Added
Variable naming (`inner_jobs` → `non_critical_tasks`)	✅ Fixed in commit `ffed962`
Field naming at line 609 (`non_critical_task_remaining` → `remaining_non_critical_tasks`)	✅ Fixed in commit `72f789b`

⚠️ One Minor Remaining Nit (Optional)

There's one small inconsistency remaining: at line 578, the field is named remaining_inner_jobs while the variable is non_critical_tasks. For full consistency with line 609, this could be remaining_non_critical_tasks. I've added an inline comment with a suggestion.

✅ Architecture Review

The two-phase shutdown mechanism is well-designed:

Phase 1 (Graceful): Wait for all tasks (non-critical + critical) with GRACEFUL_SHUTDOWN_TIMEOUT (12s)
Phase 2 (Critical): If timeout reached, signal global_abort and wait only for critical tasks with CRITICAL_SHUTDOWN_TIMEOUT (10s)

Strengths:

✅ Proper task tracking with JoinHandles instead of sleeps
✅ FuturesUnordered with biased select for efficient concurrent waiting
✅ Clean separation of critical vs non-critical tasks
✅ Clear timeout constants in process_killer.rs with good documentation
✅ Good pattern in clickhouse/mod.rs:105-107 wrapping both inserter and backup handles in a single task
✅ Optimistic V3 server properly waits OPTIMISTIC_V3_CLOSE_TIME_SECONDS before shutdown to avoid relay demotion
✅ Abort token forwarding in clickhouse.rs:249-253 cleanly handles upstream hang scenarios

📋 Final Assessment

Approve ✅

The PR is ready for merge. This is a significant improvement to the shutdown mechanism:

Replaces arbitrary sleeps with proper task tracking - Tasks are now awaited directly via JoinHandles rather than hoping they complete within a sleep duration
Two-phase shutdown ensures data integrity - Critical tasks (like clickhouse backup) get additional time to complete even if non-critical tasks hang
Clean code organization - Timeout constants are centralized in process_killer.rs with clear documentation of each timeout's purpose
Proper abort signaling - The global_abort token allows tasks to know when upstream is hung and perform immediate cleanup

The one remaining nit about field naming at line 578 is truly optional and doesn't affect functionality.

claude · 2026-01-23T16:00:36Z

crates/rbuilder/src/live_builder/mod.rs

+            }
+            _ = tokio::time::sleep_until(graceful_deadline) => {
+                warn!(
+                    remaining_inner_jobs = non_critical_tasks.len(),


Minor: For consistency with the renamed variable (non_critical_tasks) and the field name at line 609 (remaining_non_critical_tasks), consider using the same naming pattern here:

Suggested change

remaining_inner_jobs = non_critical_tasks.len(),

remaining_non_critical_tasks = non_critical_tasks.len(),

claude · 2026-01-23T17:09:33Z

Claude finished @ZanCorDX's task in 1m 47s —— View job

🔍 PR Review: Better Shutdown

Progress

Read CLAUDE.md for repository guidelines
Analyze the PR changes
Review modified files in detail
Check for any remaining issues from previous reviews
Provide final assessment

✅ All Previous Issues Resolved

Issue	Status
Unused `RUN_SUBMIT_TO_RELAYS_JOB_CANCEL_TIME_SECONDS` constant	✅ Fixed
Duplicate warning message	✅ Fixed in commit `92fb077`
Logging remaining tasks on timeout	✅ Added
Variable naming (`inner_jobs` → `non_critical_tasks`)	✅ Fixed in commit `ffed962`
Field naming at line 578	✅ Fixed in commit `72f789b`
Field naming at line 609	✅ Fixed in commit `72f789b`

✅ Architecture Review

The two-phase shutdown mechanism is well-designed and significantly improves upon the previous sleep-based approach:

Phase 1 (Graceful): Wait for all tasks (non-critical + critical) with GRACEFUL_SHUTDOWN_TIMEOUT (12s)
Phase 2 (Critical): If timeout reached, signal global_abort and wait only for critical tasks with CRITICAL_SHUTDOWN_TIMEOUT (10s)

Key Strengths:

Proper task tracking with JoinHandles (mod.rs:552-554): Tasks are now awaited directly rather than hoping they complete within a sleep duration
FuturesUnordered with biased select (mod.rs:564-586): Efficient concurrent waiting with deterministic behavior
Clean separation of task types: Critical tasks (like clickhouse) get additional time even if non-critical tasks hang
Centralized timeout constants (process_killer.rs:14-37): Well-documented constants explain each timeout's purpose
Good handle wrapping pattern (clickhouse/mod.rs:105-107): Wrapping both inserter and backup handles in a single task ensures both complete before shutdown
Abort token forwarding (clickhouse.rs:249-253): Cleanly handles upstream hang scenarios by forwarding abort to task_manager
V3 server critical task (flashbots_config.rs:211-213, config.rs:564-566): Optimistic V3 server properly waits to avoid relay demotion

📋 Final Assessment

Approve ✅

The PR is ready for merge. This is a significant improvement to the shutdown mechanism:

Replaces arbitrary sleeps with proper task tracking - Tasks are now awaited directly via JoinHandles
Two-phase shutdown ensures data integrity - Critical tasks get additional time to complete
Clean code organization - Timeout constants are centralized with clear documentation
Proper abort signaling - The global_abort token allows tasks to perform immediate cleanup when upstream is hung

All previously identified issues have been addressed. The code is well-structured and the shutdown logic is clear and robust.

waits more

f103986

claude bot reviewed Jan 22, 2026

View reviewed changes

crates/rbuilder/src/live_builder/mod.rs Outdated Show resolved Hide resolved

claude bot reviewed Jan 22, 2026

View reviewed changes

crates/rbuilder/src/live_builder/mod.rs Show resolved Hide resolved

claude bot reviewed Jan 22, 2026

View reviewed changes

Update crates/rbuilder/src/live_builder/mod.rs

38bcc3e

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

claude bot reviewed Jan 22, 2026

View reviewed changes

ZanCorDX added 3 commits January 23, 2026 10:33

removed sleep

86329a1

V3 wait handle

845b32b

lint

4211920

claude bot reviewed Jan 23, 2026

View reviewed changes

crates/rbuilder/src/live_builder/mod.rs Outdated Show resolved Hide resolved

removed warning

92fb077

comment, renames

ffed962

rename

72f789b

claude bot reviewed Jan 23, 2026

View reviewed changes

ZanCorDX marked this pull request as ready for review January 23, 2026 16:06

ZanCorDX requested a review from dvush as a code owner January 23, 2026 16:06

log text

b4beb75

dvush approved these changes Jan 26, 2026

View reviewed changes

ZanCorDX merged commit 924b662 into develop Jan 26, 2026
7 checks passed

ZanCorDX deleted the dx/better-shut-down branch January 26, 2026 14:03

		warn!("Graceful shutdown timeout reached, signaling abort");
		warn!(

	remaining_inner_jobs = non_critical_tasks.len(),
	remaining_non_critical_tasks = non_critical_tasks.len(),

Better shutdown. #867

Better shutdown. #867

Uh oh!

Conversation

ZanCorDX commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📝 Summary

✅ I have completed the following steps:

Uh oh!

claude bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 PR Review: Better Shutdown

Summary

✅ Strengths

⚠️ Minor Issues Identified (see inline comments)

📋 Checklist Verification

🎯 Overall Assessment

Uh oh!

claude bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

claude bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

claude bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 PR Review: Better Shutdown (Re-review after sync)

Summary

⚠️ Issue Found: Duplicate Warning Message

⚠️ Minor Issue from Previous Review

✅ Overall Assessment

Uh oh!

claude bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 PR Review: Better Shutdown (Re-review after sync)

Summary

Previous Review Status

⚠️ Minor Issues (see inline comments)

✅ Overall Assessment

Uh oh!

claude bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

claude bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 PR Review: Better Shutdown (Re-review after latest sync)

Summary

✅ Previous Issues Status

⚠️ Minor Suggestion (Optional)

✅ Architecture Review

📋 Final Assessment

Uh oh!

claude bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 PR Review: Better Shutdown (Final Re-review)

Summary

✅ Previous Issues Status

📝 Latest Commit Analysis (ffed962)

⚠️ Minor Remaining Issue

✅ Architecture Review

📋 Final Assessment

Uh oh!

claude bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 PR Review: Better Shutdown (Final Review)

Summary

✅ All Previous Issues Resolved

⚠️ One Minor Remaining Nit (Optional)

✅ Architecture Review

📋 Final Assessment

Uh oh!

ZanCorDX commented Jan 22, 2026 •

edited

Loading

claude bot commented Jan 22, 2026 •

edited

Loading

claude bot commented Jan 22, 2026 •

edited

Loading

claude bot commented Jan 23, 2026 •

edited

Loading

claude bot commented Jan 23, 2026 •

edited

Loading

claude bot commented Jan 23, 2026 •

edited

Loading

📝 Latest Commit Analysis (`ffed962`)

claude bot commented Jan 23, 2026 •

edited

Loading

claude bot commented Jan 23, 2026 •

edited

Loading