Skip to content

Conversation

@ZanCorDX
Copy link
Collaborator

@ZanCorDX ZanCorDX commented Jan 22, 2026

📝 Summary

Shutdown is now performed in 2 steps:
1- Cancellation: We propagate a cancel signal so we stop building, data stops flowing and channels close.
We expect all critical threads to finish.
2- Abort: If some critical thread is still alive we trigger a second signal so critical tasks stop waiting for data (assuming the source is hanged) and perform any necessary clean up.
Sleeps were replaced by JoinHandles.

✅ I have completed the following steps:

  • Run make lint
  • Run make test
  • Added tests (if applicable)

@claude
Copy link
Contributor

claude bot commented Jan 22, 2026

Claude finished @ZanCorDX's task in 3m 28s —— View job


🔍 PR Review: Better Shutdown

This PR improves the shutdown mechanism by replacing sleep-based waiting with JoinHandle-based task tracking. This is a significant improvement for graceful shutdown handling.

Summary

The PR introduces a two-phase shutdown mechanism:

  1. Graceful phase: Wait for all tasks (inner jobs + critical tasks) with a timeout (GRACEFUL_SHUTDOWN_TIMEOUT = 12s)
  2. Critical phase: If timeout reached, signal abort and wait only for critical tasks with a second timeout (CRITICAL_SHUTDOWN_TIMEOUT = 10s)

This replaces the previous approach that used arbitrary sleeps to wait for tasks to complete.


✅ Strengths

  1. Proper task tracking: Using JoinHandles instead of sleeps ensures tasks actually complete rather than just hoping they finish within a sleep duration

  2. Two-phase shutdown: The separation between graceful and critical shutdown phases is a well-designed pattern for handling different priorities during shutdown

  3. Abort token: The new global_abort token allows critical tasks (like clickhouse) to know when upstream is hung and perform cleanup immediately

  4. FuturesUnordered: Efficient concurrent waiting on multiple handles using biased select for deterministic behavior

  5. Clear timeout constants: Well-documented constants in process_killer.rs explaining each timeout purpose

  6. Clean JoinHandle wrapping: The pattern in clickhouse/mod.rs of wrapping both inserter and backup handles in a single task is elegant


⚠️ Minor Issues Identified (see inline comments)

  1. Unused constant: RUN_SUBMIT_TO_RELAYS_JOB_CANCEL_TIME_SECONDS is now unused after removing RUN_SUBMIT_TO_RELAYS_JOB_CANCEL_TIME

  2. Log message clarity: The log at line 605 says "No critical tasks remaining" but logs non_critical_task_remaining, which is slightly confusing

  3. Timeout logging: Consider logging the count of remaining tasks when graceful timeout is reached to aid debugging


📋 Checklist Verification

Item Status
Code compiles ✅ Appears correct
No security issues ✅ No concerns
Error handling ✅ Properly handles JoinHandle errors
Documentation ✅ Good doc comments on new structs/functions
Tests ⚠️ PR notes "Added tests (if applicable)" is unchecked - consider adding tests for the shutdown logic

🎯 Overall Assessment

Approve with minor suggestions. This is a well-designed improvement to the shutdown mechanism. The two-phase approach with proper task tracking is much more robust than the previous sleep-based approach. The inline comments contain minor suggestions that can be addressed in a follow-up if desired.

Duration::from_secs(RUN_SUBMIT_TO_RELAYS_JOB_CANCEL_TIME_SECONDS);
/// v3 might need to keep the server running for a while until the slot ends.
pub const OPTIMISTIC_V3_CLOSE_TIME_SECONDS: u64 = SLOT_DURATION_SECS;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good addition of the OPTIMISTIC_V3_CLOSE_TIME_SECONDS constant. However, it appears that RUN_SUBMIT_TO_RELAYS_JOB_CANCEL_TIME_SECONDS (defined at line 14) is now unused since RUN_SUBMIT_TO_RELAYS_JOB_CANCEL_TIME was removed. Consider cleaning up that unused constant to avoid dead code.

tokio::time::sleep(RUN_SUBMIT_TO_RELAYS_JOB_CANCEL_TIME).await;
task_manager.graceful_shutdown_with_timeout(Duration::from_secs(5));
abort_token.cancelled().await;
task_manager.graceful_shutdown();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good design pattern: This task forwards the abort signal to the task_manager, which is cleaner than the previous approach that used a hardcoded sleep. However, ensure that task_manager.graceful_shutdown() properly signals all child tasks to complete their cleanup work before the clickhouse_shutdown_handle completes.

let backup_handle = backup.spawn(task_executor, backup_table_name, tracing_target);

// Spawn a wrapper task that waits for both to complete
tokio::spawn(async move {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good pattern: Wrapping both handles in a single task that waits for both to complete is a clean solution. This ensures the returned JoinHandle only completes when both the inserter and backup tasks have finished their cleanup.

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
@claude
Copy link
Contributor

claude bot commented Jan 22, 2026

Claude finished @ZanCorDX's task in 1m 41s —— View job


🔍 PR Review: Better Shutdown (Re-review after sync)

Summary

Reviewed the latest sync which added logging of remaining tasks when graceful shutdown timeout is reached. Found one issue that should be addressed.


⚠️ Issue Found: Duplicate Warning Message

The latest commit introduced a duplicate warning message at crates/rbuilder/src/live_builder/mod.rs:577-582. The first warn! at line 577 is now redundant with the more informative version below it. See inline comment for suggested fix.


⚠️ Minor Issue from Previous Review

Log message at line 610: Says "No critical tasks remaining" but logs non_critical_task_remaining = inner_jobs.len(). The field name is slightly confusing since inner_jobs are the non-critical tasks that may still be running.


✅ Overall Assessment

The PR is in good shape. The two-phase shutdown mechanism with proper task tracking is a significant improvement over the previous sleep-based approach. Just needs the duplicate warning removed.

}
}
_ = tokio::time::sleep_until(graceful_deadline) => {
warn!("Graceful shutdown timeout reached, signaling abort");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line appears to be redundant now that the more informative warning with task counts was added below. Consider removing it:

Suggested change
warn!("Graceful shutdown timeout reached, signaling abort");

@claude
Copy link
Contributor

claude bot commented Jan 23, 2026

Claude finished @ZanCorDX's task in 1m 19s —— View job


🔍 PR Review: Better Shutdown (Re-review after sync)

Summary

Reviewed the latest sync (commits 4211920 - lint, 845b32b - V3 wait handle). The PR is in good shape overall with the two-phase shutdown mechanism being a significant improvement over sleep-based waiting.


Previous Review Status

Issue Status
Unused RUN_SUBMIT_TO_RELAYS_JOB_CANCEL_TIME_SECONDS constant ✅ Fixed (removed from process_killer.rs)
Duplicate warning message at lines 577-578 ⚠️ Still present (see inline comment)
Log message clarity at line 610 ⚠️ Minor issue remains (see inline comment)
Logging remaining tasks on timeout ✅ Added in commit 38bcc3e

⚠️ Minor Issues (see inline comments)

  1. Duplicate warning at mod.rs:577-578: The first warn! is now redundant since the second one includes task counts. Simply remove line 577.

  2. Field naming at mod.rs:610: The field non_critical_task_remaining could be renamed to remaining_inner_jobs for consistency with the warning message above.


✅ Overall Assessment

Approve with minor nits. The core shutdown mechanism is well-designed:

  • ✅ Proper task tracking with JoinHandles
  • ✅ Two-phase shutdown (graceful → critical)
  • FuturesUnordered for efficient concurrent waiting
  • ✅ Clean timeout constants in process_killer.rs
  • ✅ Good pattern for combining handles in clickhouse/mod.rs:105-107

The inline comments are minor suggestions that can be addressed in a follow-up if desired.

Comment on lines 577 to 578
warn!("Graceful shutdown timeout reached, signaling abort");
warn!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is now redundant with the more informative warning that includes task counts on the next line. Consider removing the duplicate:

Suggested change
warn!("Graceful shutdown timeout reached, signaling abort");
warn!(
warn!(

@claude
Copy link
Contributor

claude bot commented Jan 23, 2026

Claude finished @ZanCorDX's task in 1m 23s —— View job


🔍 PR Review: Better Shutdown (Re-review after latest sync)

Summary

Reviewed the latest state after commit 92fb077 (removed warning). The PR is in excellent shape with the two-phase shutdown mechanism being a significant improvement over the previous sleep-based approach.


✅ Previous Issues Status

Issue Status
Unused RUN_SUBMIT_TO_RELAYS_JOB_CANCEL_TIME_SECONDS constant ✅ Fixed
Duplicate warning message ✅ Fixed in commit 92fb077
Logging remaining tasks on timeout ✅ Added

⚠️ Minor Suggestion (Optional)

Field naming at mod.rs:609: The field non_critical_task_remaining logs inner_jobs.len(). Consider renaming to remaining_inner_jobs for consistency with the warning message at line 578-580, though this is a very minor nit.


✅ Architecture Review

The two-phase shutdown mechanism is well-designed:

  1. Phase 1 (Graceful): Wait for all tasks (inner jobs + critical tasks) with GRACEFUL_SHUTDOWN_TIMEOUT (12s)
  2. Phase 2 (Critical): If timeout reached, signal global_abort and wait only for critical tasks with CRITICAL_SHUTDOWN_TIMEOUT (10s)

Strengths:

  • ✅ Proper task tracking with JoinHandles instead of sleeps
  • FuturesUnordered with biased select for efficient concurrent waiting
  • ✅ Clean separation of critical vs non-critical tasks
  • ✅ Clear timeout constants in process_killer.rs with good documentation
  • ✅ Good pattern in clickhouse/mod.rs:105-107 wrapping both inserter and backup handles
  • ✅ Optimistic V3 server properly waits OPTIMISTIC_V3_CLOSE_TIME_SECONDS before shutdown to avoid relay demotion

📋 Final Assessment

Approve

The PR is ready for merge. All previously identified issues have been addressed. The shutdown mechanism is well-designed and properly handles graceful degradation when tasks don't complete in time.

@claude
Copy link
Contributor

claude bot commented Jan 23, 2026

Claude finished @ZanCorDX's task in 1m 15s —— View job


🔍 PR Review: Better Shutdown (Final Re-review)

Summary

Reviewed the latest state after commit ffed962 (comment, renames). The PR is in excellent shape. All previously identified issues have been addressed, and the latest commit improves code clarity with better variable naming.


✅ Previous Issues Status

Issue Status
Unused RUN_SUBMIT_TO_RELAYS_JOB_CANCEL_TIME_SECONDS constant ✅ Fixed
Duplicate warning message ✅ Fixed in commit 92fb077
Logging remaining tasks on timeout ✅ Added
Variable naming (inner_jobsnon_critical_tasks) ✅ Fixed in commit ffed962

📝 Latest Commit Analysis (ffed962)

The latest commit improves code clarity:

  • Renamed inner_jobs_handlesnon_critical_tasks_join_handles
  • Renamed inner_jobsnon_critical_tasks
  • Updated documentation comment to reference "non-critical and critical tasks"

This makes the two-phase shutdown logic much clearer to understand.


⚠️ Minor Remaining Issue

Log field name at mod.rs:609: The field non_critical_task_remaining is slightly awkward given the variable was renamed to non_critical_tasks. Consider either:

  • Renaming to remaining_non_critical_tasks for consistency, OR
  • This is minor enough to leave as-is
// Current:
non_critical_task_remaining = non_critical_tasks.len(),

// Suggested (optional):
remaining_non_critical_tasks = non_critical_tasks.len(),

✅ Architecture Review

The two-phase shutdown mechanism is well-designed:

Phase 1 (Graceful): Wait for all tasks (non-critical + critical) with GRACEFUL_SHUTDOWN_TIMEOUT (12s)
Phase 2 (Critical): If timeout reached, signal global_abort and wait only for critical tasks with CRITICAL_SHUTDOWN_TIMEOUT (10s)

Strengths:

  • ✅ Proper task tracking with JoinHandles instead of sleeps
  • FuturesUnordered with biased select for efficient concurrent waiting
  • ✅ Clean separation of critical vs non-critical tasks
  • ✅ Clear timeout constants in process_killer.rs with good documentation
  • ✅ Good pattern in clickhouse/mod.rs:105-107 wrapping both inserter and backup handles
  • ✅ Optimistic V3 server properly waits OPTIMISTIC_V3_CLOSE_TIME_SECONDS before shutdown to avoid relay demotion
  • ✅ Abort token forwarding in clickhouse.rs:249-253 cleanly handles upstream hang scenarios

📋 Final Assessment

Approve

The PR is ready for merge. This is a significant improvement to the shutdown mechanism:

  1. Replaces arbitrary sleeps with proper task tracking - Tasks are now awaited directly via JoinHandles rather than hoping they complete within a sleep duration

  2. Two-phase shutdown ensures data integrity - Critical tasks (like clickhouse backup) get additional time to complete even if non-critical tasks hang

  3. Clean code organization - Timeout constants are centralized in process_killer.rs with clear documentation of each timeout's purpose

  4. Proper abort signaling - The global_abort token allows tasks to know when upstream is hung and perform immediate cleanup

The one minor nit about field naming is truly optional and doesn't affect functionality.

@claude
Copy link
Contributor

claude bot commented Jan 23, 2026

Claude finished @ZanCorDX's task in 1m 23s —— View job


🔍 PR Review: Better Shutdown (Final Review)

Summary

Reviewed the latest state after commit 72f789b (rename). The PR is in excellent shape. All previously identified issues have been addressed, and the latest commit fixes the field naming inconsistency at line 609.


✅ All Previous Issues Resolved

Issue Status
Unused RUN_SUBMIT_TO_RELAYS_JOB_CANCEL_TIME_SECONDS constant ✅ Fixed
Duplicate warning message ✅ Fixed in commit 92fb077
Logging remaining tasks on timeout ✅ Added
Variable naming (inner_jobsnon_critical_tasks) ✅ Fixed in commit ffed962
Field naming at line 609 (non_critical_task_remainingremaining_non_critical_tasks) ✅ Fixed in commit 72f789b

⚠️ One Minor Remaining Nit (Optional)

There's one small inconsistency remaining: at line 578, the field is named remaining_inner_jobs while the variable is non_critical_tasks. For full consistency with line 609, this could be remaining_non_critical_tasks. I've added an inline comment with a suggestion.


✅ Architecture Review

The two-phase shutdown mechanism is well-designed:

Phase 1 (Graceful): Wait for all tasks (non-critical + critical) with GRACEFUL_SHUTDOWN_TIMEOUT (12s)
Phase 2 (Critical): If timeout reached, signal global_abort and wait only for critical tasks with CRITICAL_SHUTDOWN_TIMEOUT (10s)

Strengths:

  • ✅ Proper task tracking with JoinHandles instead of sleeps
  • FuturesUnordered with biased select for efficient concurrent waiting
  • ✅ Clean separation of critical vs non-critical tasks
  • ✅ Clear timeout constants in process_killer.rs with good documentation
  • ✅ Good pattern in clickhouse/mod.rs:105-107 wrapping both inserter and backup handles in a single task
  • ✅ Optimistic V3 server properly waits OPTIMISTIC_V3_CLOSE_TIME_SECONDS before shutdown to avoid relay demotion
  • ✅ Abort token forwarding in clickhouse.rs:249-253 cleanly handles upstream hang scenarios

📋 Final Assessment

Approve

The PR is ready for merge. This is a significant improvement to the shutdown mechanism:

  1. Replaces arbitrary sleeps with proper task tracking - Tasks are now awaited directly via JoinHandles rather than hoping they complete within a sleep duration

  2. Two-phase shutdown ensures data integrity - Critical tasks (like clickhouse backup) get additional time to complete even if non-critical tasks hang

  3. Clean code organization - Timeout constants are centralized in process_killer.rs with clear documentation of each timeout's purpose

  4. Proper abort signaling - The global_abort token allows tasks to know when upstream is hung and perform immediate cleanup

The one remaining nit about field naming at line 578 is truly optional and doesn't affect functionality.

}
_ = tokio::time::sleep_until(graceful_deadline) => {
warn!(
remaining_inner_jobs = non_critical_tasks.len(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: For consistency with the renamed variable (non_critical_tasks) and the field name at line 609 (remaining_non_critical_tasks), consider using the same naming pattern here:

Suggested change
remaining_inner_jobs = non_critical_tasks.len(),
remaining_non_critical_tasks = non_critical_tasks.len(),

@ZanCorDX ZanCorDX marked this pull request as ready for review January 23, 2026 16:06
@ZanCorDX ZanCorDX requested a review from dvush as a code owner January 23, 2026 16:06
@claude
Copy link
Contributor

claude bot commented Jan 23, 2026

Claude finished @ZanCorDX's task in 1m 47s —— View job


🔍 PR Review: Better Shutdown

Progress

  • Read CLAUDE.md for repository guidelines
  • Analyze the PR changes
  • Review modified files in detail
  • Check for any remaining issues from previous reviews
  • Provide final assessment

✅ All Previous Issues Resolved

Issue Status
Unused RUN_SUBMIT_TO_RELAYS_JOB_CANCEL_TIME_SECONDS constant ✅ Fixed
Duplicate warning message ✅ Fixed in commit 92fb077
Logging remaining tasks on timeout ✅ Added
Variable naming (inner_jobsnon_critical_tasks) ✅ Fixed in commit ffed962
Field naming at line 578 ✅ Fixed in commit 72f789b
Field naming at line 609 ✅ Fixed in commit 72f789b

✅ Architecture Review

The two-phase shutdown mechanism is well-designed and significantly improves upon the previous sleep-based approach:

Phase 1 (Graceful): Wait for all tasks (non-critical + critical) with GRACEFUL_SHUTDOWN_TIMEOUT (12s)
Phase 2 (Critical): If timeout reached, signal global_abort and wait only for critical tasks with CRITICAL_SHUTDOWN_TIMEOUT (10s)

Key Strengths:

  1. Proper task tracking with JoinHandles (mod.rs:552-554): Tasks are now awaited directly rather than hoping they complete within a sleep duration

  2. FuturesUnordered with biased select (mod.rs:564-586): Efficient concurrent waiting with deterministic behavior

  3. Clean separation of task types: Critical tasks (like clickhouse) get additional time even if non-critical tasks hang

  4. Centralized timeout constants (process_killer.rs:14-37): Well-documented constants explain each timeout's purpose

  5. Good handle wrapping pattern (clickhouse/mod.rs:105-107): Wrapping both inserter and backup handles in a single task ensures both complete before shutdown

  6. Abort token forwarding (clickhouse.rs:249-253): Cleanly handles upstream hang scenarios by forwarding abort to task_manager

  7. V3 server critical task (flashbots_config.rs:211-213, config.rs:564-566): Optimistic V3 server properly waits to avoid relay demotion


📋 Final Assessment

Approve

The PR is ready for merge. This is a significant improvement to the shutdown mechanism:

  1. Replaces arbitrary sleeps with proper task tracking - Tasks are now awaited directly via JoinHandles
  2. Two-phase shutdown ensures data integrity - Critical tasks get additional time to complete
  3. Clean code organization - Timeout constants are centralized with clear documentation
  4. Proper abort signaling - The global_abort token allows tasks to perform immediate cleanup when upstream is hung

All previously identified issues have been addressed. The code is well-structured and the shutdown logic is clear and robust.


@ZanCorDX ZanCorDX merged commit 924b662 into develop Jan 26, 2026
7 checks passed
@ZanCorDX ZanCorDX deleted the dx/better-shut-down branch January 26, 2026 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants