snapshots: fix pipeline state machine edge cases, randomly generate error control signals in testing #7570

amass-jump · 2025-12-09T17:27:27Z

No description provided.

amass-jump · 2025-12-09T17:29:19Z

src/discof/restore/fd_snapdc_tile.c

-      if( FD_UNLIKELY( ctx->is_zstd && ctx->dirty ) ) {
-        FD_LOG_WARNING(( "encountered end-of-file in the middle of a compressed frame" ));
-        ctx->state = FD_SNAPSHOT_STATE_ERROR;
-        fd_stem_publish( stem, 0UL, FD_SNAPSHOT_MSG_CTRL_ERROR, 0UL, 0UL, 0UL, 0UL, 0UL );
-        return;
-      }


This check could never trigger before (we always reset dirty to 0 above), but also returning here deadlocks the pipeline

amass-jump · 2025-12-09T17:31:44Z

src/discof/restore/fd_snapdc_tile.c

-      FD_TEST( ctx->state==FD_SNAPSHOT_STATE_PROCESSING ||
-               ctx->state==FD_SNAPSHOT_STATE_ERROR );


It's also possible for tiles to receive the FAIL control message in the IDLE state. This is the cause of several of the FD_TEST assertions we've seen.

This happens when snapct sends out a DONE, and an early tile immediately handles it and goes to IDLE. But a later tile may fail and generate ERROR, which causes snapct to send out a FAIL control message which is looped back through the pipeline

amass-jump · 2025-12-09T17:32:53Z

src/discof/restore/fd_snapin_tile.c

 transition_malformed( fd_snapin_tile_t *  ctx,
                      fd_stem_context_t * stem ) {
+  if( FD_UNLIKELY( ctx->state==FD_SNAPSHOT_STATE_ERROR ) ) return;
  ctx->state = FD_SNAPSHOT_STATE_ERROR;
  fd_stem_publish( stem, ctx->out_ct_idx, FD_SNAPSHOT_MSG_CTRL_ERROR, 0UL, 0UL, 0UL, 0UL, 0UL );
 }


Not a bug, but no need to generate extra ERROR messages when we are already in that state

amass-jump · 2025-12-09T17:37:32Z

src/discof/restore/fd_snapls_tile.c

-static void
-after_credit( fd_snapls_tile_t *  ctx,
-              fd_stem_context_t *  stem,
-              int *                opt_poll_in FD_PARAM_UNUSED,
-              int *                charge_busy FD_PARAM_UNUSED ) {
-  if( FD_UNLIKELY( ctx->hash_accum.received_lthashes==ctx->num_hash_tiles && ctx->hash_accum.awaiting_ack ) ) {
-    fd_lthash_sub( &ctx->hash_accum.calculated_lthash, &ctx->running_lthash );
-    if( FD_UNLIKELY( memcmp( &ctx->hash_accum.expected_lthash, &ctx->hash_accum.calculated_lthash, sizeof(fd_lthash_value_t) ) ) ) {


This doesn't need to be asynchronous, because receiving all the NEXT/DONE "acks" from the snapla's implies that we must have already seen the HASH_RESULT messages from each of them. So as soon as we get the last ack we can check the lthash for correctness.

github-actions · 2025-12-10T10:29:49Z

Performance Measurements ⏳

Suite	Baseline	New	Change
backtest `mainnet-368528500-perf` per slot	`0.073258 s`	`0.074401 s`	`1.560%` ✅
backtest `mainnet-368528500-perf` snapshot load	`3.141 s`	`3.155 s`	`0.446%` ✅
backtest `mainnet-368528500-perf` total elapsed	`73.258409 s`	`74.401362 s`	`1.560%` ✅
`firedancer mem` usage with `mainnet.toml`	`1023.23 GiB`	`1023.23 GiB`	`0.000%` ✅

github-actions · 2025-12-10T11:05:18Z

Performance Measurements ⏳

Suite	Baseline	New	Change
backtest `mainnet-368528500-perf` per slot	`0.073616 s`	`0.073573 s`	`-0.058%` ✅
backtest `mainnet-368528500-perf` snapshot load	`3.151 s`	`3.209 s`	`1.841%` ✅
backtest `mainnet-368528500-perf` total elapsed	`73.615818 s`	`73.573048 s`	`-0.058%` ✅
`firedancer mem` usage with `mainnet.toml`	`1023.23 GiB`	`1023.23 GiB`	`0.000%` ✅

github-actions · 2025-12-17T13:52:20Z

Performance Measurements ⏳

Suite	Baseline	New	Change
backtest `mainnet-368528500-perf` per slot	`0.050826 s`	`0.050693 s`	`-0.262%` ✅
backtest `mainnet-368528500-perf` snapshot load	`1.67 s`	`1.659 s`	`-0.659%` ✅
backtest `mainnet-368528500-perf` total elapsed	`50.826305 s`	`50.692915 s`	`-0.262%` ✅
`firedancer mem` usage with `mainnet.toml`	`1005.23 GiB`	`1005.23 GiB`	`0.000%` ✅

cali-jumptrading · 2025-12-19T20:08:16Z

src/discof/restore/utils/fd_ssctrl.h

+   error control messages from tiles in the snapshots pipeline in appropriate
+   states. */
+
+#define FD_SNAPSHOT_TEST_ERROR_NANOS (0L)


Think this test logic should be done in a proper unit test or maybe as a sub-command of snapshot-load.

cali-jumptrading · 2025-12-19T20:48:46Z

src/discof/restore/fd_snapdc_tile.c

                  fd_stem_context_t * stem,
                  ulong               chunk,
                  ulong               sz ) {
+  if( FD_UNLIKELY( fd_ssctrl_test_maybe_error( &ctx->state, stem, 0UL ) ) ) return 0;


This looks confusing without additional context. Apparently nothing happens in this function when the define FD_SNAPSHOT_TEST_ERROR_NANOS is not set.

If you truly want something like this I think it would be better to guard this call with a defined macro

cali-jumptrading · 2025-12-19T20:52:24Z

src/discof/restore/fd_snapin_tile.c

      if( FD_UNLIKELY( ctx->state!=FD_SNAPSHOT_STATE_FINISHING ) ) {
        transition_malformed( ctx, stem );
-        return;
+        break;


why the break here instead of return? Do we still need to forward the control message if we are generating an error?

amass-jump commented Dec 9, 2025

View reviewed changes

amass-jump force-pushed the amass/snap-error branch from d435810 to 9ac6b51 Compare December 10, 2025 07:32

amass-jump changed the title ~~snapshots: fix pipeline state machine edge cases~~ [DO NOT MERGE] snapshots: fix pipeline state machine edge cases Dec 10, 2025

amass-jump requested review from cali-jumptrading and ripatel-fd December 10, 2025 10:24

amass-jump self-assigned this Dec 10, 2025

amass-jump marked this pull request as ready for review December 10, 2025 10:25

amass-jump force-pushed the amass/snap-error branch from d9a07ab to 61e85bc Compare December 10, 2025 11:00

amass-jump changed the title ~~[DO NOT MERGE] snapshots: fix pipeline state machine edge cases~~ [DO NOT MERGE] snapshots: fix pipeline state machine edge cases, randomly generate error control signals in testing Dec 10, 2025

amass-jump changed the title ~~[DO NOT MERGE] snapshots: fix pipeline state machine edge cases, randomly generate error control signals in testing~~ snapshots: fix pipeline state machine edge cases, randomly generate error control signals in testing Dec 12, 2025

amass-jump added 2 commits December 17, 2025 13:43

snapshots: fix pipeline state machine edge cases

afccbc5

snapshots: randomly generate error control signals in testing

4536805

amass-jump force-pushed the amass/snap-error branch from 61e85bc to 4536805 Compare December 17, 2025 13:44

cali-jumptrading reviewed Dec 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

snapshots: fix pipeline state machine edge cases, randomly generate error control signals in testing #7570

snapshots: fix pipeline state machine edge cases, randomly generate error control signals in testing #7570

Uh oh!

amass-jump commented Dec 9, 2025

Uh oh!

amass-jump Dec 9, 2025

Uh oh!

amass-jump Dec 9, 2025

Uh oh!

amass-jump Dec 9, 2025

Uh oh!

amass-jump Dec 9, 2025

Uh oh!

github-actions bot commented Dec 10, 2025

Uh oh!

github-actions bot commented Dec 10, 2025

Uh oh!

github-actions bot commented Dec 17, 2025

Uh oh!

cali-jumptrading Dec 19, 2025

Uh oh!

cali-jumptrading Dec 19, 2025

Uh oh!

cali-jumptrading Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		FD_TEST( ctx->state==FD_SNAPSHOT_STATE_PROCESSING \|\|
		ctx->state==FD_SNAPSHOT_STATE_ERROR );

snapshots: fix pipeline state machine edge cases, randomly generate error control signals in testing #7570

Are you sure you want to change the base?

snapshots: fix pipeline state machine edge cases, randomly generate error control signals in testing #7570

Uh oh!

Conversation

amass-jump commented Dec 9, 2025

Uh oh!

amass-jump Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

amass-jump Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

amass-jump Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

amass-jump Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 10, 2025

Performance Measurements ⏳

Uh oh!

github-actions bot commented Dec 10, 2025

Performance Measurements ⏳

Uh oh!

github-actions bot commented Dec 17, 2025

Performance Measurements ⏳

Uh oh!

cali-jumptrading Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

cali-jumptrading Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

cali-jumptrading Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants