-
Notifications
You must be signed in to change notification settings - Fork 373
snapshots: fix pipeline state machine edge cases, randomly generate error control signals in testing #7570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| if( FD_UNLIKELY( ctx->is_zstd && ctx->dirty ) ) { | ||
| FD_LOG_WARNING(( "encountered end-of-file in the middle of a compressed frame" )); | ||
| ctx->state = FD_SNAPSHOT_STATE_ERROR; | ||
| fd_stem_publish( stem, 0UL, FD_SNAPSHOT_MSG_CTRL_ERROR, 0UL, 0UL, 0UL, 0UL, 0UL ); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check could never trigger before (we always reset dirty to 0 above), but also returning here deadlocks the pipeline
| FD_TEST( ctx->state==FD_SNAPSHOT_STATE_PROCESSING || | ||
| ctx->state==FD_SNAPSHOT_STATE_ERROR ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also possible for tiles to receive the FAIL control message in the IDLE state. This is the cause of several of the FD_TEST assertions we've seen.
This happens when snapct sends out a DONE, and an early tile immediately handles it and goes to IDLE. But a later tile may fail and generate ERROR, which causes snapct to send out a FAIL control message which is looped back through the pipeline
| transition_malformed( fd_snapin_tile_t * ctx, | ||
| fd_stem_context_t * stem ) { | ||
| if( FD_UNLIKELY( ctx->state==FD_SNAPSHOT_STATE_ERROR ) ) return; | ||
| ctx->state = FD_SNAPSHOT_STATE_ERROR; | ||
| fd_stem_publish( stem, ctx->out_ct_idx, FD_SNAPSHOT_MSG_CTRL_ERROR, 0UL, 0UL, 0UL, 0UL, 0UL ); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a bug, but no need to generate extra ERROR messages when we are already in that state
| static void | ||
| after_credit( fd_snapls_tile_t * ctx, | ||
| fd_stem_context_t * stem, | ||
| int * opt_poll_in FD_PARAM_UNUSED, | ||
| int * charge_busy FD_PARAM_UNUSED ) { | ||
| if( FD_UNLIKELY( ctx->hash_accum.received_lthashes==ctx->num_hash_tiles && ctx->hash_accum.awaiting_ack ) ) { | ||
| fd_lthash_sub( &ctx->hash_accum.calculated_lthash, &ctx->running_lthash ); | ||
| if( FD_UNLIKELY( memcmp( &ctx->hash_accum.expected_lthash, &ctx->hash_accum.calculated_lthash, sizeof(fd_lthash_value_t) ) ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need to be asynchronous, because receiving all the NEXT/DONE "acks" from the snapla's implies that we must have already seen the HASH_RESULT messages from each of them. So as soon as we get the last ack we can check the lthash for correctness.
d435810 to
9ac6b51
Compare
Performance Measurements ⏳
|
d9a07ab to
61e85bc
Compare
Performance Measurements ⏳
|
61e85bc to
4536805
Compare
Performance Measurements ⏳
|
| error control messages from tiles in the snapshots pipeline in appropriate | ||
| states. */ | ||
|
|
||
| #define FD_SNAPSHOT_TEST_ERROR_NANOS (0L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think this test logic should be done in a proper unit test or maybe as a sub-command of snapshot-load.
| fd_stem_context_t * stem, | ||
| ulong chunk, | ||
| ulong sz ) { | ||
| if( FD_UNLIKELY( fd_ssctrl_test_maybe_error( &ctx->state, stem, 0UL ) ) ) return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks confusing without additional context. Apparently nothing happens in this function when the define FD_SNAPSHOT_TEST_ERROR_NANOS is not set.
If you truly want something like this I think it would be better to guard this call with a defined macro
| if( FD_UNLIKELY( ctx->state!=FD_SNAPSHOT_STATE_FINISHING ) ) { | ||
| transition_malformed( ctx, stem ); | ||
| return; | ||
| break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the break here instead of return? Do we still need to forward the control message if we are generating an error?
No description provided.