Fix Reassure CI flakiness caused by noisy sub-10ms baselines#745
Fix Reassure CI flakiness caused by noisy sub-10ms baselines#745abzokhattab wants to merge 1 commit intoExpensify:mainfrom
Conversation
9fd08b2 to
53e3ca2
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 53e3ca2a7e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const effectiveAbsoluteThreshold = baselineDuration < MIN_BASELINE_FOR_DEFAULT_THRESHOLD_MS ? 100 : allowedDurationDeviation; | ||
|
|
||
| const isMeasurementRelevant = Math.abs(durationDeviation) > effectiveAbsoluteThreshold; |
There was a problem hiding this comment.
Keep sub-10ms checks from bypassing real regressions
Using a hardcoded 100 ms absolute threshold for every benchmark with baseline.meanDuration < 10 causes large slowdowns to be ignored before the relative check runs. In this path, a measurement can regress from single-digit milliseconds to tens of milliseconds (for example 8ms→80ms, +900%) and still be treated as not relevant, so the performance gate no longer protects many of the fastest code paths that recently moved under 10ms.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Yeahh @abzokhattab I dont think we should do 3. Adaptive absolute threshold for sub-10ms baselines (validateReassureOutput.ts) for now. Changes 1 and 2 make sense to me however.
There was a problem hiding this comment.
@fabioh8010 i just reverted it but looking at some of the failed pipleines i see that some them were hitting more %1000 so i think that still could occur after merging ... what do you think?
| Run ID | Test | Deviation | Baseline | URL |
|---|---|---|---|---|
| 22388332008 | doAllCollectionItemsBelongToSameParent |
22.44ms (1838%) | ~1ms | https://github.com/Expensify/react-native-onyx/actions/runs/22388332008/job/64803831835 |
| 22388332008 | isValidNonEmptyCollectionForMerge |
10.84ms (1193%) | ~1ms | same run |
| 22388863959 | doAllCollectionItemsBelongToSameParent |
23.46ms (2237%) | ~1ms | https://github.com/Expensify/react-native-onyx/actions/runs/22388863959/job/64805540337 |
| 22388863959 | isValidNonEmptyCollectionForMerge |
10.63ms (1171%) | ~1ms | same run |
There was a problem hiding this comment.
The issue is that by allowing so big threshold we would also mask real regresssions if they happen in these funcitions
|
@abzokhattab After addressing comments could you run |
53e3ca2 to
da22e40
Compare
Two changes:
1. Raise delta check thresholds from 10ms/20% to 20ms/40% to match stability
check thresholds — shared CI runners introduce ~10-15ms of jitter which
makes the previous 10ms absolute threshold too strict.
2. Fix Boolean('false') bug in IS_VALIDATING_STABILITY parsing — the string
'false' was coerced to true, causing delta check failures to be
misreported as stability check failures.
da22e40 to
e21e635
Compare
|
Done |
Explanation of Change
Root Cause Analysis
After PR #689 removed
unstable_batchedUpdates, several Onyx functions dropped to sub-millisecond baselines (0.3-2ms). On shared CI runners (ubuntu-24.04-v4), system jitter alone introduces 10-15ms of variance per measurement. When a 0.5ms function measures at 12ms due to jitter, Reassure's Z-test correctly flags this as statistically significant — but it's noise, not a regression.This manifests as two distinct failure modes:
Failure Mode 1: Delta check thresholds too strict
The stability check thresholds were raised to 20ms/40% in PR #727, but the delta check still used 10ms/20%. On shared CI, ~10-15ms of jitter is normal, so a 10ms absolute threshold catches noise as "regressions."
Failure Mode 2:
Boolean('false')bugIS_VALIDATING_STABILITYwas parsed withBoolean(getInputOrEnv(...)). SincegetInputOrEnvreturns a string,Boolean('false') === true, causing delta check failures to be misreported as stability check failures. This made debugging harder by producing misleading error messages.Changes
1. Raise delta check thresholds (reassurePerfTests.yml)
ALLOWED_DURATION_DEVIATION: 10 → 20msALLOWED_RELATIVE_DURATION_DEVIATION: 20 → 40%Matches the stability check thresholds already merged in PR #727. On shared CI, jitter alone accounts for ~10-15ms, so 10ms was too tight for the absolute threshold.
2. Fix
Boolean('false')bug (validateReassureOutput.ts)Fixed Issues
$ Expensify/App#80320
PROPOSAL:
Tests
Offline tests
QA Steps
// TODO: These must be filled out, or the issue title must include "[No QA]."
PR Author Checklist
### Fixed Issuessection aboveTestssectionOffline stepssectionQA stepssectiontoggleReportand notonIconClick)src/languages/*files and using the translation methodSTYLE.md) were followedAvatar, I verified the components usingAvatarare working as expected)StyleUtils.getBackgroundAndBorderStyle(theme.componentBG))npm run compress-svg)Avataris modified, I verified thatAvataris working as expected in all cases)Designlabel and/or tagged@Expensify/designso the design team can review the changes.ScrollViewcomponent to make it scrollable when more elements are added to the page.mainbranch was merged into this PR after a review, I tested again and verified the outcome was still expected according to theTeststeps.Screenshots/Videos
Android: Native
Android: mWeb Chrome
iOS: Native
iOS: mWeb Safari
MacOS: Chrome / Safari