Skip to content

Fix Reassure CI flakiness caused by noisy sub-10ms baselines#745

Open
abzokhattab wants to merge 1 commit intoExpensify:mainfrom
abzokhattab:abzokhattab/fix-reassure-ci-flakiness
Open

Fix Reassure CI flakiness caused by noisy sub-10ms baselines#745
abzokhattab wants to merge 1 commit intoExpensify:mainfrom
abzokhattab:abzokhattab/fix-reassure-ci-flakiness

Conversation

@abzokhattab
Copy link
Contributor

@abzokhattab abzokhattab commented Mar 1, 2026

Explanation of Change

Root Cause Analysis

After PR #689 removed unstable_batchedUpdates, several Onyx functions dropped to sub-millisecond baselines (0.3-2ms). On shared CI runners (ubuntu-24.04-v4), system jitter alone introduces 10-15ms of variance per measurement. When a 0.5ms function measures at 12ms due to jitter, Reassure's Z-test correctly flags this as statistically significant — but it's noise, not a regression.

This manifests as two distinct failure modes:

Failure Mode 1: Delta check thresholds too strict
The stability check thresholds were raised to 20ms/40% in PR #727, but the delta check still used 10ms/20%. On shared CI, ~10-15ms of jitter is normal, so a 10ms absolute threshold catches noise as "regressions."

Failure Mode 2: Boolean('false') bug
IS_VALIDATING_STABILITY was parsed with Boolean(getInputOrEnv(...)). Since getInputOrEnv returns a string, Boolean('false') === true, causing delta check failures to be misreported as stability check failures. This made debugging harder by producing misleading error messages.

Changes

1. Raise delta check thresholds (reassurePerfTests.yml)
ALLOWED_DURATION_DEVIATION: 10 → 20ms
ALLOWED_RELATIVE_DURATION_DEVIATION: 20 → 40%

Matches the stability check thresholds already merged in PR #727. On shared CI, jitter alone accounts for ~10-15ms, so 10ms was too tight for the absolute threshold.

2. Fix Boolean('false') bug (validateReassureOutput.ts)

// Before (always true for any non-empty string):
const isValidatingStability = Boolean(getInputOrEnv('IS_VALIDATING_STABILITY'));

// After:
const isValidatingStability = getInputOrEnv('IS_VALIDATING_STABILITY') === 'true';

Fixed Issues

$ Expensify/App#80320
PROPOSAL:

Tests

Offline tests

QA Steps

// TODO: These must be filled out, or the issue title must include "[No QA]."

  • Verify that no errors appear in the JS console

PR Author Checklist

  • I linked the correct issue in the ### Fixed Issues section above
  • I wrote clear testing steps that cover the changes made in this PR
    • I added steps for local testing in the Tests section
    • I added steps for the expected offline behavior in the Offline steps section
    • I added steps for Staging and/or Production testing in the QA steps section
    • I added steps to cover failure scenarios (i.e. verify an input displays the correct error message if the entered data is not correct)
    • I turned off my network connection and tested it while offline to ensure it matches the expected behavior (i.e. verify the default avatar icon is displayed if app is offline)
    • I tested this PR with a High Traffic account against the staging or production API to ensure there are no regressions (e.g. long loading states that impact usability).
  • I included screenshots or videos for tests on all platforms
  • I ran the tests on all platforms & verified they passed on:
    • Android: Native
    • Android: mWeb Chrome
    • iOS: Native
    • iOS: mWeb Safari
    • MacOS: Chrome / Safari
  • I verified there are no console errors (if there's a console error not related to the PR, report it or open an issue for it to be fixed)
  • I followed proper code patterns (see Reviewing the code)
    • I verified that any callback methods that were added or modified are named for what the method does and never what callback they handle (i.e. toggleReport and not onIconClick)
    • I verified that comments were added to code that is not self explanatory
    • I verified that any new or modified comments were clear, correct English, and explained "why" the code was doing something instead of only explaining "what" the code was doing.
    • I verified any copy / text shown in the product is localized by adding it to src/languages/* files and using the translation method
      • If any non-english text was added/modified, I used JaimeGPT to get English > Spanish translation. I then posted it in #expensify-open-source and it was approved by an internal Expensify engineer. Link to Slack message:
    • I verified all numbers, amounts, dates and phone numbers shown in the product are using the localization methods
    • I verified any copy / text that was added to the app is grammatically correct in English. It adheres to proper capitalization guidelines (note: only the first word of header/labels should be capitalized), and is either coming verbatim from figma or has been approved by marketing (in order to get marketing approval, ask the Bug Zero team member to add the Waiting for copy label to the issue)
    • I verified proper file naming conventions were followed for any new files or renamed files. All non-platform specific files are named after what they export and are not named "index.js". All platform-specific files are named for the platform the code supports as outlined in the README.
    • I verified the JSDocs style guidelines (in STYLE.md) were followed
  • If a new code pattern is added I verified it was agreed to be used by multiple Expensify engineers
  • I followed the guidelines as stated in the Review Guidelines
  • I tested other components that can be impacted by my changes (i.e. if the PR modifies a shared library or component like Avatar, I verified the components using Avatar are working as expected)
  • I verified all code is DRY (the PR doesn't include any logic written more than once, with the exception of tests)
  • I verified any variables that can be defined as constants (ie. in CONST.ts or at the top of the file that uses the constant) are defined as such
  • I verified that if a function's arguments changed that all usages have also been updated correctly
  • If any new file was added I verified that:
    • The file has a description of what it does and/or why is needed at the top of the file if the code is not self explanatory
  • If a new CSS style is added I verified that:
    • A similar style doesn't already exist
    • The style can't be created with an existing StyleUtils function (i.e. StyleUtils.getBackgroundAndBorderStyle(theme.componentBG))
  • If new assets were added or existing ones were modified, I verified that:
    • The assets are optimized and compressed (for SVG files, run npm run compress-svg)
    • The assets load correctly across all supported platforms.
  • If the PR modifies code that runs when editing or sending messages, I tested and verified there is no unexpected behavior for all supported markdown - URLs, single line code, code blocks, quotes, headings, bold, strikethrough, and italic.
  • If the PR modifies a generic component, I tested and verified that those changes do not break usages of that component in the rest of the App (i.e. if a shared library or component like Avatar is modified, I verified that Avatar is working as expected in all cases)
  • If the PR modifies a component related to any of the existing Storybook stories, I tested and verified all stories for that component are still working as expected.
  • If the PR modifies a component or page that can be accessed by a direct deeplink, I verified that the code functions as expected when the deeplink is used - from a logged in and logged out account.
  • If the PR modifies the UI (e.g. new buttons, new UI components, changing the padding/spacing/sizing, moving components, etc) or modifies the form input styles:
    • I verified that all the inputs inside a form are aligned with each other.
    • I added Design label and/or tagged @Expensify/design so the design team can review the changes.
  • If a new page is added, I verified it's using the ScrollView component to make it scrollable when more elements are added to the page.
  • I added unit tests for any new feature or bug fix in this PR to help automatically prevent regressions in this user flow.
  • If the main branch was merged into this PR after a review, I tested again and verified the outcome was still expected according to the Test steps.

Screenshots/Videos

Android: Native
Android: mWeb Chrome
iOS: Native
iOS: mWeb Safari
MacOS: Chrome / Safari

@abzokhattab abzokhattab requested a review from a team as a code owner March 1, 2026 23:43
@melvin-bot melvin-bot bot requested review from arosiclair and removed request for a team March 1, 2026 23:44
@abzokhattab abzokhattab changed the title Fix Reassure CI flakiness caused by noisy sub-10ms baselines WIP: Fix Reassure CI flakiness caused by noisy sub-10ms baselines Mar 1, 2026
@abzokhattab abzokhattab marked this pull request as draft March 1, 2026 23:45
@abzokhattab abzokhattab force-pushed the abzokhattab/fix-reassure-ci-flakiness branch 2 times, most recently from 9fd08b2 to 53e3ca2 Compare March 1, 2026 23:57
@abzokhattab abzokhattab marked this pull request as ready for review March 1, 2026 23:59
@abzokhattab abzokhattab changed the title WIP: Fix Reassure CI flakiness caused by noisy sub-10ms baselines Fix Reassure CI flakiness caused by noisy sub-10ms baselines Mar 1, 2026
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 53e3ca2a7e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +56 to +58
const effectiveAbsoluteThreshold = baselineDuration < MIN_BASELINE_FOR_DEFAULT_THRESHOLD_MS ? 100 : allowedDurationDeviation;

const isMeasurementRelevant = Math.abs(durationDeviation) > effectiveAbsoluteThreshold;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep sub-10ms checks from bypassing real regressions

Using a hardcoded 100 ms absolute threshold for every benchmark with baseline.meanDuration < 10 causes large slowdowns to be ignored before the relative check runs. In this path, a measurement can regress from single-digit milliseconds to tens of milliseconds (for example 8ms→80ms, +900%) and still be treated as not relevant, so the performance gate no longer protects many of the fastest code paths that recently moved under 10ms.

Useful? React with 👍 / 👎.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeahh @abzokhattab I dont think we should do 3. Adaptive absolute threshold for sub-10ms baselines (validateReassureOutput.ts) for now. Changes 1 and 2 make sense to me however.

Copy link
Contributor Author

@abzokhattab abzokhattab Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabioh8010 i just reverted it but looking at some of the failed pipleines i see that some them were hitting more %1000 so i think that still could occur after merging ... what do you think?

Run ID Test Deviation Baseline URL
22388332008 doAllCollectionItemsBelongToSameParent 22.44ms (1838%) ~1ms https://github.com/Expensify/react-native-onyx/actions/runs/22388332008/job/64803831835
22388332008 isValidNonEmptyCollectionForMerge 10.84ms (1193%) ~1ms same run
22388863959 doAllCollectionItemsBelongToSameParent 23.46ms (2237%) ~1ms https://github.com/Expensify/react-native-onyx/actions/runs/22388863959/job/64805540337
22388863959 isValidNonEmptyCollectionForMerge 10.63ms (1171%) ~1ms same run

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that by allowing so big threshold we would also mask real regresssions if they happen in these funcitions

@fabioh8010
Copy link
Contributor

@abzokhattab After addressing comments could you run npm run gh-actions-build again? I'm getting different output from yours.

@abzokhattab abzokhattab force-pushed the abzokhattab/fix-reassure-ci-flakiness branch from 53e3ca2 to da22e40 Compare March 3, 2026 00:36
Two changes:

1. Raise delta check thresholds from 10ms/20% to 20ms/40% to match stability
   check thresholds — shared CI runners introduce ~10-15ms of jitter which
   makes the previous 10ms absolute threshold too strict.

2. Fix Boolean('false') bug in IS_VALIDATING_STABILITY parsing — the string
   'false' was coerced to true, causing delta check failures to be
   misreported as stability check failures.
@abzokhattab abzokhattab force-pushed the abzokhattab/fix-reassure-ci-flakiness branch from da22e40 to e21e635 Compare March 3, 2026 00:40
@abzokhattab
Copy link
Contributor Author

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants