Skip to content

Commit 5649cbf

Browse files
DR-002-Infra: Integration Testing in a Distributed Monolith (eclipse-score#1689)
1 parent 22dd313 commit 5649cbf

File tree

1 file changed

+372
-0
lines changed

1 file changed

+372
-0
lines changed
Lines changed: 372 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,372 @@
1+
<!--
2+
Copyright (c) 2025 Contributors to the Eclipse Foundation
3+
4+
See the NOTICE file(s) distributed with this work for additional
5+
information regarding copyright ownership.
6+
7+
This program and the accompanying materials are made available under the
8+
terms of the Apache License Version 2.0 which is available at
9+
https://www.apache.org/licenses/LICENSE-2.0
10+
11+
SPDX-License-Identifier: Apache-2.0
12+
-->
13+
14+
# DR-002-Infra: Integration Testing in a Distributed Monolith
15+
16+
* **Status:** Agreed within Community
17+
* **Owner:** Infrastructure Community
18+
* **Date:** 2025-09-01
19+
20+
---
21+
## Executive Summary
22+
23+
Large systems often span multiple repositories. Each repository can look “green” on its own, yet problems only show up when everything is combined. These late surprises slow down development and make debugging painful.
24+
25+
The concept described here turns a collection of separate repositories into a system that behaves like a single, continuously tested whole — ensuring the main line is always integrable across all components.
26+
27+
### Proposed Approach
28+
- Every change in any repository is tested **in combination with the rest of the system**, not just in isolation.
29+
- There are **two testing layers**:
30+
- a **fast feedback loop** (lightweight tests that run on every pull request),
31+
- and a **deeper validation** (heavier tests run after merges or on a schedule).
32+
- This setup guarantees that developers can trust the system as a whole to consistently work.
33+
34+
### Benefits
35+
- Problems across repositories are caught early.
36+
- Developers spend less time coordinating merges (“merge after me” scenarios disappear).
37+
- The project always has a “known good” baseline to fall back on, enabling stability while still moving fast.
38+
39+
Note: this concept is easily extendable to support multiple versions of S-CORE. But that's currently not required.
40+
41+
---
42+
## Introduction
43+
44+
Teams often split what is functionally a single system across many repositories. Each
45+
repository can show a green build while the assembled system is already broken. This
46+
article looks at how to bring system-level feedback earlier when you work that way. This
47+
article does not argue for pull requests, trunk-based development, or continuous
48+
integration itself. Those are well covered elsewhere. It also does not look into any
49+
specific tools or implementations for achieving these practices - except for providing a
50+
GitHub based example.
51+
52+
The context here assumes three things: you develop through pull requests with required
53+
checks; you have multiple interdependent repositories that ship together; and you either
54+
have or will create a central integration repository used only for orchestration. If any
55+
of those are absent you will need to establish them first; the rest of the discussion
56+
builds on them.
57+
58+
---
59+
## Motivation / Where Problems Usually Appear
60+
An interface change (for example a renamed field in a shared schema) is updated in two
61+
direct consumers. Their pull requests pass. Another consumer several repositories away
62+
still depends on the old interface and only fails once the whole set of changes reaches
63+
main and a later integration run executes. The defect was present early but only visible
64+
late. Investigation now needs cross-repo log hunting instead of a quick fix while the
65+
change was still in flight.
66+
67+
Running full end-to-end environments on every pull request is rarely affordable.
68+
Coordinated multi-repository changes are then handled informally through ad-hoc
69+
ordering: “merge yours after mine”. Late detection raises cost and makes regression
70+
origins harder to locate.
71+
72+
---
73+
## Core Concepts
74+
We model the integrated system as an explicit set of (component, commit) pairs captured
75+
in a manifest. Manifests are derived deterministically from events: a single pull
76+
request, a coordinated group of pull requests, or a post-merge refresh. A curated fast
77+
subset of integration tests provides pre-merge feedback; a deeper suite runs after
78+
merge. Passing suites produce a recorded manifest (“known good”). Coordinated
79+
multi-repository change is treated as a first-class case—we validate the set as a unit
80+
rather than relying on merge ordering.
81+
82+
Terminology (brief):
83+
* Component - repository that participates in the assembled product (e.g. service API
84+
repo, shared library).
85+
* Fast subset - curated integration tests finishing in single-digit minutes (protocol
86+
seams, migration boundaries, adapters).
87+
* Tuple - mapping of component names to commit SHAs for one integrated build (e.g. {
88+
users: a1c3f9d, billing: 9e02b4c }).
89+
* Known good - tuple + metadata (timestamp, suite, manifest hash) stored for later
90+
reproduction.
91+
92+
History & context: classic continuous integration assumed a single codebase; splitting
93+
one system across repositories reintroduces coordination issues CI was intended to
94+
remove. This adapts familiar CI principles (frequent integration, fast feedback,
95+
reproducibility) to a multi-repository boundary. The central integration repository is a
96+
neutral place to define participating components, build manifests, hold
97+
integration-specific helpers (overrides, fixtures, seam tests), and persist known-good
98+
records. It should not contain business logic; keeping it lean reduces accidental
99+
coupling and simplifies review.
100+
101+
---
102+
## Integration Workflows
103+
We use three recurring workflows: a single pull request, a coordinated subset when
104+
multiple pull requests must land together, and a post-merge fuller suite. Each produces
105+
a manifest, runs an appropriate depth of tests, and may record the tuple if successful.
106+
107+
### Visual Overview
108+
```{mermaid}
109+
flowchart TB
110+
subgraph COMP[Component Repos]
111+
pr[PR opened / updated<br/>&lt;event&gt;]:::event --> comp_ci[Component tests]:::step
112+
113+
trigger1[Merge to main<br/>&lt;event&gt;]:::event
114+
end
115+
116+
subgraph INT[Integration Repo]
117+
comp_ci --> |dispatch|detect_changeset[Detect multi repository PRs]:::step
118+
knownGood[(Known good store)]:::artifact
119+
120+
%% PR
121+
detect_changeset --> buildMan[Build PR/PRs manifest using PR/PRs SHA + known good others]:::step
122+
knownGood --> buildMan
123+
buildMan --> runSubset[Run fast subset of integration tests]:::step
124+
runSubset --> prFeedback[Provide Feedback in PR / all PRs]:::step
125+
126+
%% Post-merge / scheduled full suite
127+
trigger1 -->|dispatch| fullMan[Build full manifest from latest mains of all repos]:::step
128+
trigger2[schedule<br/>&lt;event&gt;]:::event --> fullMan
129+
fullMan --> fullSuite[Run full integration test suite]:::step
130+
fullSuite --> fullPass{Full suite pass?}:::decision
131+
fullPass -->|Yes| knownGood
132+
fullPass -->|No| issue["Create Issue<br>(or a more clever automated bisect solution)"]:::red
133+
end
134+
```
135+
*High-level flow of integration workflows. Known good store feeds manifest construction for single and coordinated paths; full test suite success updates the store.*
136+
137+
### Single Pull Request
138+
When a pull request opens or updates, its repository runs its normal fast tests. The
139+
integration repository is also triggered with the repository name, pull request number,
140+
and head SHA. It builds a manifest using that SHA for the changed component and the last
141+
known-good SHAs for others, then runs the curated fast subset. The result is reported
142+
back to the pull request. The manifest and logs are stored even when failing so a
143+
developer can reproduce locally.
144+
145+
The subset is explicit rather than dynamically inferred. Tests in it should fail quickly
146+
when contracts or shared schemas drift. If the list grows until it is slow it will
147+
either be disabled or ignored; regular curation keeps it useful.
148+
149+
### Coordinated Multi-Repository Subset
150+
Some changes require multiple repositories to move together (for example a schema
151+
evolution, a cross-cutting refactor, a protocol tightening). We mark related pull
152+
requests using a stable mechanism such as a common label (e.g. changeset:feature-x). The
153+
integration workflow discovers all open pull requests sharing the label, builds a
154+
manifest from their head SHAs, and runs the same fast subset. A unified status is posted
155+
back to each pull request. None merge until the coordinated set is green. This removes
156+
informal merge ordering as a coordination mechanism.
157+
158+
### Post-Merge Full Suite
159+
After merges we run a deeper suite. Some teams trigger on every push to main; others run
160+
on a schedule (hourly seems to be a common practice). Per-merge runs localise failures
161+
but cost more; batched runs save resources but expand the search space when problems
162+
appear. When the suite fails, retaining the manifest lets you bisect between the last
163+
known-good tuple and the current manifest (using a scripted search across the changed
164+
SHAs if multiple components advanced). On success we append a record for the tuple with
165+
a manifest hash and timing data.
166+
167+
### Manifests
168+
Manifests are minimal documents describing the composition. They allow reconstruction of
169+
the integrated system later.
170+
171+
Single pull request example:
172+
```
173+
pr: 482
174+
component_under_test:
175+
name: docs-as-code
176+
repo: eclipse-score/docs-as-code
177+
sha: 6bc901f2
178+
others:
179+
- name: component-a
180+
repo: eclipse-score/component-a
181+
ref: 34985hf8 # based on last known-good
182+
- name: component-b
183+
repo: eclipse-score/component-b
184+
ref: a4fd56re # based on last known-good
185+
subset: pr_fast
186+
timestamp: 2025-08-13T12:14:03Z
187+
```
188+
189+
Coordinated example:
190+
```
191+
components_under_test:
192+
- name: users-service
193+
repo: eclipse-score/users-service
194+
branch: feature/new_email_index
195+
ref: a57hrdfg
196+
pr: 16
197+
- name: auth-service
198+
repo: eclipse-score/auth-service
199+
branch: feature/lenient-token-parser
200+
ref: q928d46b75
201+
pr: 150
202+
others:
203+
- name: billing-service
204+
repo: eclipse-score/billing-service
205+
ref: a4fd56re # based on last known-good
206+
subset: pr_fast
207+
changeset: feature-x
208+
```
209+
210+
Large configuration belongs elsewhere; manifests should stay readable and diffable.
211+
212+
---
213+
## Example: GitHub Actions (Conceptual)
214+
*Conceptual outline; not yet implemented here.*
215+
216+
Trigger from a component repository:
217+
```
218+
name: integration-pr
219+
on: [pull_request]
220+
jobs:
221+
dispatch:
222+
runs-on: ubuntu-latest
223+
steps:
224+
- name: Dispatch to integration repo
225+
uses: peter-evans/repository-dispatch@v3
226+
with:
227+
token: ${{ secrets.INTEGRATION_TRIGGER_TOKEN }}
228+
repository: eclipse-score/reference_integration
229+
event-type: pr-integration
230+
client-payload: >-
231+
{"repo":"${{ github.repository }}","pr":"${{ github.event.pull_request.number }}","sha":"${{ github.sha }}"}
232+
```
233+
234+
Integration repository receiver (subset):
235+
```
236+
on:
237+
repository_dispatch:
238+
types: [pr-integration]
239+
jobs:
240+
pr-fast-subset:
241+
runs-on: ubuntu-latest
242+
steps:
243+
- uses: actions/checkout@v4
244+
- name: Parse payload
245+
run: echo '${{ toJson(github.event.client_payload) }}' > payload.json
246+
247+
- name: Materialize composition
248+
run: gen_pr_manifest.py last_known_good.yaml payload.json > manifest.pr.yaml
249+
250+
- name: Render MODULE overrides
251+
run: render_overrides.py manifest.pr.yaml > MODULE.override.bzl
252+
253+
- name: Bazel test (subset)
254+
run: bazel test //integration/subset:pr_fast --override_module_files=MODULE.override.bzl
255+
256+
- name: Store manifest & results
257+
uses: actions/upload-artifact@v4
258+
with:
259+
name: pr-subset-${{ github.run_id }}
260+
path: |
261+
manifest.pr.yaml
262+
bazel-testlogs/**/test.log
263+
```
264+
265+
Post-merge full suite:
266+
```
267+
on:
268+
schedule: [{cron: "15 * * * *"}]
269+
repository_dispatch:
270+
types: [component-merged]
271+
jobs:
272+
full-suite:
273+
runs-on: ubuntu-latest
274+
steps:
275+
- uses: actions/checkout@v4
276+
277+
- name: Generate new last_known_good.yaml
278+
run: update_last_known_good.py last_known_good.yaml > last_known_good.yaml
279+
280+
- name: Bazel test (full)
281+
run: bazel test //integration/full:all --test_tag_filters=-flaky
282+
283+
- name: Persist known-good tuple (on success)
284+
if: success()
285+
run: |
286+
git add last_known_good.yaml
287+
git commit -m "update known good"
288+
289+
- name: Upload artifacts
290+
uses: actions/upload-artifact@v4
291+
with:
292+
name: full-${{ github.run_id }}
293+
path: |
294+
bazel-testlogs/**/test.log
295+
```
296+
297+
### Recording Known-Good Tuples
298+
Known-good records are stored append-only.
299+
```
300+
[
301+
{
302+
"timestamp": "2025-08-13T12:55:10Z",
303+
"tuple": {
304+
"docs-as-code": "6bc901f2",
305+
"component-a": "91c0d4e1",
306+
"component-b": "a44f0cd9"
307+
},
308+
"manifest_sha256": "4c9b7f...",
309+
"suite": "full",
310+
"duration_s": 742
311+
}
312+
]
313+
```
314+
Persisting enables reproduction (attach manifest to a defect), audit (what exactly
315+
passed before a release), gating (choose any known-good tuple), and comparison (diff
316+
manifests to isolate drift) without relying on (rather fragile) links to unique runs in
317+
your CI system.
318+
319+
---
320+
## Operating It
321+
**Curating the fast subset:** Tests should fail quickly when public seams change. Keep
322+
the list explicit (e.g. //integration/subset:pr_fast). Remove redundant tests and
323+
quarantine flaky ones; review periodically (monthly or after significant interface
324+
churn) to preserve signal.
325+
326+
**Handling failures:** For a failing pull request subset: inspect manifest + log;
327+
reproduce locally with a script consuming the manifest. For a failing coordinated set:
328+
treat all related pull requests as atomic. For a failing post-merge full suite: bisect
329+
between the last known-good tuple and current manifest (script permutations if multiple
330+
repositories changed) to narrow cause. Distinguish real regressions from test fragility.
331+
332+
**Trade-offs and choices:** Manifests + SHAs avoid tag noise and keep validation close
333+
to heads. Two tiers (subset + full) offer a clear mental model; add more only with
334+
evidence. A central orchestration repository centralises caching, secrets, and audit
335+
history.
336+
337+
**Practical notes:** Cache builds to stabilise subset runtime. Hash manifests (e.g.
338+
SHA-256) for concise references. Expose an endpoint or badge showing the latest known
339+
good. Generate overrides; do not hand-edit ephemeral files. Optionally lint the subset
340+
target for allowed directories.
341+
342+
**Avoiding pitfalls:** Diff-based dynamic test selection often misses schema or contract
343+
drift. Ad-hoc manual edits to integration config reduce reproducibility. Merge ordering
344+
as coordination defers detection to the last merge.
345+
346+
**Signs it is working:** Interface breakage is caught pre-merge. Coordinated change sets
347+
show unified status. Multi-repository regressions are localised rapidly using stored
348+
manifests.
349+
350+
---
351+
## Releases and Bazel Registry
352+
353+
Bazel modules should be released only once they are verified, which in this setup is
354+
equivalent to being included in the known-good store. This does not imply that all
355+
verified versions need to end up in a release. That's still up to the module
356+
maintainers.
357+
358+
However in some cases pre-releases are even mandatory: when two modules are verified
359+
together (multi repo PR) and one depends on the other, the PR cannot be merged without
360+
internally releasing the dependent module, and setting the appropriate dependency in the
361+
other.
362+
363+
---
364+
## Summary
365+
By expressing the integrated system as explicit manifests, curating a fast integration
366+
subset for pull requests, and running a deeper post-merge suite, you move discovery of
367+
cross-repository breakage earlier while keeping costs predictable. Each successful run
368+
leaves a reproducible record, making release selection and debugging straightforward.
369+
The approach lets a distributed codebase behave operationally like a single one.
370+
371+
*Further reading:* Continuous Integration (Fowler), Continuous Delivery (Humble &
372+
Farley), trunk-based development resources.

0 commit comments

Comments
 (0)