Skip to content

Conversation

@janekmi
Copy link
Contributor

@janekmi janekmi commented Dec 18, 2025

DAOS-17893 is a ticket reporting a crash in DAVv2 happened while freeing an allocation which happened because the provided offset was not a beginning of a memory block but the memory block at hand was not a run. The allocator itself could check for this kind of discrepancies and report before the process will be terminated by a SIGFPE signal. But the higher the issue would be caught the more information we could potentially recover from the crash.

One such place, notorious to be involved in this kind of incidents is the VOS garbage collector (e.g. DAOS-18049). Possibly not because it is more buggy than any other piece of DAOS rather it is a place where we enumerate large chunks of the VOS metadata in order to free the requested objects and all their descendants.

Hence, this PR introduces a few asserts into the GC code so whenever it is possible to validate the offset GC is about to free whether it actually points to an object we expect to live there we assert it actually is as expected and we dump its contents if not for further investigation.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@janekmi janekmi requested review from a team as code owners December 18, 2025 16:35
@janekmi janekmi requested review from NiuYawei and sherintg December 18, 2025 16:36
@daosbuild3
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17293/1/execution/node/279/log

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

@github-actions
Copy link

Errors are Unable to load ticket data
https://daosio.atlassian.net/browse/DAOS-17893

@daosbuild3
Copy link
Collaborator

@janekmi janekmi force-pushed the janekmi/DAOS-17893-GC-asserts branch from 6bbd198 to 0824861 Compare December 18, 2025 18:24
@daosbuild3
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17293/2/execution/node/301/log

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

@janekmi janekmi force-pushed the janekmi/DAOS-17893-GC-asserts branch from 0824861 to 1508a5e Compare December 18, 2025 20:05
@daosbuild3
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17293/3/execution/node/301/log

@daosbuild3
Copy link
Collaborator

assert + memory dump for debug.

Signed-off-by: Jan Michalski <[email protected]>
@janekmi janekmi force-pushed the janekmi/DAOS-17893-GC-asserts branch from 1508a5e to 3ad502c Compare December 18, 2025 20:17
@daosbuild3
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17293/4/execution/node/302/log

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

NiuYawei
NiuYawei previously approved these changes Dec 19, 2025
@daosbuild3
Copy link
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17293/5/testReport/

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17293/5/testReport/

- add d_log_memory_ut to utest.yaml
- change D_EMIT to D_FATAL so it is easily visible
- move d_log_memory() call before d_alt_assert() call

Signed-off-by: Jan Michalski <[email protected]>
- fix DTX_ACT_BLOB_MAGIC assert
- remove the ILOG assert (left a comment to spare making the mistake again)

Signed-off-by: Jan Michalski <[email protected]>
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Unit Test bdev on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17293/6/testReport/

@daosbuild3
Copy link
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17293/6/testReport/

@daosbuild3
Copy link
Collaborator

Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17293/6/testReport/

- fix the magic assert for committed DBDs

Signed-off-by: Jan Michalski <[email protected]>
@janekmi janekmi requested review from Nasf-Fan and NiuYawei December 24, 2025 12:18
* Since the key's structure does not have a magic value and the ilog root (which has
* a magic value) is already destroyed at this stage there is no way to verify the pointer
* actually points to a valid data.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the assert is removed? It was incorrect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It was incorrect. As I have written in the comment, at this stage the key's ILOG is already destroyed, so we cannot use it. A pity.

@janekmi janekmi requested a review from NiuYawei December 29, 2025 13:22
Copy link
Contributor

@Nasf-Fan Nasf-Fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hence, this PR introduces a few asserts into the GC code so whenever it is possible to validate the offset GC is about to free whether it actually points to an object we expect to live there we assert it actually is as expected and we dump its contents if not for further investigation.

The commit message is out of date. Please update when you have other chance to refresh the patch. Thanks.

D_FATAL("Assertion '%s' failed: " fmt, #cond, ##__VA_ARGS__); \
d_log_memory((uint8_t *)ptr, size); \
if (d_alt_assert != NULL) \
d_alt_assert(0, #cond, __FILE__, __LINE__); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"else assert(0);" ?

@janekmi janekmi requested a review from a team December 31, 2025 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants