Skip to content

Conversation

@kzscisoft
Copy link
Collaborator

Add System Health Alerts

Issue: #904

Python Version(s) Tested: 3.13.5

Operating System(s): Ubuntu 25.10

Documentation PR: Issue on Docs repo.

📝 Summary

Adds functionality to prevent run loss after system health failure.

🔄 Changes

Adds pre-defined alerts which trigger when the system is low on health:

  • Available memory falls below 5%
  • Available disk space falls below 5%

✔️ Checklist

  • Unit and integration tests passing.
  • Pre-commit hooks passing.
  • Quality checks passing.
  • Updated the documentation.

@kzscisoft kzscisoft requested a review from wk9874 January 9, 2026 13:45
@kzscisoft kzscisoft added enhancement New feature or request python Pull requests that update python code labels Jan 9, 2026
@kzscisoft kzscisoft linked an issue Jan 9, 2026 that may be closed by this pull request

@property
def memory_available_percent(self) -> float:
return 100 - typing.cast("float", psutil.virtual_memory().percent)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested that these work on windows?

name="low_available_virtual_memory",
metric=f"{RESOURCES_METRIC_PREFIX}/memory.virtual.available.percentage",
threshold=5,
aggregation="at least one",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this do?

retention_period: str | None = None,
timeout: int | None = 180,
visibility: typing.Literal["public", "tenant"] | list[str] | None = None,
terminate_on_low_system_health: bool = True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would default this to false personally

aggregation="at least one",
window=2,
rule="is below",
trigger_abort=terminate_on_alert,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also add an email notification option?


def to_dict(self) -> dict[str, float]:
"""Create metrics dictionary for sending to a Simvue server."""
_metrics: dict[str, float] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to rethink how these things are named, the resources metrics page now looks very confusing:

Image


attempts: int = 0

while run._status == "terminated" and attemps < 5:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo 'attemps' - this will crash

import random
import datetime
import simvue

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add unit tests:

  • Check the metrics appear automatically
  • Check if you add a process that spikes the RAM usage / create a large tempfile, the available RAM / memory metrics change appropriately
  • Check the options for the alert (terminate, email if you decide to add that) are added to the alert correctly (ie get the alert definition back once its created, check it matches)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python Pull requests that update python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Define Alerts to Terminate Runs if System Unhealthy

3 participants