Skip to content

[Bug]: Custodian cannot be run safely when the executable is on a separate node #398

@Andrew-S-Rosen

Description

@Andrew-S-Rosen

What happened?

For full context of this issue, refer to the summary in #396.

Custodian might end up running on a master node with the VASP processes being launched on sister nodes. This is often done, for instance, when requesting a single large Slurm allocation and running many concurrent VASP processes therein. Currently, Custodian cannot handle this setup, as the Custodian process on the master node seemingly does not have permission to kill the VASP process on the other node(s) in the allocation, and it then defaults to a killall command killing everything (including perfectly fine jobs). However, Custodian does have permission to kill the parent process that launches the VASP executable (typically an srun or mpirun call), which in fact is what the killall indiscriminately kills.

#396 solves this for VASP, but essentially the same problem exists for the other codes. The fix in #396 is quite easy to implement for other codes once it is merged.

Version

2025.8.13

Which OS?

  • MacOS
  • Windows
  • Linux

Log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions