Gracefully exit rank != 0 job steps on slurm cluster #1780

erjel · 2024-11-08T12:32:45Z

Hi,

first, thanks for open sourcing submitit! It simplifies the setup of distributed cluster jobs on our inhouse slurm cluster a lot.

Some context:

While requeuing 32 node jobs, I got a message from our infrastructure team that my jobs are causing nodes to be stuck in the slurm "drained" state. Even worst: The requeuing worked fine, so my job brought down the first 32 nodes and restarted on 32 different nodes (and probably would have "drained" them as well). This would eventually have brought down our entire GPU partition.

According to the infrastructure team, nodes go to the "drained" state when jobs keep running on the cluster after the final SIGKILL was send (i.e. when the timeout is reached or when the job is requeued). As far as I understand, the timing of the SIGKILL depends on the individual slurm settings. On our cluster there is a very liberal 120 sec delay between timeout and signal send (apparently the default is about 30 sec).

For debugging we had a look into the running processes on one of the nodes with 4 GPUs each.
Normally there are multiple processes running:

4x srun (starts the main python script)
4x main python script (one for each GPU; spawning X child processes where X is the number of workers for each pytorch dataloader)
4x X dataloader processes feeding training samples from CPU to GPU

Once a timeout is reached:

all srun processes stop
1x main python script stops
X dataloader processes stop

The remaining python scripts keep running (even continuing utilizing GPU resources) until SIGKILL is send. While we could not reproducibly trigger the drained state on-the-fly, we decided that I should not rely on SIGKILL to bring my remaining job steps down.

The solution which worked on our cluster:

After a timeout or a requeue a SIGTERM is send to all job steps. While we don't want to call sys.exit(-1) on rank != 0 nodes once the SIGUSR2 is send (i.e. to give the rank 0 job step time to create a checkpoint first), we can prepare the job steps for the following SIGTERM. SIGTERM is triggered by slurm after scontrol requeue <jobid> and thus a strong indication that rank 0 is done with checkpointing and we can call sys.exit() in the remaining job steps.

With the proposed changes we could observe that the jobs gracefully exit well before the 120 sec SIGKILL timelimit. Additionally we noticed that time spend in the slurm completing state is notably reduced.

Even though cluster setups can be wildly different, I decided to create a pull request in order to give feedback to the community (and hope that the workaround and information is useful for others).

erjel · 2025-03-12T13:37:15Z

ping @jrapin

Any thoughts on this?

baldassarreFe · 2025-04-07T20:02:22Z

Seems like an application issue that should be solved in the application and not in submitit.

What is the multiprocessing method that you use for the dataloader processes? Do your jobs fail to exit if you set this before doing any CUDA-related operation?

multiprocessing.set_start_method("forkserver", force=True)

Lengthier explanation: the dataloader processes should not have any GPU resource open, not even an unused CUDA context, or they might fail to die when receiving a signal. The default multiprocessing start method is fork, which is bad because it passes the CUDA context to the child processes. The alternative start method forkserver solves this issue. If you check the output of nvidia-smi you should see that only the "main" python script has some open GPU resource.

erjel added 4 commits November 6, 2024 10:45

test: don't bypass sigcont and sigterm

6cc151f

Use sigterm (after sigusr2) to kill slave jobsteps

ad884ac

fix linting error

9c4a338

doc: Add name to contributors

1652742

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 8, 2024

Handle cancellation properly, do not double requeue, simplify logic

38969c7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gracefully exit rank != 0 job steps on slurm cluster #1780

Gracefully exit rank != 0 job steps on slurm cluster #1780

Uh oh!

erjel commented Nov 8, 2024

Uh oh!

erjel commented Mar 12, 2025

Uh oh!

baldassarreFe commented Apr 7, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Gracefully exit rank != 0 job steps on slurm cluster #1780

Are you sure you want to change the base?

Gracefully exit rank != 0 job steps on slurm cluster #1780

Uh oh!

Conversation

erjel commented Nov 8, 2024

Some context:

The solution which worked on our cluster:

Uh oh!

erjel commented Mar 12, 2025

Uh oh!

baldassarreFe commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

baldassarreFe commented Apr 7, 2025 •

edited

Loading