Skip to content

Conversation

@erjel
Copy link

@erjel erjel commented Nov 8, 2024

Hi,

first, thanks for open sourcing submitit! It simplifies the setup of distributed cluster jobs on our inhouse slurm cluster a lot.

Some context:

While requeuing 32 node jobs, I got a message from our infrastructure team that my jobs are causing nodes to be stuck in the slurm "drained" state. Even worst: The requeuing worked fine, so my job brought down the first 32 nodes and restarted on 32 different nodes (and probably would have "drained" them as well). This would eventually have brought down our entire GPU partition.

According to the infrastructure team, nodes go to the "drained" state when jobs keep running on the cluster after the final SIGKILL was send (i.e. when the timeout is reached or when the job is requeued). As far as I understand, the timing of the SIGKILL depends on the individual slurm settings. On our cluster there is a very liberal 120 sec delay between timeout and signal send (apparently the default is about 30 sec).

For debugging we had a look into the running processes on one of the nodes with 4 GPUs each.
Normally there are multiple processes running:

  • 4x srun (starts the main python script)
  • 4x main python script (one for each GPU; spawning X child processes where X is the number of workers for each pytorch dataloader)
  • 4x X dataloader processes feeding training samples from CPU to GPU

Once a timeout is reached:

  • all srun processes stop
  • 1x main python script stops
  • X dataloader processes stop

The remaining python scripts keep running (even continuing utilizing GPU resources) until SIGKILL is send. While we could not reproducibly trigger the drained state on-the-fly, we decided that I should not rely on SIGKILL to bring my remaining job steps down.

The solution which worked on our cluster:

After a timeout or a requeue a SIGTERM is send to all job steps. While we don't want to call sys.exit(-1) on rank != 0 nodes once the SIGUSR2 is send (i.e. to give the rank 0 job step time to create a checkpoint first), we can prepare the job steps for the following SIGTERM. SIGTERM is triggered by slurm after scontrol requeue <jobid> and thus a strong indication that rank 0 is done with checkpointing and we can call sys.exit() in the remaining job steps.

With the proposed changes we could observe that the jobs gracefully exit well before the 120 sec SIGKILL timelimit. Additionally we noticed that time spend in the slurm completing state is notably reduced.

Even though cluster setups can be wildly different, I decided to create a pull request in order to give feedback to the community (and hope that the workaround and information is useful for others).

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 8, 2024
@erjel
Copy link
Author

erjel commented Mar 12, 2025

ping @jrapin

Any thoughts on this?

@baldassarreFe
Copy link
Contributor

baldassarreFe commented Apr 7, 2025

Seems like an application issue that should be solved in the application and not in submitit.

What is the multiprocessing method that you use for the dataloader processes? Do your jobs fail to exit if you set this before doing any CUDA-related operation?

multiprocessing.set_start_method("forkserver", force=True)

Lengthier explanation: the dataloader processes should not have any GPU resource open, not even an unused CUDA context, or they might fail to die when receiving a signal. The default multiprocessing start method is fork, which is bad because it passes the CUDA context to the child processes. The alternative start method forkserver solves this issue. If you check the output of nvidia-smi you should see that only the "main" python script has some open GPU resource.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants