Gracefully exit rank != 0 job steps on slurm cluster #1780
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi,
first, thanks for open sourcing
submitit! It simplifies the setup of distributed cluster jobs on our inhouse slurm cluster a lot.Some context:
While requeuing 32 node jobs, I got a message from our infrastructure team that my jobs are causing nodes to be stuck in the slurm "drained" state. Even worst: The requeuing worked fine, so my job brought down the first 32 nodes and restarted on 32 different nodes (and probably would have "drained" them as well). This would eventually have brought down our entire GPU partition.
According to the infrastructure team, nodes go to the "drained" state when jobs keep running on the cluster after the final
SIGKILLwas send (i.e. when the timeout is reached or when the job is requeued). As far as I understand, the timing of theSIGKILLdepends on the individual slurm settings. On our cluster there is a very liberal 120 sec delay between timeout and signal send (apparently the default is about 30 sec).For debugging we had a look into the running processes on one of the nodes with 4 GPUs each.
Normally there are multiple processes running:
Once a timeout is reached:
The remaining python scripts keep running (even continuing utilizing GPU resources) until
SIGKILLis send. While we could not reproducibly trigger the drained state on-the-fly, we decided that I should not rely onSIGKILLto bring my remaining job steps down.The solution which worked on our cluster:
After a timeout or a requeue a
SIGTERMis send to all job steps. While we don't want to callsys.exit(-1)on rank != 0 nodes once theSIGUSR2is send (i.e. to give the rank 0 job step time to create a checkpoint first), we can prepare the job steps for the followingSIGTERM.SIGTERMis triggered by slurm afterscontrol requeue <jobid>and thus a strong indication that rank 0 is done with checkpointing and we can callsys.exit()in the remaining job steps.With the proposed changes we could observe that the jobs gracefully exit well before the 120 sec
SIGKILLtimelimit. Additionally we noticed that time spend in the slurm completing state is notably reduced.Even though cluster setups can be wildly different, I decided to create a pull request in order to give feedback to the community (and hope that the workaround and information is useful for others).