Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add slurm workers for calibration end-to-end test #3461

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nefrathenrici
Copy link
Member

@nefrathenrici nefrathenrici commented Dec 2, 2024

This PR updates the calibration end-to-end test to use Distributed.jl and the updated ClimaCalibrate with task-based parallelism. The main two files changed are calibration/model_interface.jl and calibration/test/e2e_test.jl

ClimaCalibrate v0.0.6 has three changes relevant to this PR

  • just requires forward_model instead of set_up_forward_model and run_forward_model
  • addprocs(SlurmManager(n)) can be used to acquired Slurm workers
  • Adds WorkerBackend to distribute forward_model runs across Julia workers

Other changes:

  • Removed the Calibration test GHA, it does not have a clear purpose
  • Removed the prior.toml and put it in the test script
  • Added *.out to the gitignore because the worker output goes to .out log files.

@nefrathenrici nefrathenrici force-pushed the ne/slurm_workers branch 3 times, most recently from 8957ff2 to 8cbe910 Compare December 2, 2024 18:41
@szy21
Copy link
Member

szy21 commented Dec 3, 2024

I'm not sure if I'm the best person to review this. And I think Charlie is out? Maybe some other people who are more familiar with calibration can take a look?

Copy link
Member

@Sbozzolo Sbozzolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some docs (inline comments would be fine) to give some context to those who read this file and are not familiar with ClusterManagers?

calibration/test/Project.toml Outdated Show resolved Hide resolved
@nefrathenrici nefrathenrici force-pushed the ne/slurm_workers branch 5 times, most recently from 33a96d3 to a20a84c Compare December 18, 2024 00:54
Copy link
Member

@Sbozzolo Sbozzolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there documentation about the WorkerBacked?

@nefrathenrici
Copy link
Member Author

nefrathenrici commented Dec 19, 2024

Is there documentation about the WorkerBacked?

There is no documentation on the WorkerBackend at the moment, I am planning on updating the ClimaCalibrate as part of this PR but I can add some more information in this current PR as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add example calibration demonstrating persistent Slurm workers
3 participants