Project: Building and testing AF3 cluster deployment process Account: Test account (dan.maclean.tsl@gmail.com) Production Target: kamounlab account deployment
This guide documents a complete dry run of deploying an AlphaFold 3 cluster on Google Cloud Platform. The goal is to learn the process, identify issues, and create a tutorial for the actual production deployment.
GCP provides cloud computing resources (servers, storage, networking) that you rent on-demand. Instead of buying physical servers, you configure virtual resources that can scale up or down.
Critical concept: By default, new GCP projects have very low or zero quotas for expensive resources like GPUs.
You cannot create GPU VMs until quotas are approved - typically takes 2-5 business days.
Plan ahead: Request quotas early, then continue with other setup while waiting.
Infrastructure-as-code tool. You describe what resources you want (VMs, networks, storage) in configuration files, and Terraform creates them. This makes deployments repeatable and documentable.
Google's tool for building HPC clusters on GCP. The gcluster command converts YAML configuration into Terraform code specifically designed for scientific computing workloads.
The deployment process has distinct phases:
- Create/configure GCP project
- Set up billing and budget alerts
- Enable required APIs
- Request quotas (then wait 2-5 days)
- Install gcloud CLI, Terraform, Git, Go
- Authenticate and configure tools
- Clone Cluster Toolkit
- Customize deployment settings
- Match configuration to quota requests
- Validate configuration
- Generate Terraform configs
- Review what would be created
- Verify configuration is valid
- Create network infrastructure
- Build custom VM image
- Deploy cluster
- Upload data and run test jobs
We complete Phases 1-4 in this dry run without spending credits or needing quota approval.
Document: tutorial_budget_creation.md
What you do:
- Create budget alerts in GCP console
- Set spending limits and notification thresholds
Why first: Prevents unexpected charges before doing anything else.
Document: tutorial_project_creation_and_billing.md
What you do:
- Create new GCP project
- Record project ID (critical - used everywhere)
- Link billing account
- Verify billing is active
Key learning: Must manually switch to new project after creation.
Document: tutorial_enabling_apis.md
What you do:
- Enable 10 required APIs via web console
- Each API must be enabled individually
- Wait for each to complete
Tricky parts:
- API names in console differ from gcloud names
- IAM search returns multiple results - select correct one
- Stackdriver API needed for toolkit validation
Document: tutorial_quota_requests.md
What you do:
- Navigate to quota management
- Request A100 GPU quota (4 GPUs for dry run)
- Request A2 CPU quota
- Provide justification for use
Critical concepts:
- Must request BEFORE deployment - cannot create VMs without quota
- Takes 2-5 business days for approval
- RAM is bundled with machine types (no separate quota)
- Our choice: 80GB A100 GPUs, skip MSA computation
This is a blocking step - proceed with other tutorials while waiting.
Document: tutorial_gcloud_installation.md
What you do:
- Install gcloud CLI via Homebrew
- Authenticate with Google account
- Configure default project and region
- Install Terraform, Git, Go
Why needed: Tools for managing GCP from your local machine.
Document: tutorial_application_credentials.md
What you do:
- Clone Cluster Toolkit repository
- Build gcluster binary with
make - Set up Application Default Credentials (ADC)
Tricky part: ADC setup requires two terminals and careful scope selection in browser.
Document: tutorial_cluster_configuration.md
What you do:
- Edit
af3-slurm-deployment.yaml - Set project ID, region, bucket names
- Configure GPU count (4 to match quota)
- Disable datapipeline (using precomputed MSAs)
- Skip database bucket (no template search)
Key validation: Configuration matches quota requests.
Document: tutorial_terraform_planning.md
What you do:
- Run
gcluster createto generate terraform configs - Run
terraform initto download providers - Run
terraform planto see what would be created
What you learn:
- 22 network resources would be created
- Configuration is valid and ready for deployment
- No actual resources created during planning
This proves the configuration works without spending money or needing quotas.
Document: tutorial_bucket_and_upload.md
What you do:
- Create GCS bucket for AF3 model weights
- Upload weights from HPC storage
- Verify upload succeeded
- Test bucket access
Prerequisites:
- Quota approval received (can create bucket anytime, but deploy after quota)
- AF3 model weights accessible (from HPC isilon storage)
Key commands:
# Create bucket
gsutil mb -p af3-cluster-dryrun -l us-central1 gs://af3-dryrun-weights/
# Upload weights
gsutil cp /path/to/af3.bin gs://af3-dryrun-weights/
# Verify
gsutil ls -lh gs://af3-dryrun-weights/Important: This step is required before deployment but can be done independently of quota approval.
Use case: Structure prediction with precomputed MSAs
Decisions made:
- Region: us-central1 (good GPU availability)
- GPUs: 4x A100 80GB (a2-ultragpu-1g)
- Skip datapipeline: 0 CPU nodes (have precomputed MSAs)
- Skip databases: No template search, MSAs only
- Buckets: Model weights only (af3-dryrun-weights)
Cost implications: ~$12-16/hour when 4 GPUs running, $0 when idle (autoscaling).
Completed:
- ✅ Budget and alerts configured
- ✅ Project created (af3-cluster-dryrun)
- ✅ Billing linked
- ✅ 10 APIs enabled
- ✅ Quota requests submitted (4 A100 GPUs, A2 CPUs)
- ✅ Local tools installed (gcloud, terraform, git, go)
- ✅ Cluster Toolkit built
- ✅ ADC configured
- ✅ Cluster configured
- ✅ Terraform configs generated and validated
Waiting on:
- ⏳ Quota approval (2-5 business days)
Ready for deployment when quotas approved.
When quotas are approved and you run deployment:
-
Environment (~5 minutes):
- Creates VPC network
- Creates subnet and firewall rules
- Sets up NAT and routing
-
Image Build (~30 minutes):
- Builds custom VM image with Slurm and Apptainer
- Installs AlphaFold 3 container
- Only done once, reused for all VMs
-
Cluster (~10 minutes):
- Creates login node (SSH access, job submission)
- Creates controller node (Slurm scheduler)
- Configures autoscaling (0-4 GPU nodes)
- Mounts GCS buckets
-
Total deployment time: ~45 minutes
-
Cluster ready: Submit AF3 jobs via Slurm
cluster-toolkit/examples/science/af3-slurm/af3-slurm-deployment.yaml- Edited configurationcluster-toolkit/examples/science/af3-slurm/af3-slurm-deployment.yaml.original- Backup of original
cluster-toolkit/af3-slurm/- Generated terraform configs (not deployed)
CLAUDE.md- Project overview and objectives01_account_setup_plan.md- Initial planning documenttutorial_*.md- Step-by-step guides (see list above)
When ready to deploy to production account:
- Use these tutorials as the guide
- Adjust configuration:
- Different project ID
- Larger quota requests (based on scale needed)
- Production budget alerts ($200,000)
- Production bucket names
- Request quotas early - weeks before planned deployment
- Test with small job before scaling up
- Have research computing team review configuration
Critical steps that weren't obvious:
- Project Picker not clearly labeled in console
- Must manually switch to new project after creation
- APIs must be enabled even after project setup
- Stackdriver API needed for toolkit but not in initial list
- ADC requires all scopes selected (even when they look selected)
- Quotas are blocking - must request before deployment
Things that worked well:
- Budget creation is straightforward with guided dialogue
- gcloud CLI authentication is smooth
- Terraform planning validates configuration without cost
- Cluster Toolkit automates complex Slurm setup
Time investment:
- Account setup: ~2 hours active work
- Waiting for quotas: 2-5 business days
- Configuration and planning: ~1 hour
- Total active time: ~3 hours
- Total calendar time: 3-7 days (including quota wait)
During quota wait:
- Review AlphaFold 3 documentation
- Prepare model weights for upload
- Prepare test input data
- Review Slurm job submission process
When quotas approved:
- Create GCS bucket:
gsutil mb -p af3-cluster-dryrun -l us-central1 gs://af3-dryrun-weights/ - Upload AF3 model weights to bucket
- Deploy:
./gcluster deploy af3-slurm - Test with single prediction
- Document deployment process
Last Updated: 2025-10-28 Status: Configuration complete, awaiting quota approval