Skip to content

TeamMacLean/google_cloud_compute_setup_documentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AlphaFold 3 on Google Cloud - Dry Run Guide

Project: Building and testing AF3 cluster deployment process Account: Test account (dan.maclean.tsl@gmail.com) Production Target: kamounlab account deployment


Purpose

This guide documents a complete dry run of deploying an AlphaFold 3 cluster on Google Cloud Platform. The goal is to learn the process, identify issues, and create a tutorial for the actual production deployment.


Key Concepts

What is Google Cloud Platform (GCP)?

GCP provides cloud computing resources (servers, storage, networking) that you rent on-demand. Instead of buying physical servers, you configure virtual resources that can scale up or down.

Why Quotas Matter

Critical concept: By default, new GCP projects have very low or zero quotas for expensive resources like GPUs.

You cannot create GPU VMs until quotas are approved - typically takes 2-5 business days.

Plan ahead: Request quotas early, then continue with other setup while waiting.

What is Terraform?

Infrastructure-as-code tool. You describe what resources you want (VMs, networks, storage) in configuration files, and Terraform creates them. This makes deployments repeatable and documentable.

What is the Cluster Toolkit?

Google's tool for building HPC clusters on GCP. The gcluster command converts YAML configuration into Terraform code specifically designed for scientific computing workloads.


Workflow Overview

The deployment process has distinct phases:

Phase 1: Account Setup (No Quotas Needed)

  • Create/configure GCP project
  • Set up billing and budget alerts
  • Enable required APIs
  • Request quotas (then wait 2-5 days)

Phase 2: Local Environment Setup (No Quotas Needed)

  • Install gcloud CLI, Terraform, Git, Go
  • Authenticate and configure tools
  • Clone Cluster Toolkit

Phase 3: Configuration (No Quotas Needed)

  • Customize deployment settings
  • Match configuration to quota requests
  • Validate configuration

Phase 4: Planning (No Quotas Needed)

  • Generate Terraform configs
  • Review what would be created
  • Verify configuration is valid

Phase 5: Deployment (Requires Approved Quotas)

  • Create network infrastructure
  • Build custom VM image
  • Deploy cluster
  • Upload data and run test jobs

We complete Phases 1-4 in this dry run without spending credits or needing quota approval.


Tutorial Documents (In Order)

1. Budget and Project Setup

Document: tutorial_budget_creation.md

What you do:

  • Create budget alerts in GCP console
  • Set spending limits and notification thresholds

Why first: Prevents unexpected charges before doing anything else.


2. Project Creation and Billing

Document: tutorial_project_creation_and_billing.md

What you do:

  • Create new GCP project
  • Record project ID (critical - used everywhere)
  • Link billing account
  • Verify billing is active

Key learning: Must manually switch to new project after creation.


3. Enable Required APIs

Document: tutorial_enabling_apis.md

What you do:

  • Enable 10 required APIs via web console
  • Each API must be enabled individually
  • Wait for each to complete

Tricky parts:

  • API names in console differ from gcloud names
  • IAM search returns multiple results - select correct one
  • Stackdriver API needed for toolkit validation

4. Request Resource Quotas

Document: tutorial_quota_requests.md

What you do:

  • Navigate to quota management
  • Request A100 GPU quota (4 GPUs for dry run)
  • Request A2 CPU quota
  • Provide justification for use

Critical concepts:

  • Must request BEFORE deployment - cannot create VMs without quota
  • Takes 2-5 business days for approval
  • RAM is bundled with machine types (no separate quota)
  • Our choice: 80GB A100 GPUs, skip MSA computation

This is a blocking step - proceed with other tutorials while waiting.


5. Install Local Tools

Document: tutorial_gcloud_installation.md

What you do:

  • Install gcloud CLI via Homebrew
  • Authenticate with Google account
  • Configure default project and region
  • Install Terraform, Git, Go

Why needed: Tools for managing GCP from your local machine.


6. Build Cluster Toolkit and Set Credentials

Document: tutorial_application_credentials.md

What you do:

  • Clone Cluster Toolkit repository
  • Build gcluster binary with make
  • Set up Application Default Credentials (ADC)

Tricky part: ADC setup requires two terminals and careful scope selection in browser.


7. Configure the Cluster

Document: tutorial_cluster_configuration.md

What you do:

  • Edit af3-slurm-deployment.yaml
  • Set project ID, region, bucket names
  • Configure GPU count (4 to match quota)
  • Disable datapipeline (using precomputed MSAs)
  • Skip database bucket (no template search)

Key validation: Configuration matches quota requests.


8. Generate and Review Terraform Plan

Document: tutorial_terraform_planning.md

What you do:

  • Run gcluster create to generate terraform configs
  • Run terraform init to download providers
  • Run terraform plan to see what would be created

What you learn:

  • 22 network resources would be created
  • Configuration is valid and ready for deployment
  • No actual resources created during planning

This proves the configuration works without spending money or needing quotas.


9. Create Bucket and Upload Data

Document: tutorial_bucket_and_upload.md

What you do:

  • Create GCS bucket for AF3 model weights
  • Upload weights from HPC storage
  • Verify upload succeeded
  • Test bucket access

Prerequisites:

  • Quota approval received (can create bucket anytime, but deploy after quota)
  • AF3 model weights accessible (from HPC isilon storage)

Key commands:

# Create bucket
gsutil mb -p af3-cluster-dryrun -l us-central1 gs://af3-dryrun-weights/

# Upload weights
gsutil cp /path/to/af3.bin gs://af3-dryrun-weights/

# Verify
gsutil ls -lh gs://af3-dryrun-weights/

Important: This step is required before deployment but can be done independently of quota approval.


Our Configuration Choices

Use case: Structure prediction with precomputed MSAs

Decisions made:

  • Region: us-central1 (good GPU availability)
  • GPUs: 4x A100 80GB (a2-ultragpu-1g)
  • Skip datapipeline: 0 CPU nodes (have precomputed MSAs)
  • Skip databases: No template search, MSAs only
  • Buckets: Model weights only (af3-dryrun-weights)

Cost implications: ~$12-16/hour when 4 GPUs running, $0 when idle (autoscaling).


Current Status

Completed:

  • ✅ Budget and alerts configured
  • ✅ Project created (af3-cluster-dryrun)
  • ✅ Billing linked
  • ✅ 10 APIs enabled
  • ✅ Quota requests submitted (4 A100 GPUs, A2 CPUs)
  • ✅ Local tools installed (gcloud, terraform, git, go)
  • ✅ Cluster Toolkit built
  • ✅ ADC configured
  • ✅ Cluster configured
  • ✅ Terraform configs generated and validated

Waiting on:

  • ⏳ Quota approval (2-5 business days)

Ready for deployment when quotas approved.


What Happens at Deployment

When quotas are approved and you run deployment:

  1. Environment (~5 minutes):

    • Creates VPC network
    • Creates subnet and firewall rules
    • Sets up NAT and routing
  2. Image Build (~30 minutes):

    • Builds custom VM image with Slurm and Apptainer
    • Installs AlphaFold 3 container
    • Only done once, reused for all VMs
  3. Cluster (~10 minutes):

    • Creates login node (SSH access, job submission)
    • Creates controller node (Slurm scheduler)
    • Configures autoscaling (0-4 GPU nodes)
    • Mounts GCS buckets
  4. Total deployment time: ~45 minutes

  5. Cluster ready: Submit AF3 jobs via Slurm


File Reference

Configuration Files

  • cluster-toolkit/examples/science/af3-slurm/af3-slurm-deployment.yaml - Edited configuration
  • cluster-toolkit/examples/science/af3-slurm/af3-slurm-deployment.yaml.original - Backup of original

Generated Files

  • cluster-toolkit/af3-slurm/ - Generated terraform configs (not deployed)

Documentation

  • CLAUDE.md - Project overview and objectives
  • 01_account_setup_plan.md - Initial planning document
  • tutorial_*.md - Step-by-step guides (see list above)

For Production Deployment (kamounlab)

When ready to deploy to production account:

  1. Use these tutorials as the guide
  2. Adjust configuration:
    • Different project ID
    • Larger quota requests (based on scale needed)
    • Production budget alerts ($200,000)
    • Production bucket names
  3. Request quotas early - weeks before planned deployment
  4. Test with small job before scaling up
  5. Have research computing team review configuration

Key Learnings

Critical steps that weren't obvious:

  • Project Picker not clearly labeled in console
  • Must manually switch to new project after creation
  • APIs must be enabled even after project setup
  • Stackdriver API needed for toolkit but not in initial list
  • ADC requires all scopes selected (even when they look selected)
  • Quotas are blocking - must request before deployment

Things that worked well:

  • Budget creation is straightforward with guided dialogue
  • gcloud CLI authentication is smooth
  • Terraform planning validates configuration without cost
  • Cluster Toolkit automates complex Slurm setup

Time investment:

  • Account setup: ~2 hours active work
  • Waiting for quotas: 2-5 business days
  • Configuration and planning: ~1 hour
  • Total active time: ~3 hours
  • Total calendar time: 3-7 days (including quota wait)

Next Steps

During quota wait:

  • Review AlphaFold 3 documentation
  • Prepare model weights for upload
  • Prepare test input data
  • Review Slurm job submission process

When quotas approved:

  • Create GCS bucket: gsutil mb -p af3-cluster-dryrun -l us-central1 gs://af3-dryrun-weights/
  • Upload AF3 model weights to bucket
  • Deploy: ./gcluster deploy af3-slurm
  • Test with single prediction
  • Document deployment process

Last Updated: 2025-10-28 Status: Configuration complete, awaiting quota approval

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published