AlphaFold | NCI at Frederick

AlphaFold is an AI system developed by Google DeepMind that predicts a protein’s 3D structure from its amino acid sequence. It regularly achieves accuracy competitive with experiment.

Documentation

AlphaFold home page on DeepMind
AlphaFold Protein Structure Database
AppDB link

Slurm scripts

In the original implementation, AlphaFold first runs some multithreaded analyses using up to 8 CPUs before running model inference on the GPU. Using this workflow, the allocated GPU(s) are idle for the first part of the analysis, while the CPUs are largely idle for the second part. This is a less than optimal use of the compute resources. Several national labs have examined the issue and two new scripts were written to separate the two parts of the analysis.

Split analysis

Create the scripts alphafold_msa.sh and alphafold_predict.sh with the contents:

#!/bin/bash
#SBATCH --job-name=alphafold_msa
#SBATCH --ntasks=8
#SBATCH --time=8:00:00
#SBATCH --mem=128g
#SBATCH --partition=norm
module load alphafold
run_alphafold_msa.py \
  --fasta_paths=/scratch/cluster_scratch/onealdw/alphafold/pi3k.fa \
  --output_dir=/scratch/cluster_scratch/onealdw/alphafold/output \
  --model_preset=multimer \
  --num_multimer_predictions_per_model=2 \
  --db_preset=full \
  --data_dir=/mnt/alphafold/2.3.2 \
  --pdb_seqres_database_path=/mnt/alphafold/2.3.2/pdb_seqres/pdb_seqres.txt \
  --uniref30_database_path=/mnt/alphafold/2.3.2/uniref30/UniRef30_2021_06/UniRef30_2021_06 \
  --uniprot_database_path=/mnt/alphafold/2.3.2/uniprot/uniprot.fasta \
  --uniref90_database_path=/mnt/alphafold/2.3.2/uniref90/uniref90.fasta \
  --mgnify_database_path=/mnt/alphafold/2.3.2/mgnify/mgy_clusters_2022_05.fa \
  --template_mmcif_dir=/mnt/alphafold/2.3.2/pdb_mmcif/mmcif_files \
  --obsolete_pdbs_path=/mnt/alphafold/2.3.2/pdb_mmcif/obsolete.dat \
  --bfd_database_path=/mnt/alphafold/2.3.2/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
  --max_template_date=2020-05-14 \
  --run_relax=false \
  --use_gpu_relax=false

and

#!/bin/bash
#SBATCH --job-name=alphafold_predict
#SBATCH --ntasks=8
#SBATCH --time=8:00:00
#SBATCH --mem=128g
#SBATCH --partition=gpu
#SBATCH --gres=gpu:v100:1
module load alphafold
run_alphafold_predict.py \
  --fasta_paths=/scratch/cluster_scratch/onealdw/alphafold/pi3k.fa \
  --output_dir=/scratch/cluster_scratch/onealdw/alphafold/output \
  --use_precomputed_msas=true \
  --model_preset=multimer \
  --num_multimer_predictions_per_model=2 \
  --db_preset=full \
  --data_dir=/mnt/alphafold/2.3.2 \
  --pdb_seqres_database_path=/mnt/alphafold/2.3.2/pdb_seqres/pdb_seqres.txt \
  --uniref30_database_path=/mnt/alphafold/2.3.2/uniref30/UniRef30_2021_06/UniRef30_2021_06 \
  --uniprot_database_path=/mnt/alphafold/2.3.2/uniprot/uniprot.fasta \
  --uniref90_database_path=/mnt/alphafold/2.3.2/uniref90/uniref90.fasta \
  --mgnify_database_path=/mnt/alphafold/2.3.2/mgnify/mgy_clusters_2022_05.fa \
  --template_mmcif_dir=/mnt/alphafold/2.3.2/pdb_mmcif/mmcif_files \
  --obsolete_pdbs_path=/mnt/alphafold/2.3.2/pdb_mmcif/obsolete.dat \
  --bfd_database_path=/mnt/alphafold/2.3.2/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
  --max_template_date=2020-05-14 \
  --run_relax=true \
  --use_gpu_relax=true

The two scripts need to be submitted so that the second will not start until the first is complete. Submit the first script using sbatch alphafold_msa.sh. The command will return a message reading
Submitted batch job 40164997
Using this job number, submit the second script with sbatch --dependency=afterok:######## alphafold_split.sh

The first script will be queued immediately and the second will be started only after the first finishes successfully.

Original, single-script, analysis

Create the script alphafold.sh with the contents:

#!/bin/bash
#SBATCH --job-name=alphafold
#SBATCH --ntasks=8
#SBATCH --time=8:00:00
#SBATCH --mem=128g
#SBATCH --partition=gpu
#SBATCH --gres=gpu:v100:1
module load alphafold
run_alphafold.py \
  --fasta_paths=/scratch/cluster_scratch/onealdw/alphafold/pi3k.fa \
  --output_dir=/scratch/cluster_scratch/onealdw/alphafold/output \
  --model_preset=multimer \
  --num_multimer_predictions_per_model=2 \
  --data_dir=/mnt/alphafold/2.3.2 \
  --pdb_seqres_database_path=/mnt/alphafold/2.3.2/pdb_seqres/pdb_seqres.txt \
  --uniref30_database_path=/mnt/alphafold/2.3.2/uniref30/UniRef30_2021_06/UniRef30_2021_06 \
  --uniprot_database_path=/mnt/alphafold/2.3.2/uniprot/uniprot.fasta \
  --uniref90_database_path=/mnt/alphafold/2.3.2/uniref90/uniref90.fasta \
  --mgnify_database_path=/mnt/alphafold/2.3.2/mgnify/mgy_clusters_2022_05.fa \
  --template_mmcif_dir=/mnt/alphafold/2.3.2/pdb_mmcif/mmcif_files \
  --obsolete_pdbs_path=/mnt/alphafold/2.3.2/pdb_mmcif/obsolete.dat \
  --bfd_database_path=/mnt/alphafold/2.3.2/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
  --max_template_date=2020-05-14 \
  --use_gpu_relax=true

The script can then be submitted with the command sbatch alphafold.sh