Skip to content

Alphafold and Colabfold

Alphafold is a novel program for protein structure prediction, using neural network run on GPUs to provide protein structures, which accuracy is comparable to laborious manual structure simulations. Colabfold uses Alphafold, but replaces the time consuming database searches with much faster, but less accurate alternatives.

Alphafold

Alphafold consists of two major steps, the first being the genetic database searches for the amino-acid sequence, defined by the input fasta file, which run completely on the CPUs, are very I/O intensive, don't parallelize much, and don't utilize GPUs. The second part is utilizing the pre-trained neural network coupled with molecular dynamics simulations refinement to provide the 3D protein structure in a form of a PDB file. This step uses GPUs for the neural network inference, and optionally for the molecular dynamics simulation. Therefore the scarce GPU resource is only utilized by a part of the workflow.

To make things worse, this first database search step runs very slowly, when the genetic databases are located on a network mounted storage, which is the most commonly used storage on the CHPC clusters. We investigated performance of the database search on all CHPC network file systems, and neither provides acceptable results  - a small protein sequence search took 8-16 hours to complete depending on a file system used. The best alternative is to create a RAM disk on the node where the Alphafold simulation runs, and copy the small databases and indices of the large databases onto this RAM disk for fast access. This brings down the database search part in the aforementioned example to 40 minutes. 

We have created a script to create the RAM disk and copy the databases to it, located at /uufs/chpc.utah.edu/sys/installdir/alphafold/db_to_tmp_232.sh. Note that the databases on the RAM disk occupy ~25 GB so ask for RAM for the job accordingly. The file copy is fairly fast, it takes about a minute. Also make sure that the databases are removed at the end of the job. Below we provide shell commands and SLURM script that does this.

Also, since some databases are on the RAM disk and some on the network mounted storage, Alphafold must be run with options that reflect the database locations. Since we package the Alphafold distribution in a Singularity container, the Alphafold launch command gets even more complicated, however, this is all shown in the examples below. To make this simpler, we have created several wrapper scripts to make the launch command simpler.

The second, neural network / MD part, in our example, takes 12 minutes on a 1080ti GPU, while it would take 3.5 hours on a CPU using 16 cores of the notchpeak-shared-short partition.

As we can see from the above mentioned example timings, out of the 52 minutes the job ran, the GPU was utilized only for 12 minutes. For this reason, we have modified the Alphafold source to run the CPU and GPU intense parts as separate jobs. The first job does the database search only on CPUs utilizing the protein sequence databases on the RAM disk. The second job runs the GPU intensive neural network part, which does not need many CPUs and the RAM disk. 

Colabfold (see below) is a reasonable alternative which uses alternative database search engine, which is less detailed but much faster.

Running Alphafold interactively

When learning to use Alphafold, or setting up a new type of simulation, we recommend to use the notchpeak-shared-short interactive queue, as that leads to quicker turnaround if errors are encountered. Once you have done that, create SLURM scripts as shown in the next section, that allows to run both the CPU and GPU parts via a single job submission.

First we submit an interactive job on the notchpeak cluster asking for 16 CPUs and 128 GB of memory to run the database search. The database search is more memory intensive, plus, we need extra memory for the RAM disk databases in order to get better performance. 

salloc -N 1 -n 16 --mem=128G -p notchpeak-shared-short -A notchpeak-shared-short -t 8:00:00

Then we load the Alphafold module and set up the RAM disk databases, located in /tmp, which is a RAM disk:

ml alphafold/2.3.2
/uufs/chpc.utah.edu/sys/installdir/alphafold/db_to_tmp_232.sh

Now we are ready to run Alphafold. This can be done either with therun_alphafold.sh command which requires explicit listing of the database location parameters, or with the run_alphafold_full.shcommand which defines the locations of the databases, so only additional runtime parameters need to be listed. This includes the user supplied parameters such as the FASTA input file name, which we define through a bash shell environment variable FASTA_FILE, and the output directory, defined by OUTPUT_DIR variable

export FASTA_FILE=ex1.fasta
export OUTPUT_DIR=my_out_dir
export SCRDB=/scratch/general/vast/app-repo/alphafold
export TMPDB=/tmp/$SLURM_JOBID
# run_alphafold.sh is an alias defined in the modulefile, requiring to list the appropriate databases
#run_alphafold.sh --data_dir=$SCRDB --uniref90_database_path=$SCRDB/uniref90/uniref90.fasta --uniref30_database_path=$TMPDB/UniRef30_2021_03 --mgnify_database_path=$SCRDB/mgnify/mgy_clusters_2022_05.fa --bfd_database_path=$TMPDB/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt --pdb70_database_path=$TMPDB/pdb70 --template_mmcif_dir=$SCRDB/pdb_mmcif/mmcif_files --obsolete_pdbs_path=$SCRDB/pdb_mmcif/obsolete.dat --use_gpu_relax --fasta_paths=$FASTA_FILE --output_dir=$OUTPUT_DIR --max_template_date=2022-01-01 --run_feature=1
# run_alphafold_full.sh is an alias that includes the list of the full databases in the argument list, so one only needs to provide the run specific runtime options
run_alphafold_full.sh --use_gpu_relax --fasta_paths=$FASTA_FILE --output_dir=$OUTPUT_DIR --max_template_date=2022-01-01 --run_feature=1

Notice the option use --run_feature=1 which tells the program to run only the database search, and saves a file called features.pkl which contains the database search results for each fasta file. Once this file is written, the first step is finished, and we can delete this job.

For reduced databases, which are useful for larger sequences or multimers, use the run_alphafold_full.sh command, which points to these reduced databases.

Second we submit an interactive job on the notchpeak cluster asking for 4 CPUs to run and one GPU to run the GPU intensive part. The GPU part does not need many CPUs, and uses less memory, though with larger sequences one may need to ask for more than 16 GB that are the default for 4 CPUs on notchpeak-shared-short.

salloc -N 1 -n 4 -p notchpeak-shared-short -A notchpeak-shared-short -t 8:00:00 --gres=gpu:1080ti

Run the second part of Alphafold as

export FASTA_FILE=ex1.fasta
export OUTPUT_DIR=my_out_dir
export SCRDB=/scratch/general/vast/app-repo/alphafold
export TMPDB=/scratch/general/vast/app-repo/alphafold
#run_alphafold.sh --data_dir=$SCRDB --uniref90_database_path=$SCRDB/uniref90/uniref90.fasta --uniref30_database_path=$TMPDB/UniRef30_2021_03 --mgnify_database_path=$SCRDB/mgnify/mgy_clusters_2022_05.fa --bfd_database_path=$TMPDB/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt --pdb70_database_path=$TMPDB/pdb70 --template_mmcif_dir=$SCRDB/pdb_mmcif/mmcif_files --obsolete_pdbs_path=$SCRDB/pdb_mmcif/obsolete.dat --use_gpu_relax --fasta_paths=$FASTA_FILE --output_dir=$OUTPUT_DIR --max_template_date=2022-01-01
# run_alphafold_full.sh is an alias that includes the list of the full databases in the argument list, so one only needs to provide the run specific runtime options
run_alphafold_full.sh --use_gpu_relax --fasta_paths=$FASTA_FILE --output_dir=$OUTPUT_DIR --max_template_date=2022-01-01

Notice that we use --use-gpu_relax to run the molecular dynamics (MD) relaxation on a GPU, we have noticed that on small structures the CPU relaxation is faster, while for larger structures the GPU is faster. Since we ask for smaller CPU count to run the GPU intensive part, we choose to run the MD on the GPU.To see all the runtime options, run run_alphafold.sh --help.

This launch command is for a monomer, for multimer, some of the database parameters are different. Notice that we are also using the reduced databases:

/uufs/chpc.utah.edu/sys/installdir/alphafold/db_to_tmp_232_reduced.sh
run_alphafold_red.sh --fasta_paths=$FASTA_FILE --max_template_date=2022-06-27 --output_dir=$OUTPUT_DIR --use_gpu_relax --model_preset=multimer --db_preset=reduced_dbs --run_feature=1

Note that the command above only does the CPU intensive part on the CPU, the GPU intensive part will need to be run as well with

run_alphafold_red.sh --fasta_paths=$FASTA_FILE --max_template_date=2022-06-27 --output_dir=$OUTPUT_DIR --use_gpu_relax --model_preset=multimer --db_preset=reduced_dbs

Running Alphafold in a job script

We are providing a sample SLURM scripts that in essence does the steps outlined above at /uufs/chpc.utah.edu/sys/installdir/alphafold/run_alphafold_chpc_232.slr for the first step, and /uufs/chpc.utah.edu/sys/installdir/alphafold/run_alphafold_chpc_232_2.slr for the second step. Note the "232" at the end of the script which denotes the Alphafold version. Because of the explicit launch of Alphafold from the container necessitated by the RAM disk databases, we need to explicitly call the appropriate container version.

The databases need about 25 GB worth of RAM on the RAM disk, so make sure to add this RAM to the amount requested with the #SBATCH --mem option.

The first step SLURM script, run_alphafold_chpc_232.slr, then looks like:

#!/bin/bash
#SBATCH -t 8:00:00
#SBATCH -n 16
#SBATCH -N 1
#SBATCH -p notchpeak-shared-short
#SBATCH -A notchpeak-shared-short
#SBATCH --mem=128G

# this script runs the first, CPU intensive, part of AlphaFold
ml purge
ml alphafold/2.3.2

# put the name of the fasta file here
export FASTA_FILE="t1050.fasta"
export OUTPUT_DIR="out"
# copy some of the databases to the RAM disk
/uufs/chpc.utah.edu/sys/installdir/alphafold/db_to_tmp_232.sh

SCRDB=/scratch/general/vast/app-repo/alphafold
TMPDB=/tmp/$SLURM_JOBID

sbatch -d afterok:$SLURM_JOBID run_alphafold_chpc_232_2.slr

# run_alphafold.sh is an alias defined in the modulefile, requiring to list the appropriate databases
#run_alphafold.sh --data_dir=$SCRDB --uniref90_database_path=$SCRDB/uniref90/uniref90.fasta --uniref30_database_path=$TMPDB/UniRef30_2021_03 --mgnify_database_path=$SCRDB/mgnify/mgy_clusters_2022_05.fa --bfd_database_path=$TMPDB/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt --pdb70_database_path=$TMPDB/pdb70 --template_mmcif_dir=$SCRDB/pdb_mmcif/mmcif_files --obsolete_pdbs_path=$SCRDB/pdb_mmcif/obsolete.dat --use_gpu_relax --fasta_paths=$FASTA_FILE --output_dir=$OUTPUT_DIR --max_template_date=2022-01-01 --run_feature=1
# run_alphafold_full.sh is an alias that includes the list of the full databases in the argument list, so one only needs to provide the run specific runtime options
run_alphafold_full.sh --use_gpu_relax --fasta_paths=$FASTA_FILE --output_dir=$OUTPUT_DIR --max_template_date=2022-01-01 --run_feature=1

rm -rf $TMPDB

Again, we are only asking for CPUs (as many as possible, the database search to certain extent utilizes more CPUs). We are also submitting the second step from this job with the -d afterok:$SLURM_JOBID, which submits the second step job with dependence after this first CPU step finishes correctly.

The second GPU intensive step, run_alphafold_chpc_232_2.slr, runs on a few CPUs, needs less memory and does not use the RAM disk for the databases, since they are not used - just need to be fed to the run_alphafold.sh command since it checks if these databases exist. Notice also that we are not passing the FASTA_FILE and OUTPUT_DIR environment variables, they are by default passed from the first job.

#!/bin/bash
#SBATCH -t 8:00:00
#SBATCH -n 4
#SBATCH -N 1
#SBATCH -p notchpeak-shared-short
#SBATCH -A notchpeak-shared-short
#SBATCH --gres=gpu:t4:1
#SBATCH --mem=32G

# this script runs the second, GPU intensive, part of AlphaFold
ml purge
ml alphafold/2.3.2

# FASTA_FILE and OUTPUT_DIR are brought from the previous job

# no use of databases so no need to create them in /tmp
SCRDB=/scratch/general/vast/app-repo/alphafold
TMPDB=/scratch/general/vast/app-repo/alphafold

# run_alphafold.sh is an alias defined in the modulefile, requiring to list the appropriate database
#run_alphafold.sh --data_dir=$SCRDB --uniref90_database_path=$SCRDB/uniref90/uniref90.fasta --uniref30_database_path=$TMPDB/UniRef30_2021_03 --mgnify_database_path=$SCRDB/mgnify/mgy_clusters_2022_05.fa --bfd_database_path=$TMPDB/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt --pdb70_database_path=$TMPDB/pdb70 --template_mmcif_dir=$SCRDB/pdb_mmcif/mmcif_files --obsolete_pdbs_path=$SCRDB/pdb_mmcif/obsolete.dat --use_gpu_relax --fasta_paths=$FASTA_FILE --output_dir=$OUTPUT_DIR --max_template_date=2022-01-01
# run_alphafold_full.sh is an alias that includes the list of the full databases in the argument list, so one only needs to provide the run specific runtime options
run_alphafold_full.sh --use_gpu_relax --fasta_paths=$FASTA_FILE --output_dir=$OUTPUT_DIR --max_template_date=2022-01-01 -data_dir=$SCRDB --uniref90_database_path=$SCRDB/uniref90/uniref90.fasta --uniref30_database_path=$TMPDB/uniref30/UniRef30_2021_03 --mgnify_database_path=$SCRDB/mgnify/mgy_clusters_2022_05.fa --bfd_database_path=$TMPDB/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt --pdb70_database_path=$TMPDB/pdb70/pdb70 --template_mmcif_dir=$SCRDB/pdb_mmcif/mmcif_files --obsolete_pdbs_path=$SCRDB/pdb_mmcif/obsolete.dat --fasta_paths=$FASTA_FILE --output_dir=$OUTPUT_DIR --use_gpu_relax --max_template_date=2022-01-01

When running your own jobs, you may need to change the SLURM account (-A), partition (-p), memory (--mem) or GPU type (--gres=gpu), depending on the size of the job and the accounts, partitions and GPUs you have access to.

Once the two job scripts are ready submit the first one with the sbatch  command. The second job gets submitted automatically from the first job:

sbatch run_alphafold_chpc_232.slr

For the multimer with reduced databases, example script is at /uufs/chpc.utah.edu/sys/installdir/alphafold/run_alphafold_chpc_multimer_232.slr  and /uufs/chpc.utah.edu/sys/installdir/alphafold/run_alphafold_chpc_multimer_232_2.slr. Note mainly the different databases that are being used in the command line, and calling a different script, /uufs/chpc.utah.edu/sys/installdir/alphafold/db_to_tmp_232_reduced.sh , to copy the databases to the RAM disk.

Colabfold

Colabfold is an adaptation of Alphafold run using the Google Colab cloud, which includes modified database search resulting in faster performance. Since running Jupyter notebooks on a Google Colab cloud infrastructure may be impractical for our users, we have set up an adaptation of Colabfold, called localcolabfold, which allows Colabfold to run locally, e.g. on an HPC cluster.

The database search is done on a shared remote server, which means that with increased usage this remote server may become a bottleneck. For that reason please be judicious with submitting Colabfold jobs. Once we reach a point of high use, we may need to look into setting up a dedicated local server for the database searches. 

To run Colabfold, we load the module and run the command:

ml colabfold
export FASTA_FILE=ex1.fasta
export OUTPUT_DIR=my_output_dir
colabfold_batch --amber --templates --num-recycle 3 --use-gpu-relax $FASTA_FILE $OUTPUT_DIR

These commands can be either typed after one starts an interactive GPU job, as shown at the Alphafold interactive example above, or put into a SLURM script, replacing the Alphafold module and commands in the SLURM script shown above. The SLURM script would then look like this:

#!/bin/bash
#SBATCH -t 8:00:00
#SBATCH -n 16
#SBATCH -N 1
#SBATCH -p notchpeak-shared-short
#SBATCH -A notchpeak-shared-short
#SBATCH --gres=gpu:1080ti:1

ml colabfold
export FASTA_FILE=ex1.fasta
export OUTPUT_DIR=my_output_dir
colabfold_batch --amber --templates --num-recycle 3 --use-gpu-relax $FASTA_FILE $OUTPUT_DIR
Last Updated: 12/19/23