ProteinBench: A Holistic Evaluation of Protein Foundation Models

Fei Ye , Zaixiang Zheng^∗, Dongyu Xue^∗, Yuning Shen^∗, Lihao Wang^∗
Yiming Ma, Yan Wang, Xinyou Wang, Xiangxin Zhou and Quanquan Gu
ByteDance Research
{yefei.joyce,quanquan.gu}@bytedance.com

Project page: https://proteinbench.github.io/ Equal contribution.Corresponding author.

Abstract

Recent years have witnessed a surge in the development of protein foundation models, significantly improving performance in protein prediction and generative tasks ranging from 3D structure prediction and protein design to conformational dynamics. However, the capabilities and limitations associated with these models remain poorly understood due to the absence of a unified evaluation framework. To fill this gap, we introduce ProteinBench, a holistic evaluation framework designed to enhance the transparency of protein foundation models. Our approach consists of three key components: (i) A taxonomic classification of tasks that broadly encompass the main challenges in the protein domain, based on the relationships between different protein modalities; (ii) A multi-metric evaluation approach that assesses performance across four key dimensions: quality, novelty, diversity, and robustness; and (iii) In-depth analyses from various user objectives, providing a holistic view of model performance. Our comprehensive evaluation of protein foundation models reveals several key findings that shed light on their current capabilities and limitations. To promote transparency and facilitate further research, we release the evaluation dataset, code, and a public leaderboard publicly for further analysis and a general modular toolkit. We intend for ProteinBench to be a living benchmark for establishing a standardized, in-depth evaluation framework for protein foundation models, driving their development and application while fostering collaboration within the field.

1 Introduction

Proteins are fundamental molecules playing pivotal roles in a vast array of biological processes, from enzymatic catalysis and signal transduction to structural support and immune response. Their functions are determined by their amino acid sequences, often mediated through folding into specific three-dimensional structures. Understanding the complex interplay between protein sequence, structure, and function is crucial for advancing science and engineering spanning pharmaceuticals, agriculture, specialty chemicals, and biofuels (Kuhlman & Bradley, 2019).

In recent years, there has been a surge in the development of protein foundation models¹¹1In this study, we broaden the definition of protein foundation models to include any generative models aimed at addressing foundational problems of protein sciences. aimed at understanding fundamental biological processes by capturing the intricate mechanisms of proteins (Jumper et al., 2021; Abramson et al., 2024; Lin et al., 2023; Watson et al., 2023b; Ingraham et al., 2023; Krishna et al., 2024; Shin et al., 2021; Madani et al., 2023; Alley et al., 2019; Wang et al., 2024b; Hayes et al., 2024; Hie et al., 2024). These models, leveraging advanced deep-learning and generative AI techniques, have demonstrated remarkable capabilities and marks a significant shift from traditional, task-specific approaches to more generalizable frameworks capable of learning complex patterns and relationships within vast protein datasets. For instance, AlphaFold3 (Abramson et al., 2024), which is based on diffusion models, has achieved unprecedented accuracy in full atom structure prediction for all biomolecules, while others like the ESM series (Rives et al., 2021; Hsu et al., 2022; Lin et al., 2023; Verkuil et al., 2022; Hayes et al., 2024) and DPLM (Wang et al., 2024b) have shown impressive representation capability in protein language modeling benefiting diverse downstream tasks. Furthermore, these foundation models are not limited to single modalities. Multi-modal models that jointly consider sequence, structure, and function are emerging, offering a comprehensive understanding of protein behavior (Hayes et al., 2024; Liu et al., 2023). One important aspect of understanding this sequence-structure-function relationship is protein conformational dynamics. Recent work has extended protein structure prediction to several conformation prediction tasks and introduced generative AI to model the conformational distribution of proteins (Jing et al., 2023; Zheng et al., 2024; Jing et al., 2024; Wang et al., 2024c; Lu et al., 2024).

However, the rapid progress of protein foundation models has also led to an urgent need for a unified framework to holistically evaluate their performance across a diverse set of tasks, datasets, and metrics, as shown in Appendix A. The current landscape of protein foundation models is characterized by ununified modeling approaches, task-specific or model-specific evaluation criteria. This heterogeneity in evaluation methods makes it challenging to draw meaningful comparisons between different models and to fully understand their relative strengths and limitations.

Through systematic evaluations of datasets spanning diverse biological domains, with a particular emphasis on protein design and conformational dynamics, we aims to provide a comprehensive analysis of model architecture and performance on protein foundation models. This approach allows us to dissect the impact of various model components and data characteristics on different aspects of protein modeling. Comparing the capabilities of these models on standardized benchmarks is crucial for guiding future research directions, informing model selection for practical applications, and driving the advancement of the field as a whole.

In this study, as shown in Figure 1, we present ProteinBench, the first benchmark designed to provide a comprehensive evaluation of protein foundation models through four key components:

Refer to caption — Figure 1: Comprehensive overview of fundamental protein modeling tasks in ProteinBench. ProteinBench incorporates a spectrum of protein modeling challenges. Tasks marked with blue stars highlight domains where standardized performance benchmarks were previously unavailable.

(1) A taxonomic classification of tasks encompassing the main generative challenges in the protein domain. ProteinBench covers a wide range of generative tasks, including protein design (spanning structure design, sequence design, structure-sequence co-design, and an application-specific task of antibody design), three-dimensional structure prediction, and conformation dynamics. These tasks, addressing different protein modalities, enable a nuanced analysis of the interplay between model architecture and modal characteristics on performance. We utilize diverse and carefully curated datasets to capture the complexity and diversity of the protein universe, ensuring a thorough evaluation of model capabilities.

(2) A multi-metric evaluation approach assessing performance across four key dimensions: quality, novelty, diversity, and robustness. Current evaluations of protein generative models often suffer from non-unified metrics and incomplete assessments, typically focusing on only one or two aspects. However, protein scientific problems encompass a complex and systematic array of challenges. Downstream tasks in protein modeling and design involve intricate interplays between sequence, structure, and function. ProteinBench addresses this limitation by providing a comprehensive measurement of a model’s ability to capture the mechanisms of the protein universe. We evaluate models based on four critical dimensions: quality, novelty, diversity, and robustness. This multi-faceted approach offers a more holistic view of model performance and capabilities.

(3) In-depth analyses from various user objectives, providing a holistic view of model performance. Recognizing that different users may have varying objectives when applying protein foundation models, we conduct in-depth analyses from multiple perspectives. For instance, in protein design, some users may prioritize models that fit natural evolutionary distributions, while others may seek models capable of generating novel proteins outside the training set distribution. By analyzing model capabilities from these different objectives, ProteinBench provides insights that are beneficial for a wide range of practical applications.

(4) Leaderboard and code framework. To facilitate fair comparisons and support the development of new methods, we provide a unified experimental framework. This includes a public leaderboard and open-source code, enabling researchers to easily benchmark their models against existing ones and contribute to the ongoing advancement of the field.

By incorporating these four components, ProteinBench aims to establish a standardized, comprehensive, and user-centric evaluation framework for protein foundation models. This approach not only illuminates the current state-of-the-art but also guides future research directions and accelerates progress in the field of protein modeling and design.

2 Background and task definition

In this section, we provide a concise overview of the tasks addressed by various protein foundation models as shown in Table 1, with a particular focus on two key generative tasks: protein design and conformational dynamics. These two areas are further divided into eight subtasks.

Table 1: Overview of ProteinBench, which summarizes the dimensions, metrics, and methods used in ProteinBench. We use ‘italics’ for highlighting, a method that has not yet been evaluated in ProteinBench but will be assessed in the future.

Protein Design
Tasks	Dimension	Metrics	Methods
Inverse Folding	Sequence recovery	AAR	ProteinMPNN, ESMIF1,
	Refoldability	scTM (AF2)	LM-Design, ESM3
	Stability	pLDDT (AF2)	PiFold, CarbonDesign
Backbone Design	Quality	scTM, scRMSD (ProteinMPNN & ESMFold)	Rfdiffusion, Frameflow, Chroma,
	Novelty	Max. TM score to PDB database (Foldseek)	Framediff, Foldflow, Genie
	Diversity	Pairwise TM, Max Cluster (Foldseek)	foldingdiff, Proteus
Sequence Design	Quality	pLDDT (AF2)	ProGen2, EvoDiff,
	Novelty	Max. TM to PDB database (Foldseek)	DPLM, ESM3
	Diversity	Pairwise TM , Max Cluster (Foldseek)
Struct-seq Co-design	Quality	scTM, scRMSD (ESMFold)	ProteinGenerator, ProtPardelle,
	Novelty	Max. TM score to PDB database (Foldseek)	Multiflow, ESM3, CarbonNovo
	Diversity	Pairwise TM, Max Cluster (Foldseek)
Motif Scaffolding	Quality	Motif RMSD, Scafold RMSD	FrameFlow, Rfdiffusion, TDS, EvoDiff, DPLM, ESM3
Antibody Design	Accuracy	AAR, RMSD, TM-score	HERN,
	Functionality	Binding Energy (Rosetta)	MEAN, dyMEAN,
	Specificity	Seq Similarity, PHR	DiffAb, AbDPO
	Rationality	CN-Score, Clashes, Seq Naturalness
		Total Energy (Rosetta), scRMSD (IgFold)
Protein Conformation Prediction
Single state (folding)	Accuracy	TM score, RMSD, GDT, lDDT	AlphaFold2, OpenFold, ESMFold,
Single state (folding)	Quality	CA clash/break rate, Peptide bond break rate	RosettaFold2, EigenFold
Multiple state Prediction	Accuracy	Ensemble TM score/RMSD	EigenFold, MSA-subsampling, Str2Str, AlphaFlow/ESMFlow, ConfDiff
	Diversity	pairwise RMSD/TM
	Quality	CA clash/break rate, Peptide bond break rate
Distribution Prediction	Accuracy	Flexibility accuracy, Distributional similarity, Ensemble observables
	Diversity	Pairwise RMSD, RMSF
	Quality	CA clash/break rate, Peptide bond break rate

For each task, we focus on the following aspects, with detailed information provided in the appendix:

[Task Definition] A clear and concise description of the task, including its objectives and relevance to protein science. Specification of the input data format and expected output for each task.
[Evaluation Metrics] Description of the metrics used to assess model performance, including quality, novelty, diversity, and robustness measures.
[Datasets] Overview of the datasets used for each task, including their size, diversity, and any pre-processing steps applied.

2.1 Protein Design

2.1.1 Inverse folding

[Task Definition] The objective is to predict an optimal amino acid sequence for a given target protein structure, considering factors such as stability, refoldability, and potential functionality.
[Evaluation Metrics] Performance in protein sequence design is assessed using multiple complementary metrics: (1) Sequence Recovery: This metric compares the designed sequences to natural sequences with similar structures. It quantifies how well the design method can recapitulate evolutionarily conserved sequence patterns associated with specific structural motifs. (2) Refoldability: This measure evaluates the structural similarity between the target backbone and the predicted structure of the designed sequence. The prediction is performed using AlphaFold2 (Jumper et al., 2021). Similarity is quantified using self-consistent template modeling score (scTM) (Trippe et al., 2022) and self-consistent root-mean-square deviation (scRMSD), providing insight into how well the designed sequence would fold into the intended structure. (3) Stability: This is assessed using the predicted local distance difference test (pLDDT) calculated by AlphaFold2. The pLDDT score serves as a proxy for the predicted stability of the designed protein, which is used in Dauparas et al. (2022).
[Datasets] Evaluations were conducted on different datasets targeting two distinct objectives of structure-based sequence design: (1) capture the native evolutionary distribution: we evaluated two independent datasets containing newly released PDB structures: CASP15 (cas, 2022) and CAMEO (Robin et al., 2021). We collected new structures from the ongoing CAMEO assessment between January and July 2024, resulting in a total of 332 complex structures. Additionally, 32 protein structures were collected from CASP15, which includes only protein entities, excluding nucleic acids or ligands. (2) de novo protein design: RFdiffusion (Watson et al., 2023a) was used to generate backbones of varying lengths: specifically, 100, 200, 300, 400, and 500 residues. For each length, 10 different structures were randomly sampled, using a sampling temperature of 0.1 for all methods. The designability of these sequences was evaluated using AlphaFold2, with the scTM score and pLDDT metrics serving as the primary assessment criteria. Existing benchmarks for inverse folding, such as PDB-Struct (Wang et al., 2023) and Proteininvbench (Gao et al., 2024), provide standardized protein structure sets for evaluating inverse folding methods. While these benchmarks have significantly contributed to the field’s advancement, there is a growing need for more comprehensive evaluation frameworks. These expanded evaluations should align more closely with diverse user objectives in protein design, encompassing aspects like accuracy in capturing natural evolutionary distributions and robustness in de novo backbone-based sequence design.

2.1.2 Protein backbone design

[Task Definition] Protein backbone design focuses on creating new protein folds to achieve de novo design objectives. This task is essential for expanding the repertoire of protein structures beyond those found in nature, with significant applications in fields such as drug discovery, biomaterials, and therapeutics.
[Evaluation Metrics] The evaluation of backbone design encompasses multiple criteria to assess both the quality and novelty of generated structures. Structural quality is primarily measured using self-consistent TM-score and RMSD, which provide quantitative measures of the backbone’s refoldability measured by ProteinMPNN (Dauparas et al., 2022) and ESMFold (Lin et al., 2023). Equally important are novelty metrics, which gauge the method’s capacity to explore new structural space beyond known protein folds. This aspect is evaluated using two key metrics: The maximum TM-score obtained when comparing designed structures to existing entries in the RCSB Protein Data Bank (PDB) (Berman et al., 2000). This comparison is performed using Foldseek (van Kempen et al., 2022), a rapid structural alignment tool. Diversity metrics, which include: (a) Pairwise maximum TM-scores among the designed structures. (b) The number of distinct structural clusters identified within the set of designed backbones, also determined using Foldseek (van Kempen et al., 2022). These diversity metrics help quantify the range of unique structures the design method can produce, ensuring that it’s not simply recreating known folds but generating a varied repertoire of protein backbones.
[Datasets] The primary objective of generative tasks is to accurately map the general distribution of the training set. For protein structure generation, high-resolution structures from the Protein Data Bank (PDB) are commonly used. To gain insight into this data distribution, we randomly sampled 100 native single-chain structures from the RCSB database as references. To ensure diversity, we iteratively removed structures with the highest TM-score compared to others, until we arrived at a final set of 100 distinct structures. This approach provides a representative snapshot of the single-chain structural distribution within the PDB, serving as a benchmark for evaluating the performance of generative models in capturing the true distribution of protein structures.

2.1.3 Protein sequence design

[Task Definition] The aim of this task is to generate amino acid sequences of desired properties, such as quality, diversity and novelty. Besides sequence-based evaluation, the structural characteristics of the generated sequences are also important.
[Evaluation Metrics] For sequence naturalness, we use perplexity from an autoregressive protein language model (ProGen2) to quantify if the patterns of generated sequences lie in natural sequence distribution. For structure-based evaluation, we use single-sequence folding model, i.e., ESMFold, to predict the structure of the generated sequences, and then measure the structural quality by pLDDT as the proxy of structural stability of the sequence using their predicted structures from AlphaFold2, as well as structural diversity and novelty using the same protocol as in backbone design.
[Datasets] UniRef50 is the commonly used dataset for training protein sequence generative models and language models.

2.1.4 Structure and sequence co-design

[Task Definition] Protein structure-sequence co-design involves simultaneously optimizing both the backbone structure and amino acid sequence of a protein to achieve desired properties or functions. This task is more complex than sequence design or structure design alone, as it explores a larger solution space.
[Evaluation Metrics] Evaluation metrics are derived from those used for both sequence and structure design: structure quality assessments, sequence-structure compatibility, as well as novelty of both sequence and structure compared to known proteins is also crucial.
[Datasets] High-resolution protein structures from the Protein Data Bank (PDB) is the commonly used datasets for this task, with careful consideration given to remove redundancy.

2.1.5 Motif scaffolding

[Task Definition] Motif scaffolding involves designing a protein structure that incorporates a specific functional motif or binding site. The goal is to create a stable protein framework (scaffold) that presents the desired motif in the correct geometry for its function.
[Evaluation Metrics] Following Yim et al. (2024), key metrics include the structural accuracy of the motif within the designed scaffold (typically measured by RMSD), overall protein stability, and retention of the motif’s functional properties. Experimental validation through binding assays or enzymatic activity tests is often crucial.
[Datasets] Datasets typically include libraries of known functional motifs (e.g., catalytic sites, binding interfaces) and diverse scaffold structures that can potentially accommodate these motifs. The Protein Data Bank is a primary source, but curated datasets of functional sites like the Catalytic Site Atlas are also valuable.
[Related benchmarks] Enzyme Design Challenge provides relevant test cases. However, given the specificity of motif scaffolding tasks, benchmarks often need to be tailored to the particular class of motifs or functions being targeted. Currently, there exists no comprehensive benchmark for this task in the field. A widely used benchmark containing 17 (25) motif-scaffolding problems was used in RFDiffusion (Watson et al., 2023b).

2.1.6 Antibody Design

[Task Definition] The goal of antibody design is to generate antibodies that can specifically bind to a given antigen. Since the Complementarity-Determining Regions (CDRs) of antibodies are highly variable and primarily responsible for antigen binding, antibody design could be simplified to the design of CDR regions and further reduced to the design of the third CDR in heavy chain (CDR-H3). Given the crucial role that protein structure plays in interactions, antibody design usually involves the simultaneous design of the sequence and the structure when binding to the antigen.
[Evaluation Metrics] As a highly goal-oriented functional protein design task, the evaluation of antibody design is straightforward, namely the Functionality (binding capability to the target antigen) and Specificity of the designed antibody. Additionally, the Rationality of the designed antibodies sequence and structure needs to be evaluated for filtering out invalid designs. Existing studies also evaluate the Accuracy of designed antibodies by measuring their similarity to natural antibodies as natural ones are confirmed to be effective. However, using accuracy as an evaluation metric is inadequate in many cases, which we will demonstrate in detail in Section 3.1.6.
[Datasets] The Structural Antibody Database (SAbDab Dunbar et al. (2013)) is the commonly used dataset for antibody design. It contains structural data of the antibody-antigen complex, but the data size is limited and contains numerous redundancies.

2.2 Protein Conformation Prediction

2.2.1 Protein Folding: single-state prediction

[Task Definition] Protein folding is the task of predicting the folded structure of a protein from its sequence. Folding models, such as AlphaFold2, have played a pivotal role in the recent development of models for protein conformation prediction (Jing et al., 2024; Wang et al., 2024c). Therefore, we recognize the necessity of including protein folding in this benchmark, viewing it as a specific instance of protein conformation prediction for a single conformational state.
[Evaluation Metrics] The accuracy of a predicted structure is evaluated by compared with its reference structures deposited in PDB using RMSD, TM-score, global distance test (GDT), and local distance difference test (lDDT). We also evaluate the quality of predicted structures by measuring the rate of clashing alpha carbons (CA-clash), disconnecting neighbor alpha carbons (CA-break), and disconnecting peptide bonds (PepBond-break) in predicted structures. See Appendix B.2.2 for details.
[Datasets] Most of the folding models compared in this benchmark were established prior to 2022. We use CAMEO2022 from Jing et al. (2023) for evaluation, which consists of 183 short-to-mid-length single protein chains ( $<$ 750 amino acids) from the targets of CAMEO between Aug 1 and Oct 31, 2022.

2.2.2 Multiple-state prediction

[Task Definition] As an extension of the single-state prediction task, multiple-state prediction aims to accurately predict (by sampling) two or more distinct conformational states of a protein that have been observed under different conditions (e.g., ligand binding) or through molecular dynamics simulations. The ability to predict these “alternative” conformations in addition to the folded structure could provide insights into conformational changes and protein functions.
[Evaluation Metrics] We evaluate this task based on accuracy, diversity, and quality. The accuracy of predicting a state is determined by the best structural similarity of the samples to the reference structure, measured by TM-score or RMSD. The overall accuracy of multiple-state prediction is assessed by “ensemble accuracy”, which is the average accuracy across all reference states (TMens or RMSDens where “ens” stands for ensemble), similar to Jing et al. (2023). For sample structural diversity, we measure the pairwise TM-score (or RMSD) among the samples. Finally, we assess the structural quality of generated samples, similar to single-state prediction, using CA-clash, CA-break, and PepBond-break.
[Datasets] We benchmark the models on two public datasets from previous works: 1) apo-holo, which contains 91 proteins, each with a pair of experimental structures (apo or unbound, and holo or bound) related to ligand-binding-induced conformational changes (Saldaño et al., 2022; Jing et al., 2023); (2) BPTI (Bovine Pancreatic Trypsin Inhibitor), a 58 amino acids protein, where a previous long-time MD simulation revealed five clusters of distinct conformations (Shaw et al., 2010).

2.2.3 Distribution prediction

[Task Definition] In contrast to multiple-state prediction, where the main goal is to recover specific conformational states, distribution prediction focuses on generating a sample distribution that resembles a target distribution—such as the empirical distribution sampled from molecular dynamics (MD). This task further bridges the gap between protein conformation prediction models and current MD-based approaches for studying protein dynamics and thermodynamic properties.
[Evaluation Metrics] In addition to the quality and diversity criteria from the previous sections, we follow (Jing et al., 2024) and include three categories of metrics to compare the ensemble of model-generated samples with the reference samples from MD simulations: flexibility assesses whether the model can distinguish more “flexible” regions or proteins from less “flexible” ones, measured by the Pearson correlation $r$ of region/protein diversity (e.g., pairwise RMSD); distributional accuracy directly compares the conformational distributions of model-generated samples with the reference MD conformations through the Wasserstein distance or the cosine similarity of the first principal components; and ensemble observables focus on function-related observables, such as transient contacts between residues due to dynamics, and compare the sample ensemble with the reference ensemble from MD. See Appendix B.2.2 for detailed descriptions of the metrics.
[Datasets] We evaluate performance using the ATLAS dataset (Vander Meersche et al., 2024), a recent database of MD simulation results for diverse proteins. To avoid data leakage for models trained on portions of the ATLAS dataset, we follow Jing et al. (2024) and benchmark on 82 proteins whose PDB entries were deposited after May 1, 2019 and are not part of the training or validation set.

3 ProteinBench

In this section, we provide ProteinBench, a holistic evaluation framework for protein foundation models. By systematically evaluating protein foundation models on the following tasks, we aim to provide a comprehensive understanding of their capabilities and limitations. This approach allows for a nuanced comparison of different model architectures and strategies, highlighting areas of strength and identifying opportunities for improvement. All data used in this benchmark are publicly available, ensuring reproducibility and facilitating wider participation in the research community.

3.1 Protein Design

In this section, we present a comprehensive evaluation of various protein foundation models across fundamental protein design tasks, including single-modal approaches (structure-based sequence design, structure design, and sequence design), multi-modal structure-sequence co-design, and the application-specific task of antibody design. This holistic assessment allows us to examine the versatility and effectiveness of different modeling approaches across a spectrum of protein engineering challenges. Notably, for backbone design, sequence design, co-design, and motif scaffolding, the quality, novelty, and diversity metrics are calculated using the same method. By utilizing common evaluation metrics across tasks, we enable cross-task comparisons, hoping to provide performance analysis to identify the strengths and limitations of each modeling approaches, and help to uncover potential synergies between different protein modals for future research.

3.1.1 Inverse folding

In this section, we evaluate the performance of various inverse-folding models for structure-based sequence design, focusing on two distinct objectives: natural evolutionary fitness (in-distribution proteins) and de novo designed backbone-based sequence design. The latter represents an out-of-distribution problem that tests the robustness of the methods, as these structures typically contain some noise different from high-resolution structure deposited in PDBs. The results are presented in Table 2.

Our analysis of native distribution fitness reveals that language model-based methods, for example LM-Design (Zheng et al., 2023) in our investigation, effectively capture the natural evolutionary distribution, achieving high sequence recovery rates for native protein structure-based sequence design. This suggests that these models effectively learn and replicate the intricate patterns of amino acid selection that have emerged through evolutionary processes. However, its performance decreases when applied to de novo backbone-based sequence design. Conversely, ProteinMPNN (Dauparas et al., 2022), a method specifically developed for de novo design and trained using coordinates perturbed with 0.2Å added noise, consistently demonstrates superior performance in de novo design tasks. However, ProteinMPNN’s performance shows a decline when evaluated on objective to fiting to native evolution. This finding has significant implications for the field, suggesting that no single model currently excels across all protein design objectives. The choice of model should be carefully aligned with the intended applications.

ESM-IF1 (Hsu et al., 2022) was trained on the largest dataset of native sequences and structures from the AlphaFoldDB (Varadi et al., 2022) based on GVP (Jing et al., 2020) and Transformer architectures and incorporated 0.1Å noise during training (similar to ProteinMPNN), it showed suboptimal performance in de novo backbone sequence design. Further investigation into the effects of larger noise additions or alternative model architectures on ESM-IF1’s performance could prove insightful. It is worth noting that we did not include functional mutation prediction tasks in this study, an area where ESM-IF1 has demonstrated impressive results, as these have been extensively studied in other benchmarks, such as ProteinGYM (Notin et al., 2024). ESM3 (Hayes et al., 2024), the recently released multi-modal protein language model, exhibits performance comparable to ESM-IF1, with notable advantages for specific sequence lengths (100, 300, and 400 residues). We have noticed that certain inverse folding methods such as PiFold (Gao et al., 2022) and CarbonDesign (Ren et al., 2024) are currently not featured in ProteinBench. We plan to update their performance soon in the near future.

Table 2: Performance of structure-based sequence design models on inverse folding tasks. The reported results are the median of repetitive experiments. ’N/A’ stands for not applicable. ESMIF1 and ESM3 use all native structures and sequences for model training, therefore, they not measured in the evolution distribution fitting objective.

	Fitting Evolution Distribution		De novo backbones based sequence design
	CASP	CAMEO	length 100		length 200		length 300		length 400		length 500
	AAR ↑	AAR ↑	scTM ↑	pLDDT ↑	scTM ↑	pLDDT ↑	scTM ↑	pLDDT ↑	scTM ↑	pLDDT ↑	scTM ↑	pLDDT ↑
ProteinMPNN	0.450	0.468	0.962	94.14	0.945	89.34	0.962	90.28	0.875	83.76	0.568	67.09
ESM-IF1	N/A	N/A	0.810	88.83	0.635	69.67	0.336	74.36	0.449	64.59	0.462	58.97
LM-Design	0.516	0.570	0.834	78.45	0.373	58.41	0.481	69.86	0.565	59.87	0.397	56.35
ESM3	N/A	N/A	0.942	86.60	0.486	60.69	0.632	70.78	0.564	62.63	0.452	59.37

3.1.2 Structure design

In this section, we evaluate the performance of protein foundation models for backbone design. The results are presented in Table 3. Our analysis focuses on the quality, novelty, and diversity of the generated structures across various chain lengths. Based on the quality metrics of scTM-score and scRMSD, RFdiffusion (Watson et al., 2023b) demonstrates exceptional performance in backbone design for chain lengths ranging from 50 to 300 amino acids. FrameFlow (Yim et al., 2023) achieves the second-best performance in this range. However, we observe a significant performance decrease across all models for longer chains (500 amino acids), with scTM scores dropping by more than 20%. This decline suggests that developing methods for long-chain backbone design remains an important challenge for future research. Novelty is an equally important metric, as it gauges a method’s capacity to explore new structural space beyond known protein folds. Under moderate quality constraints (scTM score >0.5), FoldFlow (Bose et al., 2023) and Genie (Lin & AlQuraishi, 2023) exhibit good performance in generating novel structures. When we increase the quality threshold (scTM score >0.8), Chroma (Ingraham et al., 2023) generally shows the best performance across chain lengths from 50 to 500 amino acids. In terms of structural diversity, Chroma shows commendable performance across the tested chain lengths. It is important to note that for this evaluation, we used the released FoldFlow model trained on a smaller training set with shorter sequences. This limitation may lead to an unfair comparison of the model architecture to other methods trained on the entire PDB database, particularly for longer chain lengths. We will soon update our evaluations to include more methods, such as Foldingdiff (Wu et al., 2024a) and Proteous (Wang et al., 2024a).

Table 3: Performance of backbone design models evaluated using various lengths ranging from 50 to 500. The reported results are the median of repetitive experiments. We highlight the best performance in bold and the second-best with the underline. For the novelty and diversity metrics, we only highlight results with the corresponding scTM score higher than 0.5. ’N/A’ stands for not applicable.

	length 50					length 100
	Quality		Novelty	Diversity		Quality		Novelty	Diversity
	scTM ↑	scRMSD ↓	Max TM ↓	pairwise TM ↓	Max Clust. ↑	scTM ↑	scRMSD ↓	Max TM ↓	pairwise TM ↓	Max Clust.↑
Native PDBs	0.91 $\pm$ 0.11	0.74 $\pm$ 1.45	N/A	0.29 $\pm$ 0.03	0.66	0.96 $\pm$ 0.10	0.67 $\pm$ 1.61	N/A	0.30 $\pm$ 0.02	0.77
RFdiffusion	0.95 $\pm$ 0.12	0.45 $\pm$ 1.71	0.65 $\pm$ 0.16	0.58 $\pm$ 0.05	0.67	0.98 $\pm$ 0.12	0.48 $\pm$ 0.56	0.76 $\pm$ 0.01	0.41 $\pm$ 0.03	0.32
FrameFlow	0.91 $\pm$ 0.09	0.58 $\pm$ 0.51	0.75 $\pm$ 0.01	0.68 $\pm$ 0.10	0.39	0.94 $\pm$ 0.08	0.70 $\pm$ 0.70	0.72 $\pm$ 0.01	0.55 $\pm$ 0.08	0.49
Chroma	0.85 $\pm$ 0.15	1.05 $\pm$ 1.49	0.59 $\pm$ 0.08	0.29 $\pm$ 0.01	0.48	0.89 $\pm$ 0.13	1.27 $\pm$ 1.85	0.70 $\pm$ 0.01	0.35 $\pm$ 0.03	0.59
FrameDiff(latest)	0.85 $\pm$ 0.13	1.00 $\pm$ 1.27	0.67 $\pm$ 0.01	0.35 $\pm$ 0.02	0.64	0.90 $\pm$ 0.08	1.23 $\pm$ 1.02	0.71 $\pm$ 0.08	0.52 $\pm$ 0.05	0.11
FoldFlow1(sfm)	0.90 $\pm$ 0.10	0.67 $\pm$ 0.88	0.68 $\pm$ 0.03	0.63 $\pm$ 0.07	0.48	0.87 $\pm$ 0.11	1.34 $\pm$ 1.42	0.65 $\pm$ 0.01	0.49 $\pm$ 0.08	0.83
FoldFlow1(base)	0.79 $\pm$ 0.14	1.19 $\pm$ 1.27	0.66 $\pm$ 0.02	0.53 $\pm$ 0.08	0.76	0.81 $\pm$ 0.15	1.70 $\pm$ 1.95	0.62 $\pm$ 0.01	0.48 $\pm$ 0.07	0.83
FoldFlow1(ot)	0.83 $\pm$ 0.16	1.10 $\pm$ 1.53	0.65 $\pm$ 0.02	0.53 $\pm$ 0.08	0.77	0.83 $\pm$ 0.15	1.60 $\pm$ 1.95	0.64 $\pm$ 0.01	0.48 $\pm$ 0.06	0.81
Genie	0.57 $\pm$ 0.15	3.12 $\pm$ 2.07	0.57 $\pm$ 0.03	0.32 $\pm$ 0.02	0.90	0.69 $\pm$ 0.17	3.38 $\pm$ 3.04	0.59 $\pm$ 0.01	0.31 $\pm$ 0.02	0.96
	length 300					length 500
	Quality		Novelty	Diversity		Quality		Novelty	Diversity
	scTM ↑	scRMSD ↓	Max TM ↓	pairwise TM ↓	Max Clust. ↑	scTM ↑	scRMSD ↓	Max TM ↓	pairwise TM ↓	Max Clust.↑
Native PDBs	0.97 $\pm$ 0.10	0.82 $\pm$ 2.67	N/A	0.28 $\pm$ 0.02	0.77	0.97 $\pm$ 0.17	1.07 $\pm$ 5.96	N/A	0.29 $\pm$ 0.03	0.8
RFdiffusion	0.96 $\pm$ 0.15	1.03 $\pm$ 3.14	0.64 $\pm$ 0.01	0.36 $\pm$ 0.03	0.65	0.79 $\pm$ 0.19	5.60 $\pm$ 5.66	0.62 $\pm$ 0.004	0.33 $\pm$ 0.02	0.89
FrameFlow	0.92 $\pm$ 0.15	1.95 $\pm$ 2.76	0.65 $\pm$ 0.01	0.43 $\pm$ 0.07	0.88	0.61 $\pm$ 0.19	7.92 $\pm$ 4.08	0.61 $\pm$ 0.01	0.40 $\pm$ 0.06	0.92
Chroma	0.87 $\pm$ 0.13	2.47 $\pm$ 3.63	0.66 $\pm$ 0.01	0.36 $\pm$ 0.04	0.67	0.72 $\pm$ 0.18	6.71 $\pm$ 5.76	0.60 $\pm$ 0.01	0.29 $\pm$ 0.01	0.99
FrameDiff(latest)	0.87 $\pm$ 0.12	2.73 $\pm$ 2.69	0.69 $\pm$ 0.00	0.48 $\pm$ 0.04	0.21	0.63 $\pm$ 0.24	9.52 $\pm$ 18.19	0.58 $\pm$ 0.03	0.40 $\pm$ 0.06	0.52
FoldFlow1(sfm)	0.45 $\pm$ 0.11	9.04 $\pm$ 2.52	0.54 $\pm$ 0.01	0.39 $\pm$ 0.04	1.00	0.37 $\pm$ 0.06	13.04 $\pm$ 1.71	0.53 $\pm$ 0.01	0.37 $\pm$ 0.03	1.00
FoldFlow1(base)	0.43 $\pm$ 0.09	9.56 $\pm$ 2.42	0.54 $\pm$ 0.01	0.39 $\pm$ 0.05	0.98	0.35 $\pm$ 0.05	13.20 $\pm$ 2.29	0.52 $\pm$ 0.01	0.39 $\pm$ 0.05	1.00
FoldFlow1(ot)	0.54 $\pm$ 0.12	8.21 $\pm$ 2.38	0.58 $\pm$ 0.00	0.41 $\pm$ 0.06	0.94	0.37 $\pm$ 0.06	12.48 $\pm$ 2.00	0.51 $\pm$ 0.01	0.35 $\pm$ 0.03	1.00
Genie	0.27 $\pm$ 0.02	20.37 $\pm$ 1.70	0.30 $\pm$ 0.01	0.23 $\pm$ 0.01	1.00	0.25 $\pm$ 0.01	26.08 $\pm$ 1.58	0.22 $\pm$ 0.002	0.23 $\pm$ 0.004	1.00

3.1.3 Sequence design

Table 4: Performance of protein sequence generative models/language models on sequence generation tasks. The reported results are the average of repetitive experiments with the standard derivation. The pLDDT score is the output of AlphaFold2. Max TM is an abbreviation for Maximum TM-score to PDB database. ’N/A’ stands for not applicable.We highlight the best performance in bold.

	length 100					length 200
	Quality		Diversity		Novelty	Quality		Diversity		Novelty
	ppl ↓	pLDDT ↑	pairwise TM ↓	Max Clust. ↑	Max TM ↓	ppl ↓	pLDDT↑	pairwise TM ↓	Max Clust. ↑	Max TM ↓
Native Seqs		68.46 $\pm$ 16.50	0.55 $\pm$ 0.19	0.75	N/A		61.91 $\pm$ 11.62	0.49 $\pm$ 0.10	0.78	N/A
Progen 2 (700M)	8.28 $\pm$ 3.87	64.00 $\pm$ 21.26	0.42 $\pm$ 0.10	0.94	0.64 $\pm$ 0.08	5.68 $\pm$ 3.64	69.91 $\pm$ 9.23	0.40 $\pm$ 0.13	0.91	0.69 $\pm$ 0.05
EvoDiff	16.89 $\pm$ 1.04	50.20 $\pm$ 10.27	0.43 $\pm$ 0.05	0.98	0.69 $\pm$ 0.03	17.28 $\pm$ 1.64	50.66 $\pm$ 16.38	0.36 $\pm$ 0.04	1.00	0.71 $\pm$ 0.02
DPLM (650M)	6.21 $\pm$ 3.10	85.38 $\pm$ 14.20	0.50 $\pm$ 0.20	0.80	0.74 $\pm$ 0.10	4.61 $\pm$ 2.63	93.54 $\pm$ 3.73	0.54 $\pm$ 0.24	0.70	0.91 $\pm$ 0.004
ESM3 (1.4B)	14.79 $\pm$ 2.90	54.26 $\pm$ 15.35	0.45 $\pm$ 0.15	0.90	0.68 $\pm$ 0.07	12.96 $\pm$ 2.38	58.45 $\pm$ 9.40	0.35 $\pm$ 0.07	1.00	0.80 $\pm$ 0.01
	length 300					length 500
	Quality		Diversity		Novelty	Quality		Diversity		Novelty
	ppl ↓	pLDDT ↑	pairwise TM ↓	Max Clust. ↑	Max TM ↓	ppl ↓	pLDDT↑	pairwise TM ↓	Max Clust. ↑	Max TM ↓
Native Seqs		61.49 $\pm$ 14.47	0.51 $\pm$ 0.13	0.85	N/A		62.95 $\pm$ 12.60	0.51 $\pm$ 0.11	0.78	N/A
Progen 2 (700M)	6.25 $\pm$ 4.02	65.69 $\pm$ 20.93	0.42 $\pm$ 0.16	0.93	0.66 $\pm$ 0.06	4.27 $\pm$ 3.60	61.45 $\pm$ 20.17	0.32 $\pm$ 0.11	0.95	0.68 $\pm$ 0.08
EvoDiff	17.13 $\pm$ 2.00	45.14 $\pm$ 9.95	0.31 $\pm$ 0.03	1.00	0.68 $\pm$ 0.02	16.51 $\pm$ 3.82	43.14 $\pm$ 5.16	0.31 $\pm$ 0.03	1.00	0.69 $\pm$ 0.02
DPLM (650M)	3.47 $\pm$ 1.44	93.07 $\pm$ 5.77	0.57 $\pm$ 0.25	0.63	0.91 $\pm$ 0.01	3.33 $\pm$ 1.8	87.73 $\pm$ 11.61	0.43 $\pm$ 0.18	0.85	0.85 $\pm$ 0.04
ESM3 (1.4B)	14.59 $\pm$ 2.97	48.08 $\pm$ 13.34	0.32 $\pm$ 0.03	1.00	0.75 $\pm$ 0.02	11.10 $\pm$ 2.26	52.17 $\pm$ 10.52	0.30 $\pm$ 0.05	1.00	0.54 $\pm$ 0.03

In this section, we assess the performance of various protein sequence generative models based on the quality, diversity, and novelty of their generated sequences across different chain lengths. The evaluation metrics include AlphaFold2 (AF2) predicted pLDDT scores for structural plausibility (quality), maximum TM-score and maximum cluster values for structural diversity, and maximum TM-score to PDB structures for structural novelty. We choose representative methods of distinct modeling foundations for evaluation. Among the methods evaluated, ProGen2 (Nijkamp et al., 2023) is an autoregressive protein language model (AR-LM), while EvoDiff (Alamdari et al., 2023) is designed as an order-agnostic autoregressive diffusion model (OADM). DPLM (Wang et al., 2024b) and ESM3 (Hayes et al., 2024) share a probabilistic foundation as absorbing discrete diffusion models or generative masked language models. Notably, ESM3 is a multimodal model that advances beyond other sequence-only methods by jointly learning protein sequences, structures, and functions through tokenization. For each model and sequence length, we sample 50 sequences to evaluate their performance.

As shown in Table 4, DPLM consistently shows the highest quality scores, indicating superior accuracy in sequence generation. However, it has relatively lower diversity metrics, suggesting less variation in its generated sequences. EvoDiff, while demonstrating lower pLDDT scores, excels in diversity, particularly in producing highly diverse sequence clusters. Surprisingly, ESM3, a multimodal protein LM, displays lower pLDDT in sequence generation, while maintaining competitive diversity, especially in generating novel sequences. ProGen2 strikes a balance between quality and diversity, offering moderate pLDDT scores and satisfactory diversity and novelty. This model is effective for generating sequences that are both diverse and close to known structures, depending on specific application needs. Regarding different chain lengths, all the models generally exhibit consistent trends in their performance metrics. As the chain length increases, there is a slight decline in the quality of sequences generated by some models, particularly for EvoDiff and ESM3. This indicates a challenge in maintaining high sequence quality as the chain length grows. Among them, DPLM demonstrate robust performance across all lengths, maintaining high pLDDT even for longer sequences. Overall, DPLM is good at highly structural protein sequence generation, while EvoDiff and ESM3 are preferable for better diversity and novelty, with ProGen2 offering a balanced performance across metrics.

3.1.4 Structure and sequence co-design

Table 5: Performance of protein co-design models on structure-sequence co-generation tasks. The reported results are the average of repetitive experiments with the standard derivation. We highlight the best performance in bold.
^∗: We have tried our best to reproduce all models according to the instructions in their respective codebases, using publicly available model weights. However, some results may differ from those reported in the original studies. We welcome any feedback and corrections to help us make timely updates in the future.

	length 100				length 200
	Quality		Diversity	Novelty	Quality		Diversity	Novelty
	scTM ↑	scRMSD ↓	Max Clust. ↑	Max TM ↓	scTM ↑	scRMSD ↓	Max Clust. ↑	Max TM ↓
Native PDBs	0.91 $\pm$ 0.11	2.98 $\pm$ 3.49	0.75	N/A	0.88 $\pm$ 0.09	3.24 $\pm$ 3.77	0.77	N/A
ProteinGenerator	0.91 $\pm$ 0.08	3.75 $\pm$ 3.39	0.24	0.73	0.88 $\pm$ 0.09	6.24 $\pm$ 4.10	0.25	0.72
ProtPardelle*	0.56 $\pm$ 0.12	12.9 $\pm$ 1.88	0.57	0.66	0.64 $\pm$ 0.11	13.67 $\pm$ 2.80	0.10	0.69
Multiflow	0.96 $\pm$ 0.04	1.10 $\pm$ 0.71	0.33	0.71	0.95 $\pm$ 0.04	1.61 $\pm$ 1.73	0.42	0.71
ESM3*	0.72 $\pm$ 0.19	13.80 $\pm$ 10.51	0.64	0.41	0.63 $\pm$ 0.20	21.18 $\pm$ 16.19	0.63	0.61
	length 300				length 500
	Quality		Diversity	Novelty	Quality		Diversity	Novelty
	scTM ↑	scRMSD ↓	Max Clust. ↑	Max TM ↓	scTM ↑	scRMSD ↓	Max Clust. ↑	Max TM ↓
Native PDBs	0.92 $\pm$ 0.12	3.94 $\pm$ 4.95	0.75	N/A	0.90 $\pm$ 0.14	9.64 $\pm$ 7.05	0.80	N/A
ProteinGenerator	0.81 $\pm$ 0.14	9.26 $\pm$ 4.13	0.22	0.71	0.69 $\pm$ 0.17	17.00 $\pm$ 5.52	0.18	0.73
ProtPardelle*	0.69 $\pm$ 0.08	14.91 $\pm$ 3.45	0.04	0.72	0.44 $\pm$ 0.12	43.15 $\pm$ 9.86	0.60	0.69
Multiflow	0.96 $\pm$ 0.06	2.14 $\pm$ 3.24	0.58	0.71	0.95 $\pm$ 0.07	2.71 $\pm$ 3.65	0.62	0.71
ESM3*	0.59 $\pm$ 0.21	25.5 $\pm$ 20.68	0.52	0.73	0.64 $\pm$ 0.20	26.72 $\pm$ 21.08	0.46	0.78

In this section, we examine the performance of protein structure-sequence co-generation, a topic that has recently gained significant interest within the research community. We inspect the performance of ProteinGenerator (Lisanza et al., 2023), ProtPardelle (Chu et al., 2024), Multiflow (Campbell et al., ), and ESM3 (Hayes et al., 2024) for different lengths. The performance is assessed using metrics similar to those applied in backbone generation. Note that, however, the quality here is about structure-sequence compatibility measuring how well the designed sequence can fold into the corresponding designed structure, using scTM and scRMSD. The key difference is that co-design models are tasked with simultaneously generating both the sequence and structure, while backbone design models require an additional inverse folding model, such as ProteinMPNN, to design the sequence. Other metrics used for evaluation include diversity (max cluster) and novelty (max TM-score to PDB).

As shown in Table 5, ProteinGenerator and Multiflow consistently show strong performance of structure-sequence compatibility across all sequence lengths, with high scTM scores (up to 0.96±0.06) and relatively low scRMSD values, indicating superior structural quality in generated sequences. ProteinGenerator particularly excels at shorter lengths, showing a balanced performance between quality and diversity metrics. Multiflow maintains high performance even as sequence length increases, demonstrating its robustness with consistently high scTM scores and lower scRMSD values, which indicates its capability to generate high-quality structures. ProtPardelle and ESM3, on the other hand, shows degradation in performance with increasing sequence length, as indicated by its low scTM scores and very high scRMSD values, suggesting that it struggles with maintaining structure quality for longer sequences. Overall, these findings suggest that while ProteinGenerator and Multiflow are effective models for generating high-quality protein structures across different lengths, Multiflow is particularly robust across all tested lengths.

3.1.5 Motif-scaffolding

In this section, we evaluate the performance of various motif-scaffolding methods across different scaffolds used in Watson et al. (2023b) and Yim et al. (2024), focusing on their effectiveness in designing scaffold structures. The primary objective of this evaluation is to compare the efficacy of structure-based and sequence-based approaches in generating designable scaffolds. For purely sequence-based methods, e.g., EvoDiff (Alamdari et al., 2023) and DPLM (Wang et al., 2024b), we use ESMFold to predict the structures of their designed motif-scaffold sequences.

Figure 3 reveals a wide range of performance levels among the tested methods, each exhibiting distinct strengths and weaknesses depending on the specific scaffold context. Notably, structure-based methods such as RFdiffusion (Watson et al., 2023b), TDS (Wu et al., 2024b) and FrameFlow (Yim et al., 2024) consistently perform well across most scenarios, with RFdiffusion showing particular robustness in generating a high number of designable scaffolds. This suggests that structure-based methods are highly effective at capturing the intricate structural details necessary for successful scaffold design. In contrast, sequence-based methods like EvoDiff and DPLM display variable performance, excelling in certain scaffolds that are primarily governed by evolutionary constraints, but underperforming in others with more complex structural motifs. This variability may reflect their limitations in recognizing and adapting to specific structural features.

Interestingly, ESM3 (Hayes et al., 2024), the latest sequence-based method and multimodal language model, capable of perceiving tertiary features through structure tokenization, demonstrates competitive performance in generating designable scaffolds across most cases. Its performance is comparable to that of more advanced structure-based models. This suggests that multimodal language models like ESM3 may effectively integrate structural capabilities within a unified language modeling framework, making them versatile tools for scaffold design. However, ESM3 does not consistently approach structure-based methods across all scenarios, indicating that while multimodal protein language models hold promise, further refinement and optimization are needed to achieve more consistent performance across different structural challenges.

Overall, our findings underscore that no single model currently excels universally across all scaffolds, highlighting the importance of selecting a motif-scaffolding method that aligns with the specific design objectives. Future research should explore the integration of these methods to capitalize on their respective strengths, potentially leading to more robust and versatile scaffold design capabilities

3.1.6 Antibody design

Table 6: Performance of antibody design methods on 55 antibody-antigen complexes from the RAbD dataset. For methods that can generate multiple antibodies (marked with *), the standard deviations between different antibodies generated against the same antigen are also reported.

	Accuracy			Functionality	Specificity
	AAR ↑	RMSD ↓	TM-score ↑	Binding Energy ↓	SeqSim-outer ↓	SeqSim-inner ↑	PHR ↓
RAbD (natural)	100.00%	0.00	1.00	-15.33	0.26	N/A	45.78%
HERN	33.17%	9.86	0.16	1242.77	0.41	N/A	39.83%
MEAN	33.47%	1.82	0.25	263.90	0.65	N/A	40.74%
dyMEAN	40.95%	2.36*	0.36	889.28	0.58	N/A	42.04%
dyMEAN-FixFR	40.05% $\pm$ 1.06	2.37 $\pm$ 0.03	0.35 $\pm$ 0.01	612.75 $\pm$ 56.03	0.60	0.96	43.75% $\pm$ 2.24
DiffAb	35.04% $\pm$ 8.36	2.53 $\pm$ 0.60	0.37 $\pm$ 0.06	489.42 $\pm$ 499.76	0.37	0.45	40.68% $\pm$ 10.65
AbDPO	31.29% $\pm$ 7.29	2.79 $\pm$ 3.01	0.35 $\pm$ 0.06	116.06 $\pm$ 186.06	0.38	0.60	69.69% $\pm$ 8.49
AbDPO++	36.25% $\pm$ 7.95	2.48 $\pm$ 0.59	0.35 $\pm$ 0.06	223.73 $\pm$ 281.7	0.39	0.54	44.51% $\pm$ 9.55
	Rationality
	CN-score ↑	Clashes-inner ↓	Clashes-outer ↓	SeqNat↑	Total Energy ↓	scRMSD ↓
RAbD (natural)	50.19	0.07	0.00	-1.74	-16.76	1.77
HERN	0.04	0.04	3.25	-1.47	5408.74	9.89
MEAN	1.33	11.65	0.29	-1.83	1077.32	2.77
dyMEAN	1.49	9.15	0.47	-1.79	1642.65	2.11
dyMEAN-FixFR	1.14 $\pm$ 1.71	8.88 $\pm$ 0.55	0.48 $\pm$ 0.12	-1.82 $\pm$ 0.10	1239.29 $\pm$ 113.84	2.48 $\pm$ 0.24
DiffAb	2.02 $\pm$ 2.83	1.84 $\pm$ 1.35	0.19 $\pm$ 0.31	-1.88 $\pm$ 0.47	495.69 $\pm$ 350.96	2.57 $\pm$ 0.77
AbDPO	1.33 $\pm$ 2.31	4.14 $\pm$ 1.84	0.10 $\pm$ 0.24	-1.99 $\pm$ 0.34	270.12 $\pm$ 217.45	2.79 $\pm$ 3.25
AbDPO++	2.34 $\pm$ 3.20	1.66 $\pm$ 1.28	0.08 $\pm$ 0.20	-1.78 $\pm$ 0.43	338.14 $\pm$ 266.48	2.50 $\pm$ 0.75

In this section, we selected five antigen-specific antibody design methods (HERN (Jin et al., 2022), MEAN (Kong et al., 2022), dyMEAN (Kong et al., 2023), DiffAb (Luo et al., 2022), AbDPO (Zhou et al., 2024)) and two of their variants (dyMEAN-FixFR implemented according to Appendix B.1.2 and AbDPO++), making a total of seven methods, to evaluate their performance in CDR-H3 generation towards the given antigens. All methods were trained on the same dataset with parameters reported in the corresponding papers and tested on a common set of 55 test cases from the RAbD dataset (Adolf-Bryfogle et al., 2018), details refer to Appendix B.1.3. Notably, dyMEAN-FixFR is not an official variant of dyMEAN; we modified dyMEAN to align its task setting with the other methods and allow it to generate different antibodies for the same antigen. The final evaluation results are shown in Table 6. For each evaluation metric, we highlighted the best performance in bold and the second-best with the underline, the detailed implementation of each metric could be seen at Appendix B.1.4.

In the Accuracy evaluation, dyMEAN and MEAN achieved the best performance in terms of sequence and structure (highest AAR and lowest RMSD), while DiffAb performed best in TM-score. However, considering multiple evaluation metrics, these methods did not perform as well overall. Additionally, apart from HERN, there were no significant performance differences among the other methods.

In the evaluation of Functionality, all methods produced antibodies with binding energies to the given antigens significantly higher than those of natural antibodies. AbDPO and AbDPO++ achieved the best performance among all methods by aligning on binding energy.

In the Specificity evaluation of antibodies, we mainly observed the sequence similarity between antibodies against different antigens (SeqSim-outer) and the proportion of hydrophobic residues in the generated antibodies (PHR). The former metric indicates whether the method can design antibodies specific to a given antigen, while the latter reflects the potential non-specific binding due to high hydrophobicity.

•

In SeqSim-outer, we noted that MEAN and dyMEAN generated highly similar sequences for different antigens (the maximum SeqSim-outer in our test set was 0.79, indicating that all antibody differences came only from length variations). This suggests that their excellent AAR might stem from learning high-frequency patterns in antibody sequences, generating antibodies according to these patterns for different antigens. In contrast, DiffAb and AbDPO performed the best.
•

For methods that can generate different antibodies for the same antigen, we also measured the sequence similarity among different antibodies generated for the same antigen (SeqSim-inner). We expect antibodies generated for the same antigen to be more similar. In this aspect, dyMEAN-FixFR and AbDPO performed the best. However, the 0.96 SeqSim-inner of dyMEAN-FixFR indicates that despite introducing randomness during model initialization, the final sequence generation showed almost no differences. Additionally, DiffAb, which performed best in SeqSim-outer, generated less similar antibodies for the same antigen, suggesting possible underfitting in sequence generation. Considering both types of SeqSim, AbDPO achieved the best performance.
•

In PHR, HERN and dyMEAN performed the best, but overall, almost all methods performed better than natural antibodies. Only AbDPO generated an excessive number of hydrophobic residues, reducing specificity. However, its variant, AbDPO++, controlled PHR well, closely matching natural antibodies among all methods.

The Rationality evaluation includes three aspects: structural rationality, sequence rationality, and joint structural and sequence rationality.

•

In structural rationality, we focused on the score for peptide bond lengths conforming to the natural peptide bonds length distribution (CN-score), the number of potential internal clashes in the generated structure (Clashes-inner), and the clashes between the generated structure and other parts (Clashes-outer). It was evident that irrational structures were prevalent in generated antibodies, but overall, diffusion-based methods performed better. AbDPO++ and DiffAb achieved the best performance among all methods. HERN and MEAN/dyMEAN exhibited different tendencies in Clashes-inner/outer, corresponding to our observations of the generated samples. HERN tends to generate large CDR-H3 structures, leading to fewer internal clashes but more clashes with the antigen, whereas MEAN/dyMEAN tends to generate smaller CDR-H3 structures.
•

In sequence rationality, we used the inverse perplexity of AntiBERTy (Ruffolo et al., 2021) to represent sequence naturalness, SeqNat, showing that HERN performed the best, possibly due to HERN being the only auto-regressive model. AbDPO++ achieved the second-best performance and was closest to natural antibodies.
•

In the joint evaluation of structure and sequence, we mainly focused on the consistency between the generated structure and sequence from two perspectives: physical energy and structure prediction. In terms of physical energy, we calculated the total energy of the generated CDR-H3s (Total Energy), which would be severely affected by the clashes caused by sidechains and thus reflect the irrationality between the generated structure and sequence. In this energy-related metric, AbDPO and AbDPO++ performed best among all methods. From the perspective of structure prediction, we used IgFold (Ruffolo et al., 2023) to predict the structure of the generated sequence, performed a post-optimization with the antigen as the condition, and calculated the CA-RMSD between the predicted structure and the generated structure (scRMSD). dyMEAN and dyMEAN-FixFR performed best in scRMSD. Although these two metrics both reflect the consistency between sequence and structure, they focus on different aspects. Moreover, both energy calculations and structure predictions have inherent errors, so the performance of different methods may not be consistent across these two metrics.

Overall, evaluating antibody design methods encompasses various aspects, and using only a few metrics will seriously mislead researchers’ understanding of model performance. Moreover, we must recognize that no single method outperformed all others across the board, and all methods showed substantial gaps compared to natural antibodies. The discrepancies may come from the severe lack of structured data, causing models to focus on certain sequence patterns or structures. Additionally, most models do not perform atomic-level modeling of antibodies and antigens, preventing accurate interaction modeling. New task paradigms must be developed to overcome current challenges in antibody design. Nonetheless, AbDPO++, by utilizing synthetic data and aligning with various properties, achieved one of the best performances in almost all aspects among all methods, without exhibiting obvious weaknesses.

3.2 Protein Conformation Prediction

In the second part of ProteinBench, we focus on conformation prediction, another class of cross-modality tasks aimed at predicting protein structures (conformations) from their sequences. While the current models are based on a body of work distinct from the design tasks, the ability to predict protein conformations provides insight into a model’s understanding of the physics and dynamics of protein structures. This capability is essential for future protein foundation models to fully understand, predict, and design proteins that embody the key sequence-structure-function relationships

The development of conformation prediction models is still in its early stages, with only a handful of exploratory methods proposed. A comprehensive comparison between these methods has yet to be conducted. To the best of our knowledge, this is the first benchmark study on current conformation prediction models that includes the major strategies proposed to date: (1) perturbing the sequence input of folding models (Del Alamo et al., 2022; Wayment-Steele et al., 2024); (2) perturbing protein structures through a structural-only diffusion model (Lu et al., 2024); (3) training generative models on large-scale structural data from experiments or simulations (Jing et al., 2023; 2024; Wang et al., 2024c; Zheng et al., 2024); (4) improving the conformational sampling using physical models (Zheng et al., 2024; Wang et al., 2024c).

3.2.1 Protein Folding: single-state prediction

While most folding models, such as AlphaFold2 (Jumper et al., 2021) and ESMFold (Lin et al., 2023), are not generative in nature, we still consider them “protein foundation models” for conformation prediction because (1) they are trained on large amounts of structural and/or sequence data; (2) they have played a fundamental role in understanding sequence-structure relationships; and (3) they are closely related to foundation models like ESM2 (Rives et al., 2021) and AlphaFold3 (Abramson et al., 2024), and serve as pivotal building blocks for conformation prediction models (Jing et al., 2024; Wang et al., 2024c). Therefore, it is essential to benchmark their performance in discussions of protein conformation prediction.”

In Table 7, we summarize the results of folding models on CAMEO2022. For AlphaFold2, OpenFold, and EigenFold, we predict five structures and report the best structure based on the model’s internal confidence scores. Consistent with previous reports (Jing et al., 2023), folding models (AlphaFold2, OpenFold, RosettaFold2) based on Multiple Sequence Alignment (MSA) outperform folding models (ESMFold, EigenFold) based on protein language models. AlphaFold2 and its faithful reproduction, OpenFold, demonstrate the best performance across all accuracy metrics. The quality of the predicted structures is comparable among the models (except for EigenFold) with minimal inter-residue clashes or bond breaks. EigenFold (Jing et al., 2023) is one of the first diffusion generative models to claim both protein folding and conformation prediction capabilities; however, its performance may be limited by several design factors: it is built on OmegaFold Wu et al. (2022), uses a coarse-grained representation with only alpha carbons, and has a small model size of 572K trainable parameters. In summary, although the MSA search is time-consuming, AlphaFold2 and OpenFold still achieve the best performance in both accuracy and structural quality for protein folding, providing a strong foundation for protein conformation prediction.

Table 7: Performance of protein folding on the CAMEO2022 dataset. Results are reported as mean/median over 183 proteins. The best performance is highlighted in bold, and the second-best is underlined. “N/A” indicates not applicable. *Unknown amino acids (“X”) in the sequence need to be removed for EigenFold, which may introduce slight differences in metric values.

	Accuracy				Quality
	TM-score ↑	RMSD ↓	GDT-TS ↑	lDDT ↑	CA clash (%) ↓	CA break (%) ↓	PepBond break (%) ↓
AlphaFold2	0.871/0.952	3.21/1.64	0.860/0.921	0.904/0.933	0.3/0.0	0.0/0.0	4.8/4.1
OpenFold	0.870/0.947	3.21/1.59	0.856/0.913	0.899/0.933	0.4/0.0	0.0/0.0	2.0/1.7
RoseTTAFold2	0.859/0.941	3.52/1.75	0.845/0.903	0.892/0.926	0.3/0.0	0.2/0.0	5.5/4.0
ESMFold	0.847/0.929	3.98/2.10	0.826/0.881	0.870/0.907	0.3/0.0	0.0/0.0	4.7/3.4
EigenFold*	0.743/0.823	7.65/3.73	0.703/0.781	0.737/0.810	8.0/4.6	0.5/0.0	N/A

3.2.2 Multiple-state prediction

In this section, we evaluate the performance of predicting multiples conformational states for five conformation prediction models and their variants: EigenFold Jing et al. (2023), MSA-subsampling (Del Alamo et al., 2022), Str2Str (Lu et al., 2024), AlphaFlow/ESMFlow Jing et al. (2024) and ConfDiff (Wang et al., 2024c). Here we highlight the key differences of these methods. MSA-subsampling perturbs OpenFold’s model input by reducing the number of input MSAs (referred to as “depth”), allowing the sampling of different conformations for a protein. Str2Str uses a structure-only diffusion model (i.e., a backbone design model) to generate conformations by perturbing initial folding predictions through a forward-backward diffusion process. The level of perturbation is controlled by the maximum diffusion time, $T_{\text{max}}$ , and ensemble outputs are generated by sampling structures at various diffusion times $t\leq T_{\text{max}}$ . EigenFold, AlphaFlow/ESMFlow, and ConfDiff take a similar approach by training diffusion or flow models through fine-tuning folding models using structural data from PDB. While AlphaFlow/ESMFlow open all layers of the original model for fine-tuning, EigenFold and ConfDiff only use pretrained representations from folding models and train a lightweight module for score or structural prediction. Additionally, both AlphaFlow/ESMFlow and ConfDiff provide versions further fine-tuned on a recent MD conformation dataset (Vander Meersche et al., 2024), indicated by the “-MD” suffix. ConfDiff (Wang et al., 2024c) introduced two guidance techniques to improve conformational sampling: (1) classifier-free guidance, which combines a sequence-conditioned conformation model with an unconditional (structure-only) model to explore conformational space (ConfDiff-ClsFree), and (2) energy/force guidance, which directs sampling toward regions with lower potential energy (ConfDiff-Energy/Force) through auxiliary prediction modules for intermediate energy/force guidance. However, such physical prediction modules are dataset-specific and are only available for the BPTI dataset.

Table 8: Performance on the multiple-state prediction of BPTI. Accuracy metrics (RMSDens, RMSD Cluster 3) are reported as the mean and standard deviations from 20 bootstrap samples with replacement, at different sample sizes (

N=10\sim 1000

). Diversity and Quality scores are evaluated based on 1,000 conformations for each model. The best performance is highlighted in bold, and the second-best is underlined. “N/A” indicates not applicable due to model resolution. RMSD is measured in Å.

RMSDens ↓

RMSD Cluster 3 ↓

Diversity

Quality

N=10

N=100

N=500

N=1000

N=10

N=100

N=500

N=1000

Pairwise

RMSD

clash% ↓

break%↓

PepBond

break%↓

EigenFold

1.56

\pm

0.02

1.50

\pm

0.01

1.47

\pm

0.01

1.46

\pm

0.00

2.54

\pm

0.03

2.48

\pm

0.01

2.46

\pm

0.01

2.46

\pm

0.01

0.85

1.4

4.3

N/A

MSA-depth256

1.57

\pm

0.01

1.54

\pm

0.01

1.52

\pm

0.00

1.52

\pm

0.01

2.51

\pm

0.02

2.47

\pm

0.02

2.45

\pm

0.02

2.45

\pm

0.02

0.20

0.0

9.2

MSA-depth64

1.60

\pm

0.02

1.54

\pm

0.01

1.51

\pm

0.01

1.50

\pm

0.01

2.48

\pm

0.03

2.40

\pm

0.04

2.35

\pm

0.04

2.33

\pm

0.03

0.55

0.0

7.9

MSA-depth32

1.67

\pm

0.05

1.53

\pm

0.04

1.45

\pm

0.04

1.41

\pm

0.02

2.39

\pm

0.15

2.21

\pm

0.15

1.93

\pm

0.15

1.87

\pm

0.06

2.14

0.6

0.0

10.6

Str2Str-ODE (

T_{\text{max}}=0.15

)

2.36

\pm

0.10

2.19

\pm

0.06

2.10

\pm

0.02

2.08

\pm

0.01

3.03

\pm

0.17

2.68

\pm

0.12

2.60

\pm

0.05

2.56

\pm

0.02

1.86

0.0

13.9

Str2Str-SDE (

T_{\text{max}}=0.15

)

2.83

\pm

0.23

2.48

\pm

0.11

2.28

\pm

0.04

2.25

\pm

0.03

3.42

\pm

0.32

2.92

\pm

0.28

2.52

\pm

0.14

2.48

\pm

0.13

3.60

0.3

0.0

16.0

AlphaFlow-PDB

1.53

\pm

0.02

1.45

\pm

0.01

1.42

\pm

0.01

1.41

\pm

0.01

2.48

\pm

0.04

2.43

\pm

0.02

2.41

\pm

0.02

2.40

\pm

0.01

0.86

0.0

13.2

AlphaFlow-MD

1.74

\pm

0.09

1.51

\pm

0.04

1.45

\pm

0.02

1.43

\pm

0.02

2.44

\pm

0.06

2.32

\pm

0.06

2.28

\pm

0.04

2.24

\pm

0.00

1.26

0.0

0.1

26.2

ESMFlow-PDB

1.61

\pm

0.04

1.49

\pm

0.02

1.44

\pm

0.01

1.42

\pm

0.01

2.47

\pm

0.05

2.41

\pm

0.03

2.37

\pm

0.03

2.35

\pm

0.01

0.74

0.0

6.0

ESMFlow-MD

1.66

\pm

0.07

1.50

\pm

0.04

1.41

\pm

0.03

1.40

\pm

0.02

2.49

\pm

0.06

2.29

\pm

0.09

2.20

\pm

0.04

2.18

\pm

0.03

1.17

0.0

14.3

ConfDiff-Open-ClsFree

1.65

\pm

0.05

1.48

\pm

0.05

1.41

\pm

0.04

1.37

\pm

0.03

2.56

\pm

0.05

2.30

\pm

0.23

2.16

\pm

0.20

2.03

\pm

0.13

1.77

0.5

0.0

5.5

ConfDiff-Open-MD

1.64

\pm

0.04

1.50

\pm

0.02

1.44

\pm

0.02

1.42

\pm

0.02

2.49

\pm

0.08

2.39

\pm

0.05

2.32

\pm

0.03

2.31

\pm

0.02

1.37

0.2

0.0

4.6

ConfDiff-ESM-ClsFree

1.58

\pm

0.05

1.45

\pm

0.02

1.41

\pm

0.01

1.39

\pm

0.00

2.50

\pm

0.05

2.39

\pm

0.03

2.35

\pm

0.03

2.33

\pm

0.02

1.52

0.5

0.0

7.5

ConfDiff-ESM-MD

1.61

\pm

0.03

1.47

\pm

0.02

1.42

\pm

0.01

1.40

\pm

0.01

2.45

\pm

0.08

2.32

\pm

0.06

2.26

\pm

0.04

2.24

\pm

0.01

1.42

0.1

0.0

5.0

ConfDiff-ESM-Energy

1.63

\pm

0.06

1.47

\pm

0.01

1.43

\pm

0.01

1.42

\pm

0.01

2.55

\pm

0.07

2.43

\pm

0.04

2.41

\pm

0.02

2.40

\pm

0.01

1.26

0.1

0.0

7.5

ConfDiff-ESM-Force

1.58

\pm

0.06

1.44

\pm

0.03

1.37

\pm

0.02

1.36

\pm

0.01

2.45

\pm

0.09

2.33

\pm

0.07

2.23

\pm

0.06

2.22

\pm

0.06

1.76

0.1

0.0

8.9

In Table 8, we summarize the results on predicting the five structural clusters of BPTI. Specifically, the ensemble accuracy of five clusters (RMSDens) and the accuracy of Cluster 3 (RMSD Cluster 3), the most difficult to sample, are evaluated by bootstrapping at different sample sizes. ConfDiff models with classifier-free guidance demonstrated the best performance in overall accuracy (RMSDens) across most sample sizes, highlighting its greater effectiveness on BPTI compared to fine-tuning on MD conformation data. Consistent with Wang et al. (2024c), ConfDiff-ESM-Force achieved the highest ensemble accuracy, suggesting that incorporating physical information can improve the sampling of high-accuracy conformations. For the task of sampling Cluster 3, MSA subsampling, despite being a simple approach, was able to generate conformations most likely to capture this remote state. As the MSA depth decreases, sample diversity increases, allowing the model to sample more closely to Cluster 3. In comparison, Str2Str models do not perform well on this task, potentially because the structure-only approach does not ensure that the perturbed structure remains faithful to the provided sequence, leading to overall poorer performance. EigenFold also shows lower diversity, which may limit its effectiveness in sampling diverse conformations. AlphaFlow/ESMFlow also demonstrated competitive performance. Fine-tuning on the MD dataset provided higher diversity and improved accuracy in sampling Cluster 3 compared to their PDB-trained base models. However, we also observed a more pronounced decline in quality due to fine-tuning on MD conformation data, specifically with an increased rate of peptide bond breaking between residues.

Table 9: Performance on the conformation prediction task for the apo-holo dataset. apo/holo-TM represents the maximum TM-score of the samples relative to the reference apo/holo structure. Twenty conformations were sampled for each protein, and the results are reported as mean/median across 91 proteins. The best performance is highlighted in bold, and the second-best is underlined. “N/A” indicates not applicable due to model resolution.

Accuracy

Diversity

Quality

apo-TM ↑

holo-TM ↑

TMens ↑

Pairwise TM

CA clash % ↓

CA break% ↓

PepBond break %

↓

apo model

1.000/1.000

0.790/0.821

0.895/0.910

N/A

EigenFold

0.831/0.862

0.864/0.900

0.847/0.874

0.907/0.958

3.6/1.2

0.3/0.1

N/A

MSA-depth256

0.845/0.882

0.889/0.936

0.867/0.894

0.978/0.993

0.2/0.0

0.0/0.0

4.6/4.0

MSA-depth64

0.844/0.877

0.883/0.927

0.863/0.906

0.950/0.980

0.2/0.0

0.0/0.0

5.7/5.0

MSA-depth32

0.824/0.865

0.857/0.905

0.841/0.882

0.864/0.924

0.2/0.0

0.0/0.0

8.9/7.3

Str2Str-ODE (

T_{\text{max}}=0.1

)

0.762/0.791

0.778/0.816

0.770/0.794

0.954/0.956

0.2/0.0

0.0/0.0

14.0/12.4

Str2Str-ODE (

T_{\text{max}}=0.3

)

0.766/0.797

0.781/0.818

0.774/0.797

0.872/0.871

0.2/0.0

0.0/0.0

14.7/12.9

Str2Str-SDE (

T_{\text{max}}=0.1

)

0.682/0.703

0.693/0.717

0.688/0.712

0.760/0.748

0.2/0.1

1.5/1.5

22.6/21.3

Str2Str-SDE (

T_{\text{max}}=0.3

)

0.680/0.685

0.689/0.718

0.684/0.697

0.639/0.604

0.2/0.1

1.4/1.4

21.1/19.6

AlphaFlow-PDB

0.855/0.896

0.891/0.942

0.873/0.900

0.924/0.955

0.3/0.0

0.0/0.0

6.6/6.7

AlphaFlow-MD

0.857/0.888

0.863/0.913

0.860/0.892

0.894/0.918

0.2/0.0

0.0/0.0

20.8/20.6

ESMFlow-PDB

0.849/0.878

0.882/0.924

0.866/0.900

0.935/0.952

0.3/0.1

0.0/0.0

4.8/4.7

ESMFlow-MD

0.851/0.882

0.864/0.908

0.858/0.890

0.897/0.922

0.1/0.0

0.0/0.0

10.9/10.9

ConfDiff-Open-ClsFree

0.838/0.886

0.879/0.927

0.859/0.885

0.870/0.898

0.8/0.6

0.0/0.0

5.8/5.6

ConfDiff-Open-MD

0.839/0.881

0.874/0.918

0.857/0.890

0.863/0.892

0.4/0.2

0.0/0.0

6.8/6.8

ConfDiff-ESM-ClsFree

0.837/0.883

0.864/0.907

0.850/0.887

0.846/0.869

0.7/0.6

0.0/0.0

4.6/4.5

ConfDiff-ESM-MD

0.836/0.877

0.862/0.908

0.849/0.892

0.846/0.875

0.3/0.2

0.0/0.0

4.1/4.0

We next move on to a larger dataset, apo, which includes 91 proteins with ligand-binding-induced conformation changes (Table 9). In this task, models are required to predict both the unbound (apo) and bound (holo) structures. Interestingly, we found that the best-performing models are those that most closely resemble folding models (e.g., MSA-depth256, AlphaFlow-PDB). Despite using a small perturbation level ( $T_{\text{max}}=0.1$ ), Str2Str does not accurately predict either the apo or holo structures. Strategies to improve sample diversity, such as decreasing MSA depth, fine-tuning on MD conformation data, or using classifier-free guidance, generally do not improve (and sometimes even harm) the TMens score. Additionally, we included a baseline model that always predicts the perfect apo structure (the apo model), which achieved a higher TMens score than current models. These findings suggest that a strong folding model improves sampling quality, but current performance struggles to outperform a “perfect” folding model, as the apo-holo conformation challenge remains difficult for existing protein conformation models.

In the multiple-state prediction task, we observed that strategies such as MSA subsampling, classifier-free guidance, force guidance, and training on MD conformation data improved sample diversity and accuracy on the BPTI dataset. However, most of these strategies failed to improve the dual-state prediction task for apo-holo conformational changes

3.2.3 Distribution prediction

In this final task, we benchmark models on the ATLAS test set, which includes 82 proteins, and focus on each model’s ability to recover the conformational distribution observed in classic protein molecular dynamics simulations. The results are summarized in Table 10. For comparison, we include reference performances of (1) i.i.d. samples (MD iid) from MD-generated structures and (2) 250 consecutive samples, corresponding to 2.5 ns of simulation time (MD 2.5 ns).

Overall, generative models trained to sample protein conformations from sequence (AlphaFlow/ESMFlow, ConfDiff) perform considerably better than perturbation-based methods (MSA subsampling and Str2Str) across nearly all accuracy metrics of flexibility prediction, distributional accuracy, and ensemble observables. In our experiments, adjusting perturbation levels for Str2Str (the maximum forward time $T_{\text{max}}$ ) and MSA-subsampling (the MSA depth) did not improve distribution-related metrics, suggesting that perturbation alone may not be sufficient to accurately capture the sample distribution. We identified two factors consistently improve the model performance of AlphaFlow/ESMFlow and ConfDiff: (1) choosing a strong base folding model (e.g., AlphaFold or OpenFold), which, while potentially reducing sample diversity, improves distribution-related metrics; and (2) fine-tuning process on MD conformation data, which further enhances the models’ ability to predict the target distribution. The latter again emphasizes the importance of aligning the model’s distribution to the target distribution through supervised methods, rather than relying solely on conformation exploration strategies (e.g., classifier-free guidance), to accurately predict distributions. In addition, the results align with previous tasks regarding the trade-offs between diversity, prediction performance, and sample quality: for example, while fine-tuning on MD conformation data improves sample diversity and prediction performance for AlphaFlow/ESMFlow, it also significantly increases the level of peptide bond breaking in these models.

While current conformation prediction models have shown promising signs in capturing dynamics-related features and approximating conformation distributions, it is important to note that there remains a clear gap between the performance of these models and that of classic MD simulations, even at short simulation times (e.g., 2.5 ns). Achieving performance comparable to i.i.d. sampling from MD conformational samples has yet to be achieved.

Table 10: Performance on distribution prediction for the ATLAS test set. A total of 250 structures were sampled for each protein, and the median values across 82 proteins are reported. The best performance is highlighted in bold, and the second-best is underlined. *These metrics require all-atom or backbone predictions; therefore, EigenFold and Str2Str do not have sufficient resolution for evaluation (indicated as “N/A”).

Diversity

Flexibility: Pearson

r

Distributional accuracy

Pairwise

RMSD

*RMSF

Pairwise

RMSD ↑

*Global

RMSF ↑

*Per target

RMSF ↑

*RMWD ↓

MD PCA

W2 ↓

Joint

PCA W2 ↓

PC sim

>

0.5 %↑

MD iid

2.76

1.63

0.96

0.97

0.99

0.71

0.76

0.70

93.9

MD 2.5 ns

1.54

0.98

0.89

0.85

2.21

1.57

1.93

36.6

EigenFold

5.96

N/A

-0.04

N/A

2.35

7.96

12.2

MSA-depth256

0.84

0.53

0.25

0.34

0.59

3.63

1.83

2.90

29.3

MSA-depth64

2.03

1.51

0.24

0.30

0.57

4.00

1.87

3.32

18.3

MSA-depth32

5.71

7.96

0.07

0.17

0.53

6.12

2.50

5.67

17.1

Str2Str-ODE (t=0.1)

1.66

N/A

0.13

N/A

2.12

4.42

6.1

Str2Str-ODE (t=0.3)

3.15

N/A

0.12

N/A

2.23

4.75

9.8

Str2Str-SDE (t=0.1)

4.74

N/A

0.10

N/A

2.54

8.84

9.8

Str2Str-SDE (t=0.3)

7.54

N/A

0.00

N/A

3.29

12.28

7.3

AlphaFlow-PDB

2.58

1.20

0.27

0.46

0.81

2.96

1.66

2.60

37.8

AlphaFlow-MD

2.88

1.63

0.53

0.66

0.85

2.68

1.53

2.28

39.0

ESMFlow-PDB

3.00

1.68

0.14

0.27

0.71

4.20

1.77

3.54

28.0

ESMFlow-MD

3.34

2.13

0.19

0.30

0.76

3.63

1.54

3.15

25.6

ConfDiff-Open-ClsFree

3.68

2.12

0.40

0.54

0.83

2.92

1.50

2.54

46.3

ConfDiff-Open-PDB

2.90

1.43

0.38

0.51

0.82

2.97

1.57

2.51

34.1

ConfDiff-Open-MD

3.43

2.21

0.59

0.67

0.85

2.76

1.44

2.25

35.4

ConfDiff-ESM-ClsFree

4.04

2.84

0.31

0.43

0.82

3.82

1.72

3.06

37.8

ConfDiff-ESM-PDB

3.42

2.06

0.29

0.40

0.80

3.67

1.70

3.17

34.1

ConfDiff-ESM-MD

3.91

2.79

0.35

0.48

0.82

3.67

1.66

2.89

39.0

Ensemble observables

Quality

Weak

contacts

J

↑

Transient

contacts

J

↑

*Exposed

residue

J

↑

*Exposed MI

matrix

\rho

↑

CA break

% ↓

CA clash

% ↓

PepBond

break % ↓

MD iid

0.90

0.80

0.93

0.56

0.0

0.1

3.4

MD 2.5 ns

0.62

0.45

0.64

0.24

0.0

0.1

3.4

EigenFold

0.36

0.18

N/A

0.7

9.6

N/A

MSA-depth256

0.30

0.28

0.33

0.06

0.0

0.2

5.9

MSA-depth64

0.38

0.27

0.38

0.12

0.0

0.2

8.4

MSA-depth32

0.39

0.24

0.36

0.15

0.1

0.5

13.0

Str2Str-ODE (t=0.1)

0.42

0.17

N/A

0.0

0.1

13.7

Str2Str-ODE (t=0.3)

0.41

0.17

N/A

0.0

0.1

14.8

Str2Str-SDE (t=0.1)

0.40

0.13

N/A

1.6

0.2

23.0

Str2Str-SDE (t=0.3)

0.35

0.13

N/A

1.5

0.2

21.4

AlphaFlow-PDB

0.44

0.33

0.42

0.18

0.0

0.2

6.6

AlphaFlow-MD

0.57

0.38

0.50

0.24

0.0

0.2

21.7

ESMFlow-PDB

0.42

0.29

0.41

0.16

0.0

0.6

5.4

ESMFlow-MD

0.51

0.33

0.47

0.21

0.0

0.3

10.9

ConfDiff-Open-PDB

0.47

0.34

0.43

0.18

0.0

0.9

5.7

ConfDiff-Open-ClsFree

0.54

0.33

0.47

0.21

0.0

1.2

5.7

ConfDiff-Open-MD

0.59

0.36

0.50

0.24

0.0

0.8

6.3

ConfDiff-ESM-PDB

0.48

0.31

0.42

0.18

0.0

1.6

3.9

ConfDiff-ESM-ClsFree

0.54

0.31

0.47

0.18

0.0

1.8

4.3

ConfDiff-ESM-MD

0.56

0.34

0.48

0.23

0.0

1.5

4.0

4 Conclusions and Future Work

In summary, we present the first comprehensive study evaluating the capabilities of various protein foundation models across eight distinct tasks, with a particular focus on protein design and conformation dynamics. We have developed a unified, multi-metric evaluation framework, which is essential for unbiased assessment of protein foundation models from multiple facets. Based on the performance results, we provide insights and considerations for the development and effective use of protein foundation models, offering guidance for future research. We highlight the key observations from our holistic evaluation as follows.

4.1 Key observations

Valid evaluation of protein foundation models necessitates the use of correct and comprehensive evaluation metrics. The emergence of advanced folding models, exemplified by AlphaFold2 and ESMFold, has opened up valuable opportunities for accurately assessing the quality, stability, and precision in protein generative tasks. However, it is crucial to acknowledge that, due to their current limitations in complex structure prediction capabilities, certain tasks may still lack sufficiently accurate evaluation methods. For example, within the realm of antibody design, researchers have at times been misled by reconstruction metrics like Amino Acid Recovery (AAR) and Root Mean Square Deviation (RMSD) related to accuracy, resulting in overly optimistic conclusions. Here, we intend to tackle this challenge by introducing a combined evaluation approach, integrating reconstitution and physical rationality metrics. Also, considering the inherent complexity of protein scientific problems, it becomes imperative to adopt a multifaceted evaluation strategy to capture various facets of protein structure and function. Here, in ProteinBench, we aims to capture various facets of protein structure and function, fostering a more holistic understanding of the performance of foundation models in protein-related tasks. Furthermore, metrics alone are insufficient. In the development of generative models for protein, the primary objective is to accurately fit the distribution of the training data. Our evaluation approach extends beyond simple comparisons of metric values. We have implemented a more comprehensive assessment strategy that includes measuring the same metrics for the training data (which encompasses native proteins, antibodies, and molecular dynamics conformations in various lengths). This provides a high-resolution gold reference for protein generative targets, allowing for a more contextually rich evaluation framework.

No single model currently excels across all protein design objectives. The choice of model should be carefully aligned with the intended applications. In the field of protein foundation models, two primary approaches have emerged: language models and geometric models. Each approach has its strengths and limitations, which are reflected in the performance of ProteinBench. We found language models show good performance in capturing nature evolution distributions. This is evidenced by their high accuracy in native sequence recovery (inverse-folding) and high quality in scaffolding evolution-conserved motifs. However, language models show limitations in robustness when designing sequence for de novo backbones, and in generating novel sequences for sequence-based protein design. In contrast, structure-based models exhibit greater robustness and tolerance for structural noises in de novo design task, and show greater potential for creating proteins with new folds or functions. These findings underscore the importance of carefully considering specific design objectives when researchers are selecting a model to use.

While generative models extended from classic folding models have shown ability to sample protein conformations, challenges remain in both multiple-state prediction and distribution prediction. Protein conformation prediction is a new but crucial assessment of the multi-modal capabilities and physical understanding of protein foundation models. While strategies proposed in current models may benefit certain tasks, they often provide limited improvement in others. For example, although fine-tuning models using the MD conformation dataset showed promising results on the ATLAS benchmark, little to no improvement was observed in the multi-state prediction of apo-holo conformations. Additionally, the common trade-off between diversity and quality in current models underscores the importance of consistent evaluation across the dimensions of accuracy, diversity, and quality in protein conformation prediction tasks.

4.2 Limitations and Future Work

We acknowledge several limitations and opportunities for enhancement in our current benchmark: (1) The selection of foundation models may not be exhaustive. Future iterations should incorporate additional foundation models to provide a more comprehensive comparison. (2) Inconsistencies in training data across models currently hinder direct comparisons of different model architectures. Future work could address this by standardizing datasets, allowing for more accurate comparisons of architectural performance. (3) The benchmark could be expanded to include a wider range of tasks, further broadening its scope and utility. We are committed to continually refining and expanding ProteinBench. Our vision is for it to evolve into a dynamic, growing benchmark that accelerates progress in the field of protein modeling and design.

Acknowledgements

This benchmark represents a collaborative effort from our research group, with each member contributing significantly from their respective areas of expertise. The diverse insights and analyses provided by each contributor have been instrumental in shaping this comprehensive work. Q. Gu conceived and oversees the project. F. Ye coordinated the experiments and analysis, while also conducting model evaluations for inverse folding, backbone design, and a portion of the sequence design tasks. Z. Zheng was responsible for model evaluations and results analysis in sequence design, co-design, and motif scaffolding. D. Xue carried out model evaluation and metrics analysis for antibody design. Y. Shen and L. Wang conducted model evaluation and analysis for single and multiple-state prediction as well as conformation distribution prediction. F. Ye initiated and drove the writing of the paper with contributions from all other authors. We are grateful for the dedication and expertise demonstrated by each team member. Their collective efforts have been crucial in developing this multifaceted benchmark.

Appendix A Overview of protein foundation model benchmarks

In this section, we provide a comprehensive overview of existing benchmarks for protein foundation models. Table 11 illustrates the current landscape of these benchmarks, revealing significant limitations in the scope and applicability. The majority of existing benchmarks are narrowly focused, primarily addressing task-specific evaluations rather than offering a holistic assessment of protein foundation models.

The benchmarks we examined can be broadly categorized into two main groups: those focused on protein design tasks and those evaluating protein conformational dynamics. Within the protein design category, we observe that while inverse folding is well-represented across multiple benchmarks, other crucial aspects such as backbone design, sequence design, and structure-sequence co-design are often overlooked. Similarly, in the realm of protein conformational dynamics, only a few benchmarks address critical tasks like single-state and multiple-state prediction.

Notably, our proposed ProteinBench stands out by offering the most comprehensive coverage across various tasks. It encompasses a wide range of evaluations, including inverse folding, backbone design, sequence design, structure-sequence co-design, and antibody design in the protein design domain, as well as single-state folding, and multiple-state prediction in the conformational dynamics domain.

Table 11: A comparison of benchmarks for protein fundamental tasks.

Benchmark

Protein Design

Protein Conformation Prediction

Inverse

Folding

Backbone

Design

Sequence

Design

Struc-Seq

Codesign

Motif

scaffolding

Antibody

Design

Folding

(single-state)

Multiple State

Prediction

Distribution

Prediction

PDB-Struct (Wang et al., 2023)

✓

✘

Proteininvbench (Gao et al., 2024)

✓

✘

RFDiffusion (Watson et al., 2023b)

✘

✓

✘

CASP (cas, 2022)

✘

✓

✘

CAMEO (Robin et al., 2021)

✘

✓

✘

ProteinBench

✓

Appendix B Additional Details on Benchmarking Evaluations

B.1 Antibody Design

In this section, we will provide a detailed introduction to the evaluation of antibody design methods, including the overall evaluation concept, the variant to dyMEAN, the datasets used for training and testing, and the implementation for all evaluation metrics.

B.1.1 Evaluation Concept

As mentioned in the main text, antibody design can ultimately be simplified to the design of CDR-H3. Therefore, in this study, we evaluate the performance of different antibody design methods by evaluating the CDR-H3 sequences generated by these methods. Given the primary objective of this study is to assess the relative performance of various design models rather than the in vivo/vitro functionality of the antibodies they generate, we opted to directly evaluate the designed antibodies using their predicted structures. This approach is grounded in several considerations: firstly, it ensures a clear focus on evaluating the design methodology itself, independent of experimental constraints. Secondly, the significant time and resources required for extensive experimental validations, as well as the limitations of methods that can accurately simulate the real binding structure of antibodies, render in vivo/vitro assessments impractical. Direct evaluation of the designed structures presents a feasible and efficient strategy that aligns with the study’s goals and resource constraints while still providing valuable theoretical benchmarks for subsequent experimental investigations.

For methods capable of generating multiple antibodies for the same antigen, we generated 64 CDR-H3 sequences per antigen using each method and calculated the average performance across these different generated samples. Additionally, we also calculated the standard deviation of the performance among different samples generated for a single antigen.

B.1.2 Variant of dyMEAN

Unlike other methods, which are designed to accept the true structure of the antibody-antigen complex and generate the missing CDR-H3 region, dyMEAN is set up to accept only the structure of the antigen and the sequence of the non-CDR-H3 regions of the antibody. Therefore, the model needs to both generate the CDR-H3 region and predict the overall structure of the antibody as well as the binding pose between the antibody and antigen. Incorrect pose estimation can severely affect the interactions between CDR-H3 and the antigen, making a direct comparison between dyMEAN and other methods unfair. To compare dyMEAN with other methods more fairly, we made some modifications to dyMEAN by providing the true structure of the non-CDR-H3 regions of the antibody and the binding pose, aligning dyMEAN with the other methods. In dyMEAN-FixFR, we also used Rosetta (Alford et al., 2017) to repack the side chains, consistent with other methods, to avoid the influence of the side chains generated by dyMEAN on the evaluation results. Additionally, we introduced some randomness in the initialization of the structure, which allows dyMEAN-FixFR to generate multiple different antibodies for the same antigen.

B.1.3 Dataset

To retrain all the methods for a fair comparison, we use antibody-antigen complex structural data from the SAbDab dataset under the IMGT scheme (Lefranc et al., 2009) as the training dataset. We collected antigen-antibody complexes with both heavy and light chains and protein antigens. We then discarded duplicate data with the same CDR-L3 and CDR-H3 sequence. The remaining complexes are used to cluster via MMseqs2 (Steinegger & Söding, 2017) with 40% sequence similarity as the threshold based on the CDR-H3 sequence of each complex. Finally, we select the clusters that do not contain complexes in the RAbD dataset and split the complexes into training and validation sets with a ratio of 9:1 (1786 and 193 complexes respectively).

The test set contains 55 antibody-antigen complexes extracted from the RAbD dataset. The original RAbD dataset contains 60 antibody-antigen complexes. In this study, we hope that the evaluation of antibody design methods is based on antibodies that contain both light and heavy chains, and simultaneously the antigen contains at least one protein chain. In practice, 2ghw and 3uzq lack light chains, while 3h3b lack heavy chains. 5d96 is excluded because of the incorrect chain ID information in rabd_summary.jsonl⁴⁴4https://github.com/THUNLP-MT/MEAN/blob/main/summaries/rabd_summary.jsonl, where heavy chain J and light chain I do not bind to antigen chain A. 4etq is excluded as HERN reported an error when running for this complex.

B.1.4 Implementation of Evaluation Metrics

[Accuracy]:

•

AAR: For the calculation of AAR (Amino Acid Recovery Rate), similar to existing work, we calculated the number of residues in the generated CDR-H3 sequences that match the natural antibody.
•

RMSD: In the calculation of RMSD (Root Mean Square Deviation), we measured the RMSD of the generated and natural antibodies in the CA coordinates of the CDR-H3 region. For methods other than dyMEAN, since their task setting provides the true binding pose of the antibody FR region and antigen, there is no need to align the generated structure with the natural structure when calculating RMSD. For dyMEAN, we aligned the 2 FR residues at each end of CDR-H3 with the corresponding residues in the natural antibody, applied the obtained transformation to CDR-H3, and then calculated the RMSD.
•

TM-score: We calculated the TM-score only for the CDR-H3 region. To this end, we saved the generated CDR-H3 part as a .pdb file and used TMalign (Zhang & Skolnick, 2005) to calculate the TM-score between the generated CDR-H3 and the natural CDR-H3.

[Functionality]:

•

Binding Energy: The calculation of binding energy requires the all-atom structure of the protein, while most methods only generate the backbone atom structure. Therefore, we first used Rosetta to pack the missing side-chain atoms. Subsequently, we optimized the side-chains in the CDR-H3 region using Rosetta minimization while keeping the backbone structure unchanged to ensure that the CDR-H3 generated by the model reaches the minimum energy state in the binding environment with the antigen. During minimization, we set the step to 100 (we tried using more steps and repeats, although the energy did further decrease, the reduction was very limited and much smaller than the energy difference between different methods; however, the time consumption significantly increased). After minimization, we calculated the energy on the all-atom structure. Finally, we used the InterfaceAnalyzer in Rosetta to calculate the binding energy between CDR-H3 and the antigen.

[Specificity]:

•

SeqSim: SeqSim is defined as the average similarity between any sequence pairs among the generated sequences. First, we introduce the definition and implementation of similarity. The similarity between two sequences is defined as the percentage of matched amino acids over the aligned length after alignment (thus, this metric is affected by the length gap between the two sequences). Given that our goal is to calculate the number of matches rather than the matching score and that the two ends of CDR-H3 are fixed to FR3 and FR4, we need an alignment method that: (1) assigns a score of 1 for matches, and 0 for gaps and mismatches; (2) does not introduce gaps at the two ends of CDR-H3. We used the PairwiseAligner in Biopython (Cock et al., 2009) for sequence alignment, setting match_score to 1, all other scores to 0, and the end_gap_score to -inf so that the alignment process meets our requirements. For methods that generate only one antibody per antigen, we directly calculate the average SeqSim among the 55 generated CDR-H3 sequences as SeqSim-outer. For methods that generate multiple antibodies, we calculate the average SeqSim between two sets of sequences generated for two antigens as SeqSim-outer and also calculate the average SeqSim within each set as SeqSim-inner. The formulas for calculating SeqSim-outer and SeqSim-inner are as follows:

SeqSim-outer $\displaystyle=\frac{1}{N*(N-1)*M^{2}}\sum_{i=1}^{N}\sum_{j=1|j\neq i}^{N}\sum_% {x=1}^{M}\sum_{y=1}^{M}\text{SeqSim}(s_{i}^{x},s_{j}^{y}),$ (1)

SeqSim-inner $\displaystyle=\frac{1}{N*M*(M-1)}\sum_{i=1}^{N}\sum_{x=1}^{M}\sum_{y=1|y\neq~{% }x}^{M}\text{SeqSim}(s_{i}^{x},s_{i}^{y}),$ (2)

where $N$ denotes the number of antigens in the test set ( $N$ =55 in this study), $M$ denotes the number of samples generated for each antigen ( $M$ =64 in this study), and $s_{i}^{x}$ represents the $x$ -th CDR-H3 sequence generated for the $i$ -th antigen.
•

PHR: PHR is the proportion of hydrophobic residues in the generated CDR-H3 sequences. Although both PHR and SeqSim are used to represent the specificity of antibody design methods, they focus on different aspects. Thus, the same method may exhibit different tendencies in these two metrics (SeqSim can be understood as an evaluation of the method’s specificity, while PHR is an evaluation of the specificity of the generated antibodies. When SeqSim performs poorly, the performance of PHR is of limited significance). For example, AbDPO achieves high SeqSim-outer but does not perform well in PHR. This indicates that AbDPO can specifically design antibodies for different antigens, but these antibodies contain many hydrophobic residues, leading to potential nonspecific interactions with multiple proteins.

[Rationality]:

•

CN-Score: To evaluate the consistency of the peptide bond length of generated antibodies with that of natural antibodies, we fit a Kernel Density Estimation (KDE) function using the length of peptide bonds found within the CDR-H3 regions of natural antibodies. The density of the generated peptide bond length, CN-Score, is used to represent the consistency. For generated peptide bonds shorter than the minimum natural peptide bond length or longer than the maximum, the density is defined as 0. The final CN-Score for a generated antibody is defined as the average density of the lengths of all its peptide bonds. It is important to note that the length variation of peptide bonds is very small, which leads to a very narrow distribution of natural peptide bond lengths. When the generated peptide bond length deviates slightly from the average length (1.3310), its density in the KDE function will sharply decrease, which explains why all methods show a significant difference in CN-Score compared to natural antibodies.
•

Clashes: Although atomic clashes within proteins mainly occur between the side chains, most methods do not generate the side chains of residues. Using packing methods to complete side chains can always find a side-chain conformation with the fewest clashes through extensive searching. Therefore, we instead evaluate the potential clash level in the generated structures rather than the specific number of clashes. To do this, we calculate the CA distance between two residues; when the CA-CA distance between two residues not connected by a covalent bond is less than the minimum CA-CA distance commonly found in covalently bonded residues (3.6574, derived from the CA-CA distance statistics in the CDR-H3 region of the RAbD dataset), we consider these two residues to have potential clashes. We then calculate the number of residue pairs with distances below this threshold to measure the clash level in the generated structures. The difference between Clashes-inner and Clashes-outer is: Clashes-inner measures the clash level within the generated CDR-H3 structure, while Clashes-outer measures the clash level between the generated CDR-H3 structure and other components, including the antigen, the heavy chain FR region, and the light chain of the antibody.
•

SeqNat: To measure how close the designed CDR-H3 sequences are to natural sequences, we used the pLL predicted by the AntiBERTy model. We input the entire heavy chain sequence into the model, which means that AntiBERTy makes prediction based on the entire heavy chain of the antibody, but unlike the standard procedure in AntiBERTy, the pLL calculation area is only within the CDR-H3 region (the standard procedure calculates pLL over the entire input sequence).
•

Total Energy: Before calculating the total energy, we performed the same energy optimization process on the designed CDR-H3 regions as described in the Functionality section. We then used Rosetta’s full atom score function with the default weights from REF15 (Alford et al., 2017) to calculate the total energy of each residue in the CDR-H3 region. The Total Energy of the CDR-H3 region is defined as the sum of the total energy of all its residues.
•

scRMSD: In this metric, we used IgFold to predict the structure of the generated sequences. IgFold predicts the structure based on the sequence pair of the antibody’s light and heavy chains (although the region we evaluate only exists in the heavy chain, and IgFold also supports single-chain input, we found that inputting two chains results in higher accuracy). The real structure of the non-CDR-H3 regions of the antibody was also provided as a template to obtain the initial predicted structure. We then used the Kabsch algorithm to align the non-CDR-H3 regions of the heavy chain with the real structure and applied the resulting transformation to the predicted CDR-H3 structure. This aligns the predicted CDR-H3 structure to its original complex. At this point, the CA-RMSD between the predicted CDR-H3 structure and the real structure in the RAbD dataset is 1.95. The structure predicted by IgFold is unrelated to the antigen, and since the antibody undergoes conformational changes in the binding interface after binding with the antigen, we used Rosetta to relax the predicted CDR-H3 in the presence of the antigen. The relaxation involves changes in both the backbone and side-chain structures. Specifically, we repeated relaxation runs five times for each structure predicted by IgFold, with 200 steps each time, and selected the structure with the lowest energy as the final predicted structure. At this stage, the CA-RMSD with the real structure decreased to 1.77. We then calculated the RMSD of the CA coordinates between the predicted structure and the backbone CA coordinates generated by the model, which is referred to as scRMSD.

B.2 Protein Conformation Prediction

In this section, we provide further details on the datasets, evaluation metrics, and model implementations used in the benchmark for Protein Conformation Prediction.

B.2.1 Datasets

•

CAMEO2022 consists of 183 single protein chains collected from CAMEO targets between August and October 2022, with sequence lengths of less than 750 amino acids, following Jing et al. (2023). Protein sequences and structures were extracted from the mmCIF files available at the RCSB Protein Data Bank (https://www.rcsb.org/, Berman et al. (2000)) using customized scripts. One of the proteins (PDB ID: 8AHP, chain A) has since been superseded by a new PDB entry 8QCW and we have replaced this chain with the updated record.
•

Apo-holo consists of 91 single chain proteins curated by Saldaño et al. (2022). The protein sequences and the structures of both apo and holo conformations were extracted using the same pipeline as in CAMEO2022. Follwoing Jing et al. (2023), we use the sequences of the apo structures as the primary sequence for model inference.
•

BPTI is a 58 amino acids protein whose dynamics have been extensively studied through long-time MD simulations (Shaw et al., 2010). We use the structures of the cluster centers identified in the MD study as the reference structures for evaluation.
•

ATLAS is a recently published dataset containing triplicated 100 ns MD simulations for 1,390 diverse single-chain proteins (Vander Meersche et al., 2024). In this work, we use a subset of 82 proteins whose PDB entries were deposited after May 1, 2019, following Jing et al. (2024). “Protein-only” trajectories were downloaded from the ATLAS database ⁵⁵5https://www.dsimb.inserm.fr/ATLAS/index.html for evaluation.

B.2.2 Evaluation Metrics

[Accuracy]

We evaluate the accuracy by comparing the generated conformations with reference structures. Specifically, TMscore, RMSD, GDT-TS are calculated using TMscore Zhang & Skolnick (2004) obtained from https://zhanggroup.org/TM-score/. We use -seq option to align sequences before structural alignment. lDDT scores are calculated using the original implementation Mariani et al. (2013), available from https://swissmodel.expasy.org/lddt/downloads/.

In multiple-state prediction, we evaluate the accuracy of predicting a specific state by the best accuracy among the generated samples. For example, apo-TM) is the highest TM-score between samples (N=20 in the benchmark) and the reference apo structure. Following Jing et al. (2023), we evaluate the overall ability to predict multiple states using “ensemble accuracy”, which is the average of accuracy for each state, measured by TM-score or RMSD.

For the accuracy metrics in distribution prediction task (flexibility, distributional accuracy, ensemble observables) , we follow the implementation ⁶⁶6https://github.com/bjing2016/alphaflow/blob/master/scripts/analyze_ensembles.py of Jing et al. (2024). Below is an overview of these metrics: Flexibility (or Predicting flexibility in Jing et al. (2024)) reflects how accurately the model predicts the diversity of proteins (Pairwise RMSD) or atoms (RMSF). This is measured by the Pearson correlation $r$ between the diversity measure of the model-generated samples (sample distribution for short) and the reference MD samples (reference distribution for short). Distributional accuracy directly compare the similarity between the sample and reference distributions. RMWD is the root mean Wasserstein distance between the distributions of aligned coordinates, modeled as multivariate Gaussians. In this benchmark, we report only the total distance without translation and variance decomposition. Additionally, we evaluate the Wasserstein-2 distance of the conformational distribution in the first two PCA dimensions (of aligned coordinates), as well as the cosine similarity between the PCA components of the sample and reference distributions. Lastly, ensemble observables include metrics comparing specific observables (i.e., properties of interest) in the sample and reference distributions, particularlly, residue-residue contacts (weak or transient) and residue exposures (e.g., surface residue that contacts the solvent). Jaccard similarity and Mutual Information (MI) are used to compare these observables.

[Diversity]

Diversity is evaluated by the average pairwise structural similarity among the generated samples for a protein, measured using TM-score or RMSD. To reduce computation time, we randomly sample 100 pairs of structures for this calculation.

[Quality]

The quality of generated conformation structures are assessed by three backbone structural violations: CA-clash %, CA-break %, and PepBond-break %.

•

CA-clash % is the rate of clashes between alpha-carbon atoms. A clash is determined if the distance between a pair of alpha carbon atoms is less than 3.0 Å, similar to Lu et al. (2024). And CA-clash % is calculated as

$\text{CA-clash \%}=\frac{\text{number of residues with clashes}}{\text{% sequence length}}\times 100.$
•

CA-break % is the rate of two connecting residues are too distant, leading to potential bond break. We determine a break if the distance between two connecting residues is greater than 4.2 Å and CA-break % is calculated as

$\text{CA-break \%}=\frac{\text{number of connecting CA-CA pairs with break}}{% \text{sequence length}-1}\times 100.$
•

PepBond-break % specifically evaluates the potential peptide bond (C-N) break between connecting residues, providing a more rigorous metric about inter-residue disconnection than CA-break %. We use a maximum peptide length threshold of 1.4 Å to determine a chain break, as in the Biopython package⁷⁷7https://biopython.org/docs/dev/api/Bio.PDB.internal_coords.html#Bio.PDB.internal_coords.IC_Chain. Similarily, PepBond-break % is calculated as

$\text{PepBond-break \% }=\frac{\text{number of C-N bond break}}{\text{sequence% length}-1}\times 100.$

B.2.3 Model implementations

•

AlphaFold2 (Jumper et al., 2021): We used the ColabFold implementation Mirdita et al. (2022) for AlphaFold2 inference, with input MSAs obtained using the Colab pipeline. All five models (with pTM) were used to predict five candidate structures, and the structure with the highest pLDDT confidence score was selected for performance evaluation. All models were run with default settings, and no templates were provided.
•

OpenFold (Ahdritz et al., 2022): We used openfold v2.0.0 for inference with their pretrained OpenFold weights (with pTM). The same MSAs as those used for AlphaFold2 were provided as inputs. Since only one checkpoint (finetuning_no_templ_ptm_1) corresponding to the model configuration model_3_ptm is available, we generated three structures using three random seeds and made a total of 5 predictions. The structure with the highest pLDDT score was selected for performance evaluation. Templates were not provided for inference.
•

ESMFold (Lin et al., 2023): We use the public ESM repository for inference with the model esm.pretrained.esmfold_v1(). Since EMSFold predictions are deterministic, we generated only one structure per protein for performance evaluation.
•

RoseTTAFold2 (Baek et al., 2023): We follow their official repository and instructions for inference, with the same MSA as AlphaFold2 and OpenFold. No templates were provided. Only one structure per protein was predicted for performance evaluation.
•

EigenFold (Jing et al., 2023): We follow the official repository, weights, and the setups provided by the authors for inference. In the protein folding task, we sampled 5 structures for each protein and selected the one with the highest ELBO estimation for performance evaluation. Because EigenFold can not predict sequences containing unknown amino acids (labeled ’X’), we removed the ’X’ in the input sequences, as done in the original implementation. This adjustment might introduce some performance difference due to inference with slightly different sequences.
•

MSA-subsampling (Del Alamo et al., 2022): We implemented MSA-subsampling using the openfold v2.0.0 package by adjusting the two configuration parameters, max_msa_clusters and max_extra_msa, following Del Alamo et al. (2022). Specifically, we refer to max_extra_msa as the MSA depth and set max_msa_clusters to half that depth, while keeping other OpenFold settings at their default values. The original MSAs were obtained using the same ColabFold pipeline as in AlphaFold2.
•

Str2Str (Lu et al., 2024): We followed the official implementation of Str2Str and used OpenFold-predicted structure as the initial structures. Ensemble results were collected by uniformly sampling from $t$ values. For BPTI, we used the author-recommended noising schedule with maximum forward time of $T_{\text{max}}=0.15(t=0.10,0.15)$ . For apo-holo and ATLAS datasets, we experimented with $T_{\text{max}}=0.1(t=0.06,0.08,0.10,0.12,0.14)$ and $T_{\text{max}}=0.3(t=0.06,0.12,0.18,0.24,0.30)$ for both the SDE and ODE models.
•

AlphaFlow/ESMFlow (Jing et al., 2024): We used the official repository and released model weights for inference. The MSAs for AlphaFlow models were obtained through ColabFold’s pipeline.
•

ConfDiff (Wang et al., 2024c): We followed the authors’ implementation and used the released weights for inference. In this benchmark, we used recycle3 representations for both ConfDiff-Open and ConfDiff-ESM models. The energy and force guidance models are dataset-specific and are only available for the BPTI dataset with ESMFold representations.

References

cas (2022) Abstract book of the 15th critical assessment of structure prediction. 2022. URL https://predictioncenter.org/casp15/doc/CASP15Abstracts.pdf.
Abramson et al. (2024) Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore, Andrew J Ballard, Joshua Bambrick, et al. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature, pp. 1–3, 2024.
Adolf-Bryfogle et al. (2018) Jared Adolf-Bryfogle, Oleks Kalyuzhniy, Michael Kubitz, Brian D Weitzner, Xiaozhen Hu, Yumiko Adachi, William R Schief, and Roland L Dunbrack Jr. Rosettaantibodydesign (rabd): A general framework for computational antibody design. PLoS computational biology, 14(4):e1006112, 2018.
Ahdritz et al. (2022) Gustaf Ahdritz, Nazim Bouatta, Christina Floristean, Sachin Kadyan, Qinghui Xia, William Gerecke, Timothy J O’Donnell, Daniel Berenberg, Ian Fisk, Niccolò Zanichelli, Bo Zhang, Arkadiusz Nowaczynski, Bei Wang, Marta M Stepniewska-Dziubinska, Shang Zhang, Adegoke Ojewole, Murat Efe Guney, Stella Biderman, Andrew M Watkins, Stephen Ra, Pablo Ribalta Lorenzo, Lucas Nivon, Brian Weitzner, Yih-En Andrew Ban, Peter K Sorger, Emad Mostaque, Zhao Zhang, Richard Bonneau, and Mohammed AlQuraishi. OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. bioRxiv, 2022. doi: 10.1101/2022.11.20.517210. URL https://www.biorxiv.org/content/10.1101/2022.11.20.517210.
Alamdari et al. (2023) Sarah Alamdari, Nitya Thakkar, Rianne van den Berg, Alex X Lu, Nicolo Fusi, Ava P Amini, and Kevin K Yang. Protein generation with evolutionary diffusion: sequence is all you need. In Machine Learning for Structural Biology Workshop, NeurIPS 2023, 2023.
Alford et al. (2017) Rebecca F Alford, Andrew Leaver-Fay, Jeliazko R Jeliazkov, Matthew J O’Meara, Frank P DiMaio, Hahnbeom Park, Maxim V Shapovalov, P Douglas Renfrew, Vikram K Mulligan, Kalli Kappel, et al. The rosetta all-atom energy function for macromolecular modeling and design. Journal of chemical theory and computation, 13(6):3031–3048, 2017.
Alley et al. (2019) Ethan C. Alley, Grigory Khimulya, Surojit Biswas, Mohammed AlQuraishi, and George M. Church. Unified rational protein engineering with sequence-based deep representation learning. Nature Methods, 16:1315–1322, 12 2019. ISSN 1548-7091. doi: 10.1038/s41592-019-0598-1.
Baek et al. (2023) Minkyung Baek, Ivan Anishchenko, Ian R Humphreys, Qian Cong, David Baker, and Frank DiMaio. Efficient and accurate prediction of protein structure using rosettafold2. BioRxiv, pp. 2023–05, 2023.
Berman et al. (2000) Helen M Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N Bhat, Helge Weissig, Ilya N Shindyalov, and Philip E Bourne. The protein data bank. Nucleic acids research, 28(1):235–242, 2000.
Bose et al. (2023) Avishek Joey Bose, Tara Akhound-Sadegh, Kilian Fatras, Guillaume Huguet, Jarrid Rector-Brooks, Cheng-Hao Liu, Andrei Cristian Nica, Maksym Korablyov, Michael Bronstein, and Alexander Tong. Se (3)-stochastic flow matching for protein backbone generation. arXiv preprint arXiv:2310.02391, 2023.
(11) Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi Jaakkola. Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design. In Forty-first International Conference on Machine Learning.
Chu et al. (2024) Alexander E Chu, Jinho Kim, Lucy Cheng, Gina El Nesr, Minkai Xu, Richard W Shuai, and Po-Ssu Huang. An all-atom protein generative model. Proceedings of the National Academy of Sciences, 121(27):e2311500121, 2024.
Cock et al. (2009) Peter JA Cock, Tiago Antao, Jeffrey T Chang, Brad A Chapman, Cymon J Cox, Andrew Dalke, Iddo Friedberg, Thomas Hamelryck, Frank Kauff, Bartek Wilczynski, et al. Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11):1422, 2009.
Dauparas et al. (2022) J. Dauparas, I. Anishchenko, N. Bennett, H. Bai, R. J. Ragotte, L. F. Milles, B. I. M. Wicky, A. Courbet, R. J. de Haas, N. Bethel, P. J. Y. Leung, T. F. Huddy, S. Pellock, D. Tischer, F. Chan, B. Koepnick, H. Nguyen, A. Kang, B. Sankaran, A. K. Bera, N. P. King, and D. Baker. Robust deep learning–based protein sequence design using proteinmpnn. Science, 378:49–56, 10 2022. ISSN 0036-8075. doi: 10.1126/science.add2187.
Del Alamo et al. (2022) Diego Del Alamo, Davide Sala, Hassane S Mchaourab, and Jens Meiler. Sampling alternative conformational states of transporters and receptors with alphafold2. Elife, 11:e75751, 2022.
Dunbar et al. (2013) James Dunbar, Konrad Krawczyk, Jinwoo Leem, Terry Baker, Angelika Fuchs, Guy Georges, Jiye Shi, and Charlotte M. Deane. SAbDab: the structural antibody database. Nucleic Acids Research, 42(D1):D1140–D1146, 11 2013. ISSN 0305-1048. doi: 10.1093/nar/gkt1043. URL https://doi.org/10.1093/nar/gkt1043.
Gao et al. (2022) Zhangyang Gao, Cheng Tan, Pablo Chacón, and Stan Z Li. Pifold: Toward effective and efficient protein inverse folding. arXiv preprint arXiv:2209.12643, 2022.
Gao et al. (2024) Zhangyang Gao, Cheng Tan, Yijie Zhang, Xingran Chen, Lirong Wu, and Stan Z Li. Proteininvbench: Benchmarking protein inverse folding on diverse tasks, models, and metrics. Advances in Neural Information Processing Systems, 36, 2024.
Hayes et al. (2024) Tomas Hayes, Roshan Rao, Halil Akin, Nicholas J Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q Tran, Jonathan Deaton, Marius Wiggert, et al. Simulating 500 million years of evolution with a language model. bioRxiv, pp. 2024–07, 2024.
Hie et al. (2024) Brian L. Hie, Varun R. Shanker, Duo Xu, Theodora U. J. Bruun, Payton A. Weidenbacher, Shaogeng Tang, Wesley Wu, John E. Pak, and Peter S. Kim. Efficient evolution of human antibodies from general protein language models. Nature Biotechnology, 42:275–283, 2 2024. ISSN 1087-0156. doi: 10.1038/s41587-023-01763-2.
Hsu et al. (2022) Chloe Hsu, Robert Verkuil, Jason Liu, Zeming Lin, Brian Hie, om Sercu, Adam Lerer, and Alexander Rives. Learning inverse folding from millions of predicted structures. Proceedings of the 39th International Conference on Machine Learning, 162, 2022. doi: 10.1101/2022.04.10.487779.
Ingraham et al. (2023) John B Ingraham, Max Baranov, Zak Costello, Karl W Barber, Wujie Wang, Ahmed Ismail, Vincent Frappier, Dana M Lord, Christopher Ng-Thow-Hing, Erik R Van Vlack, et al. Illuminating protein space with a programmable generative model. Nature, 623(7989):1070–1078, 2023.
Jin et al. (2022) Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Antibody-antigen docking and design via hierarchical structure refinement. In International Conference on Machine Learning, pp. 10217–10227. PMLR, 2022.
Jing et al. (2020) Bowen Jing, Stephan Eismann, Patricia Suriana, Raphael John Lamarre Townshend, and Ron Dror. Learning from protein structure with geometric vector perceptrons. In International Conference on Learning Representations, 2020.
Jing et al. (2023) Bowen Jing, Ezra Erives, Peter Pao-Huang, Gabriele Corso, Bonnie Berger, and Tommi S Jaakkola. Eigenfold: Generative protein structure prediction with diffusion models. In ICLR 2023-Machine Learning for Drug Discovery workshop, 2023.
Jing et al. (2024) Bowen Jing, Bonnie Berger, and Tommi Jaakkola. Alphafold meets flow matching for generating protein ensembles. In Forty-first International Conference on Machine Learning, 2024.
Jumper et al. (2021) John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, Ellen Clancy, Michal Zielinski, Martin Steinegger, Michalina Pacholska, Tamas Berghammer, Sebastian Bodenstein, David Silver, Oriol Vinyals, Andrew W. Senior, Koray Kavukcuoglu, Pushmeet Kohli, and Demis Hassabis. Highly accurate protein structure prediction with alphafold. Nature, 596:583–589, 8 2021. ISSN 0028-0836. doi: 10.1038/s41586-021-03819-2.
Kong et al. (2022) Xiangzhe Kong, Wenbing Huang, and Yang Liu. Conditional antibody design as 3d equivariant graph translation. In The Eleventh International Conference on Learning Representations. ICLR, 2022.
Kong et al. (2023) Xiangzhe Kong, Wenbing Huang, and Yang Liu. End-to-end full-atom antibody design. In Proceedings of the 40th International Conference on Machine Learning, pp. 17409–17429, 2023.
Krishna et al. (2024) Rohith Krishna, Jue Wang, Woody Ahern, Pascal Sturmfels, Preetham Venkatesh, Indrek Kalvet, Gyu Rie Lee, Felix S. Morey-Burrows, Ivan Anishchenko, Ian R. Humphreys, Ryan McHugh, Dionne Vafeados, Xinting Li, George A. Sutherland, Andrew Hitchcock, C. Neil Hunter, Alex Kang, Evans Brackenbrough, Asim K. Bera, Minkyung Baek, Frank DiMaio, and David Baker. Generalized biomolecular modeling and design with rosettafold all-atom. Science, 384, 4 2024. ISSN 0036-8075. doi: 10.1126/science.adl2528.
Kuhlman & Bradley (2019) Brian Kuhlman and Philip Bradley. Advances in protein structure prediction and design. Nature Reviews Molecular Cell Biology, 20:681–697, 11 2019. ISSN 1471-0072. doi: 10.1038/s41580-019-0163-x.
Lefranc et al. (2009) Marie-Paule Lefranc, Veronique Giudicelli, Chantal Ginestoux, Joumana Jabado-Michaloud, Geraldine Folch, Fatena Bellahcene, Yan Wu, Elodie Gemrot, Xavier Brochet, Jeroˆme Lane, et al. Imgt®, the international immunogenetics information system®. Nucleic acids research, 37(suppl_1):D1006–D1012, 2009.
Lin & AlQuraishi (2023) Yeqing Lin and Mohammed AlQuraishi. Generating novel, designable, and diverse protein structures by equivariantly diffusing oriented residue clouds. arXiv preprint arXiv:2301.12485, 2023.
Lin et al. (2023) Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Salvatore Candido, and Alexander Rives. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379:1123–1130, 3 2023. ISSN 0036-8075. doi: 10.1126/science.ade2574.
Lisanza et al. (2023) Sidney Lyayuga Lisanza, Jake Merle Gershon, Sam Tipps, Lucas Arnoldt, Samuel Hendel, Jeremiah Nelson Sims, Xinting Li, and David Baker. Joint generation of protein sequence and structure with rosettafold sequence space diffusion. bioRxiv, pp. 2023–05, 2023.
Liu et al. (2023) Haiyan Liu, Yufeng Liu, and Linghui Chen. Diffusion in a quantized vector space generates non-idealized protein structures and predicts conformational distributions. bioRxiv, pp. 2023–11, 2023.
Lu et al. (2024) Jiarui Lu, Bozitao Zhong, Zuobai Zhang, and Jian Tang. Str2str: A score-based framework for zero-shot protein conformation sampling. In The Twelfth International Conference on Learning Representations, 2024.
Luo et al. (2022) Shitong Luo, Yufeng Su, Xingang Peng, Sheng Wang, Jian Peng, and Jianzhu Ma. Antigen-specific antibody design and optimization with diffusion-based generative models for protein structures. Advances in Neural Information Processing Systems, 35:9754–9767, 2022.
Madani et al. (2023) Ali Madani, Ben Krause, Eric R. Greene, Subu Subramanian, Benjamin P. Mohr, James M. Holton, Jose Luis Olmos, Caiming Xiong, Zachary Z. Sun, Richard Socher, James S. Fraser, and Nikhil Naik. Large language models generate functional protein sequences across diverse families. Nature Biotechnology, 41:1099–1106, 8 2023. ISSN 1087-0156. doi: 10.1038/s41587-022-01618-2.
Mariani et al. (2013) Valerio Mariani, Marco Biasini, Alessandro Barbato, and Torsten Schwede. lddt: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics, 29(21):2722–2728, 2013.
Mirdita et al. (2022) Milot Mirdita, Konstantin Schütze, Yoshitaka Moriwaki, Lim Heo, Sergey Ovchinnikov, and Martin Steinegger. Colabfold: making protein folding accessible to all. Nature methods, 19(6):679–682, 2022.
Nijkamp et al. (2023) Erik Nijkamp, Jeffrey A. Ruffolo, Eli N. Weinstein, Nikhil Naik, and Ali Madani. Progen2: Exploring the boundaries of protein language models. Cell Systems, 14:968–978.e3, 11 2023. ISSN 24054712. doi: 10.1016/j.cels.2023.10.002.
Notin et al. (2024) Pascal Notin, Aaron Kollasch, Daniel Ritter, Lood Van Niekerk, Steffanie Paul, Han Spinner, Nathan Rollins, Ada Shaw, Rose Orenbuch, Ruben Weitzman, et al. Proteingym: Large-scale benchmarks for protein fitness prediction and design. Advances in Neural Information Processing Systems, 36, 2024.
Ren et al. (2024) Milong Ren, Chungong Yu, Dongbo Bu, and Haicang Zhang. Accurate and robust protein sequence design with carbondesign. Nature Machine Intelligence, 6(5):536–547, 2024.
Rives et al. (2021) Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118, 4 2021. ISSN 0027-8424. doi: 10.1073/pnas.2016239118.
Robin et al. (2021) Xavier Robin, Juergen Haas, Rafal Gumienny, Anna Smolinski, Gerardo Tauriello, and Torsten Schwede. Continuous automated model evaluation (cameo)—perspectives on the future of fully automated evaluation of structure prediction methods. Proteins: Structure, Function, and Bioinformatics, 89:1977–1986, 12 2021. ISSN 0887-3585. doi: 10.1002/prot.26213.
Ruffolo et al. (2021) Jeffrey A Ruffolo, Jeffrey J Gray, and Jeremias Sulam. Deciphering antibody affinity maturation with language models and weakly supervised learning. In Machine Learning for Structural Biology Workshop, NeurIPS 2021., 2021.
Ruffolo et al. (2023) Jeffrey A Ruffolo, Lee-Shin Chu, Sai Pooja Mahajan, and Jeffrey J Gray. Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies. Nature communications, 14(1):2389, 2023.
Saldaño et al. (2022) Tadeo Saldaño, Nahuel Escobedo, Julia Marchetti, Diego Javier Zea, Juan Mac Donagh, Ana Julia Velez Rueda, Eduardo Gonik, Agustina García Melani, Julieta Novomisky Nechcoff, Martín N Salas, et al. Impact of protein conformational diversity on alphafold predictions. Bioinformatics, 38(10):2742–2748, 2022.
Shaw et al. (2010) David E Shaw, Paul Maragakis, Kresten Lindorff-Larsen, Stefano Piana, Ron O Dror, Michael P Eastwood, Joseph A Bank, John M Jumper, John K Salmon, Yibing Shan, et al. Atomic-level characterization of the structural dynamics of proteins. Science, 330(6002):341–346, 2010.
Shin et al. (2021) Jung-Eun Shin, Adam J. Riesselman, Aaron W. Kollasch, Conor McMahon, Elana Simon, Chris Sander, Aashish Manglik, Andrew C. Kruse, and Debora S. Marks. Protein design and variant prediction using autoregressive generative models. Nature Communications, 12:2403, 4 2021. ISSN 2041-1723. doi: 10.1038/s41467-021-22732-w.
Steinegger & Söding (2017) Martin Steinegger and Johannes Söding. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology, 35(11):1026–1028, 2017.
Trippe et al. (2022) Brian L Trippe, Jason Yim, Doug Tischer, David Baker, Tamara Broderick, Regina Barzilay, and Tommi Jaakkola. Diffusion probabilistic modeling of protein backbones in 3d for the motif-scaffolding problem. arXiv preprint arXiv:2206.04119, 2022.
van Kempen et al. (2022) Michel van Kempen, Stephanie S Kim, Charlotte Tumescheit, Milot Mirdita, Cameron LM Gilchrist, Johannes Söding, and Martin Steinegger. Foldseek: fast and accurate protein structure search. Biorxiv, pp. 2022–02, 2022.
Vander Meersche et al. (2024) Yann Vander Meersche, Gabriel Cretin, Aria Gheeraert, Jean-Christophe Gelly, and Tatiana Galochkina. Atlas: protein flexibility description from atomistic molecular dynamics simulations. Nucleic acids research, 52(D1):D384–D392, 2024.
Varadi et al. (2022) Mihaly Varadi, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, Oana Stroe, Gemma Wood, Agata Laydon, et al. Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic acids research, 50(D1):D439–D444, 2022.
Verkuil et al. (2022) Robert Verkuil, Ori Kabeli, Yilun Du, Basile IM Wicky, Lukas F Milles, Justas Dauparas, David Baker, Sergey Ovchinnikov, Tom Sercu, and Alexander Rives. Language models generalize beyond natural proteins. BioRxiv, pp. 2022–12, 2022.
Wang et al. (2024a) Chentong Wang, Yannan Qu, Zhangzhi Peng, Yukai Wang, Hongli Zhu, Dachuan Chen, and Longxing Cao. Proteus: exploring protein structure generation for enhanced designability and efficiency. bioRxiv, pp. 2024–02, 2024a.
Wang et al. (2023) Chuanrui Wang, Bozitao Zhong, Zuobai Zhang, Narendra Chaudhary, Sanchit Misra, and Jian Tang. Pdb-struct: A comprehensive benchmark for structure-based protein design. arXiv preprint arXiv:2312.00080, 2023.
Wang et al. (2024b) Xinyou Wang, Zaixiang Zheng, Fei Ye, Dongyu Xue, Shujian Huang, and Quanquan Gu. Diffusion language models are versatile protein learners. In International conference on machine learning, 2024b.
Wang et al. (2024c) Yan Wang, Lihao Wang, Yuning Shen, Yiqun Wang, Huizhuo Yuan, Yue Wu, and Quanquan Gu. Protein conformation generation via force-guided se (3) diffusion models. In Forty-first International Conference on Machine Learning, 2024c.
Watson et al. (2023a) Joseph L. Watson, David Juergens, Nathaniel R. Bennett, Brian L. Trippe, Jason Yim, Helen E. Eisenach, Woody Ahern, Andrew J. Borst, Robert J. Ragotte, Lukas F. Milles, Basile I. M. Wicky, Nikita Hanikel, Samuel J. Pellock, Alexis Courbet, William Sheffler, Jue Wang, Preetham Venkatesh, Isaac Sappington, Susana Vázquez Torres, Anna Lauko, Valentin De Bortoli, Emile Mathieu, Sergey Ovchinnikov, Regina Barzilay, Tommi S. Jaakkola, Frank DiMaio, Minkyung Baek, and David Baker. De novo design of protein structure and function with rfdiffusion. Nature, 620:1089–1100, 8 2023a. ISSN 0028-0836. doi: 10.1038/s41586-023-06415-8.
Watson et al. (2023b) Joseph L Watson, David Juergens, Nathaniel R Bennett, Brian L Trippe, Jason Yim, Helen E Eisenach, Woody Ahern, Andrew J Borst, Robert J Ragotte, Lukas F Milles, et al. De novo design of protein structure and function with rfdiffusion. Nature, 620(7976):1089–1100, 2023b.
Wayment-Steele et al. (2024) Hannah K Wayment-Steele, Adedolapo Ojoawo, Renee Otten, Julia M Apitz, Warintra Pitsawong, Marc Hömberger, Sergey Ovchinnikov, Lucy Colwell, and Dorothee Kern. Predicting multiple conformations via sequence clustering and alphafold2. Nature, 625(7996):832–839, 2024.
Wu et al. (2024a) Kevin E Wu, Kevin K Yang, Rianne van den Berg, Sarah Alamdari, James Y Zou, Alex X Lu, and Ava P Amini. Protein structure generation via folding diffusion. Nature communications, 15(1):1059, 2024a.
Wu et al. (2024b) Luhuan Wu, Brian Trippe, Christian Naesseth, David Blei, and John P Cunningham. Practical and asymptotically exact conditional sampling in diffusion models. Advances in Neural Information Processing Systems, 36, 2024b.
Wu et al. (2022) Ruidong Wu, Fan Ding, Rui Wang, Rui Shen, Xiwen Zhang, Shitong Luo, Chenpeng Su, Zuofan Wu, Qi Xie, Bonnie Berger, et al. High-resolution de novo structure prediction from primary sequence. BioRxiv, pp. 2022–07, 2022.
Yim et al. (2023) Jason Yim, Andrew Campbell, Andrew YK Foong, Michael Gastegger, José Jiménez-Luna, Sarah Lewis, Victor Garcia Satorras, Bastiaan S Veeling, Regina Barzilay, Tommi Jaakkola, et al. Fast protein backbone generation with se (3) flow matching. arXiv preprint arXiv:2310.05297, 2023.
Yim et al. (2024) Jason Yim, Andrew Campbell, Emile Mathieu, Andrew YK Foong, Michael Gastegger, José Jiménez-Luna, Sarah Lewis, Victor Garcia Satorras, Bastiaan S Veeling, Frank Noé, et al. Improved motif-scaffolding with se (3) flow matching. ArXiv, 2024.
Zhang & Skolnick (2004) Yang Zhang and Jeffrey Skolnick. Scoring function for automated assessment of protein structure template quality. Proteins: Structure, Function, and Bioinformatics, 57(4):702–710, 2004.
Zhang & Skolnick (2005) Yang Zhang and Jeffrey Skolnick. Tm-align: a protein structure alignment algorithm based on the tm-score. Nucleic acids research, 33(7):2302–2309, 2005.
Zheng et al. (2024) Shuxin Zheng, Jiyan He, Chang Liu, Yu Shi, Ziheng Lu, Weitao Feng, Fusong Ju, Jiaxi Wang, Jianwei Zhu, Yaosen Min, et al. Predicting equilibrium distributions for molecular systems with deep learning. Nature Machine Intelligence, pp. 1–10, 2024.
Zheng et al. (2023) Zaixiang Zheng, Yifan Deng, Dongyu Xue, Yi Zhou, Fei Ye, and Quanquan Gu. Structure-informed language models are protein designers. In International conference on machine learning, pp. 42317–42338. PMLR, 2023.
Zhou et al. (2024) Xiangxin Zhou, Dongyu Xue, Ruizhe Chen, Zaixiang Zheng, Liang Wang, and Quanquan Gu. Antigen-specific antibody design via direct energy-based preference optimization. In ICML 2024 Workshop AI4Science, 2024.

	SeqSim-outer	$\displaystyle=\frac{1}{N(N-1)M^{2}}\sum_{i=1}^{N}\sum_{j=1\|j\neq i}^{N}\sum_% {x=1}^{M}\sum_{y=1}^{M}\text{SeqSim}(s_{i}^{x},s_{j}^{y}),$		(1)
	SeqSim-inner	$\displaystyle=\frac{1}{NM(M-1)}\sum_{i=1}^{N}\sum_{x=1}^{M}\sum_{y=1\|y\neq~{% }x}^{M}\text{SeqSim}(s_{i}^{x},s_{i}^{y}),$		(2)