Reevaluating ML in Science: Striking the Balance of Quality and Quantity
Written on
Introduction
Recent advancements in machine learning (ML) have brought about notable changes in various disciplines, particularly in scientific research. However, there are challenges that need addressing to validate new models, enhance testing and validation protocols, and ensure these models are applicable to practical scenarios. Such challenges include biased, subjective evaluations, often unintentional, the use of datasets that do not accurately represent real-world scenarios (e.g., datasets that are overly simplistic), and incorrect methodologies for partitioning datasets into training, testing, and validation subsets. This article will delve into these issues, particularly within the realm of biology, which is currently experiencing a transformation due to ML techniques.
As we progress, I will also touch upon the interpretability of ML models, which remains limited yet crucial, as it could shed light on many of the challenges identified in the initial discussions. Additionally, while some ML models may be exaggerated in their capabilities, this does not imply they lack utility or have not contributed valuable insights that can advance specific subfields of ML. The frequency of new ML models emerging, particularly within my area of expertise, is staggering.
Recent Developments and Questions
With the exponential increase in ML papers focusing on scientific applications, I find myself questioning the true revolutionary nature and utility of these works. For instance, while AlphaFold 2 marked a significant breakthrough, other ML tools, such as ProteinMPNN, have also emerged, raising inquiries about the overall quality of research in the field. How is it possible for scientists to develop numerous ML tools simultaneously claiming to be “the best”? Are their foundations robust enough? And, even if the research is innovative and sound, are the evaluations consistently fair, objective, and balanced? Are these applications to real-world challenges genuinely as groundbreaking as they claim?
Each time I come across a new ML technique for structural biology, I ponder how to evaluate its merit, particularly its performance and usefulness in my research.
Researchers are increasingly leveraging the latest neural network advancements to tackle longstanding challenges across various domains, leading to substantial progress. Nonetheless, it is imperative to ensure that evaluations are unbiased, objective, and balanced, and that datasets and prediction capabilities genuinely reflect the real-world applicability of the ML models.
Social media, preprint archives, and peer-reviewed journals reveal a surge in the application of modern neural network methodologies—such as transformers and diffusion models—to age-old scientific problems. This trend is promising, as it has resulted in significant advancements in multiple fields. Notable examples include the strides made in protein structure prediction by AlphaFold 2, which triumphed at CASP14, and the innovations in protein design spearheaded by D. Baker's ProteinMPNN, whose predicted sequences have undergone extensive experimental validation confirming their efficacy. For further insights into these methodologies, feel free to explore my blog articles:
Evaluation Metrics and Concerns
However, in many instances, new methods are somewhat overstated. A clear example can be found in protein design, where recent studies often measure success based on the sequence identity of the generated sequences compared to those of the input protein structure. While this metric may seem logical, a deeper examination reveals that a high sequence match does not guarantee proper protein folding. Single mutations can lead to proteins that fail to fold correctly, despite showing near-identical sequences to the wild type. Moreover, some mutations might cause fold swaps, resulting in high sequence identity yet a complete loss of structural integrity. Conversely, proteins with entirely different sequences may still fold in similar ways, indicating that sequence recovery is a somewhat acceptable but ultimately limited success metric in protein design.
Currently, the only truly meaningful test for protein design involves practical laboratory work, creating the designed protein, and experimentally determining its structure to see how closely it aligns with the intended design—understanding that a perfect match is an unrealistic expectation. While some studies, including those on ProteinMPNN, include experimental validations, many preprints and articles overlook this crucial aspect, instead focusing solely on sequence recovery and related metrics. Additionally, assuming that AlphaFold can accurately back-predict structures used in design protocols does not reliably indicate that the design works; it might only serve to identify poor sequences.
Potential Issues with Datasets and Evaluations
A significant concern, which I will address in broad terms to avoid singling out specific studies, is that many research works evaluate their models using inadequate datasets. The primary issues I've noticed include datasets that do not accurately reflect real-world applications of the ML model and datasets that overlap with the training data.
This concern is not rooted in any malicious intent. Training ML models necessitates vast datasets, often too extensive for manual curation, and automated curation processes have inherent limitations. Furthermore, research papers often showcase only those instances that highlight the ML model's positive applicability to a biological issue, neglecting cases that lack biological relevance or are difficult to interpret, or that contradict established knowledge—an example of poor scientific practice.
These issues are not unique to the ML domain but rather represent broader challenges in science: positive results are often prioritized, leading to a scarcity of negative or incorrect outcomes in the literature, despite their significance in preventing wasted resources and time. The "publish-or-perish" mentality encourages the dissemination of primarily positive findings, often embellished to exaggerate novelty and superiority. To explore my thoughts on the challenges facing scientific publishing, see this:
In light of the above observations, I believe it is likely that competitions like CASP (alongside CAMEO, CAPRI for structure prediction, etc.) and studies focused on objectively benchmarking existing methods contribute more to the advancement of the field than most papers announcing new models. I am so convinced of this that I believe ranking high in a competition like CASP or an independent benchmarking study outweighs any paper that, despite presenting presumably exciting results, failed to perform well in these evaluations (though this does not preclude the potential future relevance of the methods discussed).
The Quest for Improvements in Evaluation
A noteworthy point regarding evaluations, particularly in the context of protein tertiary structure prediction following AlphaFold 2's breakthrough, is that only minor improvements are now feasible, complicating comparisons among different new methods—an issue that AlphaFold 2 did not face due to the lower initial standards. This problem is less pronounced in other fields, such as drug design, docking, and conformational dynamics, where predictions remain poor to moderate.
Despite the overselling of certain methods, it is essential to recognize that they can still provide valuable contributions and propel the field forward. As scientists explore alternative solutions to problems—especially through ML—they may propose ideas that seem innovative and promising but ultimately fall short. However, this does not imply that these new developments lack value for specific applications or fail to generate useful insights for future endeavors.
One illustrative example is the use of language models in structural biology predictions. The initial methods reported showed significantly lower performance than AlphaFold 2 but offered execution times that were exponentially faster, presenting potential advantages for particular applications:
Probably the most notable (and valuable) language-based model for protein structure prediction is Meta's ESMFold, released shortly thereafter. This neural network, along with a vast database of precomputed protein structure models, created a brief but impactful wave of excitement that I believe represents a substantial contribution to the revolution initiated by DeepMind's AlphaFold 2:
However, while ESMFold is remarkably fast and outperforms earlier protein language models for structure prediction, its results still lag behind those of AlphaFold 2 (additionally, ESMFold has limitations such as the inability to utilize custom templates or predict structures for protein complexes). You can personally assess ESMFold's relatively modest performance (by current standards) in the official evaluations conducted during the 15th edition of CASP:
Still, I do not wish to undermine Meta's model. It is indeed useful and holds great promise, having paved the way for the development of numerous new tools utilizing its protein language model. Furthermore, from the perspective of the internal mechanics of the network for structure prediction, this model could evolve significantly in the future. For instance, this collaborative research from the Baker lab and Meta demonstrated that language models trained solely on sequences can learn sufficiently to design protein structures that go beyond natural proteins, even incorporating motifs not typically found in similar structural contexts in known proteins (and experimentally validated!):
The Importance of Interpretability in ML Models
A significant limitation of ML, not just in structural biology but across many scientific and engineering disciplines, is the lack of interpretability, often functioning as black boxes. While these models may perform well for their intended tasks or predictions, they often do not provide insights into their operational mechanisms or performance metrics.
Ideally, even for models of presumed high accuracy and reliability, understanding the conditions under which a model succeeds or fails is crucial. Moreover, it is preferable for these explanations to be grounded in the fundamental sciences of the domain, such as physics or chemistry, providing clear connections between independent variables akin to traditional modeling practices.
In structural biology specifically, while ML models for predicting protein structures demonstrate impressive accuracy, their underlying mechanisms remain poorly understood. It is unclear whether these models have learned something novel about protein structure or, more likely, whether they are drawing on established knowledge that is challenging to quantify and apply analytically. Furthermore, we do not know if these ML techniques are solely exceptional at predicting folded states or if they can also forecast intermediates during folding pathways, alternative conformational states, or structural tendencies of intrinsically disordered regions. My suspicion is that they may not, or at least not with high confidence, as they are inherently designed to predict well-folded structures. Gaining explicit insights into how they achieve high accuracy in predicting these states could not only challenge my assumptions but also assist method developers in identifying and overcoming limitations through enhanced models.
The lack of interpretability in ML models presents several challenges. First, it complicates error diagnosis and correction when models underperform, particularly in extreme extrapolations, such as predicting the structure of a protein that deviates significantly from known structures. Without clarity on how the model formulates its predictions, it becomes difficult to identify solutions and evaluate the reliability of each prediction—though modern ML tools are increasingly integrating metrics for assessing prediction reliability.
Second, limited interpretability constrains our understanding of the fundamental physics, chemistry, or biology that elucidates a system's behavior, even if we can accurately predict that behavior. While this may not significantly impact practical applications, it is inadequate in addressing the fundamental understanding that science strives for.
Lastly, the absence of interpretability can hinder trust in ML models. If we cannot comprehend the rationale behind a model's predictions, we may hesitate to rely on it in scenarios where accuracy is vital. This is particularly critical in structural biology, where erroneous models could lead to incorrect conclusions regarding the function of biological molecules, obstructing the progress of related studies and developments.
The essence of interpretability in this context is that more transparent ML models could mitigate many of the issues associated with their design, training, and application in real-world settings, potentially identifying challenges before they arise and thus improving the quality of outcomes while maintaining a balance with quantity.
More interpretable ML models could alleviate many of the problems associated with their building, training, and application to real-world problems, thus improving quality in its balance with quantity.
Efforts are underway to enhance the interpretability of ML models, with some researchers focusing specifically on scientific applications. I plan to publish a blog article addressing this topic soon.