Abstract
As pharmaceutical development moves from early-stage in vitro experimentation to later in vivo and subsequent clinical trials, data and knowledge are acquired across multiple time and length scales, from the subcellular to whole patient cohort scale. Realizing the potential of this data for informing decision making in pharmaceutical development requires the individual and combined application of machine learning (ML) and mechanistic multiscale mathematical modeling approaches. Here we outline how these two approaches, both individually and in tandem, can be applied at different stages of the drug discovery and development pipeline to inform decision making compound development. The importance of discerning between knowledge and data are highlighted in informing the initial use of ML or mechanistic quantitative systems pharmacology (QSP) models. We discuss the application of sensitivity and structural identifiability analyses of QSP models in informing future experimental studies to which ML may be applied, as well as how ML approaches can be used to inform mechanistic model development. Relevant literature studies are highlighted and we close by discussing caveats regarding the application of each approach in an age of constant data acquisition.
Significance Statement We consider when best to apply machine learning (ML) and mechanistic quantitative systems pharmacology (QSP) approaches in the context of the drug discovery and development pipeline. We discuss the importance of prior knowledge and data available for the system of interest and how this informs the individual and combined application of ML and QSP approaches at each stage of the pipeline.
Introduction
The drug discovery and development pipeline offers great opportunities for combining mechanistic modeling and machine learning (ML) methodologies. Mechanistic mathematical modeling has played an important role in the development of pharmaceutical drugs over the past 50 years. Models in drug development have generally comprised compartmental pharmacokinetic and pharmacodynamic (PKPD) models, linking drug concentration in plasma to pharmacological responses. These models have been the cornerstone of dosing/scheduling decisions within drug development. They have also been expanded to include more realistic descriptions of human physiology [physiologically based pharmacokinetic (PBPK)], which describe drug absorption, distribution, metabolism, and excretion (ADME) within multiple organs in the body. Such models were initially conceived in the 1930s (Paalzow, 1995). Recently the focus has moved to more detailed models of pharmacodynamics that account for compound effects at the cellular and subcellular scales, leading to the advent of quantitative systems pharmacology (QSP). Initiated within industrial settings, the focus of QSP has been to understand the way in which single cell to whole host models can be linked to create multiscale descriptions of drug action that account for basic physiology (via PBPK models) and molecular descriptions of drug action. Such multiscale models allow for variation between individuals to be considered and have led to in silico drug trials (Clancy et al., 2016).
Mathematical modeling of drug development has not only facilitated the understanding of data (e.g., the processes determining the pharmacokinetics of a drug) but also the generation of hypotheses (e.g., “these data could be explained if this drug is a CYP450 inducer”) and experimental design (e.g., proposing of a dose range and time points for observations to be made). The result has been that mathematical modeling has had a considerable impact on drug research and development (Milligan et al., 2013; Visser et al., 2013, 2014; Allerheiligen, 2014; Davies et al., 2020; Wu et al., 2021). However, the ability to collect data from the subcellular to whole human scale with increasing speed and automation has led to an explosion in recent years in the availability of data describing biologic systems (Hafner et al., 2016). How researchers are able to make sense of the available data given its often complex, multifactorial, and multiscale (both in terms of space and time) nature has become a growing challenge.
ML has been used to interrogate and visualize complex data sets to gain insight that informs decisions in drug research. ML utilizes algorithms to learn by interrogating data. This learning is then used to identify key descriptors or patterns within large data and image sets, where doing so without computing power would be infeasible given the large number of calculations required.
There has been an increase in combining mechanistic modeling with ML to provide insight into biologic systems. Baker et al. (2018) considered the synergy between mechanistic modeling and ML in a general biologic context, suggesting a cyclic interplay of the two approaches whereby ML finds important patterns and structures in the data and mechanistic approaches attempt to explain them by hypothesis generation. Others (Zhang et al., 2022b) have discussed the application of QSP and ML approaches to drug discovery and development and considered possibilities for how they can be combined. Here we detail the importance of each approach in the context of knowledge and data, providing examples to date of their application in drug discovery and discuss areas in which both can be used together. We identify when ML and QSP approaches are best applied at the stages of drug discovery and development based on the availability of data and knowledge for their application. We close by discussing the issues and caveats of combining the approaches and opportunities for further exploring the integration of the two areas in the context of each stage of the drug discovery and development pipeline.
Data, Knowledge, and Modeling
A critical element in the application of QSP and ML within drug discovery and development is the distinction between knowledge and data. Knowledge can be defined as collection of facts or information about a system, in our case biologic ones, which allows a mechanistic description of processes within the system to be formulated. In contrast, data are most often a collection of quantitative values (e.g., rate constants, protein expression levels, time course data) collected by experimentation. Data do not immediately infer knowledge—a system and its data need to be analyzed to gain knowledge. This difference between knowledge and data are important in recognizing the distinction between mechanistic modeling approaches used in QSP versus those in ML, particularly given that prior knowledge is often available to inform a QSP model of the biologic system being considered.
Knowledge and data are acquired during the drug discovery pipeline in a multiscale manner, working from the lower to higher length scales, starting initially at the cellular and subcellular scales during the early stages before moving into the higher organ and whole host scales and subsequent clinical studies. The ability of QSP modeling to infer knowledge by deduction and of ML approaches to infer knowledge by induction means that each can be applied within and at different points in the drug discovery and development pipeline data acquisition process (see Fig. 1).
QSP and ML modeling methods. QSP methods lead to knowledge by applying first principles to a specific context and allowing the exploration of new emerging behaviors when that context is perturbed (deductive method, left). On the other hand, ML methods lead to knowledge by extracting patterns or rules from a collection of data representative of multiple possible scenarios (individual cases) for the same system (inductive method, right). Therefore, integration between QSP and ML can happen in two possible ways: 1) by extracting patterns from a collection of QSP emerging behaviors with ML methods or 2) by using the rules derived from ML methods as first principles feeding into QSP models.
Mechanistic models primarily rely upon first principles from chemistry, biology, and physiology, scoped to the specific biologic system processes of relevance, with the goal of inferring emerging system behaviors in hypothetical scenarios such as therapeutic intervention, generated by perturbing the system (Fig. 1, left). Examples of such behavior include individual cell models of tissue (e.g., agent based), which allow the effect of molecular scale processes, cell-cell communication and extracellular factors on growth and development at the tissue scale to be evaluated. In this way, a mechanistic modeler can set model parameter values with prior quantitative data either directly or indirectly (e.g., from a related system). Where no such data are available, informed estimates from subject matter experts may be used as an initial starting point. This framework allows the modeler to extend the model to include and test multiple plausible hypotheses at the molecular level (e.g., a feedback loop in a cellular signaling pathway) that earlier experimentation at a cellular, tissue, or in vivo level has indicated may be possible. Depending upon prior data and knowledge of the system, model outcomes can be tested qualitatively and/or quantitatively, the latter not always available for biologic systems due to cost and time limitations. In this way, mechanistic modeling can develop initial predictions of the system without the need for information other than the hypothesized topology of the system. In the context of drug discovery, questions that inform this may include (Morgan et al., 2012; Cook et al., 2014; Friedrich, 2016), for example:
1. Is the proposed drug target sufficiently linked to the disease indication that it is a viable therapeutic strategy?
2. Out of all the molecules developed and tested in discovery, which one has the highest potential to elicit a positive response from patients?
3. Which patients might benefit from this approach?
4. What are the potential toxicities of engaging with this target/molecule?
5. What dose and dosing frequency would this drug candidate require for therapeutic benefit in humans?
6. Is combination therapy better than monotherapy?
7. What aspects of the biology require further data/knowledge collection to be clear in progressing model development?
The mathematical predictions that can answer or inform these questions can be updated as new data and knowledge emerge at each stage of the drug discovery and development pipeline. This enables drug projects to predict from one compound to its backup molecule in discovery, from nonclinical to first in human/patient at candidate selection, or from phase 1 to subsequent phases during clinical development. It should be noted that rather than replacing the data, the simulations guide the design of studies (nonclinical or clinical) to generate the key data that will underpin decisions. Once that data has been generated, it can be used to assess the model performance and to refine/develop it further where needed. Given the granularity and multiscale nature of QSP models, this approach brings some mechanistic insights into the system, a particularly advantageous trait given the many nonlinearities associated with biologic systems, a result of interactions both within the same and across different spatial and time-dependent scales.
In contrast, ML approaches require large and diverse enough amounts of data to be collected from the system in a consistent way, with the goal of inferring general patterns or rules about the system (Fig. 1, right). In this way, ML approaches have been described as “data hungry.” Data must also be of sufficient quality (strong signal-to-noise ratio), consistent (collected by reasonably similar experimental methods and protocols), and within scope (with some degree of diversity in the sampling of cases). ML has been seen as a method for learning about the structure of data, thus providing an opportunity to extract understanding and insights from it. Any application to similar types of data on the same problem means that algorithms can be used in a predictive capacity.
Machine Learning in Drug Discovery and Development
The use of ML has gained considerable traction within the healthcare (Erickson et al., 2017; Garcia-Vidal et al., 2019), biologic (Mahmud et al., 2018), and pharmacologic sciences (Jiménez-Luna et al., 2020) over the past few years, but there remain a number of caveats to its successful application. Most critical here is the quality and quantity of data to ensure that algorithms are able to adequately learn and that there is confidence in the resulting predictions (Chen et al., 2019). Indeed, issues have been highlighted around the use of ML in clinical studies (Liu et al., 2019; Nagendran et al., 2020), leading to recent efforts regarding the creation of guidelines on the use of artificial intelligence (AI) methods in this area (Ibrahim et al., 2021). ML methods have also found a home in particularly high signal-to-noise ratio but laborious applications such as medical imaging where computers are quite capable of automatically segmenting images (Hosny et al., 2018).
Due to the challenges of understanding complex biologic and chemical data, ML has found applications in drug discovery. Algorithms have been able to elucidate core elements in complex structures (e.g., protein-protein interaction cascades and identifying drug targets, protein structure, and binding sites) (Reed et al., 2017). ML has made significant progress (Callaway, 2020; Tunyasuvunakool, 2021) with the seemingly intractable task of predicting the secondary and tertiary 3D structure of proteins based upon their amino acid sequence. Of particular note is the reported use of ML approaches to discover resistance mechanisms by integrating CRISPR gene knockout experiments and biologic network information into knowledge graphs (Gogleva et al., 2022). Here empirical approaches are integrated with prior biologic knowledge. There are a number of reported applications of ML to what many would recognize as classic PKPD analyses (Sale and Sherer, 2015; Brunton et al., 2016). Some of the applications to pharmacokinetics and PKPD modeling (Gobburu and Chen, 1996; Erickson et al., 2017) date quite far back, yet there has been little follow-up until recently (Lu et al., 2021). As well as PKPD, there have also been reported applications of ML to pharmacometric analyses such as the clearance of monoclonal antibodies (Wang et al., 2020).
ML methods in pharmacology also include structure-based property modeling of drugs [e.g., quantitative structure-activity relationship (QSAR) modeling], which uses structural descriptors to predict binding affinities (Jones et al., 2021) and pharmacokinetic properties (Soares et al., 2022). Although ML methods have attracted focus in recent years, traditional statistical modeling approaches such as linear regression still have value for hypothesis generation from large scale data sets (e.g., generating hypotheses for mechanisms of toxicity) (Munoz-Muriedas, 2021). Indeed, those looking to use ML approaches should also be aware that they do not always provide superior solutions to those already available methods. An illustrative example is of a commercial ML parole system using 137 features of offender’s history that was no better at predicting recidivism than lay reviewers or a simple linear model based on two features (Dressel and Farid, 2018). Another example is within safety pharmacology, where prediction of Torsades de Pointes using a combined mechanistic/ML model was no better than linear regression (Mistry 2017). These two examples highlight the importance of doing comparisons to simpler approaches to understand the value of ML.
QSP Modeling
The term “systems pharmacology” or “quantitative systems pharmacology” was first coined in 2008 (Jusko and Lauffenburger, 2008; Allerheiligen, 2010) and fully established in 2012 through a white paper jointly written by systems biologists and pharmacologists (https://www.nigms.nih.gov/News/Reports/Documents/SystemsPharmaWPSorger2011.pdf). Focusing on elucidating drug action at the individual cell and subcellular scales in preclinical stages of development, the field has come to encompass not only detailed temporal models of genetic regulatory and associated protein-protein interaction processes at the cell scale but also the integration of such information into higher organ and individual host scale models, allowing for the development of virtual population models (Cheng et al., 2022). Models have primarily been formulated using the theory of nonlinear ordinary differential equations (ODEs) but also other approaches such as stochastic models, including agent-based methods (Cosgrove et al., 2015), have been employed. Both have generally only accounted for temporal phenomena. Spatiotemporal approaches to drug discovery and development have been used to incorporate descriptions of drug diffusion into the surrounding tissue (McGinty and Pontrelli, 2015), although the application of partial differential equation models, accounting for temporal and spatial detail, is ripe for further exploitation in QSP. All of these approaches allow dynamic predictions of the system behavior to be made from which new understanding can be gained and hypotheses tested (e.g., understanding the pharmacokinetic properties required of a drug to gain a therapeutic benefit from binding to a target by an early model-based assessment of target pharmacology) (Chen et al., 2022).
A highly interdisciplinary field bringing together mathematical modelers, computer scientists, pharmacological scientists, and biologists to tackle specific problems, QSP models seek to integrate information and data obtained at the earlier stages of the drug discovery and development pipeline typically obtained from in vitro experiments and in vivo scale animal models. Multiscale QSP models allow the effect of compound dosing at the individual host scale to be examined at the cellular/subcellular level. In turn, the cellular functional outcomes affected by the compound can then be evaluated at the organ and whole host scale. In this way, QSP models have the ability to link across varying spatial and temporal scales, which allows for emergent behavior at particularly the lower spatial scales (e.g., multicellular level) to be evaluated at the tissue scale.
The ability to combine early-stage knowledge and data in the drug discovery and development pipeline, along with combining processes across spatial and temporal scales, means that QSP models are well placed for also considering variability not only within individuals but also across populations. In this way, QSP models can be used to run in silico clinical trials for a particular compound, long before any possible actual clinical trials take place.
The QSP approach is also helpful in cases where data or knowledge of the compound is not available for its known or proposed biologic mechanisms of action. A mechanistic model does not need a full data set describing the full system for modeling to be initiated. Models can be formulated that take account of known information, using available data from the same cellular system or ones similar to it to both inform and test the model predictions while utilizing knowledge of any remaining parameters to inform the model. Knowledge and data for building and parameterizing QSP models may take a variety of forms and include the following: 1) models (or reduced versions thereof) can be fitted to available data to obtain estimates of the required model parameters; 2) parameters from previously published mathematical models directly or indirectly related to the system being modeled; 3) estimates of model parameter values utilizing informed upper and lower bounds as to what values the parameters may obtain; or 4) a combination of 1, 2, and 3. In this way, QSP models do not need access to large amounts of experimental data for an initial set of predictions to be made. The initial predictions can be assessed both qualitatively and quantitatively against known experimental data and knowledge gained to date of the system (e.g., if the concentration of a specific entity lies within a given known range). Analysis of the model, either analytical or computational (e.g., sensitivity analysis), can then be used to identify parameters critical to the model outcomes of interest. Where critical parameters are identified for which more knowledge of that parameter is required (by combining the results of structural identifiability and sensitivity analysis), then experimental work can be explored/undertaken to do so. In this way, mechanistic modeling helps both inform our understanding of the biologic system and/or test hypotheses while helping to direct future experimental work.
Although the size and complexity of QSP models can vary, simple models can be insightful and explanatory alongside larger scale, more detailed ones (Stein and Looby, 2018; Mistry and Orrell, 2020). This is both the case when data and knowledge are plentiful or sparse. Indeed, simple models capturing gross complexity can provide overall understanding of the system dynamics, helping to improve knowledge (Mistry, 2018). At whichever scale the modeler chooses to work, as models move into the clinic, it is important that they are informed and tested with appropriately sized patient samples to ensure that predictions are meaningful and can be trusted (Riley et al., 2020).
Combining ML and QSP in Drug Discovery and Development
Drug discovery and development offer opportunities for combining ML and QSP. A key consideration here is observing the outcomes and outputs that each approach has that can inform the use of the other one. Particular thought should be given to the richness and type of data available in the context of the overall question at each stage of the pipeline. At each stage of drug discovery and development, there may be the opportunity to either use each method in its own right and/or combine both at the same stage.
As shown in Fig. 1, integration between QSP and ML can happen in two possible ways: 1) by extracting patterns from a collection of QSP emerging behaviors with ML methods or 2) by using the rules derived from ML methods as first principles feeding into QSP models. Although Zhang and colleagues (2022b) have recently identified ways and examples of unifying ML and QSP approaches in drug discovery and development, here we provide details on how the methods can be combined at each stage of drug discovery and development, given the knowledge and data available as a compound moves down the pipeline, and noting recent successes in the literature relevant to each case. In each case “data rich” refers to cases where enough experimental data are available so that meaningful insight can be gained using ML approaches, and thus they may be a good first step. “Data poor” indicates that inadequate or insufficient experimental data are available for ML approaches to be insightful, and thus mechanistic QSP approaches should be considered. References to current examples are provided and an overview of how the combination of each approach applies to the drug discovery development pipeline is shown in Fig. 2.
A summary of ML, QSP, and ML and QSP combined approaches applied to each stage of the drug discovery and development pipeline.
Discovery/Preclinical Development
ML to QSP (Data Rich)
Here ML methods can be used to interrogate experimental data sources to: 1) identify patterns and relationships upon which mechanistic QSP models can be built (e.g., searching for signals in ‘omics data associated with disease (Menden et al., 2019; Vamathevan et al., 2019; Kolluri et al., 2022); and 2) parameterize QSP models (e.g., analysis of computational QSAR models to inform PBPK model parameters) (McComb et al., 2022).
QSP to ML (Data Poor)
Here QSP models are created when limited good-quality experimental data are available. In this case, no QSAR models are available to inform the QSP parameters, so models can be informed using parameter values from similar systems or with a known level of uncertainty based on current knowledge. Models initially formulated at the subcellular genetic regulatory/protein-protein interaction level can be integrated with PKPD/PBPK ones to create patient cohort level multiscale predictive models that preempt between patient variability (e.g., the PBPK tool Simcyp with its detailed knowledge of variations in physiologic parameters for different patient populations to predict variability in pharmacokinetics) (Zhang et al., 2022a). Such Digital Twin style models (Agur et al., 2020) can be used to provide cohort predictions of drug efficacy/toxicity before clinical trials are undertaken, identifying potential issues and which clinical trial data are critical to improving model predictions. ML approaches can then be applied to identify patterns within the simulated virtual patient populations to inform decisions relevant to the design of specific disease/drug response studies (Koch et al., 2013). They may also be applied to identify parameters and thus mechanisms responsible for certain outcomes in mechanistic models, as demonstrated by McGillen and colleagues (2014).
Clinical Trial Phases 1 to 3
ML to QSP (Data Rich)
Here ‘omics data can be interrogated with ML methods during a clinical trial to uncover potential biomarkers of response (Agur et al., 2020). This information can be used to test multiscale QSP model predictions. Others have used ML analysis of high-throughput in vitro drug combination data to predict and prioritize drug combinations (Menden et al., 2019) for clinical investigation using only single drug potency data by incorporating prior mechanistic biologic knowledge. This information can then be applied to develop relevant PKPD/PBPK models of the various combinations.
QSP to ML (Data Poor)
Classically, this is the stage at which mechanistic PKPD/PBPK models have been applied to inform drug dosing and understand compound effects at the whole organ scale. There is a growing trend of applying ML approaches to PKPD/PBPK models to identify subpopulations with varying pharmacokinetics and response to treatment within clinical data sets (McComb et al., 2022). Others are utilizing ML analysis of patient data in response to therapeutics to derive PKPD-like models (Lu et al., 2021). This work demonstrated that the application of ML does not necessarily lead to better results than those captured by the initially formulated PKPD model. Indeed, the analysis detailed in Lu et al. (2021) does not follow the natural flow of data creation in clinical trials; the authors consider phase 3 data when the greatest utility of the modeling (PKPD or ML) would be to inform movement from phase 1 to phase 2 trials.
Life Cycle Management
Although we are not currently aware of any studies looking at the individual or combined use of mechanistic and ML approaches once a pharmaceutical has received regulatory approval, it is probably fair to state that this later stage of big, complex data is ripe for the application of ML approaches such as in pharmacovigilance. Here individual adverse reaction reporting of pharmaceuticals in the community can be analyzed using ML approaches to identify plausible patterns in reporting. Such results can then be used to inform future postapproval PKPD/PBPK modeling to address these findings (e.g., the formulation and parameterization of a model based on an observed drug-drug interaction).
Summary and Conclusions
There is growing interest in applying QSP mechanistic approaches and more recently ML to inform decision making in drug discovery and development, the overall goal being to inform compound development earlier in the drug discovery and development pipeline, thus reducing compound attrition at later stages. The application of both approaches relies greatly on the knowledge and data available to develop a model of the biologic system of interest.
Given the flow of data collection in the drug discovery and development pipeline and the need to assimilate knowledge at different spatial and time scales, it is becoming increasingly clear that the application of QSP and ML approaches should not be considered in isolation. Both approaches can be used to progress compound development. What is important in this marriage is the relevant context for their application and the appropriate sequence of application: the right tool for the right job. For instance, where data may be lacking to make meaningful use of ML approaches, a QSP approach can be applied to begin postulating how the system may behave, thus helping to inform future experimental data collection to improve model predictions. Such data collection may require the application of ML approaches to subsequently inform and test the QSP model. Likewise, when data sources are large and beyond the scale of humans being able to extract meaningful relationships, ML approaches can be particularly helpful at informing relationships in data upon which mechanistic models can be built.
As more data at molecular scales become publicly available, we are moving to a stage where early-stage drug discovery and development can be informed using such information before further experimentation is undertaken. Within this data-rich age we are fortunate that we can use ML approaches to bring understanding to large datasets, or where such understanding is difficult or not meaningful, we can turn to mechanistic-style QSP models to progress research and development. With both tools available and growing understanding and data, barriers to informing earlier-stage drug discovery and development, which in turn inform decision making in later stages, are being reduced.
Although data are becoming more readily available to researchers, the application of both approaches needs to account for the reproducibility issues within the scientific literature (Baker, 2016; Voelkl et al., 2018). The analysis of data and its subsequent interpretation where reproducibility is questionable can lead to incorrect conclusions and subsequent incorrect or misinformed drug development, regardless of whether ML or QSP approaches are applied independently or together. Data collection and providence needs to account for the required scale and relevance to the overall question being asked about the development of the compound. For instance, to meaningfully inform and test a QSP model, multiscale (subcellular to organ scale) longitudinal data may be required to uniquely and accurately inform parameters within the model. As such, the need to conduct reproducible experiments for a given cellular system to improve confidence in model predictions is and will continue to become increasingly important as reliance on model predictions for directing future experiments becomes greater. In contrast, to simply begin a study when limited system-specific data are available for informing model development, researchers may be satisfied to use data from different cell systems to make progress. Each case very much depends upon the model application and the model uncertainty with which modelers and their teams are willing to work. Overall, the quality and completeness of data in each case are things that both ML and QSP approaches, individually and collectively, can help with in progressing the development of future pharmaceuticals.
The success of combining ML and QSP models will be assessed on the overall ability of the combined approach to make as much progress on a given problem, minimizing time and cost to do so (Androulakis, 2022). For instance, in the case of a project where no or limited experimental data have been collected but mechanistic knowledge of processes is available, then a model should be formulated, parameterized, and used to direct any future experimental work. In such a case, it would be counterproductive to initially undertake large-scale experimentation when this option is available, Likewise, when large-scale data sets are available and no causal relationships between processes are known, ML approaches should be used first to then inform QSP model development (Putnins et al., 2022).
Authorship Contributions
Wrote or contributed to the writing of the manuscript: Tindall, Cucurull-Sanchez, Mistry, Yates.
Footnotes
- Received December 15, 2022.
- Accepted July 26, 2023.
This work was undertaken without financial support.
No author has an actual or perceived conflict of interest with the contents of this article.
Abbreviations
- ML
- machine learning
- PBPK
- physiologically based pharmacokinetic
- PKPD
- pharmacokinetic and pharmacodynamic
- QSAR
- quantitative structure-activity relationship
- QSP
- quantitative systems pharmacology
- Copyright © 2023 by The American Society for Pharmacology and Experimental Therapeutics