Improving Compound Selection in Drug Discovery: A Quantitative Approach for Biased Data Modeling

Rohit Singh Raja

Improving Compound Selection in Drug Discovery: A Quantitative Approach for Biased Data Modeling

ESP Journal of Engineering & Technology Advancements

Volume 5 Issue 1

Year of Publication : 2025

Authors : Rohit Singh Raja

:10.56472/25832646/JETA-V5I1P111

Citation:

Rohit Singh Raja, 2025. "Improving Compound Selection in Drug Discovery: A Quantitative Approach for Biased Data Modeling", ESP Journal of Engineering & Technology Advancements 5(1): 88-99.

Abstract:

According to the latest findings from the World Health Organization (WHO), cardiovascular disease reigns supreme as the leading global cause of mortality. Detecting heart ailments at an early stage is of paramount importance, as managing the condition often necessitates proactive measures like lifestyle modifications and preventive medications. Failing to address the issue promptly may unleash a cascade of cardio- vascular complications, potentially culminating in heart attacks or other life-threatening events that demand immediate medical intervention and exhibit alarmingly high fatality rates. To confront this challenge, an extensive dataset procured from Kaggle, containing a plethora of patient information alongside an identifier indicating the presence or absence of underlying heart disease, will be harnessed. Through the implementation of state-of-the-art optimization techniques, a binary classification machine learning model will be trained to predict the likelihood of new, unseen patients harboring underlying heart disease. Multiple optimization methods will be rigorously compared to unveil the most optimal model, tailored precisely to address this pressing issue.

References:

[1] E. Gawehn, J. A. Hiss, and G. Schneider, “Deep Learning in Drug Discovery,” Molecular Informatics, vol. 35, no. 1, pp. 3–14, jan 2016. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/ 27491648http://doi.wiley.com/10.1002/minf.201501008

[2] H. Chen, O. Engkvist, Y. Wang, M. Olivecrona, and T. Blaschke, “The rise of deep learning in drug discovery,” Drug Discovery Today, vol. 23, no. 6, pp. 1241–1250, jan 2018. [Online]. Available:https://www.sciencedirect.com/science/article/pii/S1359644617303598

[3] S. Kang and K. Cho, “Conditional molecular design with deep generative models,” Journal of Chemical Information and Modeling, p. acs.jcim.8b00263, jul 2018. [Online]. Available: http://pubs.acs.org/doi/ 10.1021/acs.jcim.8b00263

[4] H. O¨ ztu¨rk, A. O¨ zgu¨r, and E. Ozkirimli, “DeepDTA: deep drug–target binding affinity prediction,” Bioinformatics, vol. 34, no. 17, pp. i821–i829, sep 2018. [Online]. Available: https://academic.oup.com/ bioinformatics/article/34/17/i821/5093245

[5] F. J. Gamo, L. M. Sanz, J. Vidal, C. De Cozar, E. Alvarez, J. L. Lavandera, D. E. Vanderwall, D. V. Green, V. Kumar, S. Hasan, J. R. Brown, C. E. Peishoff, L. R. Cardon, and J. F. Garcia- Bustos, “Thousands of chemical starting points for antimalarial lead identification,” Nature, vol. 465, no. 7296, pp. 305–310, 2010. [Online] Available: http://www.nature.com/articles/nature09107

[6] T. Kalliokoski, C. Kramer, and A. Vulpetti, “Quality Issues with Public Domain Chemogenomics Data,” Molecular Informatics, vol. 32, no. 11-12, pp. 898–905, dec 2013. [Online]. Available: http://dx.doi.org/10.1002/minf.201300051

[7] P. Tiikkainen, L. Bellis, Y. Light, and L. Franke, “Estimating Error Rates in Bioactivity Databases,” J. Chem. Inf. Model., vol. 53, no. 10, pp. 2499–2505, oct 2013. [Online]. Available: http://dx.doi.org/10.1021/ci400099q

[8] M. Davies, M. Nowotka, G. Papadatos, N. Dedman, A. Gaulton, F. Atkinson, L. Bellis, and J. P. Overington, “ChEMBL web services: streamlining access to drug discovery data and utilities.” Nucleic acids research, vol. 43, no. W1, pp. W612–20, jul 2015. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/ 25883136http://www. pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC4489243

[9] D. Fourches, E. Muratov, and A. Tropsha, “Trust, but verify: on the importance of chemical structure curation in cheminformatics and QSAR modeling research.” Journal of chemical information and modeling, vol. 50, no. 7, pp. 1189–204, 2010 [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/20572635http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC2989419

[10] D. Rogers and M. Hahn, “Extended-connectivity fingerprints.” J. Chem. Inf. Model., vol. 50, no. 5, pp. 742–754, may 2010. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/20426451

[11] G. Landrum, “RDKit: Open-source cheminformatics,” https://www.rdkit.org/ (accessed Jan 12, 2017). [Online]. Available: http://www.rdkit.org

[12] A. Koutsoukas, S. Paricharak, W. R. J. D. Galloway, D. R. Spring, A. P. IJzerman, R. C. Glen, D. Marcus, and A. Bender, “How Diverse Are Diversity Assessment Methods? A Comparative Analysis and Benchmarking of Molecular Descriptor Space,” J. Chem. Inf. Model., vol. 54, no. 1, pp. 230–242, dec 2013. [Online]. Available: http://dx.doi.org/10.1021/ci400469u

[13] N. M. O’Boyle and R. A. Sayle, “Comparing structural fingerprints using a literature-based similarity benchmark,” Journal of Cheminformatics, vol. 8, no. 1, p. 36, 2016. [Online] Available: http://www.ncbi.nlm.nih.gov/pubmed/27382417http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC4932683http://jcheminf.springeropen.com/articles/10.1186/s13321-016-0148-0

[14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander- plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duches- nay, “Scikit-learn: Machine Learning in Python,” J. Mach. Learn. Res., vol. 12, pp. 2825–2830, 2011.

[15] A. T. J. W. Oliver Watson, Isidro Cortes, “A decision theoretic approach to model evaluation in computational drug discovery,” Bioinformatics, vol. In press, 2019

[16] W. P. Walters, “Modeling, informatics, and the quest for reproducibility,” Journal of Chemical Information and Modeling, vol. 53, no. 7, pp. 1529–1530, 2013. [Online]. Available: http://sourceforge.net/

[17] G. A. Landrum and N. Stiefl, “Is that a scientific publication or an advertisement? Reproducibility, source code and data in the computational chemistry literature,” Future Medicinal Chemistry, vol. 4, no. 15, pp. 1885–1887, oct 2012. [Online]. Available: http://www.future-science.com/doi/10.4155/fmc.12.160

[18] T. Kalliokoski, C. Kramer, A. Vulpetti, and P. Gedeck, “Comparability of mixed IC data - a statistical analysis.” PLoS One, vol. 8, no. 4, p. e61007, jan 2013. [Online]. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 3628986{&}tool=pmcentrez{&}rendertype=abstract

[19] I. Corte´s-Ciriano and A. Bender, “How consistent are publicly reported cytotoxicity data? Large-scale statistical analysis of the concordance of public independent cytotoxicity measurements,” ChemMedChem, vol. 11, no. 1, pp. 57–71, jan 2015. [Online]. Available: http://doi.wiley.com/10.1002/cmdc.201500424

[20] D. L. Alexander, A. Tropsha, and D. A. Winkler, “Beware of R2: Simple, Unambiguous Assessment of the Prediction Accuracy of QSAR and QSPR Models,” Journal of Chemical Information and Modeling, vol. 55, no. 7, pp. 1316–1322, 2015. [Online]. Available: http://pubs.acs.org/doi/10.1021/acs.jcim.5b00206

[21] A. Bender and R. C. Glen, “Molecular similarity: a key technique in molecular informatics.” Org. Biomol. Chem., vol. 2, no. 22, pp. 3204–3218, nov 2004. [Online]. Available: http://pubs.rsc.org/en/ content/articlehtml/2004/ob/b409813g

[22] D. Bajusz, A. Ra´cz, and K. He´berger, “Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?” Journal of cheminformatics, vol. 7, p. 20, 2015. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/26052348http://www. pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC4456712

[23] I. Cortes-Ciriano, N. C. Firth, A. Bender, and O. Watson, “Discovering highly potent molecules from an initial set of inactives using iterative screening,” Journal of Chemical Information and Modeling, vol. 58, no. 9, pp. 2000–2014, 2018. [Online]. Available: http: //pubs.acs.org/doi/10.1021/acs.jcim.8b00376

[24] A. Koutsoukas, R. Lowe, Y. KalantarMotamedi, H. Y. Mussa, W. Klaffke, J. B. O. Mitchell, R. C. Glen, and A. Bender, “In Silico Target Predictions: Defining a Benchmarking Data Set and Comparison of Performance of the Multiclass Na¨ıve Bayes and Parzen-Rosenblatt Window,” J. Chem. Inf. Model., vol. 53, no. 8, pp. 1957–1966, 2013. [Online]. Available: http://dx.doi.org/10.1021/ci300435j

[25] Paweł Szyman´ski, Magdalena Markowicz and E. Mikiciuk- Olasik, “Adaptation of High-Throughput Screening in Drug Discovery—Toxicological Screening Tests,” Int J Mol Sci., vol. 13, no. 1, pp. 427–452, 2012

[26] L. S. Ludwig, C. A. Lareau, J. C. Ulirsch, J. D. Buenrostro, A. Regev, and V. G. Sankaran, “Lineage Tracing in Humans Enabled by Mitochondrial Mutations and Single-Cell Genomics,” 2019. [Online]. Available: https://doi.org/10.1016/j.cell.2019.01.022

[27] A. Leo, C. Hansch, and D. Elkins, “Partition coefficients and their uses,” Chemical Reviews, vol. 71, no. 6, pp. 525–616, 1971. [Online]. Available: https://doi.org/10.1021/cr60274a001

Keywords:

Driven Drug Discovery, Machine Learning Models, Predictive Modeling, Space Optimization.

ISSN : 2583-2646