Researchers from Google and Stanford University have used machine learning methods -- deep learning and multitask networks -- to discover effective drug treatments for a variety of diseases.
Deep learning, which deals with many layers in artificial neural networks, enables scientists to synthesise large amounts of data into predictive models. Multitask networks compensate for limited data within an experiment, and allow data to be shared across different experiments.
"Discovering new treatments for human diseases is an immensely complicated challenge. Even after extensive research to develop a biological understanding of a disease, an effective therapeutic that can improve the quality of life must still be found," the researchers wrote on the Google Research Blog.
"This process often takes years of research, requiring the creation and testing of millions of drug-like compounds in an effort to find a just a few viable drug treatment candidates."
The researchers added that high-throughput screening (rapid automated screening of diverse compounds) is expensive and is usually done in sophisticated labs, which means it may not be the most practical solution.
Applying machine learning to virtual screening (similar to high-throughput screening) is another way to go about drug discovery, but low hit rates resulting in imbalanced datasets and "paucity" of experimental data resulting in overfitting (noise in the training data) remain as challenges, the researchers said.
"Virtual screening attempts to replace or augment the high-throughput screening process by the use of computational methods. Machine learning methods have frequently been applied to virtual screening by training supervised classifiers to predict interactions between targets and small molecules.
"The overall complexity of the virtual screening problem has limited the impact of machine learning in drug discovery," the researchers wrote in their paper called Massively Multitask Networks for Drug Discovery.
The researchers worked with 259 publicly available datasets on biological processes, which contained 37.8 million data points for 1.6 million compounds.
The datasets were made up of 128 experiments in the PubChem BioAssay database (PCBA), 17 datasets to avoid common pitfalls in virtual screening (MUV), and 102 datasets to evaluate methods to predict interactions between proteins and small molecules (DUD-E).
There were also 12 datasets from the 2014 Tox21 data challenge, run by the National Center for Advancing Translational Sciences. The goal of Tox21 is to crowdsource data analysis conducted by independent researchers to discover how they can predict compounds' interference in biochemical pathways using only chemical structure data.
"Because of our large scale, we were able to carefully probe the sensitivity of these models to a variety of changes in model structure and input data," the researchers wrote.
Sign up for Computerworld eNewsletters.