Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Google, Stanford use machine learning on 37.8m data points for drug discovery

Rebecca Merrett | March 6, 2015
Deep learning and multitask networks used on 259 datasets.

"We carefully quantified how the amount and diversity of screening data from a variety of diseases with very different biological processes can be used to improve the virtual drug screening predictions.

"Our models are able to utilise data from many different experiments to increase prediction accuracy across many diseases," the researchers wrote.

The learning models were evaluated using 'area under the receiver operating characteristic (ROC) curve', a measure for classification accuracy.

"The imbalance present in our datasets means that performance varies widely depending on the particular training/test split. To compensate for this variability, we used stratified Kfold cross-validation; that is, each fold maintains the active/inactive proportion present in the unsplit data," added the researchers.

A key finding was that multitask networks allows for significantly more accurate predictions than single-task methods. Also, their predictive capability improves as more tasks and data is added to the models and large multitask networks resulted in better transferability to tasks not contained in the training data.

The researchers noted in their paper that access to more relevant data is key to being able to build state-of-the-art models.

"Major pharmaceutical companies possess vast private stores of experimental measurements; our work provides a strong argument that increased data sharing could result in benefits for all."

The researchers also wrote that it's "disappointing... that all published applications of deep learning to virtual screening (that we are aware of) use distinct datasets that are not directly comparable", meaning standards for datasets and performance metrics need to be established.

"Another direction for future work is the further study of small molecule featurization. In this work, we use only one possible featurization (ECFP4), but there exist many others," the researchers said.

Follow CIO Australia on Twitter and Like us on Facebook... Twitter: @CIO_Australia, Facebook: CIO Australia, or take part in the CIO conversation on LinkedIn: CIO Australia

Follow Rebecca Merrett on Twitter: @Rebecca_Merrett

Read More:

 

Previous Page  1  2 

Sign up for Computerworld eNewsletters.