Technical model description

This is a technical description of model Rat toxicity prediction with Predictive Clustering Trees.

General

nameRat toxicity prediction with Predictive Clustering Trees
#endpoints28
#training dataset compounds899
#descriptors379
MLC algorithmPredictive Clustering Trees
DescriptorsMolecular weight, logP and structural fragments (matched SMARTS patterns)
Applicability domainDistance-based AD using a centroid
ImputationImputation enabled using Ensemble of Classifier Chains

MLC algorithm

Multi-Label-Classification (MLC) algorithms simultaneously predict multiple classes (endpoints) for an instance (compound). Different approaches exist to exploit the inter-correlation of endpoint values. (Our developed framework utilizes the Mulan library, slightly modified for missing values.)

The method used for this model is:

Predictive Clustering Trees

Predictive Clustering Trees is a decision tree algorithm that directly supports multi-label classification.The orignial library (named clus) was slightly updated and wrapped for mulan.

Categories

Predictive clustering trees produce clusters of compounds (the leafs of the tree). The clusters have a similar tox profile and similar features values. Each predicted compounds is assigned to a category.

Descriptors

Descriptors are numerical or nominal attributes that are computed for each compound. The descriptors are employed by the (Q)SAR models as independent variables to predict unseen compounds.

The method used for this model is:

Molecular weight, logP and structural fragments (matched SMARTS patterns)

Molecular weight and logP are computed with the Open Babel library. Structural fragments are computed by matching the compounds with three pre-defined SMARTS lists included in Open Babel (patterns, SMARTS_InteLigand, MACCS).

Applicability domain

The Applicability domain (AD) describes the compound feature space that a (Q)SAR model can be applied to. Predictions of compounds that are outside the AD should not be trusted. The AD ensures that a model only interpolates, but not extrapolates.

The method used for this model is:

Distance-based AD using a centroid

This method computes a centroid compound with mean feature values. The predicted compound is considered to be inside the Applicablity Domain if the Euclidean distance to the centroid compound is <=2x median training set distance to the centroid.

Imputation

Imputation is a technique to fill the missing values in the dataset before using this dataset as input for the prediction algorithm to predict new compounds.

The method used for this model is:

Imputation enabled using Ensemble of Classifier Chains

...

Model confidence

The prediction model provides a confidence value together with each single compound prediction value. high confidence (>66%) means that the algorithm is confident about the prediction (the prediction can still be wrong), low confidence (<33%) means that the algorithm is very unsure.