Technical model description

This is a technical description of model Carcinogenicity prediction with Ensemble of Classifier Chains.


nameCarcinogenicity prediction with Ensemble of Classifier Chains
#training dataset compounds1508
MLC algorithmEnsemble of Classfier Chains
Base classifierRandom Forest
DescriptorsPhysico-chemical descriptors and structural fragments (matched SMARTS patterns, min-freq: 10)
Applicability domainDistance-based AD using a centroid
ImputationNo Imputation

MLC algorithm

Multi-Label-Classification (MLC) algorithms simultaneously predict multiple classes (endpoints) for an instance (compound). Different approaches exist to exploit the inter-correlation of endpoint values. (Our developed framework utilizes the Mulan library, slightly modified for missing values.)

The method used for this model is:

Ensemble of Classfier Chains

Ensemble of Classifier Chains (ECC) utilize various chains of base classifiers that predict single endpoints. Each base classifier can employ the endpoint values of the previous classifiers in the chain as input feature. The chains are sorted in random order. To predict the endpoint of a compound, a consensus approach is employed that merges the predictions of the corresponding models from each chain.

Base classifier

Most MLC algorithms combine standard machine learning classification approaches (for single class predictions). We refer to the utilized classifier approach as 'base classfier'.

The method used for this model is:

Random Forest

The Random Forest approach is a bootstrap aggregation of various decision trees. (We utilize the WEKA implementation with default parameters.)


Descriptors are numerical or nominal attributes that are computed for each compound. The descriptors are employed by the (Q)SAR models as independent variables to predict unseen compounds.

The method used for this model is:

Physico-chemical descriptors and structural fragments (matched SMARTS patterns, min-freq: 10)

Physico-chemical descriptors are computed with the libraries CDK and Open Babel. Structural fragments are computed by matching the compounds with three pre-defined SMARTS lists included in Open Babel (patterns, SMARTS_InteLigand, MACCS) with minimum-frequency 10.

Applicability domain

The Applicability domain (AD) describes the compound feature space that a (Q)SAR model can be applied to. Predictions of compounds that are outside the AD should not be trusted. The AD ensures that a model only interpolates, but not extrapolates.

The method used for this model is:

Distance-based AD using a centroid

This method computes a centroid compound with mean feature values. The predicted compound is considered to be inside the Applicablity Domain if the Euclidean distance to the centroid compound is <=2x median training set distance to the centroid.


Imputation is a technique to fill the missing values in the dataset before using this dataset as input for the prediction algorithm to predict new compounds.

The method used for this model is:

No Imputation

Imputation is disabled.

Model confidence

The prediction model provides a confidence value together with each single compound prediction value. high confidence (>66%) means that the algorithm is confident about the prediction (the prediction can still be wrong), low confidence (<33%) means that the algorithm is very unsure.