Predicting environmental carcinogens with logistic regression, knn, gradient boosting and molecular fingerprinting

Balancing imbalanced data, exploring accuracy metrics, and an introduction to cheminformatics

Introduction and background

What is Tox 21?

The Toxicology in the 21st Century program, or Tox21, is a unique collaboration between several federal agencies to develop new ways to rapidly test whether substances adversely affect human health. Substances assayed in Tox21 include a diverse range of products such as: commercial chemicals, pesticides, food additives/contaminants, and medical compounds.

Why the p53 protein?

The p53 gene encodes a protein of the same name and is known as a tumor-suppressor protein. The p53 protein is expressed in cells when they undergo DNA damage — which can transform a normal cell into a cancerous one. To counteract the effects, p53 can cause growth arrest, repair DNA, or begin the process of cell death. Therefore, when DNA damage occurs, there is a significant increase in p53 expression. This increase in protein expression is a good indicator of irregular cell health. The Tox21 data was generated by testing cell lines which produce a florescent reporter gene product under the control of p53 cellular machinery. By measuring levels of the reporter gene product against various compounds, researchers were able to determine whether a compound was an agonist (activator) of the p53 pathway or not.

Predicting agonists using molecular fingerprinting

Fingerprinting is a way to represent molecular structure and properties as binary bit strings (0’s and 1's). This representation was initially developed and applied to searching databases for molecules with a specific substructure — but it can also be applied to machine learning. A hash function is a random number generator and is applied to each feature of a molecule, such as the types of bonds and molecules present, which means that they act as seeds to the function.

The following image illustrates how a two molecules can have their similarity assessed (using The Tanimoto index) by generating molecular fingerprints. Each molecule first has the hash function applied to it and then a fingerprint is generated based on features. The fingerprint generator in the image below looks at a certain bond distance radius and the features within that distance.

Types of fingerprints

This project will be using four different types of fingerprints and comparing their predictive strength.

Morgan circular fingerprint —  This fingerprint is part of the Extended-Connectivity Fingerprints (ECFPs) family and are generated using the Morgan algorithm [2,3]. These fingerprints represent molecular structures and the presence of substructures by means of circular atom neighborhoods (bond radius). Another important feature is their ability to determine the absence or presence of molecular functionality, which can further help discriminate when classifying molecules.

Daylight-like fingerprint — This fingerprint generator (using RDKit) produces a fingerprint similar to the fingerprint generated using the Daylight fingerprinting algorithm. An oversimplified explanation: the algorithm hashes bonds along a path within a molecule (topological pathways)[4][5]. The image below shows how bond paths are identified and described.

Atom-pair fingerprint — The atom-pair fingerprint is constructed using — as the name suggests — pairs of atoms (as features) as well as their topological distances. The atom-pair fingerprint is generated as shown below by first identifying heavy atoms and the shortest distance between them. Features of the individual atoms within a pair (like the number of bonds) are encoded . These encoded features are then converted into bit strings and represented as one number. This concatenated string of integers is then passed to the hash function which then assigns it as a 1 or 0.

Topological torsion fingerprints — are 2D structural fin­gerprints that are generated by identifying bond paths of four non-hydrogen atoms. These four-path fragments then have their features, such as the number of π electrons of each atom, atomic type and number of non-hydrogen branches calculated. Each atom is characterized by 8 bits and the 4 atoms have their bits stored in a 32-bit array.

Preparing the data

The Tox21 data is already labeled with active (1) and inactive (0) states so there really isn’t much to do in this step other than load, format column names, and generate mol files (these files are generally classified as data files that contain molecular data information, atom, bonds, coordinates and connectivity information) for each molecule.

In [88]:
# general and data handling
import numpy as np
import pandas as pd
import os
from collections import Counter

# Required RDKit modules
import rdkit as rd
from rdkit.Chem.Fingerprints import FingerprintMols
from rdkit import RDConfig
from rdkit.Chem import PandasTools
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import rdFingerprintGenerator
from rdkit import DataStructs
from rdkit.Chem import AllChem as Chem
from rdkit.Chem.rdMolDescriptors import GetAtomPairFingerprint
from rdkit.Chem.AtomPairs import Torsions

# modeling
import sklearn as sk
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import validation_curve
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from imblearn.over_sampling import ADASYN
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier

# Graphing
import matplotlib.pyplot as plt
import seaborn as sns

Loading and formatting

To follow along, edit the path to where the data is on your machine. We load an sdf file which is a structure-data file and contains associated data for compounds. We actually don't need these details as they wont add to classifying p53 agonists, hence we drop them. For example the "formula column" gives us no usable information at all since its the empirical formula and doesn't contain the valuable structural information a SMILES string provides. We do need mol representations of the molecules so we generate those.

In [3]:
sdfFile = os.path.join(RDConfig.RDDataDir, r'C:/Users/gurka/Projects/p53/sr-p53.sdf')
p53 = PandasTools.LoadSDF(sdfFile,smilesName='SMILES',molColName='Molecule', includeFingerprints=True)
p53 = p53.drop(['DSSTox_CID', 'FW', "Formula", 'Molecule'], axis=1)
0 1 NCGC00166288-01 CCN1C(=CC=Cc2sc3ccccc3[n+]2CC)Sc2ccccc21.[I-]
1 1 NCGC00185752-01 COC(=O)CC(O)(CCCC(C)(C)O)C(=O)OC1C(OC)=CC23CCCN2CCc2cc4c(cc2C13)OCO4
2 0 NCGC00094121-01 Cc1cccc(C)c1OCC(C)N.Cl
3 1 NCGC00241107-01 CO.COc1cc(Nc2c(C#N)cnc3cc(OCCCN4CCN(C)CC4)c(OC)cc23)c(Cl)cc1Cl
4 0 NCGC00094586-01 CC1(C)SC2C(NC(=O)C(N)c3ccc(O)cc3)C(=O)N2C1C(=O)O

mol generation and unique records

We should also ensure that there aren't any duplicates in the data. This data set in particular has a total number of 8629 records and only 6826 unique records, meaning that ~20% of the data is duplicated and needs to be dropped.

In [4]:
p53["mol"] = [Chem.MolFromSmiles(x) for x in p53["SMILES"]]
#frame1['MFP2BV'] = [Chem.GetMorganFingerprintAsBitVect(x, 2, nBits=1024) for x in frame1["mol"]]
#frame1["bitstring"] = [x.ToBitString for x in frame1["MFP2BV"]]

#count the nummber of unique rows in the SMILES column

#Compare that with total rows

#we can drop the duplicated ones
p53 = p53.drop_duplicates(['SMILES'])

Fingerprint generation

Here is an example of fingerprint generation using the Morgan algorithm. A fingerprint is generated for each compound in the "mol" column with a radius of 2 and a bit length of 2048. A list with the same parameters is also generated for each mol. A for loop then iterates through the loop to create a numpy array and append a new list denoted by "np" in the name. These lists will become our x variables when running the classifiers.

In [5]:
#create a column for the morgan fingerprints
p53["morg_fp"] = [Chem.GetMorganFingerprintAsBitVect(m, 2, nBits = 2048) for m in p53['mol']]

# generate morgan fingeprints with radius 2 contained in a list
morg_fp = [Chem.GetMorganFingerprintAsBitVect(m, 2, nBits = 2048) for m in p53['mol']]

# convert the RDKit explicit vectors into numpy arrays
morg_fp_np = []
for fp in morg_fp:
  arr = np.zeros((1,))
  DataStructs.ConvertToNumpyArray(fp, arr)
In [6]:
#create a column for the RDKit fingerprints
p53["rd_fp"] = [Chem.RDKFingerprint(m) for m in p53["mol"]]

#The fingerprinting algorithm used is similar to that used in the Daylight fingerprinter
rd_fp = [Chem.RDKFingerprint(m) for m in p53["mol"]]

rd_fp_np = []
for fp in rd_fp:
  arr = np.zeros((1,))
  DataStructs.ConvertToNumpyArray(fp, arr)
In [7]:
#create a column for the Atom-pair fingerprints
p53["AP_fp"] = [Chem.GetHashedAtomPairFingerprintAsBitVect(m) for m in p53["mol"]]

#get atom-par fingerprints
AP_fp = [Chem.GetHashedAtomPairFingerprintAsBitVect(m) for m in p53["mol"]]

#convert the RDKit explicit vectors into numpy arrays
AP_fp_np = []
for fp in AP_fp:
  arr = np.zeros((1,))
  DataStructs.ConvertToNumpyArray(fp, arr)
In [8]:
#create a column for the topological torsion fingerprints
p53["torsion_fp"] = torsion_fp = [Chem.GetHashedTopologicalTorsionFingerprintAsBitVect(m) for m in p53["mol"]]

#get topological torsion fingerprints
torsion_fp = [Chem.GetHashedTopologicalTorsionFingerprintAsBitVect(m) for m in p53["mol"]]

#convert the RDKit explicit vectors into numpy arrays
torsion_fp_np = []
for fp in torsion_fp:
  arr = np.zeros((1,))
  DataStructs.ConvertToNumpyArray(fp, arr)

Setting the x and y variables:

In [9]:
#morgan fingerprints
x_morg = morg_fp_np

#daylight-like fingerprints
x_rd = rd_fp_np

#atom-pair fingerprints
x_AP = AP_fp_np

#topological torsion fingerprints
x_torsion = torsion_fp_np

#classification labels
y = p53.Active.values

Sampling: Are the classes balanced?

When the classification labels (in our case: active = 1; inactive = 0) are skewed in proportion, bias is introduced into the model. For example, if the data set contains 95% active labels then the model will likely classify an inactive molecule as an active one, simply put, the accuracy will reflect the class class distribution. Determining if class imbalance is straight forward — count the distribution of labels. Once we determine how prevalent the minority class is in our data set, we can move forward with balancing it out.

In [9]:
#store the activity labels of all compounds
y = p53.Active.values

#return the count for each unique value (0 and 1)
unique, counts = np.unique(y, return_counts=True)
dict(zip(unique, counts))

#plot the label counts 
count1 = sns.catplot(x="Active", kind="count", palette="ch:.25", data=p53)

#I added the count of class labels as the title here
count1.fig.suptitle('0: 6406, 1: 420')
Text(0.5, 0.98, '0: 6406, 1: 420')

The count plot above shows us that there is a disproportionate distribution of classes in our data, a ratio of 15:1 (inactive to active) meaning approximately ~93.84% of our classes being inactive.

Methods of class balancing

There are two ways to balance the data: over-sampling and under-sampling. Over sampling brings balance by generating synthetic data points using using one of many algorithms. The one we will use today is a variation of the SMOTE (Synthetic Minority Over-Sampling Technique) algorithm. SMOTE works by searching for the k-nearest neighbors for a given point of data and then generates data points along the line between neighbors. Since these points fall on a line between the real data points, there is some correlation, which is why we will use the ADASYN algorithm (Adaptive Synthetic). ADASYN works essentially the same way SMOTE does, however, it adds small randomly generated values to the points to increase variance and simulate more realistic data.

note - I opted not to use under-sampling since it balances the classes by removing data points from the majority class and therefore there is a loss of data, possibly affecting the classifier's ability to discriminate.

In [10]:
x_morg_rsmp, y_morg_rsmp = ADASYN().fit_resample(x_morg, y)
morg_sample_count = sorted(Counter(y_morg_rsmp).items())

x_rd_rsmp, y_rd_rsmp = ADASYN().fit_resample(x_rd, y)
rd_sample_count = sorted(Counter(y_rd_rsmp).items())

x_AP_rsmp, y_AP_rsmp = ADASYN().fit_resample(x_AP, y)
AP_sample_count = sorted(Counter(y_rd_rsmp).items())

x_torsion_rsmp, y_torsion_rsmp = ADASYN().fit_resample(x_torsion, y)
torsion_sample_count = sorted(Counter(y_rd_rsmp).items())

print("Morgan distribution:", morg_sample_count, 
      "Daylight-like distribution:", rd_sample_count, 
      "Atom-pair distribution:", AP_sample_count, 
      "Torsion distribution:", torsion_sample_count)
Morgan distribution: [('0', 6406), ('1', 6414)] Daylight-like distribution: [('0', 6406), ('1', 6385)] Atom-pair distribution: [('0', 6406), ('1', 6385)] Torsion distribution: [('0', 6406), ('1', 6385)]

We can plot one of the re-sampled distributions to visually represent class balance:

In [11]:
y_morg_rsmp_df = pd.DataFrame(y_morg_rsmp, columns=['Active'])
sns.catplot(x="Active", kind="count", palette="ch:.25", data=y_morg_rsmp_df);

Logistic Regression

The logistic regression algorithm classifies a categorical response (outcome) variable between 1 and 0 based on its relationship with predictive features. In contrast, linear regression outputs response variables that are continuous and can be any real number. Most importantly, linear regression does not output probabilities and instead fits the best hyperplane. Therefore logistic regression is the natural choice for a binary classification problem such as ours.

Creating training, testing, and validation test sets

In order to train, test, and then test again we need to create three separate data partitions. The training data will be used to train our model - essentially the model will use this data to learn which fingerprints (the 2048 bit patterns) likely belong to p53 agonists. The test set will be used to evaluate the accuracy of our model, and lastly a validation set will be used as unbiased data set to test the predictive ability of our model. Our data will be split into 85/15 train/test sets and then the training data will be further split into an 85/15 train/validation set.

In [11]:
#Partitioning the data into an 85/15 train/test split
x_morg_train, x_morg_test, y_morg_train, y_morg_test = train_test_split(x_morg_rsmp, y_morg_rsmp, test_size=0.15, random_state=1)
#The training data is further split into an 80/15 train/validtion split 
x_morg_train, x_morg_val, y_morg_train, y_morg_val = train_test_split(x_morg_train, y_morg_train, test_size=0.15, random_state=1)

x_rd_train, x_rd_test, y_rd_train, y_rd_test = train_test_split(x_rd_rsmp, y_rd_rsmp, test_size=0.15, random_state=1)
x_rd_train, x_rd_val, y_rd_train, y_rd_val = train_test_split(x_rd_train, y_rd_train, test_size=0.15, random_state=1)

x_AP_train, x_AP_test, y_AP_train, y_AP_test = train_test_split(x_AP_rsmp, y_AP_rsmp, test_size=0.15, random_state=1)
x_AP_train, x_AP_val, y_AP_train, y_AP_val = train_test_split(x_AP_train, y_AP_train, test_size=0.15, random_state=1)

x_torsion_train, x_torsion_test, y_torsion_train, y_torsion_test = train_test_split(x_torsion_rsmp, y_torsion_rsmp, test_size=0.15, random_state=1)
x_torsion_train, x_torsion_val, y_torsion_train, y_torsion_val = train_test_split(x_torsion_train, y_torsion_train, test_size=0.15, random_state=1)

Cross validation and model fitting

Cross validation is a statistical technique to reduce the risk of over-fitting the data. Over-fitting occurs when the classifier memorizes the data and fits noise and error instead of the underlying relationship between variables. This can occur if the model is overly complex and if there is limited data to train on. Our model doesn't seem to suffer from these two limitations but its good practice to perform cross validation anyway.

The top image illustrate how a model may "memorize" and essentially fit the data too well. The bottom image shows how cross validation creates 10 folds, each of which is split into a test and train set to fit a model, and then averages the error from each fold.

The following lines of code show how to perform logistic regression, obtain cross validations scores for each fold (just one iteration on the training data to estimate scores using the chosen classifier). We then model logistic regression over 1000 iterations while performing k-fold (10) cross validation.

In [37]:
#set our linear regression function
lr = LogisticRegression(solver = 'lbfgs', max_iter = 1000)
#fit the data 
#perform 10 fold crossvalidation to estimate accuracy of model
accuracy_morg = cross_val_score(lr, x_morg_train, y_morg_train, scoring='accuracy', cv = 10)

#get the mean of each fold 
print("Accuracy of Model with Cross Validation is:",accuracy_morg.mean() * 100, "(+/- %0.2f)" % accuracy_morg.std())

#Create our model using 10 fold cross valiation, lbfgs, and 1000 iterations - and fit the data to out training set
lr_morg_ = LogisticRegressionCV(cv=10, solver = 'lbfgs', max_iter = 1000).fit(x_morg_train, y_morg_train)
Accuracy of Model with Cross Validation is: 95.69204220812904 (+/- 0.00)

For the rest of the fingerprints we can simply write a function to model the data:

In [177]:
# A function to train the logistic regression classifier on given x and y training sets
def lrCV_model(x_train, y_train):
    lr = LogisticRegression(solver = 'lbfgs', max_iter = 650)
    model_accuracy = cross_val_score(lr, x_train, y_train, scoring='accuracy', cv = 5)
    accuracy_printout = "Accuracy of Model with Cross Validation is:",model_accuracy.mean() * 100, "(+/- %0.2f)" % model_accuracy.std()
    lr_fit = LogisticRegressionCV(cv=5, solver = 'lbfgs', max_iter = 650, multi_class = "ovr", n_jobs = -1).fit(x_train, y_train)
    return lr_fit
In [ ]:
#train logistic regression models for all fingerprints
lr_morg = lrCV_model(x_morg_train, y_morg_train)
lr_rd = lrCV_model(x_rd_train, y_rd_train)
lr_AP = lrCV_model(x_AP_train, y_AP_train)
lr_torsion = lrCV_model(x_torsion_train, y_torsion_train)

Model Metrics and prediction

Test set predictions and confusion matrix

A confusion matrix is a table which describes the performance of the trained model on test data. There are four quadrants comprising the matrix:

true positives (TP): These are compounds that were predicted positive (active p53 agonist) and actually are positive.

true negatives (TN): These compounds were predicted to be inactive and actually are inactive.

false positives (FP): Cases where the compounds are predicted to be active but are actually inactive (Type I error).

false negatives (FN): Cases where the compounds are predicted to be inactive but are actually active (Type II error)

We can plot the confusion matrix for both test and validation set predictions along with accuracy - which is determined through (TP +TN)/Total cases. The confusion matrix on the left is that of the test set predictions and the right is for the validation set.

In [20]:
#predicting using the test set and creating a confusion matrix
predictions_morg_test = lr_morg.predict(x_morg_test)

#confustion matrix for test set predictions
cm_morg_test = metrics.confusion_matrix(y_morg_test, predictions_morg_test)

[[909  66]
 [  0 948]]

Validation set predictions and confusion matrix

In [21]:
#predicting using the validation set and creating a confusion matrix
predictions_morg_val = lr_morg.predict(x_morg_val)

#confustion matrix for validation set predictions
cm_morg_val = metrics.confusion_matrix(y_morg_val, predictions_morg_val)

[[772  74]
 [  0 789]]

Model accuracy on test set

In [178]:
score_morg_train = lr_morg.score(x_morg_train, y_morg_train)
print('Accuracy of logistic regression classifier on train set:', score_morg_train)
Accuracy of logistic regression classifier on train set: 0.9990282876268625

Model accuracy on test set

In [22]:
score_morg_test = lr_morg.score(x_morg_test, y_morg_test)
print('Accuracy of logistic regression classifier on test set:', score_morg_test)
Accuracy of logistic regression classifier on test set: 0.9656786271450858

Model accuracy on validation set

In [23]:
score_morg_val = lr_morg.score(x_morg_val, y_morg_val)
print('Accuracy of logistic regression classifier on test set:', score_morg_val)
Accuracy of logistic regression classifier on test set: 0.9547400611620795

Classification report

In [24]:
print(classification_report(y_morg_val, predictions_morg_val))
              precision    recall  f1-score   support

           0       1.00      0.91      0.95       846
           1       0.91      1.00      0.96       789

   micro avg       0.95      0.95      0.95      1635
   macro avg       0.96      0.96      0.95      1635
weighted avg       0.96      0.95      0.95      1635

In [30]:
sns.heatmap(cm_morg_test, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(score_morg_test)
plt.title(all_sample_title, size = 15);

Validation set confusion matrix plotted

In [26]:
sns.heatmap(cm_morg_val, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(score_morg_val)
plt.title(all_sample_title, size = 15);

Daylight-like fingerprint - logistic regression

In [179]:
#predicting using the test set and creating a confusion matrix
predictions_rd_test = lr_rd.predict(x_rd_test)

#confustion matrix for test set predictions
cm_rd_test = metrics.confusion_matrix(y_rd_test, predictions_rd_test)

#predicting using the validation set and creating a confusion matrix
predictions_rd_val = lr_rd.predict(x_rd_val)

#confustion matrix for test validation predictions
cm_rd_val = metrics.confusion_matrix(y_rd_val, predictions_rd_val)


score_rd_train = lr_rd.score(x_rd_train, y_rd_train)
print('Accuracy of logistic regression classifier on train set:', score_rd_train)

score_rd_test = lr_rd.score(x_rd_test, y_rd_test)
print('Accuracy of logistic regression classifier on test set:', score_rd_test)

score_rd_val = lr_rd.score(x_rd_val, y_rd_val)
print('Accuracy of logistic regression classifier on validation set:', score_rd_val)

print(classification_report(y_rd_val, predictions_rd_val))
[[781  53]
 [  0 797]]
Accuracy of logistic regression classifier on train set: 0.9967535980954442
Accuracy of logistic regression classifier on test set: 0.9661281917665451
Accuracy of logistic regression classifier on validation set: 0.967504598405886
              precision    recall  f1-score   support

           0       1.00      0.94      0.97       834
           1       0.94      1.00      0.97       797

   micro avg       0.97      0.97      0.97      1631
   macro avg       0.97      0.97      0.97      1631
weighted avg       0.97      0.97      0.97      1631

Atom-pair fingerprint - logistic regression

In [180]:
#predicting using the test set and creating a confusion matrix
predictions_AP_test = lr_AP.predict(x_AP_test)

#confustion matrix for test set predictions
cm_AP_test = metrics.confusion_matrix(y_AP_test, predictions_AP_test)


#predicting using the validation set and creating a confusion matrix
predictions_AP_val = lr_AP.predict(x_AP_val)

#confustion matrix for test validation predictions
cm_AP_val = metrics.confusion_matrix(y_AP_val, predictions_AP_val)


score_AP_train = lr_AP.score(x_AP_train, y_AP_train)
print('Accuracy of logistic regression classifier on train set:', score_AP_train)

score_AP_test = lr_AP.score(x_AP_test, y_AP_test)
print('Accuracy of logistic regression classifier on test set:', score_AP_test)

score_AP_val = lr_AP.score(x_AP_val, y_AP_val)
print('Accuracy of logistic regression classifier on validation set:', score_AP_val)

print(classification_report(y_AP_val, predictions_AP_val))
[[856  92]
 [  0 981]]
[[747  83]
 [  0 810]]
Accuracy of logistic regression classifier on train set: 0.993434506511678
Accuracy of logistic regression classifier on test set: 0.9523068947641264
Accuracy of logistic regression classifier on validation set: 0.949390243902439
              precision    recall  f1-score   support

           0       1.00      0.90      0.95       830
           1       0.91      1.00      0.95       810

   micro avg       0.95      0.95      0.95      1640
   macro avg       0.95      0.95      0.95      1640
weighted avg       0.95      0.95      0.95      1640

Topological torsion fingerprint - logistic regression

In [182]:
#predicting using the test set and creating a confusion matrix
predictions_torsion_test = lr_torsion.predict(x_torsion_test)

#confustion matrix for test set predictions
cm_torsion_test = metrics.confusion_matrix(y_torsion_test, predictions_torsion_test)


#predicting using the validation set and creating a confusion matrix
predictions_torsion_val = lr_torsion.predict(x_torsion_val)

#confustion matrix for test validation predictions
cm_torsion_val = metrics.confusion_matrix(y_torsion_val, predictions_torsion_val)


score_torsion_train = lr_torsion.score(x_torsion_train, y_torsion_train)
print('Accuracy of logistic regression classifier on train set:', score_torsion_train)

score_torsion_test = lr_torsion.score(x_torsion_test, y_torsion_test)
print('Accuracy of logistic regression classifier on test set:', score_torsion_test)

score_torsion_val = lr_torsion.score(x_torsion_val, y_torsion_val)
print('Accuracy of logistic regression classifier on validation set:', score_torsion_val)

print(classification_report(y_torsion_val, predictions_torsion_val))
[[830 131]
 [ 13 944]]
[[748  95]
 [  6 782]]
Accuracy of logistic regression classifier on train set: 0.9946952473746887
Accuracy of logistic regression classifier on test set: 0.9249217935349322
Accuracy of logistic regression classifier on validation set: 0.9380748007357449
              precision    recall  f1-score   support

           0       0.99      0.89      0.94       843
           1       0.89      0.99      0.94       788

   micro avg       0.94      0.94      0.94      1631
   macro avg       0.94      0.94      0.94      1631
weighted avg       0.94      0.94      0.94      1631

k-nearest neighbor

The k-nearest neighbor algorithm (knn) takes a data point and looks at the k closest data points (k =7 would mean the 7 nearest), this data point is then labeled 0 or 1 according to the majority label of the k closest points. For example, if given a point of data with k=7 and the closest points happen to be five 0's and two 1's then the point would be classified as 0 (inactive).

Knn performs best when working with fewer features, in turn resulting in lower chances of over-fitting. Normally we would have to perform PCA or another form of dimension reduction, however since we're using a single predictive feature we don't need to.

Grid searching and model fitting

Perform a grid search with cross validation to determine the optimal number of neighbors for our data. We will test 9 candidate numbers (3, 5, 7, 9, 11, 13, 15, 17, 19) and perform a 5 fold cross validation. This however may take a while since 9 neighbors x 5-fold validation for each amounts to 45 fits. The k-fold can be reduced so that we can keep a decent range of neighbors to try.

The grid search suggests that the best number of neighbors to use is 3.

Note: set "n_jobs" to -1 - this allows for parallel processing with max CPU usage

In [35]:
knn = KNeighborsClassifier()
grid_params = {'n_neighbors': [3, 5, 7, 9, 11, 13, 15, 17, 19], 'weights': ['distance'], 'metric': ['euclidean']}
knn_cv = GridSearchCV(knn, grid_params, cv=2, verbose = 1, n_jobs = -1)

knn_morg =, y_morg_train)


To test out a range of k neighbors numbers we can fit different fingerprint data with varying k values. Using the Atom-pair training data I tested k = 3, 5, 7, 9 which resulted in following accuracy scores in order:


It seems that 3 and 5 had the same scores, however 3 was likely chosen due to less computation time. Now to finish off, train the models with k=3 and compute their accuracy scores on first the test data and then validation data To test out a range of k neighbors numbers we can fit different fingerprint data with varying k values. Using the Atom-pair training data I tested k = 3, 5, 7, 9 which resulted in following accuracy scores in order:

In [49]:
knn_3n = KNeighborsClassifier(n_neighbors=3)
knn_AP =, y_AP_train)

knn_5n = KNeighborsClassifier(n_neighbors=5)
knn_AP5 =, y_AP_train)

knn_7n = KNeighborsClassifier(n_neighbors=7)
knn_AP7 =, y_AP_train)

knn_9n = KNeighborsClassifier(n_neighbors=9)
knn_AP9 =, y_AP_train)

knn_AP_3n = knn_AP.score(x_AP_test, y_AP_test)
knn_AP_5n = knn_AP5.score(x_AP_test, y_AP_test)
knn_AP_7n = knn_AP7.score(x_AP_test, y_AP_test)
knn_AP_9n = knn_AP9.score(x_AP_test, y_AP_test)

Training the models and writing a function for scoring

In [80]:
knn3 = KNeighborsClassifier(n_neighbors=3)

knn_morg =, y_morg_train)
knn_rd =, y_rd_train)
knn_AP =, y_AP_train)
knn_torsion =, y_torsion_train)

def mod_acc_knn(x_train, y_train, x_test, y_test):
    knn3 = KNeighborsClassifier(n_neighbors=3)
    knn_fp =, y_train)
    score_fp_test_knn = knn_fp.score(x_test, y_test)
    return score_fp_test_knn

knn training data accuracy

In [183]:
knn_morg_acc_train = mod_acc_knn(x_morg_train, y_morg_train, x_morg_train, y_morg_train)
knn_rd_acc_train = mod_acc_knn(x_rd_train, y_rd_train, x_rd_train, y_rd_train)
knn_AP_acc_train = mod_acc_knn(x_AP_train, y_AP_train, x_AP_train, y_AP_train)
knn_torsion_acc_train = mod_acc_knn(x_torsion_train, y_torsion_train, x_torsion_train, y_torsion_train)

knn test data accuracy

In [86]:
knn_morg_acc_test = mod_acc_knn(x_morg_train, y_morg_train, x_morg_test, y_morg_test)
knn_rd_acc_test = mod_acc_knn(x_rd_train, y_rd_train, x_rd_test, y_rd_test)
knn_AP_acc_test = mod_acc_knn(x_AP_train, y_AP_train, x_AP_test, y_AP_test)
knn_torsion_acc_test = mod_acc_knn(x_torsion_train, y_torsion_train, x_torsion_test, y_torsion_test)

knn validation data accuracy

In [158]:
knn_morg_acc_val = mod_acc_knn(x_morg_train, y_morg_train, x_morg_val, y_morg_val)
knn_rd_acc_val = mod_acc_knn(x_rd_train, y_rd_train, x_rd_val, y_rd_val)
knn_AP_acc_val = mod_acc_knn(x_AP_train, y_AP_train, x_AP_val, y_AP_val)
knn_torsion_acc_val = mod_acc_knn(x_torsion_train, y_torsion_train, x_torsion_val, y_torsion_val)

Gradient boosting

So far we have fit a single model to the training data for each classier. Gradient boosting combines many individual models to create a single accurate one - this is known as an ensemble. Boosting is the method used to create an ensemble. It works by fitting the data with an initial model, then a subsequent model is built which works on accurately predicting the labels that the initial model classified incorrectly. Essentially, every subsequent model works to minimize the prediction error of the previous model which leads to an overall decrease in prediction error. Therefore predictions made by the final model are the result of the weighted sum of predictions calculated by all previous models.

Grid searching and model training, and metrics

Perform a grid search to tune parameters as we've done before:

In [145]:
#set search parameters
params_gb = {
    "learning_rate": [1.0],
    "min_samples_split": np.linspace(0.1, 0.5, 12),
    "min_samples_leaf": np.linspace(0.1, 0.5, 12),
    "criterion": ["friedman_mse"],
    "subsample":[0.5, 0.75, 0.95],

#initialize grid seach
gb = GridSearchCV(GradientBoostingClassifier(), params_gb, cv=2, n_jobs=-1, verbose = 2)
#train the model while tuning 
gb_morg =, y_morg_train)

Check out the accuracy on the test and validation data

In [168]:
print(gb_morg.score(x_morg_train, y_morg_train))
print(gb_morg.score(x_morg_test, y_morg_test))
print(gb_morg.score(x_morg_val, y_morg_val))
In [ ]:
#initialize grid seach
gb = GridSearchCV(GradientBoostingClassifier(), params_gb, cv=2, n_jobs=-1, verbose = 2)
#train the model while tuning 
gb_rd =, y_rd_train)
In [167]:
print(gb_rd.score(x_rd_train, y_rd_train))
print(gb_rd.score(x_rd_test, y_rd_test))
print(gb_rd.score(x_rd_val, y_rd_val))
In [ ]:
#initialize grid seach
gb = GridSearchCV(GradientBoostingClassifier(), params_gb, cv=2, n_jobs=-1, verbose = 2)
#train the model while tuning 
gb_AP =, y_AP_train)
In [165]:
print(gb_AP.score(x_AP_train, y_AP_train))
print(gb_AP.score(x_AP_test, y_AP_test))
print(gb_AP.score(x_AP_val, y_AP_val))
In [ ]:
#initialize grid seach
gb = GridSearchCV(GradientBoostingClassifier(), params_gb, cv=2, n_jobs=-1, verbose = 2)
#train the model while tuning 
gb_torsion =, y_torsion_train)
In [166]:
print(gb_torsion_3.score(x_torsion_train, y_torsion_train))
print(gb_torsion_3.score(x_torsion_test, y_torsion_test))
print(gb_torsion_3.score(x_torsion_val, y_torsion_val))

AUC values and F1-scores


The Area Under Curve (AUC) value is another measure to evaluate the performance of a model on a binary classification task and is one of the most widely used metrics for evaluation. Two properties of a given model are needed to calculate the AUC - sensitivity and specificity. Sensitivity is the proportion of correctly classified positive points of data compared to all data points classified as positive. Sensitivity is the proportion of data points that were correctly classified as negative compared to all data points classified as negative.

  1. Sensitivity of a model: True positive / (False Negative + True positive).
  2. Specificity of a model: True negative / (False positive + True negative).

When the true positive rate is plotted against the false positive rate, the AUC of this curve can be calculated and essentially scores the ability of the model to distinguish between the two classes. A higher AUC means that molecules classified as p53 agonists are indeed more likely to be agonists.

In [ ]:
def auc_calc(x_val, y_val, model):
    # Probability predictions using validation data
    prob = model.predict_proba(x_val)
    # Correct probability predictions
    prob = prob[:, 1]
    # Calculate area under the curve with validation data labels and correct predicted probabilities
    auc = metrics.roc_auc_score(y_val, prob)
    # Return the AUC score with 6 figures
    return 'AUC: %.6f' % auc

Morgan fingerprint auc values for all three models

In [211]:
morg_lr_auc = auc_calc(x_morg_val, y_morg_val, lr_morg)
print("AUC using logistic regression: ", morg_lr_auc)
morg_knn_auc = auc_calc(x_morg_val, y_morg_val, knn_morg)
print("AUC using k-nearest naighbor: ", morg_knn_auc)
morg_gb_auc = auc_calc(x_morg_val, y_morg_val, gb_morg)
print("AUC using gradient boosting: ", morg_gb_auc)
AUC using logistic regression:  AUC: 0.973971
AUC using k-nearest naighbor:  AUC: 0.957206
AUC using gradient boosting:  AUC: 0.965234

Daylight-like fingerprint auc values for all three models

In [218]:
rd_lr_auc = auc_calc(x_rd_val, y_rd_val, lr_rd)
print("AUC using logistic regression: ", rd_lr_auc)
rd_knn_auc = auc_calc(x_rd_val, y_rd_val, knn_rd)
print("AUC using k-nearest neighbor: ", rd_knn_auc)
rd_gb_auc = auc_calc(x_rd_val, y_rd_val, gb_rd)
print("AUC using gradient boosting: ", rd_gb_auc)
AUC using logistic regression:  AUC: 0.980311
AUC using k-nearest naighbor:  AUC: 0.957979
AUC using gradient boosting:  AUC: 0.986917

Atom-pair fingerprint auc values for all three models

In [222]:
AP_lr_auc = auc_calc(x_AP_val, y_AP_val, lr_AP)
print("AUC using logistic regression: ", AP_lr_auc)
AP_knn_auc = auc_calc(x_AP_val, y_AP_val, knn_AP)
print("AUC using k-nearest naighbor: ", AP_knn_auc)
AP_gb_auc = auc_calc(x_AP_val, y_AP_val, gb_AP)
print("AUC using gradient boosting: ", AP_gb_auc)
AUC using logistic regression:  AUC: 0.963440
AUC using k-nearest naighbor:  AUC: 0.584849
AUC using gradient boosting:  AUC: 0.988373

Topologial torsion fingerprint auc values for all three models

In [224]:
torsion_lr_auc = auc_calc(x_torsion_val, y_torsion_val, lr_torsion)
print("AUC using logistic regression: ", torsion_lr_auc)
torsion_knn_auc = auc_calc(x_torsion_val, y_torsion_val, knn_torsion)
print("AUC using k-nearest naighbor: ", torsion_knn_auc)
torsion_gb_auc = auc_calc(x_torsion_val, y_torsion_val, gb_torsion)
print("AUC using gradient boosting: ", torsion_gb_auc)
AUC using logistic regression:  AUC: 0.930011
AUC using k-nearest naighbor:  AUC: 0.931734
AUC using gradient boosting:  AUC: 0.716465


The f1-score of a model is calculated using recall (calculated like sensitivity) and precision, which is the rate of:

  • true positive / (false positive + true positive)

The F1-score can then be found using:

A F1 Score finds a balance between a models recall (fewer predicted points turning out to be false negatives) and precision (fewer predicted points turning out to be false positives).

In [ ]:
def f1score(x_val, y_val, model):
    #predictions made on validation data
    predictions = model.predict(x_val)
    #generating the classification report
    clrpt = classification_report(y_val, predictions, output_dict = True)
    #accessing the weighted average entery in dictionairy
    d = clrpt["weighted avg"]
    #calling the f1_score
    f1 = d['f1-score']
    return 'f1-score %.6f' % f1

Morgan fingerprint f-scores for all three models using validation data

In [287]:
morg_lr_f1 = f1score(x_morg_val, y_morg_val, lr_morg)
morg_knn_f1 = f1score(x_morg_val, y_morg_val, knn_morg)
morg_gb_f1 = f1score(x_morg_val, y_morg_val, gb_morg)
f1-score 0.952264
f1-score 0.402606
f1-score 0.911232

Daylight-like fingerprint f-scores for all three models using validation data

In [288]:
rd_lr_f1 = f1score(x_rd_val, y_rd_val, lr_rd)
rd_knn_f1 = f1score(x_rd_val, y_rd_val, knn_rd)
rd_gb_f1 = f1score(x_rd_val, y_rd_val, gb_rd)
f1-score 0.967494
f1-score 0.469449
f1-score 0.960150

Atom-pair fingerprint f-scores for all three models using validation data

In [289]:
AP_lr_f1 = f1score(x_AP_val, y_AP_val, lr_AP)
AP_knn_f1 = f1score(x_AP_val, y_AP_val, knn_AP)
AP_gb_f1 = f1score(x_AP_val, y_AP_val, gb_AP)
f1-score 0.949292
f1-score 0.569587
f1-score 0.964024

Topological torsion fingerprint f-scores for all three models using validation data

In [290]:
torsion_lr_f1 = f1score(x_torsion_val, y_torsion_val, lr_torsion)
torsion_knn_f1 = f1score(x_torsion_val, y_torsion_val, knn_torsion)
torsion_gb_f1 = f1score(x_torsion_val, y_torsion_val, gb_torsion)
f1-score 0.938004
f1-score 0.852951
f1-score 0.651235

Results and conclusion

The table below shows the predictive power of each classifier when using the four fingerprints. Metrics included are accuracy scores are shown for training, test, and validation data sets, AUC scores for validation data, and the f1 scores of the validation data. The highest scores are highlighted in green for each fingerprint and classifier.

Which is the best classifier?

When looking at validation accuracy, logistic regression has the consistently highest scores. Next, when evaluating AUC - arguably the most important metric - gradient boosting produced the highest score but logistic regression had the highest average scores. Lastly, the highest individual and average f1-scores belong to logistic regression, suggesting that it produces models with the greatest balance between precision and robustness.

winner: Logistic regression

Which is the best fingerprint?

Literature suggests that although topological torsion may not be as popular as the Daylight-like and Morgan fingerprints, they show higher performance in comparison to other finger­prints [8]. The results show that the Daylight-like fingerprint had the highest AUC score for both logistic regression and knn - the Atom-pair fingerprint scored only marginally higher when using gradient boosting. However when looking at gradient boosting, Atom-pair fingerprints outscored the Daylight-like fingerprints in validation accuracy, AUC, and F1-scores.

winner: Daylight-like fingerprint

Possible issues

Sometimes accidental collisions between patterns can occur in fingerprints due to similar molecular substructures. This happens when the same bit is set by multiple patterns (hashing a specific feature of two similar structures as the same bit). For example, the topological torsion fingerprint produced eight identical fingerprints, this can be avoided by increasing the bit length (from 2048 to 4096).