Generating reasonable SMIRKS patterns

In this notebook we will demonstrate how chemper’s SMIRKSifier works to generate SMIRKS patterns for a list of molecules and assigned clustering. In other words, the goal is to generate a list of SMIRKS patterns which will maintain the clustering of molecular fragments the user specifies. For example, imagine you have determined the force field parameters for all bonds in your molecules set. You could group the bonds based on those which should be assigned the same force constant and equilibrium bond length. The goal of chemper’s tools is to generate a hierarchical list of SMIRKS that will maintain your clustering.

chemper’s ClusterGraph can create a single SMIRKS pattern for a group of molecular fragments. These SMIRKS patterns are fully specified using all possible SMIRKS decorators for each atom. The SMIRKSifier takes advantage of the ClusterGraph functionality and then removes unnecessary SMIRKS decorators so the final list of patterns is as generic as possible.

In this example, we assume that we want to group bonds based on their bond order so all single bonds should be in one group, all double in another group, and so on. The steps shown below are as follows:

  1. Create chemper molecules for a list of SMILES

  2. Classify the bonds in each molecule as single, double, aromatic, triple. Then group those bonds (based on atom indices) into each of those categories.

  3. Use chemper.optimize_smirks.smirksify to automatically create a hierarchical SMIRKS pattern list, then run it to remove unecessary decorators.

[1]:
from chemper.mol_toolkits import mol_toolkit
from chemper.chemper_utils import create_tuples_for_clusters
from chemper.smirksify import SMIRKSifier, print_smirks

1. Create a list of Molecules

Here we chose a list of SMILES strings and then use chemper.mol_toolkits to create a list of molecule objects.

[2]:
smiles_list = ["CCCCC", "c1ccccc1", "C1=CNC=C1", "CC=CC", "C(=O)OC", "C1C=CCCC1"]
molecules = [mol_toolkit.MolFromSmiles(s) for s in smiles_list]

2. Classify bonds

In this section we classify bonds based on the categories

  • single

  • aromatic

  • double

  • triple

This is done with the utility function create_tuples_for_clusters which creates a list atom indices (as tuples), for each molecule [ ('label', [ [(tuple of atoms for each molecule), ...] ...]) ].

In a bond we have two indexed atoms in a SMIRKS (1 and 2) because you need two atoms in order to identify a bond. For example, in the first molecule above, there is a single bond between atoms 0 and 1. The cluster_list would specify that bond with a tuple (0,1). There is a list of tuples for each molecule associated with each label.

In this example, there are six molecules. As an illustration of how the cluster_list is stuctured, consider the aromatic bonds, at cluster_list[1]. Only molecules 1 and 2 have aromatic bonds. The bonds in these molecules are specified by tuples showing the atom indices for each of the aromatic bonds below. The other four molecules have zero aromatic bonds so the associated lists are empty as no bonds need to be specified.

Moving forward

Obviously in the long run we don’t want to start with SMIRKS pattern, however, you could imagine identifying the equilibrium bond length and force constant for a variety of bonds. You could then cluster those bonds based on which parameters they should be assigned. You could give chemper these clusters of bonds as well.

[3]:
smirks_labels = [('sing', '[*:1]-[*:2]'),
                 ('aromatic', '[*:1]:[*:2]'),
                 ('double', '[*:1]=[*:2]'),
                ('triple', '[*:1]#[*:2]'),
                ]
cluster_list = create_tuples_for_clusters(smirks_labels, molecules)
[4]:
cluster_list[1]
[4]:
('aromatic',
 [[],
  [(0, 1), (1, 2), (4, 5), (0, 5), (2, 3), (3, 4)],
  [(1, 2), (0, 1), (0, 4), (2, 3), (3, 4)],
  [],
  [],
  []])

3. Generate SMIRKS and remove unnecessary SMIRKS decorators

The goal in this step is to create a generic, hierarchical list of SMIRKS patterns which will maintain the clustering of bonds we specified above.

First we will create a SMIRKSifier object. This takes your molecules and the list of classified bonds and automatically creates SMIRKS patterns using ALL possible decorators. As you can see this process leads to highly specific patterns which would not be practical assuming you want your clustering to be applied to molecules outside your training set.

There area also two optional arguments for the SMIRKSifier:

  • max_layers: this is the maximum number of atoms away from the indexed atoms you want included in your automatically generated SMIRKS patterns.

    • Internally, SMIRKSifier starts with 0 layers and only adds atoms if necessary

    • the default is max_layers = 5

  • verbose: if True (the default) the SMIRKSifier prints out information while matching the automatically generated SMIRKS with the initially assigned clusters.

3a. Create the SMIRKSifier printing out initial SMIRKS patterns

[5]:
bond_smirksifier = SMIRKSifier(molecules, cluster_list)

 Label                | SMIRKS
================================================================================
 zz_sing              | [#6!rAH1X3x0,#6!rAH2X4x0,#6!rAH3X4x0,#6AH1X3r6x2,#6AH2X4r6x2,#6H1X3ar5x2,#6H1X3ar6x2,#7H1X3ar5x2,#8!rAH0X2x0;+0:1]-[#1!rH0X1x0,#6!rH1X3x0,#6!rH2X4x0,#6!rH3X4x0,#6H1X3r6x2,#6H2X4r6x2,#8!rH0X2x0;+0;A:2]
--------------------------------------------------------------------------------
 zz_aromatic          | [#6r5,#6r6,#7r5;+0;H1;X3;a;x2:1]:;@[#6r5,#6r6,#7r5;+0;H1;X3;a;x2:2]
--------------------------------------------------------------------------------
 zz_double            | [#6!rx0,#6r6x2;+0;A;H1;X3:1]=[#6!rH1X3x0,#6H1X3r6x2,#8!rH0X1x0;+0;A:2]
--------------------------------------------------------------------------------

3b. Start removing decorators

The SMIRKSifier.reduce function attempts to remove a single decorator from a randomly chosen SMIRKS pattern during each iteration, it has two argument:

  • max_its(optional, default=1000): Number of iterations to remove

    • currently it does run for this many we are working on determining if there is a way to determine if we are done before the number of iterations is reached

  • verbose (optional, default=do not change the setting): This will temporarily change the SMIRKSifier’s verboseness, so you could make a long run more quiet.

This run returns the current set of SMIRKS patterns at the end of the simulation. You can use the internal SMIRKSifier.print_smirks function to print these in a semi-nicely formatted way.

[6]:
smirks10 = bond_smirksifier.reduce(max_its=10)
Iteration:  0
Attempting to change SMIRKS #2
[#6!rx0,#6r6x2;+0;A;H1;X3:1]=[#6!rH1X3x0,#6H1X3r6x2,#8!rH0X1x0;+0;A:2]  -->  [#6x0,#6x2r6;+0;A;H1;X3:1]=[#6,#6x2r6H1X3,#8H0x0X1;+0;A:2]
Accepted!
------------------------------------------------------------------------------------------
Iteration:  1
Attempting to change SMIRKS #1
[#6r5,#6r6,#7r5;+0;H1;X3;a;x2:1]:;@[#6r5,#6r6,#7r5;+0;H1;X3;a;x2:2]  -->  [#6r5,#6r6,#7r5;a;X3;+0;H1:1]:;@[#6r5,#6r6,#7r5;+0;H1;X3;a;x2:2]
Accepted!
------------------------------------------------------------------------------------------
Iteration:  2
Attempting to change SMIRKS #1
[#6r5,#6r6,#7r5;a;X3;+0;H1:1]:;@[#6r5,#6r6,#7r5;+0;H1;X3;a;x2:2]  -->  [*;a;X3;+0;H1:1]:;@[#6r5,#6r6,#7r5;+0;H1;X3;a;x2:2]
Accepted!
------------------------------------------------------------------------------------------
Iteration:  3
Attempting to change SMIRKS #1
[*;a;X3;+0;H1:1]:;@[#6r5,#6r6,#7r5;+0;H1;X3;a;x2:2]  -->  [*;a;+0;H1:1]:;@[#6r5,#6r6,#7r5;+0;H1;X3;a;x2:2]
Accepted!
------------------------------------------------------------------------------------------
Iteration:  4
Attempting to change SMIRKS #0
[#6!rAH1X3x0,#6!rAH2X4x0,#6!rAH3X4x0,#6AH1X3r6x2,#6AH2X4r6x2,#6H1X3ar5x2,#6H1X3ar6x2,#7H1X3ar5x2,#8!rAH0X2x0;+0:1]-[#1!rH0X1x0,#6!rH1X3x0,#6!rH2X4x0,#6!rH3X4x0,#6H1X3r6x2,#6H2X4r6x2,#8!rH0X2x0;+0;A:2]  -->  [#6x0X3AH1,#6x0X4AH2,#6x0AX4H3,#6x2X3AH1r6,#6x2X4AH2r6,#6aX3x2r5H1,#6aX3r6x2H1,#7aX3x2r5H1,#8H0x0AX2;+0:1]-[#1H0x0X1,#6x0X3H1,#6x0X4H2,#6x0X4H3,#6x2X3r6H1,#6x2X4H2r6,#8H0x0X2;A:2]
Accepted!
------------------------------------------------------------------------------------------
Iteration:  5
Attempting to change SMIRKS #1
[*;a;+0;H1:1]:;@[#6r5,#6r6,#7r5;+0;H1;X3;a;x2:2]  -->  [*;a;+0;H1:1]:;@[#6r5,#6r6,#7r5:2]
Accepted!
------------------------------------------------------------------------------------------
Iteration:  6
Attempting to change SMIRKS #1
[*;a;+0;H1:1]:;@[#6r5,#6r6,#7r5:2]  -->  [*;a;+0;H1:1]:;@[#6r5,#6,#7r5:2]
Accepted!
------------------------------------------------------------------------------------------
Iteration:  7
Attempting to change SMIRKS #0
[#6x0X3AH1,#6x0X4AH2,#6x0AX4H3,#6x2X3AH1r6,#6x2X4AH2r6,#6aX3x2r5H1,#6aX3r6x2H1,#7aX3x2r5H1,#8H0x0AX2;+0:1]-[#1H0x0X1,#6x0X3H1,#6x0X4H2,#6x0X4H3,#6x2X3r6H1,#6x2X4H2r6,#8H0x0X2;A:2]  -->  [#6x0AH1X3,#6x0AH2X4,#6x0AX4H3,#6x2AH1X3r6,#6x2AH2X4r6,#6aH1x2r5X3,#6aH1r6x2X3,#7aH1x2r5X3,#8H0x0AX2;+0:1]-[#1H0x0X1,#6x0X3H1,#6x0X4H2,#6x0X4H3,#6,#6x2X4H2r6,#8H0x0X2;A:2]
Accepted!
------------------------------------------------------------------------------------------
Iteration:  8
Attempting to change SMIRKS #2
[#6x0,#6x2r6;+0;A;H1;X3:1]=[#6,#6x2r6H1X3,#8H0x0X1;+0;A:2]  -->  [#6x0,#6x2r6;+0;A;H1;X3:1]=[#6,#6x2H1X3,#8H0x0X1;+0;A:2]
Accepted!
------------------------------------------------------------------------------------------
Iteration:  9
Attempting to change SMIRKS #0
[#6x0AH1X3,#6x0AH2X4,#6x0AX4H3,#6x2AH1X3r6,#6x2AH2X4r6,#6aH1x2r5X3,#6aH1r6x2X3,#7aH1x2r5X3,#8H0x0AX2;+0:1]-[#1H0x0X1,#6x0X3H1,#6x0X4H2,#6x0X4H3,#6,#6x2X4H2r6,#8H0x0X2;A:2]  -->  [#6x0X3AH1,#6x0X4AH2,#6x0AX4H3,#6x2X3AH1r6,#6x2X4AH2r6,#6aX3x2r5H1,#6aX3r6x2H1,#7aX3x2r5H1,#8H0x0AX2;+0:1]-[#1H0x0X1,#6x0X3H1,#6x0X4H2,#6x0X4H3,#6,#8H0x0X2;A:2]
Accepted!
------------------------------------------------------------------------------------------

 Label                | SMIRKS
================================================================================
 zz_sing              | [#6x0X3AH1,#6x0X4AH2,#6x0AX4H3,#6x2X3AH1r6,#6x2X4AH2r6,#6aX3x2r5H1,#6aX3r6x2H1,#7aX3x2r5H1,#8H0x0AX2;+0:1]-[#1H0x0X1,#6x0X3H1,#6x0X4H2,#6x0X4H3,#6,#8H0x0X2;A:2]
--------------------------------------------------------------------------------
 zz_aromatic          | [*;a;+0;H1:1]:;@[#6r5,#6,#7r5:2]
--------------------------------------------------------------------------------
 zz_double            | [#6x0,#6x2r6;+0;A;H1;X3:1]=[#6,#6x2H1X3,#8H0x0X1;+0;A:2]
--------------------------------------------------------------------------------

[7]:
print_smirks(smirks10)

 Label                | SMIRKS
================================================================================
 zz_sing              | [#6x0X3AH1,#6x0X4AH2,#6x0AX4H3,#6x2X3AH1r6,#6x2X4AH2r6,#6aX3x2r5H1,#6aX3r6x2H1,#7aX3x2r5H1,#8H0x0AX2;+0:1]-[#1H0x0X1,#6x0X3H1,#6x0X4H2,#6x0X4H3,#6,#8H0x0X2;A:2]
--------------------------------------------------------------------------------
 zz_aromatic          | [*;a;+0;H1:1]:;@[#6r5,#6,#7r5:2]
--------------------------------------------------------------------------------
 zz_double            | [#6x0,#6x2r6;+0;A;H1;X3:1]=[#6,#6x2H1X3,#8H0x0X1;+0;A:2]
--------------------------------------------------------------------------------

3c. Continue removing decorators

Now we will continue trying to reduce the SMIRKS. Note, in this case we set verbose to False and just print the final SMIRKS since 3,000 is a lot of steps.

[8]:
smirks3k = bond_smirksifier.reduce(max_its=3000, verbose=False)
print_smirks(smirks3k)

 Label                | SMIRKS
================================================================================
 zz_sing              | [*:1]~[*:2]
--------------------------------------------------------------------------------
 zz_aromatic          | [*:1]:[*:2]
--------------------------------------------------------------------------------
 zz_double            | [*:1]=[*:2]
--------------------------------------------------------------------------------

4. What have we learned for the future

Is there a systematic way to remove decorators that doesn’t introduce too much human wizardary?

Right now the removal of decorators is stochastic, so you don’t guarentee the same SMIRKS will be created for the same clustering of atoms every time. This is because it is possible to have multiple combinations of decorators that which maintain the same clustering.

We could consider looking for differences in the ClusterGraphs and start by removing any decorators that are in common for all SMIRKS since those are clearly not distinguishing features. However, it seems unlikely that a systematic removal wouldn’t be biased by the choices of the human who chose the order for checking the removal of the decorators.

[ ]: