Make SMIRKS from clustered fragments

This notebook will showcase how ChemPer’s ClusterGraph creates SMIRKS patterns from a group of user specified molecular fragments.
For example, imagine we wanted to create a SMIRKS pattern for an angle type that appears in many molecules. ClusterGraph collects the SMIRKS decorators from every molecule and stores them in a highlyspecific SMIRKS pattern.

The ultimate goal for chemper is to create a hierarchical list of SMIRKS patterns that retains fragment clustering. We could use this tool to generate SMIRKS patterns for the SMIRNOFF force field format allowing use to create data driven, direct chemical percpeption.

For example, if your initial clusters had 4 types of carbon-carbon bonds (single, aromatic, double, and triple), you would expect the final SMIRKS patterns to reflect those four categories.

The first step here is to store possible decorators for atoms and bonds in a given cluster. In this notebook we will use example SMIRKS patterns as a way of identifying groups of molecular fragments. Then we will use ClusterGraph to create highly specific SMIRKS for these same fragments.

[1]:
# import statements
from chemper.mol_toolkits import mol_toolkit
from chemper.graphs.cluster_graph import ClusterGraph
from chemper.chemper_utils import create_tuples_for_clusters

create_tuples_for_clusters

This is a utility function inside ChemPer which extracts atom indices which match a specific SMIRKS pattern.

Help on function create_tuples_for_clusters in module chemper.chemper_utils: For example, lets assume you wanted to find all of the atoms that match this SMIRKS list * “any”, '[*:1]~[*:2]' * “single”, '[*:1]-[*:2]'

In this case, the “any” bond would match all bonds, but then the “single” would match all single bonds. If you were typing Ethene (C=C) then you expect the double bond between carbon atoms 0 and 1 to match “any” bond and all C-H bonds to match “single”.

The output in this case would be:

[ ('any', [[ (0, 1) ]] ),
  ('single', [[ (0, 2), (0, 3), (1,4), (1,5) ]] )
]

Clustering from other SMIRKS

This example attempts to show how ClusterGraph creates a SMIRKS for already clustered sub-graphs.

Here, we will consider two types of angles around tetrahedral carbon atoms. In this hierarchical list c1 would match ANY angle around a tetrahedral carbon (indicated with the connectivity X4 on atom :2). Then c2 would match angles where both outer atoms are hydrogens, just H-C-H angles, meaning those angles would be assigned c2 and NOT c1.

We will use the utility function create_tuples_for_clusters (described above) to identify atoms in each example moleucle that match each of these angle types.

[2]:
smirks_list = [
    ("c1", "[*:1]~[#6X4:2]-[*:3]"),
    ("c2", "[#1:1]-[#6X4:2]-[#1:3]"),
]
for label, smirks in smirks_list:
    print(label,'\t',smirks)
c1       [*:1]~[#6X4:2]-[*:3]
c2       [#1:1]-[#6X4:2]-[#1:3]

Start with a single molecule

For the first example, we will start with just one molecule (ethane) and extract the clusters of atoms matching each angle type.

Ethane has a total of 12 sets of angles, all of which can be categorized by the two SMIRKS patterns c1 or c2 * 6 with the form H-C-C - type c1 * 6 with the form H-C-H - type c2

First we need to extract the atoms for each of these categories. We use tuples of atom indices to represent these two clusters which are identified using the create_tuple_for_cluster utilities function.

[3]:
mol = mol_toolkit.MolFromSmiles('CC')
atom_index_list = create_tuples_for_clusters(smirks_list, [mol])
for label, mol_list in atom_index_list:
    print(label)
    for mol_idx, atom_list in enumerate(mol_list):
        print('\tmolecule ', mol_idx)
        for atoms in atom_list:
            print('\t\t', atoms)
c1
        molecule  0
                 (1, 0, 3)
                 (0, 1, 7)
                 (0, 1, 6)
                 (1, 0, 4)
                 (1, 0, 2)
                 (0, 1, 5)
c2
        molecule  0
                 (5, 1, 7)
                 (5, 1, 6)
                 (6, 1, 7)
                 (3, 0, 4)
                 (2, 0, 4)
                 (2, 0, 3)

Next, we will look at the ClusterGraph for the set of atoms matching the angle type c1 ([*:1]~[#6X4:2]-[*:3]). ClusterGraph works by only storing the unique combination of atom decorators. That means that even though we are using six sets of atoms there is only one set of decorators for each atom in the SMIRKS patterns

[6]:
c1_atoms = atom_index_list[0][1]
graph = ClusterGraph([mol], c1_atoms)
print(graph.as_smirks())
[#6AH3X4x0!r+0:1]-;!@[#6AH3X4x0!r+0:2]-;!@[#1AH0X1x0!r+0:3]

Adding Layers

Similar to the ChemPerGraphs described in the single_mol_smirks example. We can add atoms outside those indexed in ClusterGraph. This is done with the key word layers. The specified number of layers corresponds to the number of bonds away from an indexed atom should be included in the SMIRKS. As with ChemPerGraphs, you can also use the keyword "all" to include all atoms in a molecule in the SMIRKS pattern. For ethane, this would result in the same SMIRKS as specifying 1 layer:

[7]:
print("layers = 0")
graph = ClusterGraph([mol], c1_atoms, layers=1)
print(graph.as_smirks())
print('-'*80)
print("layers='all'")
graph = ClusterGraph([mol], c1_atoms, layers='all')
print(graph.as_smirks())
layers = 0
[#6AH3X4x0!r+0:1](-;!@[#1AH0X1x0!r+0])(-;!@[#1AH0X1x0!r+0])(-;!@[#1AH0X1x0!r+0])-;!@[#6AH3X4x0!r+0:2](-;!@[#1AH0X1x0!r+0])(-;!@[#1AH0X1x0!r+0])-;!@[#1AH0X1x0!r+0:3]
--------------------------------------------------------------------------------
layers='all'
[#6AH3X4x0!r+0:1](-;!@[#1AH0X1x0!r+0])(-;!@[#1AH0X1x0!r+0])(-;!@[#1AH0X1x0!r+0])-;!@[#6AH3X4x0!r+0:2](-;!@[#1AH0X1x0!r+0])(-;!@[#1AH0X1x0!r+0])-;!@[#1AH0X1x0!r+0:3]

Multiple molecules

Now that you have the general idea, lets consider a more complex case, Lets create a ClusterGraph for both labels in the smirks_list from above for the hydrocarbons listed below.

First we need to create the molecules and use create_tuple_for_cluster to find group the angles by category.

[8]:
smiles = ['CC', 'CCC', 'C1CC1', 'CCCC', 'CC(C)C', 'C1CCC1', 'CCCCC']
mols = [mol_toolkit.MolFromSmiles(s) for s in smiles]
atom_index_list = create_tuples_for_clusters(smirks_list, mols)
for label, mol_list in atom_index_list:
    print(label)
    for mol_idx, atom_list in enumerate(mol_list):
        print('\tmolecule ', mol_idx)
        for atoms in atom_list:
            print('\t\t', atoms)
c1
        molecule  0
                 (1, 0, 3)
                 (0, 1, 7)
                 (0, 1, 6)
                 (1, 0, 4)
                 (1, 0, 2)
                 (0, 1, 5)
        molecule  1
                 (1, 0, 3)
                 (1, 2, 8)
                 (1, 0, 5)
                 (1, 0, 4)
                 (2, 1, 7)
                 (0, 1, 2)
                 (1, 2, 9)
                 (0, 1, 7)
                 (1, 2, 10)
                 (0, 1, 6)
                 (2, 1, 6)
        molecule  2
                 (2, 0, 4)
                 (1, 2, 8)
                 (2, 0, 3)
                 (1, 0, 3)
                 (0, 2, 8)
                 (1, 2, 7)
                 (1, 0, 2)
                 (2, 1, 5)
                 (0, 2, 7)
                 (0, 1, 2)
                 (0, 1, 6)
                 (0, 2, 1)
                 (2, 1, 6)
                 (0, 1, 5)
                 (1, 0, 4)
        molecule  3
                 (2, 1, 7)
                 (0, 1, 8)
                 (0, 1, 7)
                 (0, 1, 2)
                 (1, 2, 9)
                 (2, 3, 12)
                 (1, 2, 3)
                 (1, 2, 10)
                 (1, 0, 6)
                 (1, 0, 4)
                 (3, 2, 10)
                 (1, 0, 5)
                 (2, 1, 8)
                 (2, 3, 11)
                 (2, 3, 13)
                 (3, 2, 9)
        molecule  4
                 (2, 1, 7)
                 (1, 2, 8)
                 (0, 1, 7)
                 (0, 1, 2)
                 (1, 2, 9)
                 (0, 1, 3)
                 (1, 2, 10)
                 (1, 0, 6)
                 (3, 1, 7)
                 (2, 1, 3)
                 (1, 0, 4)
                 (1, 3, 13)
                 (1, 0, 5)
                 (1, 3, 12)
                 (1, 3, 11)
        molecule  5
                 (1, 0, 3)
                 (1, 2, 8)
                 (0, 1, 7)
                 (2, 1, 7)
                 (0, 1, 2)
                 (1, 2, 9)
                 (3, 0, 4)
                 (1, 2, 3)
                 (3, 0, 5)
                 (1, 0, 4)
                 (2, 1, 6)
                 (0, 1, 6)
                 (1, 0, 5)
                 (2, 3, 10)
                 (2, 3, 11)
                 (0, 3, 11)
                 (3, 2, 8)
                 (0, 3, 2)
                 (0, 3, 10)
                 (3, 2, 9)
        molecule  6
                 (0, 1, 8)
                 (0, 1, 2)
                 (0, 1, 9)
                 (2, 3, 12)
                 (1, 2, 3)
                 (1, 2, 10)
                 (4, 3, 13)
                 (1, 0, 6)
                 (1, 2, 11)
                 (1, 0, 7)
                 (3, 4, 16)
                 (3, 2, 10)
                 (1, 0, 5)
                 (2, 1, 8)
                 (3, 2, 11)
                 (2, 1, 9)
                 (2, 3, 13)
                 (3, 4, 14)
                 (2, 3, 4)
                 (4, 3, 12)
                 (3, 4, 15)
c2
        molecule  0
                 (5, 1, 7)
                 (5, 1, 6)
                 (6, 1, 7)
                 (3, 0, 4)
                 (2, 0, 4)
                 (2, 0, 3)
        molecule  1
                 (8, 2, 9)
                 (6, 1, 7)
                 (3, 0, 4)
                 (3, 0, 5)
                 (9, 2, 10)
                 (4, 0, 5)
                 (8, 2, 10)
        molecule  2
                 (5, 1, 6)
                 (3, 0, 4)
                 (7, 2, 8)
        molecule  3
                 (11, 3, 13)
                 (11, 3, 12)
                 (9, 2, 10)
                 (7, 1, 8)
                 (5, 0, 6)
                 (4, 0, 6)
                 (12, 3, 13)
                 (4, 0, 5)
        molecule  4
                 (11, 3, 13)
                 (11, 3, 12)
                 (9, 2, 10)
                 (12, 3, 13)
                 (8, 2, 9)
                 (5, 0, 6)
                 (4, 0, 6)
                 (4, 0, 5)
                 (8, 2, 10)
        molecule  5
                 (6, 1, 7)
                 (8, 2, 9)
                 (10, 3, 11)
                 (4, 0, 5)
        molecule  6
                 (8, 1, 9)
                 (12, 3, 13)
                 (14, 4, 15)
                 (5, 0, 6)
                 (15, 4, 16)
                 (6, 0, 7)
                 (14, 4, 16)
                 (5, 0, 7)
                 (10, 2, 11)

Now lets make a ClusterGraph object for both c1 and c2. In these patterns you will see lists of decorators on each atom. In the SMIRKS lanage ',' stands for ‘OR’. So in the case of "[#6AH1X4x0!r+0,#6AH2X4x0!r+0:1]" both decorator sets ("#6AH1X4x0!r+0" or "#6AH2X4x0!r+0") could match up with atom :1

[9]:
c1_graph = ClusterGraph(mols, atom_index_list[0][1])
print('c1\n'+'-'*50)
print(c1_graph.as_smirks())
c2_graph = ClusterGraph(mols, atom_index_list[1][1])
print()
print('c2\n'+'-'*50)
print(c2_graph.as_smirks())
c1
--------------------------------------------------
[#6AH1X4x0!r+0,#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0:1]-[#6AH1X4x0!r+0,#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0:2]-[#1AH0X1x0!r+0,#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0:3]

c2
--------------------------------------------------
[#1AH0X1x0!r+0:1]-;!@[#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0:2]-;!@[#1AH0X1x0!r+0:3]

Identifying common decorators

You might notice that some SMIRKS decorators in each atom list are very similar. For example, all of our atoms are neutral so they all have the decorator "+0" to indicate a formal charge of zero.

We can take advantage of these commonalities and group decorators together using the SMIRKS ";" symbol for ANDing decorators. For example, in "[#6,#7;+0:1]" the atom is either carbon (#6) or (,) nitrogen (#7) and (;) it has a zero formal charge (+0).

In the ChemPer graph language you can group like decorators using the keyword compress. In that case we get these SMIRKS patterns for c1 and c2 instead:

[10]:
print('c1\n'+'-'*50)
print(c1_graph.as_smirks(compress=True))
print()
print('c2\n'+'-'*50)
print(c2_graph.as_smirks(compress=True))
c1
--------------------------------------------------
[*!rH1x0,*!rH2x0,*!rH3x0,*H2r3x2,*H2r4x2;#6;+0;A;X4:1]-[*!rH1x0,*!rH2x0,*!rH3x0,*H2r3x2,*H2r4x2;#6;+0;A;X4:2]-[#1!rH0X1x0,#6!rH2X4x0,#6!rH3X4x0,#6H2X4r3x2,#6H2X4r4x2;+0;A:3]

c2
--------------------------------------------------
[#1AH0X1x0!r+0:1]-;!@[*!rH2x0,*!rH3x0,*H2r3x2,*H2r4x2;#6;+0;A;X4:2]-;!@[#1AH0X1x0!r+0:3]

Adding layers

As shown above we could also add layers to the ClusterGraphs with multiple molecules.

[11]:
for l in [1,2,3]:
    print('layers = ', l)
    c1_graph = ClusterGraph(mols, atom_index_list[0][1], layers=l)
    print('c1\n'+'-'*50)
    print(c1_graph.as_smirks())
    c2_graph = ClusterGraph(mols, atom_index_list[1][1], layers=l)
    print()
    print('c2\n'+'-'*50)
    print(c2_graph.as_smirks())
    print('\n', '='*80, '\n')
layers =  1
c1
--------------------------------------------------
[#6AH1X4x0!r+0,#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0:1](-[#1AH0X1x0!r+0,#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0])(-;!@[#1AH0X1x0!r+0,#6AH3X4x0!r+0])(-;!@[#1AH0X1x0!r+0])-[#6AH1X4x0!r+0,#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0:2](-[#1AH0X1x0!r+0,#6AH2X4x0!r+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0])(-;!@[#1AH0X1x0!r+0,#6AH3X4x0!r+0])-[#1AH0X1x0!r+0,#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0:3](-;!@[#1AH0X1x0!r+0,#6AH2X4x0!r+0,#6AH3X4x0!r+0])(-;!@[#1AH0X1x0!r+0])-;!@[#1AH0X1x0!r+0]

c2
--------------------------------------------------
[#1AH0X1x0!r+0:1]-;!@[#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0:2](-[#1AH0X1x0!r+0,#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0])(-[#6AH1X4x0!r+0,#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0])-;!@[#1AH0X1x0!r+0:3]

 ================================================================================

layers =  2
c1
--------------------------------------------------
[#6AH1X4x0!r+0,#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0:1](-[#1AH0X1x0!r+0,#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0](-[#1AH0X1x0!r+0,#6AH2X4x0!r+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0])(-;!@[#1AH0X1x0!r+0])-;!@[#1AH0X1x0!r+0])(-;!@[#1AH0X1x0!r+0,#6AH3X4x0!r+0](-;!@[#1AH0X1x0!r+0])(-;!@[#1AH0X1x0!r+0])-;!@[#1AH0X1x0!r+0])(-;!@[#1AH0X1x0!r+0])-[#6AH1X4x0!r+0,#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0:2](-;!@[#1AH0X1x0!r+0,#6AH2X4x0!r+0,#6AH3X4x0!r+0](-;!@[#1AH0X1x0!r+0,#6AH2X4x0!r+0,#6AH3X4x0!r+0])(-;!@[#1AH0X1x0!r+0])-;!@[#1AH0X1x0!r+0])(-;!@[#1AH0X1x0!r+0,#6AH3X4x0!r+0](-;!@[#1AH0X1x0!r+0])(-;!@[#1AH0X1x0!r+0])-;!@[#1AH0X1x0!r+0])-[#1AH0X1x0!r+0,#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0:3](-;!@[#1AH0X1x0!r+0,#6AH2X4x0!r+0,#6AH3X4x0!r+0](-;!@[#1AH0X1x0!r+0,#6AH3X4x0!r+0])(-;!@[#1AH0X1x0!r+0])-;!@[#1AH0X1x0!r+0])(-;!@[#1AH0X1x0!r+0])-;!@[#1AH0X1x0!r+0]

c2
--------------------------------------------------
[#1AH0X1x0!r+0:1]-;!@[#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0:2](-[#1AH0X1x0!r+0,#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0](-[#1AH0X1x0!r+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0])(-;!@[#1AH0X1x0!r+0])-;!@[#1AH0X1x0!r+0])(-[#6AH1X4x0!r+0,#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0](-;!@[#1AH0X1x0!r+0,#6AH2X4x0!r+0,#6AH3X4x0!r+0])(-;!@[#1AH0X1x0!r+0,#6AH3X4x0!r+0])-;!@[#1AH0X1x0!r+0])-;!@[#1AH0X1x0!r+0:3]

 ================================================================================

layers =  3
c1
--------------------------------------------------
[#6AH1X4x0!r+0,#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0:1](-[#1AH0X1x0!r+0,#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0](-[#1AH0X1x0!r+0,#6AH2X4x0!r+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0](-;!@[#1AH0X1x0!r+0,#6AH3X4x0!r+0])(-;!@[#1AH0X1x0!r+0])-;!@[#1AH0X1x0!r+0])(-;!@[#1AH0X1x0!r+0])-;!@[#1AH0X1x0!r+0])(-;!@[#1AH0X1x0!r+0,#6AH3X4x0!r+0](-;!@[#1AH0X1x0!r+0])(-;!@[#1AH0X1x0!r+0])-;!@[#1AH0X1x0!r+0])(-;!@[#1AH0X1x0!r+0])-[#6AH1X4x0!r+0,#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0:2](-;!@[#1AH0X1x0!r+0,#6AH2X4x0!r+0,#6AH3X4x0!r+0](-;!@[#1AH0X1x0!r+0,#6AH2X4x0!r+0,#6AH3X4x0!r+0](-;!@[#1AH0X1x0!r+0,#6AH3X4x0!r+0])(-;!@[#1AH0X1x0!r+0])-;!@[#1AH0X1x0!r+0])(-;!@[#1AH0X1x0!r+0])-;!@[#1AH0X1x0!r+0])(-;!@[#1AH0X1x0!r+0,#6AH3X4x0!r+0](-;!@[#1AH0X1x0!r+0])(-;!@[#1AH0X1x0!r+0])-;!@[#1AH0X1x0!r+0])-[#1AH0X1x0!r+0,#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0:3](-;!@[#1AH0X1x0!r+0,#6AH2X4x0!r+0,#6AH3X4x0!r+0](-;!@[#1AH0X1x0!r+0,#6AH3X4x0!r+0](-;!@[#1AH0X1x0!r+0])(-;!@[#1AH0X1x0!r+0])-;!@[#1AH0X1x0!r+0])(-;!@[#1AH0X1x0!r+0])-;!@[#1AH0X1x0!r+0])(-;!@[#1AH0X1x0!r+0])-;!@[#1AH0X1x0!r+0]

c2
--------------------------------------------------
[#1AH0X1x0!r+0:1]-;!@[#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0:2](-[#1AH0X1x0!r+0,#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0](-[#1AH0X1x0!r+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0](-;!@[#1AH0X1x0!r+0])-;!@[#1AH0X1x0!r+0])(-;!@[#1AH0X1x0!r+0])-;!@[#1AH0X1x0!r+0])(-[#6AH1X4x0!r+0,#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0](-;!@[#1AH0X1x0!r+0,#6AH2X4x0!r+0,#6AH3X4x0!r+0](-;!@[#1AH0X1x0!r+0,#6AH2X4x0!r+0,#6AH3X4x0!r+0])(-;!@[#1AH0X1x0!r+0])-;!@[#1AH0X1x0!r+0])(-;!@[#1AH0X1x0!r+0,#6AH3X4x0!r+0](-;!@[#1AH0X1x0!r+0])(-;!@[#1AH0X1x0!r+0])-;!@[#1AH0X1x0!r+0])-;!@[#1AH0X1x0!r+0])-;!@[#1AH0X1x0!r+0:3]

 ================================================================================

Where do you go from here

As you see above, the ClusterGraph SMIRKS are significantly more complicated and specific than the input SMIRKS. For example, the input SMIRKS for c1 is [*:1]~[#6X4:2]-[*:3], however ClusterGraph creates this monstrosity:

[#6AH1X4x0!r+0,#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0:1]-[#6AH1X4x0!r+0,#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0:2]-[#1AH0X1x0!r+0,#6AH2X4x0!r+0,#6AH2X4x2r3+0,#6AH2X4x2r4+0,#6AH3X4x0!r+0:3]

Although this pattern becomes a bit less complex with the compression:

[*!rH1x0,*!rH2x0,*!rH3x0,*H2r3x2,*H2r4x2;#6;+0;A;X4:1]-[*!rH1x0,*!rH2x0,*!rH3x0,*H2r3x2,*H2r4x2;#6;+0;A;X4:2]-[#1!rH0X1x0,#6!rH2X4x0,#6!rH3X4x0,#6H2X4r3x2,#6H2X4r4x2;+0;A:3]

Our goal is to generate a hierarchical list of SMIRKS would could recover the same chemistry in a different list of molecules. In order to do this we would want to generate the SMIRKS patterns for different clusters and then remove unnecessary decorators.

To meet this purpose we created the SMIRKSifier. For details on this topic see the notebook smirksifying_clusters in this example folder.

[ ]: