Claude
Skills
Sign in
Back

rdkit

Included with Lifetime
$97 forever

Open-source cheminformatics and machine learning toolkit for drug discovery, molecular manipulation, and chemical property calculation. RDKit handles SMILES, molecular fingerprints, substructure searching, 3D conformer generation, pharmacophore modeling, and QSAR. Use when working with chemical structures, drug-like properties, molecular similarity, virtual screening, or computational chemistry workflows.

General

What this skill does


# RDKit - Cheminformatics and Drug Discovery

RDKit is the industry-standard open-source toolkit for cheminformatics. It provides comprehensive tools for molecular manipulation, descriptor calculation, fingerprinting, substructure searching, and 3D molecular modeling. RDKit is used extensively in pharmaceutical companies for drug discovery and virtual screening.

## When to Use

- Reading and writing chemical file formats (SMILES, SDF, MOL2, PDB).
- Calculating molecular descriptors and drug-like properties (Lipinski's Rule of Five).
- Generating molecular fingerprints for similarity searching.
- Substructure searching and chemical pattern matching (SMARTS).
- 3D conformer generation and molecular alignment.
- Virtual screening of compound libraries.
- Pharmacophore modeling and shape similarity.
- QSAR (Quantitative Structure-Activity Relationship) modeling.
- Reaction enumeration and retrosynthesis.
- Visualizing chemical structures in 2D and 3D.
- Building machine learning models for molecular property prediction.

## Reference Documentation

**Official docs**: https://www.rdkit.org/docs/  
**RDKit Book**: https://www.rdkit.org/docs/RDKit_Book.html  
**GitHub**: https://github.com/rdkit/rdkit  
**Search patterns**: `rdkit.Chem`, `rdkit.Chem.Descriptors`, `rdkit.Chem.AllChem`, `rdkit.DataStructs`

## Core Principles

### Molecular Representation
RDKit represents molecules as graphs where atoms are nodes and bonds are edges. The core object is `Mol`, which can be created from SMILES, SDF files, or built programmatically.

### SMILES (Simplified Molecular Input Line Entry System)
A text-based notation for chemical structures. Example: `CCO` is ethanol, `c1ccccc1` is benzene. RDKit can parse and generate SMILES strings.

### Fingerprints for Similarity
Molecular fingerprints are binary vectors encoding structural features. They enable fast similarity searching and clustering of large compound libraries.

### Lazy Evaluation
Many RDKit operations are lazy - properties are computed only when needed. This makes operations on large libraries very efficient.

## Quick Reference

### Installation

```bash
# Via conda (recommended)
conda install -c conda-forge rdkit

# Via pip
pip install rdkit

# For visualization
pip install rdkit pillow
```

### Standard Imports

```python
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors, Draw, Lipinski
from rdkit.Chem import rdFingerprintGenerator
from rdkit import DataStructs
import numpy as np
import pandas as pd
```

### Basic Pattern - SMILES to Molecule

```python
from rdkit import Chem

# 1. Create molecule from SMILES
smiles = "CC(=O)OC1=CC=CC=C1C(=O)O"  # Aspirin
mol = Chem.MolFromSmiles(smiles)

# 2. Check if molecule is valid
if mol is None:
    print("Invalid SMILES")
else:
    print(f"Molecular formula: {Chem.rdMolDescriptors.CalcMolFormula(mol)}")
    print(f"Molecular weight: {Descriptors.MolWt(mol):.2f}")
```

### Basic Pattern - Calculate Properties

```python
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski

mol = Chem.MolFromSmiles("CC(=O)OC1=CC=CC=C1C(=O)O")

# Calculate drug-like properties
mw = Descriptors.MolWt(mol)
logp = Descriptors.MolLogP(mol)
hbd = Lipinski.NumHDonors(mol)
hba = Lipinski.NumHAcceptors(mol)

print(f"MW: {mw:.2f}, LogP: {logp:.2f}, HBD: {hbd}, HBA: {hba}")

# Check Lipinski's Rule of Five
lipinski_pass = (mw <= 500 and logp <= 5 and hbd <= 5 and hba <= 10)
print(f"Lipinski compliant: {lipinski_pass}")
```

## Critical Rules

### ✅ DO

- **Always Validate Molecules** - Check `mol is not None` after parsing SMILES/files to catch invalid structures.
- **Use Canonical SMILES** - Use `Chem.MolToSmiles(mol)` to get canonical (standardized) SMILES for comparison.
- **Sanitize Molecules** - RDKit auto-sanitizes by default (valence checking, aromaticity). Keep it enabled unless you have a specific reason.
- **Generate 3D Coordinates** - Use `AllChem.EmbedMolecule()` before 3D operations like alignment or docking.
- **Use Fingerprints for Large Libraries** - For similarity searching in millions of compounds, fingerprints are 1000x faster than direct comparison.
- **Specify Random Seeds** - For reproducible conformer generation, always set `randomSeed`.
- **Handle Stereochemistry** - Use `Chem.AssignStereochemistry()` to properly assign R/S and E/Z labels.
- **Batch Processing** - Use generators or chunking for processing millions of molecules to avoid memory issues.

### ❌ DON'T

- **Don't Ignore Invalid Molecules** - Always handle the case when `MolFromSmiles()` returns `None`.
- **Don't Compare SMILES Strings Directly** - Two different SMILES can represent the same molecule. Use canonical SMILES or InChI.
- **Don't Skip Kekulization** - For aromatic systems, ensure proper Kekulé structure assignment.
- **Don't Use Descriptors for Similarity** - Use fingerprints (much faster and more appropriate).
- **Don't Forget Hydrogens** - Add explicit hydrogens with `Chem.AddHs()` when needed for 3D operations.
- **Don't Overuse 3D Minimization** - Energy minimization is slow; only use when necessary (docking, visualization).

## Anti-Patterns (NEVER)

```python
from rdkit import Chem
from rdkit.Chem import AllChem

# ❌ BAD: Not checking if molecule is valid
smiles = "INVALID_SMILES"
mol = Chem.MolFromSmiles(smiles)
mw = Descriptors.MolWt(mol)  # Crashes!

# ✅ GOOD: Always validate
mol = Chem.MolFromSmiles(smiles)
if mol is not None:
    mw = Descriptors.MolWt(mol)
else:
    print("Invalid SMILES")

# ❌ BAD: Comparing SMILES strings directly
smiles1 = "CC(C)C"  # isobutane
smiles2 = "C(C)CC"  # same molecule, different SMILES
if smiles1 == smiles2:  # False, but same molecule!
    print("Same")

# ✅ GOOD: Use canonical SMILES
mol1 = Chem.MolFromSmiles(smiles1)
mol2 = Chem.MolFromSmiles(smiles2)
can1 = Chem.MolToSmiles(mol1)
can2 = Chem.MolToSmiles(mol2)
if can1 == can2:  # True
    print("Same molecule")

# ❌ BAD: 3D operations without 3D coordinates
mol = Chem.MolFromSmiles("CCO")
AllChem.AlignMol(mol, ref_mol)  # Fails! No 3D coords

# ✅ GOOD: Generate 3D coordinates first
mol = Chem.MolFromSmiles("CCO")
AllChem.EmbedMolecule(mol)
AllChem.AlignMol(mol, ref_mol)
```

## Molecular I/O and Conversion

### SMILES Parsing

```python
from rdkit import Chem

# Parse SMILES
mol = Chem.MolFromSmiles("CCO")

# Parse SMILES with sanitization control
mol = Chem.MolFromSmiles("CCO", sanitize=True)  # Default

# Generate canonical SMILES
canonical = Chem.MolToSmiles(mol)

# Generate isomeric SMILES (includes stereochemistry)
iso_smiles = Chem.MolToSmiles(mol, isomericSmiles=True)

# Generate SMILES without stereochemistry
non_iso = Chem.MolToSmiles(mol, isomericSmiles=False)

# Handle invalid SMILES
smiles_list = ["CCO", "INVALID", "c1ccccc1"]
mols = []
for smi in smiles_list:
    mol = Chem.MolFromSmiles(smi)
    if mol is not None:
        mols.append(mol)
    else:
        print(f"Failed to parse: {smi}")
```

### Reading SDF Files

```python
from rdkit import Chem

# Read single molecule from file
mol = Chem.MolFromMolFile("molecule.mol")

# Read multiple molecules from SDF
suppl = Chem.SDMolSupplier("compounds.sdf")

# Iterate through molecules
for mol in suppl:
    if mol is None:
        continue
    
    smiles = Chem.MolToSmiles(mol)
    print(f"SMILES: {smiles}")
    
    # Access SDF properties
    if mol.HasProp("_Name"):
        name = mol.GetProp("_Name")
        print(f"Name: {name}")

# Read with removeHs=False to keep explicit hydrogens
suppl = Chem.SDMolSupplier("compounds.sdf", removeHs=False)
```

### Writing SDF Files

```python
from rdkit import Chem

# Write single molecule
mol = Chem.MolFromSmiles("CCO")
writer = Chem.SDWriter("output.sdf")
writer.write(mol)
writer.close()

# Write multiple molecules
mols = [Chem.MolFromSmiles(s) for s in ["CCO", "c1ccccc1", "CC(=O)O"]]
writer = Chem.SDWriter("output.sdf")
for mol in mols:
    if mol is not None:
        writer.write(mol)
writer.close()

# Add properties to molecules
mol = Chem.MolFromSmiles("CCO")
Files: 1
Size: 31.3 KB
Complexity: 41/100
Category: General

Related in General