rdkit

Included with Lifetime

$97 forever

Open-source cheminformatics and machine learning toolkit for drug discovery, molecular manipulation, and chemical property calculation. RDKit handles SMILES, molecular fingerprints, substructure searching, 3D conformer generation, pharmacophore modeling, and QSAR. Use when working with chemical structures, drug-like properties, molecular similarity, virtual screening, or computational chemistry workflows.

General

What this skill does


# RDKit - Cheminformatics and Drug Discovery

RDKit is the industry-standard open-source toolkit for cheminformatics. It provides comprehensive tools for molecular manipulation, descriptor calculation, fingerprinting, substructure searching, and 3D molecular modeling. RDKit is used extensively in pharmaceutical companies for drug discovery and virtual screening.

## When to Use

- Reading and writing chemical file formats (SMILES, SDF, MOL2, PDB).
- Calculating molecular descriptors and drug-like properties (Lipinski's Rule of Five).
- Generating molecular fingerprints for similarity searching.
- Substructure searching and chemical pattern matching (SMARTS).
- 3D conformer generation and molecular alignment.
- Virtual screening of compound libraries.
- Pharmacophore modeling and shape similarity.
- QSAR (Quantitative Structure-Activity Relationship) modeling.
- Reaction enumeration and retrosynthesis.
- Visualizing chemical structures in 2D and 3D.
- Building machine learning models for molecular property prediction.

## Reference Documentation

**Official docs**: https://www.rdkit.org/docs/  
**RDKit Book**: https://www.rdkit.org/docs/RDKit_Book.html  
**GitHub**: https://github.com/rdkit/rdkit  
**Search patterns**: `rdkit.Chem`, `rdkit.Chem.Descriptors`, `rdkit.Chem.AllChem`, `rdkit.DataStructs`

## Core Principles

### Molecular Representation
RDKit represents molecules as graphs where atoms are nodes and bonds are edges. The core object is `Mol`, which can be created from SMILES, SDF files, or built programmatically.

### SMILES (Simplified Molecular Input Line Entry System)
A text-based notation for chemical structures. Example: `CCO` is ethanol, `c1ccccc1` is benzene. RDKit can parse and generate SMILES strings.

### Fingerprints for Similarity
Molecular fingerprints are binary vectors encoding structural features. They enable fast similarity searching and clustering of large compound libraries.

### Lazy Evaluation
Many RDKit operations are lazy - properties are computed only when needed. This makes operations on large libraries very efficient.

## Quick Reference

### Installation

```bash
# Via conda (recommended)
conda install -c conda-forge rdkit

# Via pip
pip install rdkit

# For visualization
pip install rdkit pillow
```

### Standard Imports

```python
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors, Draw, Lipinski
from rdkit.Chem import rdFingerprintGenerator
from rdkit import DataStructs
import numpy as np
import pandas as pd
```

### Basic Pattern - SMILES to Molecule

```python
from rdkit import Chem

# 1. Create molecule from SMILES
smiles = "CC(=O)OC1=CC=CC=C1C(=O)O"  # Aspirin
mol = Chem.MolFromSmiles(smiles)

# 2. Check if molecule is valid
if mol is None:
    print("Invalid SMILES")
else:
    print(f"Molecular formula: {Chem.rdMolDescriptors.CalcMolFormula(mol)}")
    print(f"Molecular weight: {Descriptors.MolWt(mol):.2f}")
```

### Basic Pattern - Calculate Properties

```python
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski

mol = Chem.MolFromSmiles("CC(=O)OC1=CC=CC=C1C(=O)O")

# Calculate drug-like properties
mw = Descriptors.MolWt(mol)
logp = Descriptors.MolLogP(mol)
hbd = Lipinski.NumHDonors(mol)
hba = Lipinski.NumHAcceptors(mol)

print(f"MW: {mw:.2f}, LogP: {logp:.2f}, HBD: {hbd}, HBA: {hba}")

# Check Lipinski's Rule of Five
lipinski_pass = (mw <= 500 and logp <= 5 and hbd <= 5 and hba <= 10)
print(f"Lipinski compliant: {lipinski_pass}")
```

## Critical Rules

### ✅ DO

- **Always Validate Molecules** - Check `mol is not None` after parsing SMILES/files to catch invalid structures.
- **Use Canonical SMILES** - Use `Chem.MolToSmiles(mol)` to get canonical (standardized) SMILES for comparison.
- **Sanitize Molecules** - RDKit auto-sanitizes by default (valence checking, aromaticity). Keep it enabled unless you have a specific reason.
- **Generate 3D Coordinates** - Use `AllChem.EmbedMolecule()` before 3D operations like alignment or docking.
- **Use Fingerprints for Large Libraries** - For similarity searching in millions of compounds, fingerprints are 1000x faster than direct comparison.
- **Specify Random Seeds** - For reproducible conformer generation, always set `randomSeed`.
- **Handle Stereochemistry** - Use `Chem.AssignStereochemistry()` to properly assign R/S and E/Z labels.
- **Batch Processing** - Use generators or chunking for processing millions of molecules to avoid memory issues.

### ❌ DON'T

- **Don't Ignore Invalid Molecules** - Always handle the case when `MolFromSmiles()` returns `None`.
- **Don't Compare SMILES Strings Directly** - Two different SMILES can represent the same molecule. Use canonical SMILES or InChI.
- **Don't Skip Kekulization** - For aromatic systems, ensure proper Kekulé structure assignment.
- **Don't Use Descriptors for Similarity** - Use fingerprints (much faster and more appropriate).
- **Don't Forget Hydrogens** - Add explicit hydrogens with `Chem.AddHs()` when needed for 3D operations.
- **Don't Overuse 3D Minimization** - Energy minimization is slow; only use when necessary (docking, visualization).

## Anti-Patterns (NEVER)

```python
from rdkit import Chem
from rdkit.Chem import AllChem

# ❌ BAD: Not checking if molecule is valid
smiles = "INVALID_SMILES"
mol = Chem.MolFromSmiles(smiles)
mw = Descriptors.MolWt(mol)  # Crashes!

# ✅ GOOD: Always validate
mol = Chem.MolFromSmiles(smiles)
if mol is not None:
    mw = Descriptors.MolWt(mol)
else:
    print("Invalid SMILES")

# ❌ BAD: Comparing SMILES strings directly
smiles1 = "CC(C)C"  # isobutane
smiles2 = "C(C)CC"  # same molecule, different SMILES
if smiles1 == smiles2:  # False, but same molecule!
    print("Same")

# ✅ GOOD: Use canonical SMILES
mol1 = Chem.MolFromSmiles(smiles1)
mol2 = Chem.MolFromSmiles(smiles2)
can1 = Chem.MolToSmiles(mol1)
can2 = Chem.MolToSmiles(mol2)
if can1 == can2:  # True
    print("Same molecule")

# ❌ BAD: 3D operations without 3D coordinates
mol = Chem.MolFromSmiles("CCO")
AllChem.AlignMol(mol, ref_mol)  # Fails! No 3D coords

# ✅ GOOD: Generate 3D coordinates first
mol = Chem.MolFromSmiles("CCO")
AllChem.EmbedMolecule(mol)
AllChem.AlignMol(mol, ref_mol)
```

## Molecular I/O and Conversion

### SMILES Parsing

```python
from rdkit import Chem

# Parse SMILES
mol = Chem.MolFromSmiles("CCO")

# Parse SMILES with sanitization control
mol = Chem.MolFromSmiles("CCO", sanitize=True)  # Default

# Generate canonical SMILES
canonical = Chem.MolToSmiles(mol)

# Generate isomeric SMILES (includes stereochemistry)
iso_smiles = Chem.MolToSmiles(mol, isomericSmiles=True)

# Generate SMILES without stereochemistry
non_iso = Chem.MolToSmiles(mol, isomericSmiles=False)

# Handle invalid SMILES
smiles_list = ["CCO", "INVALID", "c1ccccc1"]
mols = []
for smi in smiles_list:
    mol = Chem.MolFromSmiles(smi)
    if mol is not None:
        mols.append(mol)
    else:
        print(f"Failed to parse: {smi}")
```

### Reading SDF Files

```python
from rdkit import Chem

# Read single molecule from file
mol = Chem.MolFromMolFile("molecule.mol")

# Read multiple molecules from SDF
suppl = Chem.SDMolSupplier("compounds.sdf")

# Iterate through molecules
for mol in suppl:
    if mol is None:
        continue
    
    smiles = Chem.MolToSmiles(mol)
    print(f"SMILES: {smiles}")
    
    # Access SDF properties
    if mol.HasProp("_Name"):
        name = mol.GetProp("_Name")
        print(f"Name: {name}")

# Read with removeHs=False to keep explicit hydrogens
suppl = Chem.SDMolSupplier("compounds.sdf", removeHs=False)
```

### Writing SDF Files

```python
from rdkit import Chem

# Write single molecule
mol = Chem.MolFromSmiles("CCO")
writer = Chem.SDWriter("output.sdf")
writer.write(mol)
writer.close()

# Write multiple molecules
mols = [Chem.MolFromSmiles(s) for s in ["CCO", "c1ccccc1", "CC(=O)O"]]
writer = Chem.SDWriter("output.sdf")
for mol in mols:
    if mol is not None:
        writer.write(mol)
writer.close()

# Add properties to molecules
mol = Chem.MolFromSmiles("CCO")

Files: 1

Size: 31.3 KB

Complexity: 41/100

Category: General

Source: https://github.com/tondevrel/scientific-agent-skills/tree/main/skills/rdkit

Related in General

modeling-omnistudio-epc-catalog

Included

Salesforce Industries CME EPC product-modeling skill for Product2-based catalog creation. Use when creating EPC products, configuring product attributes, building offer bundles with Product Child Items, or reviewing EPC DataPack JSON metadata for product catalog changes. TRIGGER when: user creates or updates Product2 EPC records, AttributeAssignment payloads, AttributeMetadata/AttributeDefaultValues, Offer bundles, or ProductChildItem relationships. DO NOT TRIGGER when: designing OmniScripts/FlexCards/Integration Procedures (use building-omnistudio-omniscript, building-omnistudio-flexcard, or building-omnistudio-integration-procedure), implementing Apex business logic (use generating-apex), or troubleshooting deployment pipelines (use deploying-metadata).

Generalscripts

relationship-science-coach

Included

Use this skill for direct, practical adult relationship coaching: couples conflict, repair, trust, marriage, dating, flirting, attachment patterns, emotional connection, sex, desire differences, eroticism, kink negotiation, affection, love languages, breakups, and long-term passion. Draw on Gottman, EFT and Hold Me Tight, attachment science, modern sex research, Perel, Nagoski, Kerner, Schnarch, Love and Stosny, and flexible love-language tools. Be concrete and low-hedge. Redirect only for imminent danger, abuse, coercive control, minors, non-consent, self-harm, stalking, or medical/legal/psychiatric decisions.

Generalscripts

building-sf-integrations

Included

Salesforce integration architecture and runtime plumbing with 120-point scoring. Use this skill to set up Named Credentials, External Credentials, External Services, REST/SOAP callout patterns, Platform Events, and Change Data Capture. TRIGGER when: user sets up Named Credentials, External Services, REST/SOAP callouts, Platform Events, CDC, or touches .namedCredential-meta.xml files. DO NOT TRIGGER when: Connected App/OAuth config (use configuring-connected-apps), Apex-only logic (use generating-apex), or data import/export (use handling-sf-data).

Generalscripts

venue-templates

Included

Access comprehensive LaTeX templates, formatting requirements, and submission guidelines for major scientific publication venues (Nature, Science, PLOS, IEEE, ACM), academic conferences (NeurIPS, ICML, CVPR, CHI), research posters, and grant proposals (NSF, NIH, DOE, DARPA). This skill should be used when preparing manuscripts for journal submission, conference papers, research posters, or grant proposals and need venue-specific formatting requirements and templates.

Generalscripts

let-fate-decide

Included

Draws the 12 Houses of the Zodiac Tarot spread to inject entropy into planning when prompts are vague, ambiguous, or casually delegated. Interprets the spread to guide next steps. Use when the user says 'let fate decide', 'YOLO', 'whatever', 'idk', or other nonchalant phrases, makes Yu-Gi-Oh references, or when you are about to arbitrarily pick between multiple reasonable approaches. Prefer over ask-questions-if-underspecified when the user's tone is casual or playful rather than precision-seeking.

Generalscripts

net-ops

Included

Cross-platform network troubleshooting (Windows, macOS, Linux) via local or remote shell. Use for: DNS broken, can't resolve hostnames, nslookup/dig works but apps fail, NRPT, WFP, scutil, /etc/resolver, systemd-resolved, /etc/resolv.conf, NetworkManager, VPN DNS leak residue (ProtonVPN/Mullvad/WireGuard/AnyConnect), AV/firewall blocking DNS or DoH, Tailscale DNS interaction, intermittent connectivity, remote diagnostics over SSH.

Generalscripts