scikit-bio
Biological data toolkit. Sequence analysis, alignments, phylogenetic trees, diversity metrics (alpha/beta, UniFrac), ordination (PCoA), PERMANOVA, FASTA/Newick I/O, for microbiome analysis.
What this skill does
# scikit-bio
## Overview
scikit-bio is a comprehensive Python library for working with biological data. Apply this skill for bioinformatics analyses spanning sequence manipulation, alignment, phylogenetics, microbial ecology, and multivariate statistics.
## When to Use This Skill
This skill should be used when the user:
- Works with biological sequences (DNA, RNA, protein)
- Needs to read/write biological file formats (FASTA, FASTQ, GenBank, Newick, BIOM, etc.)
- Performs sequence alignments or searches for motifs
- Constructs or analyzes phylogenetic trees
- Calculates diversity metrics (alpha/beta diversity, UniFrac distances)
- Performs ordination analysis (PCoA, CCA, RDA)
- Runs statistical tests on biological/ecological data (PERMANOVA, ANOSIM, Mantel)
- Analyzes microbiome or community ecology data
- Works with protein embeddings from language models
- Needs to manipulate biological data tables
## Core Capabilities
### 1. Sequence Manipulation
Work with biological sequences using specialized classes for DNA, RNA, and protein data.
**Key operations:**
- Read/write sequences from FASTA, FASTQ, GenBank, EMBL formats
- Sequence slicing, concatenation, and searching
- Reverse complement, transcription (DNA→RNA), and translation (RNA→protein)
- Find motifs and patterns using regex
- Calculate distances (Hamming, k-mer based)
- Handle sequence quality scores and metadata
**Common patterns:**
```python
import skbio
# Read sequences from file
seq = skbio.DNA.read('input.fasta')
# Sequence operations
rc = seq.reverse_complement()
rna = seq.transcribe()
protein = rna.translate()
# Find motifs
motif_positions = seq.find_with_regex('ATG[ACGT]{3}')
# Check for properties
has_degens = seq.has_degenerates()
seq_no_gaps = seq.degap()
```
**Important notes:**
- Use `DNA`, `RNA`, `Protein` classes for grammared sequences with validation
- Use `Sequence` class for generic sequences without alphabet restrictions
- Quality scores automatically loaded from FASTQ files into positional metadata
- Metadata types: sequence-level (ID, description), positional (per-base), interval (regions/features)
### 2. Sequence Alignment
Perform pairwise and multiple sequence alignments using dynamic programming algorithms.
**Key capabilities:**
- Global alignment (Needleman-Wunsch with semi-global variant)
- Local alignment (Smith-Waterman)
- Configurable scoring schemes (match/mismatch, gap penalties, substitution matrices)
- CIGAR string conversion
- Multiple sequence alignment storage and manipulation with `TabularMSA`
**Common patterns:**
```python
from skbio.alignment import local_pairwise_align_ssw, TabularMSA
# Pairwise alignment
alignment = local_pairwise_align_ssw(seq1, seq2)
# Access aligned sequences
msa = alignment.aligned_sequences
# Read multiple alignment from file
msa = TabularMSA.read('alignment.fasta', constructor=skbio.DNA)
# Calculate consensus
consensus = msa.consensus()
```
**Important notes:**
- Use `local_pairwise_align_ssw` for local alignments (faster, SSW-based)
- Use `StripedSmithWaterman` for protein alignments
- Affine gap penalties recommended for biological sequences
- Can convert between scikit-bio, BioPython, and Biotite alignment formats
### 3. Phylogenetic Trees
Construct, manipulate, and analyze phylogenetic trees representing evolutionary relationships.
**Key capabilities:**
- Tree construction from distance matrices (UPGMA, WPGMA, Neighbor Joining, GME, BME)
- Tree manipulation (pruning, rerooting, traversal)
- Distance calculations (patristic, cophenetic, Robinson-Foulds)
- ASCII visualization
- Newick format I/O
**Common patterns:**
```python
from skbio import TreeNode
from skbio.tree import nj
# Read tree from file
tree = TreeNode.read('tree.nwk')
# Construct tree from distance matrix
tree = nj(distance_matrix)
# Tree operations
subtree = tree.shear(['taxon1', 'taxon2', 'taxon3'])
tips = [node for node in tree.tips()]
lca = tree.lowest_common_ancestor(['taxon1', 'taxon2'])
# Calculate distances
patristic_dist = tree.find('taxon1').distance(tree.find('taxon2'))
cophenetic_matrix = tree.cophenetic_matrix()
# Compare trees
rf_distance = tree.robinson_foulds(other_tree)
```
**Important notes:**
- Use `nj()` for neighbor joining (classic phylogenetic method)
- Use `upgma()` for UPGMA (assumes molecular clock)
- GME and BME are highly scalable for large trees
- Trees can be rooted or unrooted; some metrics require specific rooting
### 4. Diversity Analysis
Calculate alpha and beta diversity metrics for microbial ecology and community analysis.
**Key capabilities:**
- Alpha diversity: richness, Shannon entropy, Simpson index, Faith's PD, Pielou's evenness
- Beta diversity: Bray-Curtis, Jaccard, weighted/unweighted UniFrac, Euclidean distances
- Phylogenetic diversity metrics (require tree input)
- Rarefaction and subsampling
- Integration with ordination and statistical tests
**Common patterns:**
```python
from skbio.diversity import alpha_diversity, beta_diversity
import skbio
# Alpha diversity
alpha = alpha_diversity('shannon', counts_matrix, ids=sample_ids)
faith_pd = alpha_diversity('faith_pd', counts_matrix, ids=sample_ids,
tree=tree, otu_ids=feature_ids)
# Beta diversity
bc_dm = beta_diversity('braycurtis', counts_matrix, ids=sample_ids)
unifrac_dm = beta_diversity('unweighted_unifrac', counts_matrix,
ids=sample_ids, tree=tree, otu_ids=feature_ids)
# Get available metrics
from skbio.diversity import get_alpha_diversity_metrics
print(get_alpha_diversity_metrics())
```
**Important notes:**
- Counts must be integers representing abundances, not relative frequencies
- Phylogenetic metrics (Faith's PD, UniFrac) require tree and OTU ID mapping
- Use `partial_beta_diversity()` for computing specific sample pairs only
- Alpha diversity returns Series, beta diversity returns DistanceMatrix
### 5. Ordination Methods
Reduce high-dimensional biological data to visualizable lower-dimensional spaces.
**Key capabilities:**
- PCoA (Principal Coordinate Analysis) from distance matrices
- CA (Correspondence Analysis) for contingency tables
- CCA (Canonical Correspondence Analysis) with environmental constraints
- RDA (Redundancy Analysis) for linear relationships
- Biplot projection for feature interpretation
**Common patterns:**
```python
from skbio.stats.ordination import pcoa, cca
# PCoA from distance matrix
pcoa_results = pcoa(distance_matrix)
pc1 = pcoa_results.samples['PC1']
pc2 = pcoa_results.samples['PC2']
# CCA with environmental variables
cca_results = cca(species_matrix, environmental_matrix)
# Save/load ordination results
pcoa_results.write('ordination.txt')
results = skbio.OrdinationResults.read('ordination.txt')
```
**Important notes:**
- PCoA works with any distance/dissimilarity matrix
- CCA reveals environmental drivers of community composition
- Ordination results include eigenvalues, proportion explained, and sample/feature coordinates
- Results integrate with plotting libraries (matplotlib, seaborn, plotly)
### 6. Statistical Testing
Perform hypothesis tests specific to ecological and biological data.
**Key capabilities:**
- PERMANOVA: test group differences using distance matrices
- ANOSIM: alternative test for group differences
- PERMDISP: test homogeneity of group dispersions
- Mantel test: correlation between distance matrices
- Bioenv: find environmental variables correlated with distances
**Common patterns:**
```python
from skbio.stats.distance import permanova, anosim, mantel
# Test if groups differ significantly
permanova_results = permanova(distance_matrix, grouping, permutations=999)
print(f"p-value: {permanova_results['p-value']}")
# ANOSIM test
anosim_results = anosim(distance_matrix, grouping, permutations=999)
# Mantel test between two distance matrices
mantel_results = mantel(dm1, dm2, method='pearson', permutations=999)
print(f"Correlation: {mantel_results[0]}, p-value: {mantel_results[1]}")
```
**Important notes:**Related in General
modeling-omnistudio-epc-catalog
IncludedSalesforce Industries CME EPC product-modeling skill for Product2-based catalog creation. Use when creating EPC products, configuring product attributes, building offer bundles with Product Child Items, or reviewing EPC DataPack JSON metadata for product catalog changes. TRIGGER when: user creates or updates Product2 EPC records, AttributeAssignment payloads, AttributeMetadata/AttributeDefaultValues, Offer bundles, or ProductChildItem relationships. DO NOT TRIGGER when: designing OmniScripts/FlexCards/Integration Procedures (use building-omnistudio-omniscript, building-omnistudio-flexcard, or building-omnistudio-integration-procedure), implementing Apex business logic (use generating-apex), or troubleshooting deployment pipelines (use deploying-metadata).
relationship-science-coach
IncludedUse this skill for direct, practical adult relationship coaching: couples conflict, repair, trust, marriage, dating, flirting, attachment patterns, emotional connection, sex, desire differences, eroticism, kink negotiation, affection, love languages, breakups, and long-term passion. Draw on Gottman, EFT and Hold Me Tight, attachment science, modern sex research, Perel, Nagoski, Kerner, Schnarch, Love and Stosny, and flexible love-language tools. Be concrete and low-hedge. Redirect only for imminent danger, abuse, coercive control, minors, non-consent, self-harm, stalking, or medical/legal/psychiatric decisions.
building-sf-integrations
IncludedSalesforce integration architecture and runtime plumbing with 120-point scoring. Use this skill to set up Named Credentials, External Credentials, External Services, REST/SOAP callout patterns, Platform Events, and Change Data Capture. TRIGGER when: user sets up Named Credentials, External Services, REST/SOAP callouts, Platform Events, CDC, or touches .namedCredential-meta.xml files. DO NOT TRIGGER when: Connected App/OAuth config (use configuring-connected-apps), Apex-only logic (use generating-apex), or data import/export (use handling-sf-data).
venue-templates
IncludedAccess comprehensive LaTeX templates, formatting requirements, and submission guidelines for major scientific publication venues (Nature, Science, PLOS, IEEE, ACM), academic conferences (NeurIPS, ICML, CVPR, CHI), research posters, and grant proposals (NSF, NIH, DOE, DARPA). This skill should be used when preparing manuscripts for journal submission, conference papers, research posters, or grant proposals and need venue-specific formatting requirements and templates.
let-fate-decide
IncludedDraws the 12 Houses of the Zodiac Tarot spread to inject entropy into planning when prompts are vague, ambiguous, or casually delegated. Interprets the spread to guide next steps. Use when the user says 'let fate decide', 'YOLO', 'whatever', 'idk', or other nonchalant phrases, makes Yu-Gi-Oh references, or when you are about to arbitrarily pick between multiple reasonable approaches. Prefer over ask-questions-if-underspecified when the user's tone is casual or playful rather than precision-seeking.
net-ops
IncludedCross-platform network troubleshooting (Windows, macOS, Linux) via local or remote shell. Use for: DNS broken, can't resolve hostnames, nslookup/dig works but apps fail, NRPT, WFP, scutil, /etc/resolver, systemd-resolved, /etc/resolv.conf, NetworkManager, VPN DNS leak residue (ProtonVPN/Mullvad/WireGuard/AnyConnect), AV/firewall blocking DNS or DoH, Tailscale DNS interaction, intermittent connectivity, remote diagnostics over SSH.