geo-database
Access NCBI GEO for gene expression/genomics data. Search/download microarray and RNA-seq datasets (GSE, GSM, GPL), retrieve SOFT/Matrix files, for transcriptomics and expression analysis.
What this skill does
# GEO Database ## Overview The Gene Expression Omnibus (GEO) is NCBI's public repository for high-throughput gene expression and functional genomics data. GEO contains over 264,000 studies with more than 8 million samples from both array-based and sequence-based experiments. ## When to Use This Skill This skill should be used when searching for gene expression datasets, retrieving experimental data, downloading raw and processed files, querying expression profiles, or integrating GEO data into computational analysis workflows. ## Core Capabilities ### 1. Understanding GEO Data Organization GEO organizes data hierarchically using different accession types: **Series (GSE):** A complete experiment with a set of related samples - Example: GSE123456 - Contains experimental design, samples, and overall study information - Largest organizational unit in GEO - Current count: 264,928+ series **Sample (GSM):** A single experimental sample or biological replicate - Example: GSM987654 - Contains individual sample data, protocols, and metadata - Linked to platforms and series - Current count: 8,068,632+ samples **Platform (GPL):** The microarray or sequencing platform used - Example: GPL570 (Affymetrix Human Genome U133 Plus 2.0 Array) - Describes the technology and probe/feature annotations - Shared across multiple experiments - Current count: 27,739+ platforms **DataSet (GDS):** Curated collections with consistent formatting - Example: GDS5678 - Experimentally-comparable samples organized by study design - Processed for differential analysis - Subset of GEO data (4,348 curated datasets) - Ideal for quick comparative analyses **Profiles:** Gene-specific expression data linked to sequence features - Queryable by gene name or annotation - Cross-references to Entrez Gene - Enables gene-centric searches across all studies ### 2. Searching GEO Data **GEO DataSets Search:** Search for studies by keywords, organism, or experimental conditions: ```python from Bio import Entrez # Configure Entrez (required) Entrez.email = "[email protected]" # Search for datasets def search_geo_datasets(query, retmax=20): """Search GEO DataSets database""" handle = Entrez.esearch( db="gds", term=query, retmax=retmax, usehistory="y" ) results = Entrez.read(handle) handle.close() return results # Example searches results = search_geo_datasets("breast cancer[MeSH] AND Homo sapiens[Organism]") print(f"Found {results['Count']} datasets") # Search by specific platform results = search_geo_datasets("GPL570[Accession]") # Search by study type results = search_geo_datasets("expression profiling by array[DataSet Type]") ``` **GEO Profiles Search:** Find gene-specific expression patterns: ```python # Search for gene expression profiles def search_geo_profiles(gene_name, organism="Homo sapiens", retmax=100): """Search GEO Profiles for a specific gene""" query = f"{gene_name}[Gene Name] AND {organism}[Organism]" handle = Entrez.esearch( db="geoprofiles", term=query, retmax=retmax ) results = Entrez.read(handle) handle.close() return results # Find TP53 expression across studies tp53_results = search_geo_profiles("TP53", organism="Homo sapiens") print(f"Found {tp53_results['Count']} expression profiles for TP53") ``` **Advanced Search Patterns:** ```python # Combine multiple search terms def advanced_geo_search(terms, operator="AND"): """Build complex search queries""" query = f" {operator} ".join(terms) return search_geo_datasets(query) # Find recent high-throughput studies search_terms = [ "RNA-seq[DataSet Type]", "Homo sapiens[Organism]", "2024[Publication Date]" ] results = advanced_geo_search(search_terms) # Search by author and condition search_terms = [ "Smith[Author]", "diabetes[Disease]" ] results = advanced_geo_search(search_terms) ``` ### 3. Retrieving GEO Data with GEOparse (Recommended) **GEOparse** is the primary Python library for accessing GEO data: **Installation:** ```bash uv pip install GEOparse ``` **Basic Usage:** ```python import GEOparse # Download and parse a GEO Series gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data") # Access series metadata print(gse.metadata['title']) print(gse.metadata['summary']) print(gse.metadata['overall_design']) # Access sample information for gsm_name, gsm in gse.gsms.items(): print(f"Sample: {gsm_name}") print(f" Title: {gsm.metadata['title'][0]}") print(f" Source: {gsm.metadata['source_name_ch1'][0]}") print(f" Characteristics: {gsm.metadata.get('characteristics_ch1', [])}") # Access platform information for gpl_name, gpl in gse.gpls.items(): print(f"Platform: {gpl_name}") print(f" Title: {gpl.metadata['title'][0]}") print(f" Organism: {gpl.metadata['organism'][0]}") ``` **Working with Expression Data:** ```python import GEOparse import pandas as pd # Get expression data from series gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data") # Extract expression matrix # Method 1: From series matrix file (fastest) if hasattr(gse, 'pivot_samples'): expression_df = gse.pivot_samples('VALUE') print(expression_df.shape) # genes x samples # Method 2: From individual samples expression_data = {} for gsm_name, gsm in gse.gsms.items(): if hasattr(gsm, 'table'): expression_data[gsm_name] = gsm.table['VALUE'] expression_df = pd.DataFrame(expression_data) print(f"Expression matrix: {expression_df.shape}") ``` **Accessing Supplementary Files:** ```python import GEOparse gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data") # Download supplementary files gse.download_supplementary_files( directory="./data/GSE123456_suppl", download_sra=False # Set to True to download SRA files ) # List available supplementary files for gsm_name, gsm in gse.gsms.items(): if hasattr(gsm, 'supplementary_files'): print(f"Sample {gsm_name}:") for file_url in gsm.metadata.get('supplementary_file', []): print(f" {file_url}") ``` **Filtering and Subsetting Data:** ```python import GEOparse gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data") # Filter samples by metadata control_samples = [ gsm_name for gsm_name, gsm in gse.gsms.items() if 'control' in gsm.metadata.get('title', [''])[0].lower() ] treatment_samples = [ gsm_name for gsm_name, gsm in gse.gsms.items() if 'treatment' in gsm.metadata.get('title', [''])[0].lower() ] print(f"Control samples: {len(control_samples)}") print(f"Treatment samples: {len(treatment_samples)}") # Extract subset expression matrix expression_df = gse.pivot_samples('VALUE') control_expr = expression_df[control_samples] treatment_expr = expression_df[treatment_samples] ``` ### 4. Using NCBI E-utilities for GEO Access **E-utilities** provide lower-level programmatic access to GEO metadata: **Basic E-utilities Workflow:** ```python from Bio import Entrez import time Entrez.email = "[email protected]" # Step 1: Search for GEO entries def search_geo(query, db="gds", retmax=100): """Search GEO using E-utilities""" handle = Entrez.esearch( db=db, term=query, retmax=retmax, usehistory="y" ) results = Entrez.read(handle) handle.close() return results # Step 2: Fetch summaries def fetch_geo_summaries(id_list, db="gds"): """Fetch document summaries for GEO entries""" ids = ",".join(id_list) handle = Entrez.esummary(db=db, id=ids) summaries = Entrez.read(handle) handle.close() return summaries # Step 3: Fetch full records def fetch_geo_records(id_list, db="gds"): """Fetch full GEO records""" ids = ",".join(id_list) handle = Entrez.efetch(db=db, id=ids, retmode="xml") records = Entrez.read(handle) handle.close() return records # Example workflow search_results = search_geo("breast cancer AND Homo sapiens") id_list = search_resu
Related in General
modeling-omnistudio-epc-catalog
IncludedSalesforce Industries CME EPC product-modeling skill for Product2-based catalog creation. Use when creating EPC products, configuring product attributes, building offer bundles with Product Child Items, or reviewing EPC DataPack JSON metadata for product catalog changes. TRIGGER when: user creates or updates Product2 EPC records, AttributeAssignment payloads, AttributeMetadata/AttributeDefaultValues, Offer bundles, or ProductChildItem relationships. DO NOT TRIGGER when: designing OmniScripts/FlexCards/Integration Procedures (use building-omnistudio-omniscript, building-omnistudio-flexcard, or building-omnistudio-integration-procedure), implementing Apex business logic (use generating-apex), or troubleshooting deployment pipelines (use deploying-metadata).
relationship-science-coach
IncludedUse this skill for direct, practical adult relationship coaching: couples conflict, repair, trust, marriage, dating, flirting, attachment patterns, emotional connection, sex, desire differences, eroticism, kink negotiation, affection, love languages, breakups, and long-term passion. Draw on Gottman, EFT and Hold Me Tight, attachment science, modern sex research, Perel, Nagoski, Kerner, Schnarch, Love and Stosny, and flexible love-language tools. Be concrete and low-hedge. Redirect only for imminent danger, abuse, coercive control, minors, non-consent, self-harm, stalking, or medical/legal/psychiatric decisions.
building-sf-integrations
IncludedSalesforce integration architecture and runtime plumbing with 120-point scoring. Use this skill to set up Named Credentials, External Credentials, External Services, REST/SOAP callout patterns, Platform Events, and Change Data Capture. TRIGGER when: user sets up Named Credentials, External Services, REST/SOAP callouts, Platform Events, CDC, or touches .namedCredential-meta.xml files. DO NOT TRIGGER when: Connected App/OAuth config (use configuring-connected-apps), Apex-only logic (use generating-apex), or data import/export (use handling-sf-data).
venue-templates
IncludedAccess comprehensive LaTeX templates, formatting requirements, and submission guidelines for major scientific publication venues (Nature, Science, PLOS, IEEE, ACM), academic conferences (NeurIPS, ICML, CVPR, CHI), research posters, and grant proposals (NSF, NIH, DOE, DARPA). This skill should be used when preparing manuscripts for journal submission, conference papers, research posters, or grant proposals and need venue-specific formatting requirements and templates.
let-fate-decide
IncludedDraws the 12 Houses of the Zodiac Tarot spread to inject entropy into planning when prompts are vague, ambiguous, or casually delegated. Interprets the spread to guide next steps. Use when the user says 'let fate decide', 'YOLO', 'whatever', 'idk', or other nonchalant phrases, makes Yu-Gi-Oh references, or when you are about to arbitrarily pick between multiple reasonable approaches. Prefer over ask-questions-if-underspecified when the user's tone is casual or playful rather than precision-seeking.
net-ops
IncludedCross-platform network troubleshooting (Windows, macOS, Linux) via local or remote shell. Use for: DNS broken, can't resolve hostnames, nslookup/dig works but apps fail, NRPT, WFP, scutil, /etc/resolver, systemd-resolved, /etc/resolv.conf, NetworkManager, VPN DNS leak residue (ProtonVPN/Mullvad/WireGuard/AnyConnect), AV/firewall blocking DNS or DoH, Tailscale DNS interaction, intermittent connectivity, remote diagnostics over SSH.