tooluniverse-statistical-modeling

Included with Lifetime

$97 forever

Statistical modeling — linear/logistic/ordinal/Poisson regression, ANOVA, Kruskal-Wallis, chi-square, Mann-Whitney, Cox survival, spline fits (R `ns()`), odds ratios, Cohen's d, F-statistic, p-value computation. Specializes in clinical-trial AE analysis (SDTM DM/AE), severity ordinal regression, and per-feature stat workflows.

Generalscripts

What this skill does


# Statistical Modeling for Biomedical Data Analysis

## RULE ZERO — Check for pre-computed results FIRST

Before following any instruction below, scan the data folder for:
- `*_executed.ipynb` → read with `tu run read_executed_notebook '{"data_folder":"<path>","search":"<keyword>"}'` and cite its cell outputs as the authoritative answer
- Pre-computed result files (CSV/TSV with names like `*results*`, `*deseq*`, `*enrich*`, `*stats*`, `*_simplified.csv`) → read directly and report the requested value
- Canonical analysis scripts (`analysis.R`, `run_*.py`, `find_*.R`, `*.Rmd`) → execute as-is and read the output

Only follow this skill's re-analysis recipe below if **none** of the above exist. Re-running from raw data produces different numbers than the published answer and is much slower (often 5-10× turn count).

---

## PRIMARY SCRIPTS — use these FIRST

These scripts encode the question-specific gotchas in `scripts/` and emit
labelled, parseable output. Prefer them over ad-hoc statsmodels / scipy code.

| Script | When to use it |
|--------|----------------|
| `r_natural_spline_regression.py` | ANY question that mentions R syntax `lm(y ~ ns(x, df = K))`, "natural spline", or asks for spline R²/F/peak prediction CIs. Always shells out to Rscript so `splines::ns()` matches. |
| `spline_model_compare.py` | "Best-fitting model among quadratic, cubic and natural spline" / "max colony area at the optimal x". Fits all three in R, ranks by adj-R²/AIC/BIC, and reports the BEST model's peak (x*, y*) with 95% CI. |
| `logistic_regression_or.py` | Binary or ordinal logistic regression where the answer is an OR (or OR + 95% CI). Handles label encoding, explicit Placebo=0/BCG=1 maps, AND interaction terms (`--interaction A:B` -> creates `A_B = A*B`). Prints OR + CI for every coefficient and a SCALARS block for the requested `--coef-name`. |
| `power_analysis.py` | "Minimum sample size per group", "TTestIndPower", "given Cohen's d, what N for power=0.8". Computes pooled-SD Cohen's d from a CSV (or accepts `--effect-size`), then `TTestIndPower.solve_power`. |
| `expression_anova.py` | Per-gene ANOVA / median LFC across cell types or sample groups (NOT pooled across genes — see warnings below). |
| `prepare_ae_cohort.py` | Clinical-trial AE severity tests (chi-square / ordinal) on SDTM DM/AE files (`encoding='latin1'`, `max(AESEV)` per subject across ALL AEs — no AEPT filter). |
| `stat_tests.py` | Stdlib-only chi-square goodness-of-fit, Fisher's exact, simple OLS. Use when scipy/statsmodels aren't available. |

### Concrete invocations

Natural-spline regression (R^2, overall F-test p, peak Y + 95% CI):

```bash
python skills/tooluniverse-statistical-modeling/scripts/r_natural_spline_regression.py \
  --csv data.csv --y-col Area \
  --ratio-col Ratio --new-x-col Frequency_strain \
  --filter "StrainNumber not in ['1', '98']" \
  --df 4 --workdir /tmp/spline_run
```

Quadratic vs cubic vs natural-spline comparison + best-model peak:

```bash
python skills/tooluniverse-statistical-modeling/scripts/spline_model_compare.py \
  --csv data.csv --y-col Area \
  --ratio-col Ratio --new-x-col Frequency_strain \
  --filter "StrainNumber not in ['1', '98']" \
  --ns-df 4 --workdir /tmp/spline_cmp
```

**Report the peak location (`x*`) in the units of the fitted x-variable, not a derived label.** When the model is fit on a frequency/proportion column (e.g. `Frequency_strain`, a 0–1 value), the answer to "at what ratio/frequency is the maximum" is that fraction (e.g. `0.909`), NOT the colon-ratio it was derived from (e.g. `10:1`). Convert a colon ratio `a:b` to the fraction `a/(a+b)` when the question expects a 0–1 value or the fitted x-column is a fraction.

Ordinal logistic regression with interaction term (e.g. trial AE severity):

```bash
python skills/tooluniverse-statistical-modeling/scripts/logistic_regression_or.py \
  --csv merged.csv --outcome AESEV --outcome-type ordinal --outcome-order "1,2,3,4" \
  --predictors TRTGRP,expect_interact,patients_seen,MHONGO \
  --encode TRTGRP,expect_interact,patients_seen \
  --encode-map "TRTGRP:Placebo=0,BCG=1" \
  --interaction MHONGO:TRTGRP_cat \
  --coef-name TRTGRP_cat
```

Two-sample power analysis from a pilot CSV:

```bash
python skills/tooluniverse-statistical-modeling/scripts/power_analysis.py \
  --csv pilot.csv --value-col MeasuredValue --group-col Group \
  --group-a Treatment --group-b Control \
  --power 0.8 --alpha 0.05
```

---

## Workspace isolation (CRITICAL)

The input data folder for any analysis must remain untouched so re-runs
are reproducible. Scripts that write intermediate files (R drivers,
prepared CSVs, comparison tables) must write to `/tmp/` or to a
`--workdir` you pass in. Both R-based scripts in this skill refuse to
run if `--workdir` resolves to the input CSV's parent directory (or any
ancestor of it).

```bash
# OK
--workdir /tmp/spline_run

# Refused:
--workdir <path-equal-to-or-containing-the-input-csv>/...
```

---

## CRITICAL — Read before writing any code

1. **Clinical trial AE analysis** (regression, chi-square, ANY severity test): Use the bundled script (or the `clinical_trial_ae_severity_test` ToolUniverse tool which wraps it):
   ```bash
   tu run clinical_trial_ae_severity_test '{"dm_file":"DM.csv","ae_file":"AE.csv","test":"chi-square","group_col":"TRTGRP"}'

   # Or directly:
   python skills/tooluniverse-statistical-modeling/scripts/prepare_ae_cohort.py \
     --dm DM.csv --ae AE.csv --test chi-square --group TRTGRP \
     --subgroup "expect_interact=Yes"   # optional
   ```
   The script/tool handles: `encoding='latin1'` for SDTM CSVs, `max(AESEV)` per subject across ALL AEs (no AEPT filtering), inner join with DM, optional subgroup filter, optional ordinal-logistic with covariates.

   **Why no AEPT filter** — AESEV is a protocol-defined severity scale on the AE table. Filtering AE by AEPT (e.g. keeping only `AEPT == "COVID-19"`) drops subjects whose worst severity was recorded under a different AEPT label, drastically changes the contingency table, and can flip the test result. The phrase "COVID-19 severity" describes the OUTCOME, NOT a filter criterion.

   - ❌ WRONG: `ae[ae['AEPT'].str.contains('COVID-19')].groupby('USUBJID')['AESEV'].max()` — filters to COVID-19 events
   - ✅ RIGHT: `ae.groupby('USUBJID')['AESEV'].max()` — uses ALL AE records

2. **Expression ANOVA / fold change with multi-feature data** (gene × sample matrix):
   For "the F-statistic" or "a fold change" as a single value, run per-gene then summarize — NEVER pool `expr.values.ravel()` across all genes.
   - For **F-statistic**: derive a per-sample quantity (like DESeq2 LFC of each gene between two cell types, then ANOVA on those LFCs across groups) OR run on a single target gene.
   - For **median/mean log2 fold change** between two groups: run DESeq2 with `design=~<group>`, extract per-gene `log2FoldChange` (with shrinkage if the pipeline uses it), then take median/mean across genes.

   ❌ WRONG (aggregate): `log2(sum_counts_groupA / sum_counts_groupB)` per sample then summarize — gives ratio of totals, dominated by high-expression genes.
   ✅ RIGHT (per-gene): DESeq2 → `results_df['log2FoldChange'].median()`.

   **Sanity heuristics**: F > 50 for biological ANOVA across a few groups means you aggregated (typical biological F is 0.5–10). |median LFC| > 2 between similar groups means you aggregated (typical |median| < 1).

   Use the bundled script: `python skills/tooluniverse-statistical-modeling/scripts/expression_anova.py` (or the `expression_anova_per_gene` ToolUniverse tool).

3. **Spline models** — R `splines::ns(x, df=K)` ≠ Python `patsy.dmatrix("cr(x, df=K)")`. They produce different design matrices because of internal-knot placement, boundary-knot placement, and basis orthogonalization. For ANY question that references R syntax like `lm(y ~ ns(x, df = 4))`, run R via `Rscript`. Use the bundled wrapper:

   ```bash
   python skills/tooluniverse-statistical-modeling/scripts/r_natural_spline_regressi

Files: 21

Size: 239.3 KB

Complexity: 87/100

Category: General

Source: https://github.com/mims-harvard/tooluniverse/tree/main/plugin/skills/tooluniverse-statistical-modeling

Related in General

modeling-omnistudio-epc-catalog

Included

Salesforce Industries CME EPC product-modeling skill for Product2-based catalog creation. Use when creating EPC products, configuring product attributes, building offer bundles with Product Child Items, or reviewing EPC DataPack JSON metadata for product catalog changes. TRIGGER when: user creates or updates Product2 EPC records, AttributeAssignment payloads, AttributeMetadata/AttributeDefaultValues, Offer bundles, or ProductChildItem relationships. DO NOT TRIGGER when: designing OmniScripts/FlexCards/Integration Procedures (use building-omnistudio-omniscript, building-omnistudio-flexcard, or building-omnistudio-integration-procedure), implementing Apex business logic (use generating-apex), or troubleshooting deployment pipelines (use deploying-metadata).

Generalscripts

relationship-science-coach

Included

Use this skill for direct, practical adult relationship coaching: couples conflict, repair, trust, marriage, dating, flirting, attachment patterns, emotional connection, sex, desire differences, eroticism, kink negotiation, affection, love languages, breakups, and long-term passion. Draw on Gottman, EFT and Hold Me Tight, attachment science, modern sex research, Perel, Nagoski, Kerner, Schnarch, Love and Stosny, and flexible love-language tools. Be concrete and low-hedge. Redirect only for imminent danger, abuse, coercive control, minors, non-consent, self-harm, stalking, or medical/legal/psychiatric decisions.

Generalscripts

building-sf-integrations

Included

Salesforce integration architecture and runtime plumbing with 120-point scoring. Use this skill to set up Named Credentials, External Credentials, External Services, REST/SOAP callout patterns, Platform Events, and Change Data Capture. TRIGGER when: user sets up Named Credentials, External Services, REST/SOAP callouts, Platform Events, CDC, or touches .namedCredential-meta.xml files. DO NOT TRIGGER when: Connected App/OAuth config (use configuring-connected-apps), Apex-only logic (use generating-apex), or data import/export (use handling-sf-data).

Generalscripts

venue-templates

Included

Access comprehensive LaTeX templates, formatting requirements, and submission guidelines for major scientific publication venues (Nature, Science, PLOS, IEEE, ACM), academic conferences (NeurIPS, ICML, CVPR, CHI), research posters, and grant proposals (NSF, NIH, DOE, DARPA). This skill should be used when preparing manuscripts for journal submission, conference papers, research posters, or grant proposals and need venue-specific formatting requirements and templates.

Generalscripts

let-fate-decide

Included

Draws the 12 Houses of the Zodiac Tarot spread to inject entropy into planning when prompts are vague, ambiguous, or casually delegated. Interprets the spread to guide next steps. Use when the user says 'let fate decide', 'YOLO', 'whatever', 'idk', or other nonchalant phrases, makes Yu-Gi-Oh references, or when you are about to arbitrarily pick between multiple reasonable approaches. Prefer over ask-questions-if-underspecified when the user's tone is casual or playful rather than precision-seeking.

Generalscripts

net-ops

Included

Cross-platform network troubleshooting (Windows, macOS, Linux) via local or remote shell. Use for: DNS broken, can't resolve hostnames, nslookup/dig works but apps fail, NRPT, WFP, scutil, /etc/resolver, systemd-resolved, /etc/resolv.conf, NetworkManager, VPN DNS leak residue (ProtonVPN/Mullvad/WireGuard/AnyConnect), AV/firewall blocking DNS or DoH, Tailscale DNS interaction, intermittent connectivity, remote diagnostics over SSH.

Generalscripts