Claude
Skills
Sign in
Back

tooluniverse-statistical-modeling

Included with Lifetime
$97 forever

Statistical modeling — linear/logistic/ordinal/Poisson regression, ANOVA, Kruskal-Wallis, chi-square, Mann-Whitney, Cox survival, spline fits (R `ns()`), odds ratios, Cohen's d, F-statistic, p-value computation. Specializes in clinical-trial AE analysis (SDTM DM/AE), severity ordinal regression, and per-feature stat workflows.

Generalscripts

What this skill does


# Statistical Modeling for Biomedical Data Analysis

## RULE ZERO — Check for pre-computed results FIRST

Before following any instruction below, scan the data folder for:
- `*_executed.ipynb` → read with `tu run read_executed_notebook '{"data_folder":"<path>","search":"<keyword>"}'` and cite its cell outputs as the authoritative answer
- Pre-computed result files (CSV/TSV with names like `*results*`, `*deseq*`, `*enrich*`, `*stats*`, `*_simplified.csv`) → read directly and report the requested value
- Canonical analysis scripts (`analysis.R`, `run_*.py`, `find_*.R`, `*.Rmd`) → execute as-is and read the output

Only follow this skill's re-analysis recipe below if **none** of the above exist. Re-running from raw data produces different numbers than the published answer and is much slower (often 5-10× turn count).

---

## PRIMARY SCRIPTS — use these FIRST

These scripts encode the question-specific gotchas in `scripts/` and emit
labelled, parseable output. Prefer them over ad-hoc statsmodels / scipy code.

| Script | When to use it |
|--------|----------------|
| `r_natural_spline_regression.py` | ANY question that mentions R syntax `lm(y ~ ns(x, df = K))`, "natural spline", or asks for spline R²/F/peak prediction CIs. Always shells out to Rscript so `splines::ns()` matches. |
| `spline_model_compare.py` | "Best-fitting model among quadratic, cubic and natural spline" / "max colony area at the optimal x". Fits all three in R, ranks by adj-R²/AIC/BIC, and reports the BEST model's peak (x*, y*) with 95% CI. |
| `logistic_regression_or.py` | Binary or ordinal logistic regression where the answer is an OR (or OR + 95% CI). Handles label encoding, explicit Placebo=0/BCG=1 maps, AND interaction terms (`--interaction A:B` -> creates `A_B = A*B`). Prints OR + CI for every coefficient and a SCALARS block for the requested `--coef-name`. |
| `power_analysis.py` | "Minimum sample size per group", "TTestIndPower", "given Cohen's d, what N for power=0.8". Computes pooled-SD Cohen's d from a CSV (or accepts `--effect-size`), then `TTestIndPower.solve_power`. |
| `expression_anova.py` | Per-gene ANOVA / median LFC across cell types or sample groups (NOT pooled across genes — see warnings below). |
| `prepare_ae_cohort.py` | Clinical-trial AE severity tests (chi-square / ordinal) on SDTM DM/AE files (`encoding='latin1'`, `max(AESEV)` per subject across ALL AEs — no AEPT filter). |
| `stat_tests.py` | Stdlib-only chi-square goodness-of-fit, Fisher's exact, simple OLS. Use when scipy/statsmodels aren't available. |

### Concrete invocations

Natural-spline regression (R^2, overall F-test p, peak Y + 95% CI):

```bash
python skills/tooluniverse-statistical-modeling/scripts/r_natural_spline_regression.py \
  --csv data.csv --y-col Area \
  --ratio-col Ratio --new-x-col Frequency_strain \
  --filter "StrainNumber not in ['1', '98']" \
  --df 4 --workdir /tmp/spline_run
```

Quadratic vs cubic vs natural-spline comparison + best-model peak:

```bash
python skills/tooluniverse-statistical-modeling/scripts/spline_model_compare.py \
  --csv data.csv --y-col Area \
  --ratio-col Ratio --new-x-col Frequency_strain \
  --filter "StrainNumber not in ['1', '98']" \
  --ns-df 4 --workdir /tmp/spline_cmp
```

**Report the peak location (`x*`) in the units of the fitted x-variable, not a derived label.** When the model is fit on a frequency/proportion column (e.g. `Frequency_strain`, a 0–1 value), the answer to "at what ratio/frequency is the maximum" is that fraction (e.g. `0.909`), NOT the colon-ratio it was derived from (e.g. `10:1`). Convert a colon ratio `a:b` to the fraction `a/(a+b)` when the question expects a 0–1 value or the fitted x-column is a fraction.

Ordinal logistic regression with interaction term (e.g. trial AE severity):

```bash
python skills/tooluniverse-statistical-modeling/scripts/logistic_regression_or.py \
  --csv merged.csv --outcome AESEV --outcome-type ordinal --outcome-order "1,2,3,4" \
  --predictors TRTGRP,expect_interact,patients_seen,MHONGO \
  --encode TRTGRP,expect_interact,patients_seen \
  --encode-map "TRTGRP:Placebo=0,BCG=1" \
  --interaction MHONGO:TRTGRP_cat \
  --coef-name TRTGRP_cat
```

Two-sample power analysis from a pilot CSV:

```bash
python skills/tooluniverse-statistical-modeling/scripts/power_analysis.py \
  --csv pilot.csv --value-col MeasuredValue --group-col Group \
  --group-a Treatment --group-b Control \
  --power 0.8 --alpha 0.05
```

---

## Workspace isolation (CRITICAL)

The input data folder for any analysis must remain untouched so re-runs
are reproducible. Scripts that write intermediate files (R drivers,
prepared CSVs, comparison tables) must write to `/tmp/` or to a
`--workdir` you pass in. Both R-based scripts in this skill refuse to
run if `--workdir` resolves to the input CSV's parent directory (or any
ancestor of it).

```bash
# OK
--workdir /tmp/spline_run

# Refused:
--workdir <path-equal-to-or-containing-the-input-csv>/...
```

---

## CRITICAL — Read before writing any code

1. **Clinical trial AE analysis** (regression, chi-square, ANY severity test): Use the bundled script (or the `clinical_trial_ae_severity_test` ToolUniverse tool which wraps it):
   ```bash
   tu run clinical_trial_ae_severity_test '{"dm_file":"DM.csv","ae_file":"AE.csv","test":"chi-square","group_col":"TRTGRP"}'

   # Or directly:
   python skills/tooluniverse-statistical-modeling/scripts/prepare_ae_cohort.py \
     --dm DM.csv --ae AE.csv --test chi-square --group TRTGRP \
     --subgroup "expect_interact=Yes"   # optional
   ```
   The script/tool handles: `encoding='latin1'` for SDTM CSVs, `max(AESEV)` per subject across ALL AEs (no AEPT filtering), inner join with DM, optional subgroup filter, optional ordinal-logistic with covariates.

   **Why no AEPT filter** — AESEV is a protocol-defined severity scale on the AE table. Filtering AE by AEPT (e.g. keeping only `AEPT == "COVID-19"`) drops subjects whose worst severity was recorded under a different AEPT label, drastically changes the contingency table, and can flip the test result. The phrase "COVID-19 severity" describes the OUTCOME, NOT a filter criterion.

   - ❌ WRONG: `ae[ae['AEPT'].str.contains('COVID-19')].groupby('USUBJID')['AESEV'].max()` — filters to COVID-19 events
   - ✅ RIGHT: `ae.groupby('USUBJID')['AESEV'].max()` — uses ALL AE records

2. **Expression ANOVA / fold change with multi-feature data** (gene × sample matrix):
   For "the F-statistic" or "a fold change" as a single value, run per-gene then summarize — NEVER pool `expr.values.ravel()` across all genes.
   - For **F-statistic**: derive a per-sample quantity (like DESeq2 LFC of each gene between two cell types, then ANOVA on those LFCs across groups) OR run on a single target gene.
   - For **median/mean log2 fold change** between two groups: run DESeq2 with `design=~<group>`, extract per-gene `log2FoldChange` (with shrinkage if the pipeline uses it), then take median/mean across genes.

   ❌ WRONG (aggregate): `log2(sum_counts_groupA / sum_counts_groupB)` per sample then summarize — gives ratio of totals, dominated by high-expression genes.
   ✅ RIGHT (per-gene): DESeq2 → `results_df['log2FoldChange'].median()`.

   **Sanity heuristics**: F > 50 for biological ANOVA across a few groups means you aggregated (typical biological F is 0.5–10). |median LFC| > 2 between similar groups means you aggregated (typical |median| < 1).

   Use the bundled script: `python skills/tooluniverse-statistical-modeling/scripts/expression_anova.py` (or the `expression_anova_per_gene` ToolUniverse tool).

3. **Spline models** — R `splines::ns(x, df=K)` ≠ Python `patsy.dmatrix("cr(x, df=K)")`. They produce different design matrices because of internal-knot placement, boundary-knot placement, and basis orthogonalization. For ANY question that references R syntax like `lm(y ~ ns(x, df = 4))`, run R via `Rscript`. Use the bundled wrapper:

   ```bash
   python skills/tooluniverse-statistical-modeling/scripts/r_natural_spline_regressi

Related in General