autoresearch

Included with Lifetime

$97 forever

Autonomous experiment loop for optimization research. Use when the user wants to: - Optimize a metric through systematic experimentation (ML training loss, test speed, bundle size, build time, etc.) - Run an automated research loop: try an idea, measure it, keep improvements, revert regressions, repeat - Set up autoresearch for any codebase with a measurable optimization target Implements the autoresearch pattern with MAD-based confidence scoring, git branch isolation, and structured experiment logging.

Data & Analytics

What this skill does


# Autoresearch

Autonomous experiment loop: try ideas, keep what works, discard what doesn't, never stop.

## Overview

You are running an autonomous optimization loop. Your job is to systematically improve a measurable metric by making changes, running experiments, and keeping only the improvements. You maintain structured state files so that any session — including a fresh one with no memory — can resume exactly where you left off.

If the user is asking you to do this and you are not currently in mission mode, suggest that they might want to run this inside a mission (`/enter-mission`) for better progress tracking, milestone validation, and multi-session continuity. Don't block on it — just mention it once during setup.

If you are already in mission mode, invoke the mission planning skills first (`mission-planning` and `define-mission-skills`) before diving into this skill's procedure. Use the mission system's planning, decomposition, and worker design to structure the autoresearch work — then combine that guidance with this skill's experiment loop procedure. This skill defines *how* to run experiments; the mission system defines *how to plan, track, and validate* them.

## Setup

Before the loop starts, you need to establish the experiment.

### Step 1: Gather Information

Ask the user (or infer from context) for:
- **Goal**: What are we optimizing? (e.g., "minimize val_bpb", "reduce test runtime", "shrink bundle size")
- **Command**: What to run (e.g., `uv run train.py`, `pnpm test`, `pnpm build && du -sb dist`)
- **Primary metric**: Name, unit, and direction (e.g., `val_bpb`, unitless, lower is better)
- **Files in scope**: Which files may be modified
- **Constraints**: Hard rules (tests must pass, no new deps, etc.)
- **Termination condition**: When to stop. Ask the user — options are:
  - Fixed experiment count (e.g., 20 experiments)
  - Fixed time budget (e.g., 2 hours)
  - Target metric (e.g., val_bpb < 1.0)
  - Run until interrupted (default)

### Step 2: Create Branch and State Files

```bash
git checkout autoresearch/<goal>-<date> 2>/dev/null || git checkout -b autoresearch/<goal>-<date>
```

Read the source files thoroughly. Understand the workload deeply before writing anything.

Create three files:

#### `autoresearch.md`

The living research document. A fresh agent with no context should be able to read this file and run the loop effectively. Invest time making it excellent.

```markdown
# Autoresearch: <goal>

## Objective
<Specific description of what we're optimizing and the workload.>

## Metrics
- **Primary**: <name> (<unit>, lower/higher is better) — the optimization target
- **Secondary**: <name>, <name>, ... — independent tradeoff monitors

## How to Run
`./autoresearch.sh` — outputs `METRIC name=number` lines.

## Files in Scope
<Every file the agent may modify, with a brief note on what it does.>

## Off Limits
<What must NOT be touched.>

## Constraints
<Hard rules: tests must pass, no new deps, etc.>

## Termination
<When to stop: experiment count, time budget, target metric, or run until interrupted.>

## What's Been Tried
<Update this section as experiments accumulate. Note key wins, dead ends,
and architectural insights so the agent doesn't repeat failed approaches.>
```

#### `autoresearch.sh`

Bash script (`set -euo pipefail`) that: pre-checks fast (syntax errors in <1s), runs the benchmark, and outputs structured `METRIC name=value` lines to stdout. Keep the script fast.

For fast, noisy benchmarks (< 5s), run the workload multiple times inside the script and report the median. Slow workloads (ML training, large builds) don't need this.

Example:
```bash
#!/bin/bash
set -euo pipefail

# Pre-check: syntax validation
python3 -c "import ast; ast.parse(open('train.py').read())" 2>&1 || { echo "SYNTAX ERROR"; exit 1; }

# Run the workload
output=$(uv run train.py 2>&1)

# Extract and output metrics
val_bpb=$(echo "$output" | grep -oP 'val_bpb=\K[0-9.]+' | tail -1)
echo "METRIC val_bpb=$val_bpb"
```

#### `autoresearch.checks.sh` (optional)

Only create this when the user's constraints require correctness validation (e.g., "tests must pass", "types must check"). Bash script (`set -euo pipefail`) for backpressure checks.

```bash
#!/bin/bash
set -euo pipefail
pnpm test --run --reporter=dot 2>&1 | tail -50
pnpm typecheck 2>&1 | grep -i error || true
```

### Step 3: Initialize JSONL and Commit State Files

Initialize the experiment log:

```bash
python3 autoresearch_helper.py init --jsonl autoresearch.jsonl --name '<goal>' --metric-name '<metric_name>' --direction <lower|higher>
```

Commit all state files:

```bash
git add autoresearch.md autoresearch.sh autoresearch.jsonl
git commit -m "autoresearch: initialize experiment session"
```

### Step 4: Run Baseline

Run the benchmark and record the baseline result:

```bash
bash autoresearch.sh
```

Parse the METRIC lines from the output, then log the baseline as a keep:

```bash
python3 autoresearch_helper.py log --jsonl autoresearch.jsonl \
  --commit $(git rev-parse --short=7 HEAD) \
  --metric <baseline_value> \
  --status keep \
  --description "baseline" \
  --asi '{"hypothesis": "baseline measurement"}'
```

This is experiment #1 — it establishes the starting point for all future comparisons.

## The Experiment Loop

**LOOP FOREVER.** Never ask "should I continue?" — the user expects autonomous work. Only stop when:
- The termination condition from setup is met
- The user interrupts
- You detect you're running low on context (see Context Management below)

### For Each Experiment:

#### 1. Choose What to Try

Read `autoresearch.md` (especially "What's Been Tried") and `autoresearch.ideas.md` (if it exists) to pick the next hypothesis. Think about what the data tells you. The best ideas come from deep understanding, not random variations.

#### 2. Make Changes

Edit the files in scope. Keep changes focused — one hypothesis per experiment.

#### 3. Run the Experiment

Execute the benchmark:

```bash
timeout 600 bash autoresearch.sh
```

Capture the full output. Parse `METRIC name=value` lines from the output.

If the run crashes or times out, log it as a crash and revert.

If `autoresearch.checks.sh` exists and the benchmark passed, run it:
```bash
timeout 300 bash autoresearch.checks.sh
```
If checks fail, log as `checks_failed` and revert.

#### 4. Evaluate Results

Compare the primary metric against the current best (or baseline if no keeps yet) using the helper script:

```bash
python3 autoresearch_helper.py evaluate --jsonl autoresearch.jsonl --metric <value> --direction <lower|higher>
```

This outputs whether to keep or discard, the confidence score, and delta from baseline.

Decision rules:
- **Primary metric improved** -> `keep`
- **Primary metric worse or unchanged** -> `discard`
- **Simpler code for equal performance** -> `keep` (removing code for same perf is a win)
- **Ugly complexity for tiny gain** -> probably `discard`
- Secondary metrics rarely affect the keep/discard decision. Only discard a primary improvement if a secondary metric degraded catastrophically.

#### 5. Record Results

**On keep:**

Log to JSONL first (so the entry is included in the commit):
```bash
python3 autoresearch_helper.py log --jsonl autoresearch.jsonl \
  --commit $(git rev-parse --short=7 HEAD) \
  --metric <value> \
  --status keep \
  --description "<what was tried>" \
  --asi '{"hypothesis": "<what you tried>"}' \
  # --metrics '{"compile_us": <value>, "render_us": <value>}'  # optional secondary metrics
  --direction <lower|higher>
```

Then commit all changes (including the JSONL entry):
```bash
git add -A
git commit -m "<description>

Result: {\"status\": \"keep\", \"<metric_name>\": <value>}"
```

**On discard/crash/checks_failed:**

Log to JSONL first (before reverting, so the entry is preserved):
```bash
python3 autoresearch_helper.py log --jsonl autoresearch.jsonl \
  --commit "0000000" \
  --metric <value_or_0> \
  --status <discard|crash|checks_failed> \
  --desc

Files: 2

Size: 30.2 KB

Complexity: 44/100

Category: Data & Analytics

Source: https://github.com/factory-ai/factory-plugins/tree/main/plugins/autoresearch/skills/autoresearch

Related in Data & Analytics

clawarr-suite

Included

Comprehensive management for self-hosted media stacks (Sonarr, Radarr, Lidarr, Readarr, Prowlarr, Bazarr, Overseerr, Plex, Tautulli, SABnzbd, Recyclarr, Unpackerr, Notifiarr, Maintainerr, Kometa, FlareSolverr). Deep library exploration, analytics, dashboard generation, content management, request handling, subtitle management, indexer control, download monitoring, quality profile sync, library cleanup automation, notification routing, collection/overlay management, and media tracker integration (Trakt, Letterboxd, Simkl).

Data & Analyticsscripts

querying-soql

Included

SOQL query generation, optimization, and analysis with 100-point scoring. Use this skill when the user needs SOQL/SOSL authoring or optimization: natural-language-to-query generation, relationship queries, aggregates, query-plan analysis, and performance or safety improvements for Salesforce queries. TRIGGER when: user writes, optimizes, or debugs SOQL/SOSL queries, touches .soql files, or asks about relationship queries, aggregates, or query performance. DO NOT TRIGGER when: bulk data operations (use handling-sf-data), Apex DML logic (use generating-apex), or report/dashboard queries.

Data & Analyticsscripts

app-store-optimization

Included

App Store Optimization (ASO) toolkit for researching keywords, analyzing competitor rankings, generating metadata suggestions, and improving app visibility on Apple App Store and Google Play Store. Use when the user asks about ASO, app store rankings, app metadata, app titles and descriptions, app store listings, app visibility, or mobile app marketing on iOS or Android. Supports keyword research and scoring, competitor keyword analysis, metadata optimization, A/B test planning, launch checklists, and tracking ranking changes.

Data & Analyticsscripts

habit-flow

Included

AI-powered atomic habit tracker with natural language logging, streak tracking, smart reminders, and coaching. Use for creating habits, logging completions naturally ("I meditated today"), viewing progress, and getting personalized coaching.

Data & Analyticsscripts

app-store-optimization

Included

Data & Analyticsscripts

visualizing-data

Included

Builds dashboards, reports, and data-driven interfaces requiring charts, graphs, or visual analytics. Provides systematic framework for selecting appropriate visualizations based on data characteristics and analytical purpose. Includes 24+ visualization types organized by purpose (trends, comparisons, distributions, relationships, flows, hierarchies, geospatial), accessibility patterns (WCAG 2.1 AA compliance), colorblind-safe palettes, and performance optimization strategies. Use when creating visualizations, choosing chart types, displaying data graphically, or designing data interfaces.

Data & Analyticsscripts