text-metrics
Probability-based text analysis providing GLTR token rank histograms, DetectGPT curvature probes, and Coh-Metrix-inspired cohesion metrics. Designed to compose with ai-check for comprehensive AI writing pattern detection.
What this skill does
# Text-Metrics: Probability-Based Text Analysis
## Overview
A utility skill providing advanced statistical and model-based text analysis for AI detection. Implements three core capabilities grounded in peer-reviewed research:
1. **Token Rank Histograms** (GLTR-style): Analyzes token probability distributions
2. **DetectGPT Curvature Probes**: Measures log-probability curvature through perturbations
3. **Cohesion Metrics** (Coh-Metrix-inspired): Evaluates discourse connectives, lexical diversity, and referential cohesion
## Primary Use Case
This skill is designed to **compose with the ai-check skill** to provide Dimension 5 (Probability-Based) detection features. It can also be used standalone for text analysis research.
## Auto-Invoke Conditions
This skill is typically invoked **by other skills** (particularly ai-check) rather than directly by users. It may be invoked when:
- AI-check requests probability-based metrics
- User explicitly requests "token rank analysis" or "GLTR"
- User mentions "DetectGPT" or "curvature probe"
- User asks for "lexical diversity" or "cohesion metrics"
## Scientific Foundation
### 1. GLTR Token Rank Histograms
**Reference**: Gehrmann, S., Strobelt, H., & Rush, A. M. (2019). "GLTR: Statistical Detection and Visualization of Generated Text." ACL 2019.
**Principle**: LLM-generated text tends to select higher-probability tokens (lower ranks) more consistently than human writing, which exhibits more variability and surprisal.
**Method**:
1. For each token in text, compute its rank given preceding context using a language model
2. Bin tokens into: top-10, top-100, top-1000, rest
3. LLM text shows higher concentration in top bins
**Typical Patterns**:
- **AI-generated**: Top-10 > 40%, Top-100 > 85%
- **Human-written**: Top-10 ~ 20-30%, Top-100 ~ 70-80%
### 2. DetectGPT Curvature Probes
**Reference**: Mitchell, E., et al. (2023). "DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature." ICML 2023.
**Principle**: Generated text sits at local maxima in the model's probability distribution, while human text does not. Random perturbations of generated text tend to have lower probability.
**Method**:
1. Compute log-probability of original text: `log P(x)`
2. Generate random perturbations: `x'₁, x'₂, ..., x'ₙ`
3. Compute mean perturbed log-probability: `mean(log P(x'ᵢ))`
4. Curvature = `log P(x) - mean(log P(x'ᵢ))`
5. Positive curvature suggests generation
**Interpretation**:
- **Curvature > 0.5**: Likely generated
- **Curvature ~ 0**: Ambiguous
- **Curvature < -0.5**: Likely human
### 3. Coh-Metrix Cohesion Metrics
**Reference**: McNamara, D. S., et al. (2014). "Automated Evaluation of Text and Discourse with Coh-Metrix." Cambridge University Press.
**Principle**: Discourse cohesion patterns (connectives, referential overlap, lexical diversity) differ between human and AI writing.
**Metrics**:
- **Connective Density**: Additive, temporal, causal, adversative connectives per 1,000 tokens
- **Lexical Diversity**: Type-token ratio, unique lemmas, hapax legomena
- **Referential Cohesion**: Pronoun rate, unique pronoun types
## Functions
### 1. `token_rank_hist(text, model_name="gpt2")`
Compute GLTR-style token rank histogram.
**Input**:
- `text`: String to analyze
- `model_name`: HuggingFace model identifier (default: "gpt2")
**Output**:
```json
{
"top10_pct": 45.2,
"top100_pct": 87.3,
"top1000_pct": 96.1,
"rest_pct": 3.9,
"mean_rank": 182.4,
"median_rank": 34.0
}
```
**Usage**:
```python
from scripts.text_metrics import token_rank_hist
result = token_rank_hist("Your text here", model_name="gpt2")
print(f"Top-10 concentration: {result['top10_pct']:.1f}%")
```
### 2. `detectgpt_score(text, model_name="gpt2", num_perturbations=10)`
Compute DetectGPT curvature criterion.
**Input**:
- `text`: String to analyze
- `model_name`: HuggingFace model (default: "gpt2")
- `num_perturbations`: Number of random perturbations (default: 10)
**Output**:
```json
{
"original_logprob": -2.34,
"mean_perturbed_logprob": -3.12,
"curvature": 0.78
}
```
**Usage**:
```python
from scripts.text_metrics import detectgpt_score
result = detectgpt_score("Your text here", num_perturbations=20)
if result['curvature'] > 0.5:
print("Likely AI-generated")
```
### 3. `cohesion_bundle(text)`
Compute Coh-Metrix-inspired cohesion metrics.
**Input**:
- `text`: String to analyze
**Output**:
```json
{
"connectives": {
"additive_rate": 12.5,
"temporal_rate": 8.3,
"causal_rate": 6.2,
"adversative_rate": 9.1,
"total_rate": 36.1
},
"lexical_diversity": {
"type_token_ratio": 0.62,
"unique_lemmas": 142,
"hapax_legomena": 78
},
"referential_cohesion": {
"pronoun_rate": 45.2,
"unique_pronoun_types": 12
}
}
```
**Usage**:
```python
from scripts.text_metrics import cohesion_bundle
result = cohesion_bundle("Your text here")
diversity = result['lexical_diversity']['type_token_ratio']
print(f"Lexical diversity: {diversity:.2f}")
```
### 4. `full_analysis(text, model_name="gpt2", num_perturbations=10)`
Run all three analyses in one call.
**Output**: Combined dictionary with all metrics.
## Integration with AI-Check
When ai-check skill invokes text-metrics:
```python
# ai-check calls text-metrics for Dimension 5
token_data = invoke_skill("text-metrics", "token_rank_hist", text)
curvature = invoke_skill("text-metrics", "detectgpt_score", text)
cohesion = invoke_skill("text-metrics", "cohesion_bundle", text)
# Use results in probability dimension scoring
if token_data['top10_pct'] > 40:
flag_high_probability_concentration()
if curvature['curvature'] > 0.5:
flag_curvature_anomaly()
```
## Model Selection
### Supported Models (HuggingFace)
- **gpt2** (default): Fast, lightweight, good baseline
- **gpt2-medium**: Better accuracy, slower
- **gpt2-large**: Best accuracy, requires GPU
- **facebook/opt-125m**: Alternative small model
- **facebook/opt-1.3b**: Alternative medium model
**Recommendation**: Use `gpt2` for speed, `gpt2-medium` for accuracy.
### GPU Acceleration
Text-metrics benefits significantly from GPU:
- **CPU**: ~2-5 seconds per 500-token document
- **GPU**: ~0.3-0.8 seconds per 500-token document
Install PyTorch with CUDA support for GPU acceleration.
## Performance Characteristics
| Function | Time (CPU) | Time (GPU) | Memory |
|----------|-----------|-----------|---------|
| `token_rank_hist` | 3-5s | 0.5-1s | ~1GB |
| `detectgpt_score` | 5-10s | 1-2s | ~1GB |
| `cohesion_bundle` | <0.1s | <0.1s | <100MB |
| `full_analysis` | 8-15s | 1.5-3s | ~1GB |
Times for ~500 token documents, gpt2 model.
## Limitations
1. **Model Dependency**: Results depend on choice of language model. Mismatch between evaluation model and generation model affects accuracy.
2. **Short Text**: DetectGPT requires ~100+ tokens for reliable results. Token rank histograms need ~50+ tokens.
3. **Computational Cost**: Model inference is expensive. Cache results when possible.
4. **Perturbation Quality**: Simple word replacement perturbations used here. Production systems should use mask-filling with dedicated models.
5. **Language**: Currently English-only. Multilingual models needed for other languages.
6. **Post-Editing**: Heavily edited AI text may evade detection as curvature flattens.
## Example Usage
### Standalone Analysis
```python
from scripts.text_metrics import full_analysis
text = """
Your sample text here. Should be at least 100 tokens
for reliable DetectGPT results.
"""
results = full_analysis(text, model_name="gpt2", num_perturbations=20)
print("Token Rank Histogram:")
print(f" Top-10: {results['token_ranks']['top10_pct']:.1f}%")
print(f" Top-100: {results['token_ranks']['top100_pct']:.1f}%")
print("\nDetectGPT Curvature:")
print(f" Curvature: {results['curvature']['curvature']:.2f}")
print("\nCohesion Metrics:")
print(f" Type-Token Ratio: {results['cohesion']['lexical_diversity']['type_token_ratio']:.2f}")
print(f" ConnRelated in Writing & Docs
jax-development
IncludedUse this skill when the user is writing, debugging, profiling, refactoring, reviewing, benchmarking, parallelising, exporting, or explaining JAX code, or when they mention JAX, jax.numpy, jit, grad, value_and_grad, vmap, scan, lax, random keys, pytrees, jax.Array, sharding, Mesh, PartitionSpec, NamedSharding, pmap, shard_map, Pallas, XLA, StableHLO, checkify, profiler, or the JAX repo. It helps turn NumPy or PyTorch-style code into pure functional JAX, fix tracer/control-flow/shape/PRNG bugs, remove recompiles and host-device syncs, choose transforms and sharding strategies, inspect jaxpr/lowering/IR, and benchmark compiled code correctly.
nature-article-writer
IncludedDrafts, rewrites, diagnostically critiques, and style-calibrates primary research manuscripts for Nature and Nature Portfolio journals. Use when the user wants a Nature-style title, summary paragraph or abstract, introduction, results, discussion, methods, figure legends, presubmission enquiry, cover letter, reviewer response, or when a scientific draft sounds generic, jargon-heavy, structurally weak, or AI-ish and needs precise, broad-reader-friendly prose without inventing data, analyses, or references. Best for primary research articles and letters rather than reviews or press releases unless explicitly adapting one.
deckrd
IncludedDocument-driven framework that derives requirements, specifications, implementation plans, and executable tasks from goals through structured AI dialogue. Use when user says "write requirements", "create spec", "plan implementation", "derive tasks", "structure this feature", "break down into tasks", or "document this module". Also use for reverse engineering existing code into docs (/deckrd rev). Do NOT use for direct code writing — use /deckrd-coder after tasks are generated. Do NOT use when the user only wants to run or fix existing code without planning.
clinical-decision-support
IncludedGenerate professional clinical decision support (CDS) documents for pharmaceutical and clinical research settings, including patient cohort analyses (biomarker-stratified with outcomes) and treatment recommendation reports (evidence-based guidelines with decision algorithms). Supports GRADE evidence grading, statistical analysis (hazard ratios, survival curves, waterfall plots), biomarker integration, and regulatory compliance. Outputs publication-ready LaTeX/PDF format optimized for drug development, clinical research, and evidence synthesis.
handling-sf-data
IncludedSalesforce data operations with 130-point scoring. Use this skill to create, update, delete, bulk import/export, generate test data, and clean up org records using sf CLI and anonymous Apex. TRIGGER when: user creates test data, performs bulk import/export, uses sf data CLI commands, needs data factory patterns for Apex tests, or needs to seed/clean records in a Salesforce org. DO NOT TRIGGER when: SOQL query writing only (use querying-soql), Apex test execution (use running-apex-tests), or metadata deployment (use deploying-metadata).
accelint-ac-to-playwright
IncludedConvert and validate acceptance criteria for Playwright test automation. Use when user asks to (1) review/evaluate/check if AC are ready for automation, (2) assess if AC can be converted as-is, (3) validate AC quality for Playwright, (4) turn AC into tests, (5) generate tests from acceptance criteria, (6) convert .md bullets or .feature Gherkin files to Playwright specs, (7) create test automation from requirements. Handles both bullet-style markdown and Gherkin syntax with JSON test plan generation and validation.