Claude
Skills
Sign in
Back

wrds

Included with Lifetime
$97 forever

Use when "query WRDS", "pull SEC filings", "access Compustat/CRSP/ExecuComp/Capital IQ", "Form 4 insider data", "ISS governance/compensation", "TAQ intraday/NBBO", "SDC M&A or new issues", "FISD bonds", "Form D/ADV", "FJC court data", or any WRDS PostgreSQL query or SAS ETL on the WRDS grid (qsub/qsas/SGE).

Generalscripts

What this skill does


## Contents

- [WRDS Login Node Enforcement](#wrds-login-node-enforcement)
- [Query Enforcement](#query-enforcement)
- [SAS ETL Enforcement](#sas-etl-enforcement)
- [Quick Reference: Table Names](#quick-reference-table-names)
- [Connection](#connection)
- [Critical Filters](#critical-filters)
- [Parameterized Queries](#parameterized-queries)
- [Additional Resources](#additional-resources)

## WRDS Login Node Enforcement

### IRON LAW: NEVER RUN COMPUTE ON THE WRDS LOGIN NODE

<EXTREMELY-IMPORTANT>
The WRDS login node is shared infrastructure. Running parsers, bulk file reads, SAS jobs, or any process taking >30 seconds on the login node will get the account flagged.

**ALWAYS** write an SGE submission script and submit via `qsub`. No exceptions.

- `ssh wrds 'cat files.tsv | ./parser > output.tsv'` → **WRONG. Use qsub.**
- `ssh wrds 'nohup ./process &'` → **WRONG. Still the login node. Use qsub.**
- `ssh wrds 'python3 bulk_process.py'` → **WRONG. Use qsub.**
- `qsub -t 1-20 submit.sh` → **CORRECT.**

The login node is for: `qsub`, `qstat`, `qdel`, `scp`, `ls`, `head`, short `psql` queries.

See `references/constraints/wrds-sge-enforcement.md` for the full pattern and existing examples (quorum parser, state-of-incorp parser, SAS pipeline).
</EXTREMELY-IMPORTANT>

**Running compute on the login node is NOT HELPFUL — it gets the user's account flagged, the job killed, and the work lost.** You run on the login node because qsub feels like overhead. The overhead is 5 minutes of script writing. The downside is account suspension and a rerun from scratch.

### Rationalization Table — Login Node & Existing Infrastructure

| Excuse | Reality | Do Instead |
|--------|---------|------------|
| "It's a quick test, just one file" | One file becomes 100K when you forget to change the command | Write the SGE script first, test with `-t 1-1` |
| "nohup makes it background, so it's fine" | nohup is still the login node — same shared CPU | qsub, not nohup |
| "I'll run the real job via qsub later" | You'll forget. The 'test' run is the one that flags the account | qsub from the start |
| "It only takes 30 seconds" | You don't know that until it runs. 173K filings over NFS is not 30 seconds | If in doubt, qsub |
| "The quorum parser ran fine on the login node last time" | It didn't — you got lucky, or it was killed silently | Look at how the quorum parser ACTUALLY runs: submit_quorum.sh |
| "I need a new Go binary for this extraction" | `scan_covers` already has a profile-based framework with SGE, concurrency, and path handling | Add a profile to `scripts/scan_covers/`, don't create a standalone binary |
| "I'll build the path logic myself" | You'll get the `wrds_clean_filings` directory structure wrong | Read `references/edgar.md` — `cik_int.zfill(10)[:6]/{cik_int}/{accession}.txt` |
| "This parser is different enough to need its own binary" | It isn't. `scan_covers` profiles handle header extraction, body parsing, and custom extractors | Add a `profiles_*.go` file. If you need custom logic, use the `Custom` field type |

### Red Flags — STOP Immediately If You're About To:

- **Write `ssh wrds '... | ./binary > output'`** → STOP. That's login-node compute. Write a submit script.
- **Write `ssh wrds 'nohup ... &'`** → STOP. nohup doesn't change the node. Use qsub.
- **Write `ssh wrds 'python3 ...'` for anything that reads >10 files** → STOP. Use qsub.
- **Skip reading `references/edgar.md` before building a new WRDS file parser** → STOP. The path conventions, SGE patterns, and existing parsers are already documented. Read them first.
- **Create a new standalone Go binary for EDGAR extraction** → STOP. `scripts/scan_covers/` is a generic profile-based framework. Add a `profiles_*.go` file, not a new binary. The framework handles SGE sharding, path construction, concurrency, and form-type filtering.
- **Build a new Go/Python parser without checking `scripts/scan_covers/`** → STOP. This framework exists precisely so you don't reinvent extraction infrastructure. Every standalone parser is technical debt that should have been a profile.

### IRON LAW: USE SCAN_COVERS, NOT STANDALONE BINARIES

<EXTREMELY-IMPORTANT>
Before writing ANY new EDGAR filing extractor:

1. **Read `scripts/scan_covers/`** — generic profile-based Go framework with SGE, concurrency, path handling
2. **Add a `profiles_*.go` file** — not a standalone binary. The Profile struct supports pattern-based fields AND custom extractors (set `FullBody: true` for body-text searches like prospectus 485 filings — see `profiles_proxy_advisors.go`)
3. **Read `references/edgar.md`** — path conventions, existing profiles, SGE submission patterns

**Building a standalone parser when `scan_covers` exists is NOT HELPFUL — it reinvents infrastructure that already handles SGE sharding, NFS concurrency, path construction, form-type filtering, and error handling.** You built a 300-line standalone Go binary, ran it on the login node, got the path convention wrong, and spent 5 iterations fixing it. Adding a 60-line profile to `scan_covers` would have worked on the first try.

Every standalone EDGAR parser is technical debt. The `scan_covers` framework exists to eliminate this class of mistake.
</EXTREMELY-IMPORTANT>

# WRDS Data Access

WRDS (Wharton Research Data Services) provides academic research data via PostgreSQL at `wrds-pgdata.wharton.upenn.edu:9737`.

## Query Enforcement

### IRON LAW: NO QUERY WITHOUT FILTER VALIDATION FIRST

Before executing ANY WRDS query, you MUST:
1. **IDENTIFY** what filters are required for this dataset
2. **VALIDATE** the query includes those filters
3. **VERIFY** parameterized queries (never string formatting)
4. **EXECUTE** the query
5. **INSPECT** a sample of results before claiming success

This is not negotiable. Skipping sample inspection is NOT HELPFUL — the user builds analysis on data with undetected quality problems.

### Rationalization Table - STOP If You Think:

| Excuse | Reality | Do Instead |
|--------|---------|------------|
| "I'll add filters later" | You'll forget and pull bad data | Add filters NOW, before execution |
| "User didn't specify filters" | Standard filters are ALWAYS required | Apply Critical Filters section defaults |
| "Just a quick test query" | Test queries with bad filters teach bad patterns | Use production filters even for tests |
| "I'll let the user filter in pandas" | Pulling millions of unnecessary rows wastes time/memory | Filter at database level FIRST |
| "The query worked, so it's correct" | Query success ≠ data quality | INSPECT sample for invalid records |
| "I can use f-strings for simple queries" | SQL injection risk + wrong type handling | ALWAYS use parameterized queries |

### Red Flags - STOP Immediately If You Think:

- "Let me run this query quickly to see what's there" → NO. Check Critical Filters section first.
- "I'll just pull everything and filter later" → NO. Database-level filtering is mandatory.
- "The table name is obvious from the request" → NO. Check Quick Reference section for exact names.
- "I can inspect the data after the user sees it" → NO. Sample inspection BEFORE claiming success.

### Query Validation Checklist

Before EVERY query execution:

**For Compustat queries (comp.funda, comp.fundq):**
- [ ] Includes `indfmt = 'INDL'`
- [ ] Includes `datafmt = 'STD'`
- [ ] Includes `popsrc = 'D'`
- [ ] Includes `consol = 'C'`
- [ ] Uses parameterized queries for variables
- [ ] Date range is explicitly specified

**For CRSP v2 queries (crsp.dsf_v2, crsp.msf_v2):**
- [ ] Post-query filter: `sharetype == 'NS'`
- [ ] Post-query filter: `securitytype == 'EQTY'`
- [ ] Post-query filter: `securitysubtype == 'COM'`
- [ ] Post-query filter: `usincflg == 'Y'`
- [ ] Post-query filter: `issuertype.isin(['ACOR', 'CORP'])`
- [ ] Uses parameterized queries

**For Form 4 queries (tr_insiders.table1):**
- [ ] Transaction type filter specified (acqdisp)
- [ ] Transaction codes specified (trancode)
- [ ] Date range is explicitly specified
-
Files: 112
Size: 7652.1 KB
Complexity: 87/100
Category: General

Related in General