How to Wire the Wayback CDX Server Into Your Own Tooling
This is a complete, end‑to‑end walkthrough. You will learn how to query the CDX API, ingest and model the index, build higher‑level modules (Site Autopsy, Shadow Domain Mapper, Content Lineage), and let your systems reason over the ledger the archive has already collected.
What “Wiring CDX Into Your Tooling” Actually Means
“Wiring CDX” means turning the Wayback Machine’s index into a first‑class data source in your own systems. Instead of manually using the Wayback UI, your stack will:
- Query the CDX API over HTTP
- Receive CDX lines or JSON describing historical captures
- Normalize and store those records in your own database
- Build reusable modules on top of that data
- Expose those modules via services, dashboards, or analysis pipelines
The core CDX endpoint used by the public Wayback Machine looks like:
https://web.archive.org/cdx/search/cdx?url=<target>&[params...]
The CDX Server specification (captured in the archived docs) describes parameters such as: url, matchType, output, fl (field list), filter, collapse, and pagination mechanisms like resumption keys in some implementations.
Step 1 — Learn the API Like a Human First
Before writing code, you want tactile familiarity with the CDX API. You’ll use curl (or any HTTP
client) to understand output formats, parameters, and how filters change behavior.
Run this in a shell:
curl "https://web.archive.org/cdx/search/cdx?url=example.com&matchType=domain&output=txt&filter=statuscode:200"
You should see space‑separated CDX lines. Fields generally include a URL key, timestamp, original URL, MIME, status code, digest, and length.
Run:
curl "https://web.archive.org/cdx/search/cdx?url=example.com&matchType=domain&output=json&fl=timestamp,original,mime,statuscode,digest,length"
The first array element is the header row; subsequent arrays are values. This format is ideal for ingestion into your systems.
Filter on MIME and status:
curl "https://web.archive.org/cdx/search/cdx?url=example.com&matchType=domain&output=json&fl=timestamp,original,mime,statuscode,digest&filter=mime:text/html&filter=statuscode:200"
Collapse by digest to see only unique content versions:
curl "https://web.archive.org/cdx/search/cdx?url=example.com&matchType=domain&output=json&fl=timestamp,original,mime,statuscode,digest&collapse=digest"
This gives a deduplicated sequence of distinct contents: effectively your “change spine” for the domain.
Step 2 — Write Clients for CDX
Next you’ll create small, reusable clients that hide CDX’s parameter details behind simple functions. We’ll do one in Python for analysis pipelines and one in Node/TypeScript for web tooling and services.
Create a file cdx_client.py with:
import requests
from typing import List, Dict, Any
CDX_ENDPOINT = "https://web.archive.org/cdx/search/cdx"
def fetch_cdx(
url: str,
match_type: str = "domain",
fields: str = "timestamp,original,mime,statuscode,digest,length",
filters: List[str] | None = None,
collapse: str | None = None,
limit: int = 10000,
resume_key: str | None = None,
) -> Dict[str, Any]:
params: Dict[str, Any] = {
"url": url,
"matchType": match_type,
"output": "json",
"fl": fields,
"limit": limit,
}
if filters:
for f in filters:
params.setdefault("filter", []).append(f)
if collapse:
params["collapse"] = collapse
if resume_key:
params["resumeKey"] = resume_key # supported by some CDX server implementations
resp = requests.get(CDX_ENDPOINT, params=params, timeout=30)
resp.raise_for_status()
data = resp.json()
headers, *rows = data
normalized = [dict(zip(headers, row)) for row in rows]
return {
"headers": headers,
"rows": normalized,
"resume_key": resp.headers.get("X-Resume-Key") or None,
}
To paginate large domains, add:
def fetch_all_cdx(url: str, **kwargs) -> List[Dict[str, Any]]:
all_rows: List[Dict[str, Any]] = []
resume_key: str | None = None
while True:
batch = fetch_cdx(url=url, resume_key=resume_key, **kwargs)
rows = batch["rows"]
all_rows.extend(rows)
resume_key = batch["resume_key"]
if not resume_key or not rows:
break
return all_rows
For browser‑based tools, you generally proxy CDX through your backend to avoid CORS/rate‑limit issues. Create a small Express service:
import express from "express";
import fetch from "node-fetch";
const app = express();
const CDX_ENDPOINT = "https://web.archive.org/cdx/search/cdx";
app.get("/api/cdx", async (req, res) => {
try {
const url = req.query.url as string;
if (!url) return res.status(400).json({ error: "url is required" });
const params = new URLSearchParams({
url,
matchType: (req.query.matchType as string) || "domain",
output: "json",
fl: (req.query.fl as string) || "timestamp,original,mime,statuscode,digest,length",
limit: (req.query.limit as string) || "5000",
});
if (req.query.filters) {
const filters = Array.isArray(req.query.filters)
? (req.query.filters as string[])
: [req.query.filters as string];
for (const f of filters) params.append("filter", f);
}
if (req.query.collapse) params.append("collapse", req.query.collapse as string);
const resp = await fetch(`${CDX_ENDPOINT}?${params.toString()}`);
if (!resp.ok) return res.status(resp.status).json({ error: "CDX upstream error" });
const json = await resp.json();
const [headers, ...rows] = json;
const normalized = rows.map((row: string[]) =>
Object.fromEntries(headers.map((h: string, i: number) => [h, row[i]]))
);
res.json({ headers, rows: normalized });
} catch (err) {
console.error(err);
res.status(500).json({ error: "Internal error" });
}
});
app.listen(3000, () => console.log("CDX proxy running on :3000"));
Your front‑end can now call /api/cdx?url=example.com&matchType=domain&collapse=digest.
Step 3 — Turn the Stream Into a First‑Class Dataset
To build durable modules, you need CDX data in a database that supports temporal and domain‑centric queries. A relational model works well; you can always layer search/graph systems on top later.
For PostgreSQL‑style systems, a solid baseline:
CREATE TABLE cdx_capture (
id BIGSERIAL PRIMARY KEY,
url_key TEXT NOT NULL,
original TEXT NOT NULL,
timestamp TIMESTAMPTZ NOT NULL,
mime TEXT,
statuscode INT,
digest TEXT,
length BIGINT,
host TEXT,
domain TEXT,
path TEXT,
query TEXT,
source_label TEXT NOT NULL DEFAULT 'web.archive.org'
);
CREATE INDEX idx_cdx_domain_time ON cdx_capture (domain, timestamp);
CREATE INDEX idx_cdx_host_time ON cdx_capture (host, timestamp);
CREATE INDEX idx_cdx_digest ON cdx_capture (digest);
CREATE INDEX idx_cdx_status ON cdx_capture (statuscode);
url_key can store the SURT key if available; domain, host, path, and query are derived from original.
Create cdx_ingest.py:
from urllib.parse import urlparse
from datetime import datetime
from typing import Dict, Any, List
import psycopg2
def parse_domain(host: str) -> str:
parts = host.split(".")
return ".".join(parts[-2:]) if len(parts) >= 2 else host
def to_ts(ts_str: str) -> datetime:
padded = ts_str.ljust(14, "0")
return datetime.strptime(padded, "%Y%m%d%H%M%S")
def normalize_row(row: Dict[str, Any]) -> Dict[str, Any]:
original = row.get("original", "")
parsed = urlparse(original)
host = parsed.hostname or ""
return {
"url_key": row.get("urlkey") or row.get("url_key") or "",
"original": original,
"timestamp": to_ts(row["timestamp"]),
"mime": row.get("mime") or row.get("mimetype"),
"statuscode": int(row["statuscode"]) if row.get("statuscode") else None,
"digest": row.get("digest"),
"length": int(row["length"]) if row.get("length") else None,
"host": host,
"domain": parse_domain(host),
"path": parsed.path or "/",
"query": parsed.query or "",
}
def ingest_rows(conn, rows: List[Dict[str, Any]]):
with conn.cursor() as cur:
for raw in rows:
r = normalize_row(raw)
cur.execute("""
INSERT INTO cdx_capture
(url_key, original, timestamp, mime, statuscode, digest, length, host, domain, path, query)
VALUES
(%(url_key)s, %(original)s, %(timestamp)s, %(mime)s, %(statuscode)s, %(digest)s, %(length)s,
%(host)s, %(domain)s, %(path)s, %(query)s)
ON CONFLICT DO NOTHING;
""", r)
conn.commit()
Wire it together with the client from Step 2 to ingest a full domain’s index in batches, respecting any pagination/resumption behaviors.
Step 4 — Modules on Top of the CDX Ledger
With CDX indexed locally, you can now build higher‑level views — the modules that actually feel like tools. These are patterns, not products; you can integrate them into CLIs, services, or SPAs.
Step A — Ingest the domain index
from cdx_client import fetch_all_cdx
from cdx_ingest import ingest_rows
rows = fetch_all_cdx(
url="example.com",
match_type="domain",
fields="timestamp,original,mime,statuscode,digest,length",
filters=["statuscode:200"],
)
ingest_rows(conn, rows)
Step B — Build a change spine per digest
SELECT DISTINCT ON (digest)
digest, domain, original, timestamp, mime, statuscode
FROM cdx_capture
WHERE domain = 'example.com'
ORDER BY digest, timestamp;
Step C — Aggregate “version history”
SELECT
digest,
MIN(timestamp) AS first_seen,
MAX(timestamp) AS last_seen,
COUNT(*) AS capture_count,
ARRAY_AGG(DISTINCT statuscode) AS statuses
FROM cdx_capture
WHERE domain = 'example.com'
GROUP BY digest
ORDER BY first_seen;
Visualize this timeline with your plotting stack of choice (or a front‑end SPA): time on the x‑axis, digest versions as nodes, anomalies highlighted by status or capture density.
Step A — Group by host
SELECT
host,
MIN(timestamp) AS first_seen,
MAX(timestamp) AS last_seen,
COUNT(*) AS capture_count
FROM cdx_capture
WHERE domain = 'example.com'
GROUP BY host
ORDER BY first_seen;
Step B — Flag transient hosts
SELECT * FROM (
SELECT
host,
MIN(timestamp) AS first_seen,
MAX(timestamp) AS last_seen,
COUNT(*) AS capture_count
FROM cdx_capture
WHERE domain = 'example.com'
GROUP BY host
) h
WHERE capture_count < 20
ORDER BY capture_count ASC;
Render this into a timeline or graph view to reveal ephemeral infra, forgotten subdomains, and legacy endpoints.
Step A — Choose a digest from Site Autopsy output.
Step B — Query across all domains
SELECT
digest,
domain,
host,
original,
MIN(timestamp) AS first_seen,
MAX(timestamp) AS last_seen,
COUNT(*) AS capture_count
FROM cdx_capture
WHERE digest = $1
GROUP BY digest, domain, host, original
ORDER BY first_seen;
Use this result to build a propagation graph: nodes are domains, edges connect domains where the same digest appears over time.
Step A — Filter API‑like captures
SELECT * FROM cdx_capture
WHERE domain = 'api.example.com'
AND mime LIKE 'application/json%'
AND path LIKE '%/api/%';
Step B — Collapse by digest per endpoint
SELECT
original,
digest,
MIN(timestamp) AS first_seen,
MAX(timestamp) AS last_seen,
COUNT(*) AS capture_count
FROM cdx_capture
WHERE domain = 'api.example.com'
GROUP BY original, digest
ORDER BY original, first_seen;
Then fetch representative JSON from Wayback replay URLs and diff them (e.g., with deepdiff)
to surface breaking changes.
Step A — Bucket by month
SELECT
date_trunc('month', timestamp) AS bucket,
COUNT(*) AS capture_count
FROM cdx_capture
WHERE domain = 'example.com'
GROUP BY bucket
ORDER BY bucket;
Normalize and render as a heatmap or time‑series chart in your UI; spikes and gaps are investigation hooks.
Step 5 — Let Your Systems Think With CDX
Once the modules exist, you can move beyond visualizations into reasoning patterns – functions that compute stability, suspicion, or narrative drift scores over your CDX‑backed dataset.
Define a stability score as:
-- stability_score = distinct_digests / total_captures
SELECT
original,
COUNT(DISTINCT digest) AS digest_count,
COUNT(*) AS capture_count,
COUNT(DISTINCT digest)::float / COUNT(*)::float AS stability_score
FROM cdx_capture
WHERE domain = 'example.com'
GROUP BY original
ORDER BY stability_score ASC;
Low score → stable, rarely changing pages. High score → volatile or frequently edited resources.
Combine status volatility and lifespan:
SELECT
host,
MIN(timestamp) AS first_seen,
MAX(timestamp) AS last_seen,
COUNT(*) AS capture_count,
COUNT(DISTINCT statuscode) AS status_variants
FROM cdx_capture
WHERE domain = 'example.com'
GROUP BY host
ORDER BY status_variants DESC, capture_count ASC;
Hosts with many different status codes and short capture histories can be flagged as suspicious or ephemeral infra.
For pages like /about, /policy, or /terms, track how often content changes:
SELECT
path,
COUNT(DISTINCT digest) AS versions,
MIN(timestamp) AS first_seen,
MAX(timestamp) AS last_seen
FROM cdx_capture
WHERE domain = 'example.com'
AND path IN ('/about', '/policy', '/terms')
GROUP BY path
ORDER BY versions DESC;
Highly edited “narrative” pages get special attention in governance, legal, or media forensics workflows.
Step 6 — The Complete Wiring Plan
Here is the condensed path from nothing to a CDX‑native stack that pulls the index, builds modules, and lets your systems reason over the archive’s ledger.
| Stage | Action | Output |
|---|---|---|
| 1. Exploration | Use curl to query CDX with various params. |
Hands‑on understanding of fields and filters. |
| 2. Clients | Implement Python and/or Node clients. | Reusable functions to fetch JSON CDX. |
| 3. Storage | Create cdx_capture schema + indexes. |
CDX becomes queryable, persistent data. |
| 4. Ingestion | Normalize and load domains into DB. | Historical index mirrored locally. |
| 5. Modules | Build Site Autopsy, Shadow Mapper, etc. | Higher‑level, reusable analysis surfaces. |
| 6. Reasoning | Add scoring and drift detection. | Systems can “think” with CDX signals. |
| 7. Exposure | Expose as APIs, CLIs, or dashboards. | Human + machine access to the ledger. |
Comments
Post a Comment