Skip to main content
Wiring the Wayback CDX Server Into Your Own Tooling
WAYBACK MACHINE // PRACTICAL WIRING GUIDE

How to Wire the Wayback CDX Server Into Your Own Tooling

Full tutorial From raw CDX → intel modules Stack‑agnostic, analyst‑grade

This is a complete, end‑to‑end walkthrough. You will learn how to query the CDX API, ingest and model the index, build higher‑level modules (Site Autopsy, Shadow Domain Mapper, Content Lineage), and let your systems reason over the ledger the archive has already collected.

Understand the Target First

What “Wiring CDX Into Your Tooling” Actually Means

“Wiring CDX” means turning the Wayback Machine’s index into a first‑class data source in your own systems. Instead of manually using the Wayback UI, your stack will:

  • Query the CDX API over HTTP
  • Receive CDX lines or JSON describing historical captures
  • Normalize and store those records in your own database
  • Build reusable modules on top of that data
  • Expose those modules via services, dashboards, or analysis pipelines

The core CDX endpoint used by the public Wayback Machine looks like:

https://web.archive.org/cdx/search/cdx?url=<target>&[params...]

The CDX Server specification (captured in the archived docs) describes parameters such as: url, matchType, output, fl (field list), filter, collapse, and pagination mechanisms like resumption keys in some implementations.

Goal of this tutorial
By the end, you should be able to: pull the index, store it, build modules, and run reasoning patterns over the ledger.
CLI + Inspection

Step 1 — Learn the API Like a Human First

Before writing code, you want tactile familiarity with the CDX API. You’ll use curl (or any HTTP client) to understand output formats, parameters, and how filters change behavior.

1.1 Basic domain‑wide query

Run this in a shell:

curl "https://web.archive.org/cdx/search/cdx?url=example.com&matchType=domain&output=txt&filter=statuscode:200"

You should see space‑separated CDX lines. Fields generally include a URL key, timestamp, original URL, MIME, status code, digest, and length.

1.2 Get structured JSON with specific fields

Run:

curl "https://web.archive.org/cdx/search/cdx?url=example.com&matchType=domain&output=json&fl=timestamp,original,mime,statuscode,digest,length"

The first array element is the header row; subsequent arrays are values. This format is ideal for ingestion into your systems.

1.3 Apply filters and collapsing

Filter on MIME and status:

curl "https://web.archive.org/cdx/search/cdx?url=example.com&matchType=domain&output=json&fl=timestamp,original,mime,statuscode,digest&filter=mime:text/html&filter=statuscode:200"

Collapse by digest to see only unique content versions:

curl "https://web.archive.org/cdx/search/cdx?url=example.com&matchType=domain&output=json&fl=timestamp,original,mime,statuscode,digest&collapse=digest"

This gives a deduplicated sequence of distinct contents: effectively your “change spine” for the domain.

Python + Node

Step 2 — Write Clients for CDX

Next you’ll create small, reusable clients that hide CDX’s parameter details behind simple functions. We’ll do one in Python for analysis pipelines and one in Node/TypeScript for web tooling and services.

2.1 Python client (analysis & pipelines)

Create a file cdx_client.py with:

import requests
from typing import List, Dict, Any

CDX_ENDPOINT = "https://web.archive.org/cdx/search/cdx"

def fetch_cdx(
url: str,
match_type: str = "domain",
fields: str = "timestamp,original,mime,statuscode,digest,length",
filters: List[str] | None = None,
collapse: str | None = None,
limit: int = 10000,
resume_key: str | None = None,
) -> Dict[str, Any]:
params: Dict[str, Any] = {
"url": url,
"matchType": match_type,
"output": "json",
"fl": fields,
"limit": limit,
}
if filters:
for f in filters:
params.setdefault("filter", []).append(f)
if collapse:
params["collapse"] = collapse
if resume_key:
params["resumeKey"] = resume_key # supported by some CDX server implementations

resp = requests.get(CDX_ENDPOINT, params=params, timeout=30)
resp.raise_for_status()
data = resp.json()
headers, *rows = data
normalized = [dict(zip(headers, row)) for row in rows]
return {
"headers": headers,
"rows": normalized,
"resume_key": resp.headers.get("X-Resume-Key") or None,
}

To paginate large domains, add:

def fetch_all_cdx(url: str, **kwargs) -> List[Dict[str, Any]]:
all_rows: List[Dict[str, Any]] = []
resume_key: str | None = None
while True:
batch = fetch_cdx(url=url, resume_key=resume_key, **kwargs)
rows = batch["rows"]
all_rows.extend(rows)
resume_key = batch["resume_key"]
if not resume_key or not rows:
break
return all_rows
2.2 Node/TypeScript backend proxy (for SPAs)

For browser‑based tools, you generally proxy CDX through your backend to avoid CORS/rate‑limit issues. Create a small Express service:

import express from "express";
import fetch from "node-fetch";

const app = express();
const CDX_ENDPOINT = "https://web.archive.org/cdx/search/cdx";

app.get("/api/cdx", async (req, res) => {
try {
const url = req.query.url as string;
if (!url) return res.status(400).json({ error: "url is required" });

const params = new URLSearchParams({
url,
matchType: (req.query.matchType as string) || "domain",
output: "json",
fl: (req.query.fl as string) || "timestamp,original,mime,statuscode,digest,length",
limit: (req.query.limit as string) || "5000",
});

if (req.query.filters) {
const filters = Array.isArray(req.query.filters)
? (req.query.filters as string[])
: [req.query.filters as string];
for (const f of filters) params.append("filter", f);
}
if (req.query.collapse) params.append("collapse", req.query.collapse as string);

const resp = await fetch(`${CDX_ENDPOINT}?${params.toString()}`);
if (!resp.ok) return res.status(resp.status).json({ error: "CDX upstream error" });
const json = await resp.json();
const [headers, ...rows] = json;
const normalized = rows.map((row: string[]) =>
Object.fromEntries(headers.map((h: string, i: number) => [h, row[i]]))
);
res.json({ headers, rows: normalized });
} catch (err) {
console.error(err);
res.status(500).json({ error: "Internal error" });
}
});

app.listen(3000, () => console.log("CDX proxy running on :3000"));

Your front‑end can now call /api/cdx?url=example.com&matchType=domain&collapse=digest.

Schema Design

Step 3 — Turn the Stream Into a First‑Class Dataset

To build durable modules, you need CDX data in a database that supports temporal and domain‑centric queries. A relational model works well; you can always layer search/graph systems on top later.

3.1 Suggested relational schema

For PostgreSQL‑style systems, a solid baseline:

CREATE TABLE cdx_capture (
id BIGSERIAL PRIMARY KEY,
url_key TEXT NOT NULL,
original TEXT NOT NULL,
timestamp TIMESTAMPTZ NOT NULL,
mime TEXT,
statuscode INT,
digest TEXT,
length BIGINT,
host TEXT,
domain TEXT,
path TEXT,
query TEXT,
source_label TEXT NOT NULL DEFAULT 'web.archive.org'
);

CREATE INDEX idx_cdx_domain_time ON cdx_capture (domain, timestamp);
CREATE INDEX idx_cdx_host_time ON cdx_capture (host, timestamp);
CREATE INDEX idx_cdx_digest ON cdx_capture (digest);
CREATE INDEX idx_cdx_status ON cdx_capture (statuscode);

url_key can store the SURT key if available; domain, host, path, and query are derived from original.

3.2 Normalizing and ingesting (Python)

Create cdx_ingest.py:

from urllib.parse import urlparse
from datetime import datetime
from typing import Dict, Any, List
import psycopg2

def parse_domain(host: str) -> str:
parts = host.split(".")
return ".".join(parts[-2:]) if len(parts) >= 2 else host

def to_ts(ts_str: str) -> datetime:
padded = ts_str.ljust(14, "0")
return datetime.strptime(padded, "%Y%m%d%H%M%S")

def normalize_row(row: Dict[str, Any]) -> Dict[str, Any]:
original = row.get("original", "")
parsed = urlparse(original)
host = parsed.hostname or ""
return {
"url_key": row.get("urlkey") or row.get("url_key") or "",
"original": original,
"timestamp": to_ts(row["timestamp"]),
"mime": row.get("mime") or row.get("mimetype"),
"statuscode": int(row["statuscode"]) if row.get("statuscode") else None,
"digest": row.get("digest"),
"length": int(row["length"]) if row.get("length") else None,
"host": host,
"domain": parse_domain(host),
"path": parsed.path or "/",
"query": parsed.query or "",
}

def ingest_rows(conn, rows: List[Dict[str, Any]]):
with conn.cursor() as cur:
for raw in rows:
r = normalize_row(raw)
cur.execute("""
INSERT INTO cdx_capture
(url_key, original, timestamp, mime, statuscode, digest, length, host, domain, path, query)
VALUES
(%(url_key)s, %(original)s, %(timestamp)s, %(mime)s, %(statuscode)s, %(digest)s, %(length)s,
%(host)s, %(domain)s, %(path)s, %(query)s)
ON CONFLICT DO NOTHING;
""", r)
conn.commit()

Wire it together with the client from Step 2 to ingest a full domain’s index in batches, respecting any pagination/resumption behaviors.

Site Autopsy & Friends

Step 4 — Modules on Top of the CDX Ledger

With CDX indexed locally, you can now build higher‑level views — the modules that actually feel like tools. These are patterns, not products; you can integrate them into CLIs, services, or SPAs.

4.1 Site Autopsy Analyzer
Reconstruct the life and death of a domain.

Step A — Ingest the domain index

from cdx_client import fetch_all_cdx
from cdx_ingest import ingest_rows

rows = fetch_all_cdx(
url="example.com",
match_type="domain",
fields="timestamp,original,mime,statuscode,digest,length",
filters=["statuscode:200"],
)
ingest_rows(conn, rows)

Step B — Build a change spine per digest

SELECT DISTINCT ON (digest)
digest, domain, original, timestamp, mime, statuscode
FROM cdx_capture
WHERE domain = 'example.com'
ORDER BY digest, timestamp;

Step C — Aggregate “version history”

SELECT
digest,
MIN(timestamp) AS first_seen,
MAX(timestamp) AS last_seen,
COUNT(*) AS capture_count,
ARRAY_AGG(DISTINCT statuscode) AS statuses
FROM cdx_capture
WHERE domain = 'example.com'
GROUP BY digest
ORDER BY first_seen;

Visualize this timeline with your plotting stack of choice (or a front‑end SPA): time on the x‑axis, digest versions as nodes, anomalies highlighted by status or capture density.

4.2 Shadow Domain Mapper
Enumerate historical hosts and infra surface.

Step A — Group by host

SELECT
host,
MIN(timestamp) AS first_seen,
MAX(timestamp) AS last_seen,
COUNT(*) AS capture_count
FROM cdx_capture
WHERE domain = 'example.com'
GROUP BY host
ORDER BY first_seen;

Step B — Flag transient hosts

SELECT * FROM (
SELECT
host,
MIN(timestamp) AS first_seen,
MAX(timestamp) AS last_seen,
COUNT(*) AS capture_count
FROM cdx_capture
WHERE domain = 'example.com'
GROUP BY host
) h
WHERE capture_count < 20
ORDER BY capture_count ASC;

Render this into a timeline or graph view to reveal ephemeral infra, forgotten subdomains, and legacy endpoints.

4.3 Content Lineage Tracker
Follow a content digest as it propagates.

Step A — Choose a digest from Site Autopsy output.

Step B — Query across all domains

SELECT
digest,
domain,
host,
original,
MIN(timestamp) AS first_seen,
MAX(timestamp) AS last_seen,
COUNT(*) AS capture_count
FROM cdx_capture
WHERE digest = $1
GROUP BY digest, domain, host, original
ORDER BY first_seen;

Use this result to build a propagation graph: nodes are domains, edges connect domains where the same digest appears over time.

4.4 Historical API Diff Engine
Track JSON/API changes over time.

Step A — Filter API‑like captures

SELECT * FROM cdx_capture
WHERE domain = 'api.example.com'
AND mime LIKE 'application/json%'
AND path LIKE '%/api/%';

Step B — Collapse by digest per endpoint

SELECT
original,
digest,
MIN(timestamp) AS first_seen,
MAX(timestamp) AS last_seen,
COUNT(*) AS capture_count
FROM cdx_capture
WHERE domain = 'api.example.com'
GROUP BY original, digest
ORDER BY original, first_seen;

Then fetch representative JSON from Wayback replay URLs and diff them (e.g., with deepdiff) to surface breaking changes.

4.5 Archival Density Heatmap
Visualize capture frequency as a signal.

Step A — Bucket by month

SELECT
date_trunc('month', timestamp) AS bucket,
COUNT(*) AS capture_count
FROM cdx_capture
WHERE domain = 'example.com'
GROUP BY bucket
ORDER BY bucket;

Normalize and render as a heatmap or time‑series chart in your UI; spikes and gaps are investigation hooks.

From Data → Signals

Step 5 — Let Your Systems Think With CDX

Once the modules exist, you can move beyond visualizations into reasoning patterns – functions that compute stability, suspicion, or narrative drift scores over your CDX‑backed dataset.

5.1 Stability score per URL

Define a stability score as:

-- stability_score = distinct_digests / total_captures
SELECT
original,
COUNT(DISTINCT digest) AS digest_count,
COUNT(*) AS capture_count,
COUNT(DISTINCT digest)::float / COUNT(*)::float AS stability_score
FROM cdx_capture
WHERE domain = 'example.com'
GROUP BY original
ORDER BY stability_score ASC;

Low score → stable, rarely changing pages. High score → volatile or frequently edited resources.

5.2 Suspicion score per host

Combine status volatility and lifespan:

SELECT
host,
MIN(timestamp) AS first_seen,
MAX(timestamp) AS last_seen,
COUNT(*) AS capture_count,
COUNT(DISTINCT statuscode) AS status_variants
FROM cdx_capture
WHERE domain = 'example.com'
GROUP BY host
ORDER BY status_variants DESC, capture_count ASC;

Hosts with many different status codes and short capture histories can be flagged as suspicious or ephemeral infra.

5.3 Narrative drift on key pages

For pages like /about, /policy, or /terms, track how often content changes:

SELECT
path,
COUNT(DISTINCT digest) AS versions,
MIN(timestamp) AS first_seen,
MAX(timestamp) AS last_seen
FROM cdx_capture
WHERE domain = 'example.com'
AND path IN ('/about', '/policy', '/terms')
GROUP BY path
ORDER BY versions DESC;

Highly edited “narrative” pages get special attention in governance, legal, or media forensics workflows.

stability scoring suspicion heuristics drift analysis intelligence overlays
From Zero → CDX‑Native

Step 6 — The Complete Wiring Plan

Here is the condensed path from nothing to a CDX‑native stack that pulls the index, builds modules, and lets your systems reason over the archive’s ledger.

Stage Action Output
1. Exploration Use curl to query CDX with various params. Hands‑on understanding of fields and filters.
2. Clients Implement Python and/or Node clients. Reusable functions to fetch JSON CDX.
3. Storage Create cdx_capture schema + indexes. CDX becomes queryable, persistent data.
4. Ingestion Normalize and load domains into DB. Historical index mirrored locally.
5. Modules Build Site Autopsy, Shadow Mapper, etc. Higher‑level, reusable analysis surfaces.
6. Reasoning Add scoring and drift detection. Systems can “think” with CDX signals.
7. Exposure Expose as APIs, CLIs, or dashboards. Human + machine access to the ledger.
Operational takeaway
The Wayback UI is a convenience. CDX is the evidence. Wiring it into your tooling turns historical captures into a live analysis substrate your systems can reason over.

Comments

Popular posts from this blog