WAYBACK MACHINE // PRACTICAL WIRING GUIDE

How to Wire the Wayback CDX Server Into Your Own Tooling

Full tutorial From raw CDX → intel modules Stack‑agnostic, analyst‑grade

This is a complete, end‑to‑end walkthrough. You will learn how to query the CDX API, ingest and model the index, build higher‑level modules (Site Autopsy, Shadow Domain Mapper, Content Lineage), and let your systems reason over the ledger the archive has already collected.

0. Mental Model

Understand the Target First

What “Wiring CDX Into Your Tooling” Actually Means

“Wiring CDX” means turning the Wayback Machine’s index into a first‑class data source in your own systems. Instead of manually using the Wayback UI, your stack will:

Query the CDX API over HTTP
Receive CDX lines or JSON describing historical captures
Normalize and store those records in your own database
Build reusable modules on top of that data
Expose those modules via services, dashboards, or analysis pipelines

The core CDX endpoint used by the public Wayback Machine looks like:

https://web.archive.org/cdx/search/cdx?url=<target>&[params...]

The CDX Server specification (captured in the archived docs) describes parameters such as: url, matchType, output, fl (field list), filter, collapse, and pagination mechanisms like resumption keys in some implementations.

Goal of this tutorial

By the end, you should be able to: pull the index, store it, build modules, and run reasoning patterns over the ledger.

1. Explore the CDX API Manually

CLI + Inspection

Step 1 — Learn the API Like a Human First

Before writing code, you want tactile familiarity with the CDX API. You’ll use curl (or any HTTP client) to understand output formats, parameters, and how filters change behavior.

1.1 Basic domain‑wide query

Run this in a shell:

              curl "https://web.archive.org/cdx/search/cdx?url=example.com&matchType=domain&output=txt&filter=statuscode:200"
            

You should see space‑separated CDX lines. Fields generally include a URL key, timestamp, original URL, MIME, status code, digest, and length.

1.2 Get structured JSON with specific fields

Run:

              curl "https://web.archive.org/cdx/search/cdx?url=example.com&matchType=domain&output=json&fl=timestamp,original,mime,statuscode,digest,length"
            

The first array element is the header row; subsequent arrays are values. This format is ideal for ingestion into your systems.

1.3 Apply filters and collapsing

Filter on MIME and status:

              curl "https://web.archive.org/cdx/search/cdx?url=example.com&matchType=domain&output=json&fl=timestamp,original,mime,statuscode,digest&filter=mime:text/html&filter=statuscode:200"
            

Collapse by digest to see only unique content versions:

              curl "https://web.archive.org/cdx/search/cdx?url=example.com&matchType=domain&output=json&fl=timestamp,original,mime,statuscode,digest&collapse=digest"
            

This gives a deduplicated sequence of distinct contents: effectively your “change spine” for the domain.

2. Fetch Programmatically

Python + Node

Step 2 — Write Clients for CDX

Next you’ll create small, reusable clients that hide CDX’s parameter details behind simple functions. We’ll do one in Python for analysis pipelines and one in Node/TypeScript for web tooling and services.

2.1 Python client (analysis & pipelines)

Create a file cdx_client.py with:

              import requests
from typing import List, Dict, Any

CDX_ENDPOINT = "https://web.archive.org/cdx/search/cdx"

def fetch_cdx(
    url: str,
    match_type: str = "domain",
    fields: str = "timestamp,original,mime,statuscode,digest,length",
    filters: List[str] | None = None,
    collapse: str | None = None,
    limit: int = 10000,
    resume_key: str | None = None,
) -> Dict[str, Any]:
    params: Dict[str, Any] = {
        "url": url,
        "matchType": match_type,
        "output": "json",
        "fl": fields,
        "limit": limit,
    }
    if filters:
        for f in filters:
            params.setdefault("filter", []).append(f)
    if collapse:
        params["collapse"] = collapse
    if resume_key:
        params["resumeKey"] = resume_key  # supported by some CDX server implementations

    resp = requests.get(CDX_ENDPOINT, params=params, timeout=30)
    resp.raise_for_status()
    data = resp.json()
    headers, *rows = data
    normalized = [dict(zip(headers, row)) for row in rows]
    return {
        "headers": headers,
        "rows": normalized,
        "resume_key": resp.headers.get("X-Resume-Key") or None,
    }
            

To paginate large domains, add:

              def fetch_all_cdx(url: str, **kwargs) -> List[Dict[str, Any]]:
    all_rows: List[Dict[str, Any]] = []
    resume_key: str | None = None
    while True:
        batch = fetch_cdx(url=url, resume_key=resume_key, **kwargs)
        rows = batch["rows"]
        all_rows.extend(rows)
        resume_key = batch["resume_key"]
        if not resume_key or not rows:
            break
    return all_rows
            

2.2 Node/TypeScript backend proxy (for SPAs)

For browser‑based tools, you generally proxy CDX through your backend to avoid CORS/rate‑limit issues. Create a small Express service:

              import express from "express";
import fetch from "node-fetch";

const app = express();
const CDX_ENDPOINT = "https://web.archive.org/cdx/search/cdx";

app.get("/api/cdx", async (req, res) => {
  try {
    const url = req.query.url as string;
    if (!url) return res.status(400).json({ error: "url is required" });

    const params = new URLSearchParams({
      url,
      matchType: (req.query.matchType as string) || "domain",
      output: "json",
      fl: (req.query.fl as string) || "timestamp,original,mime,statuscode,digest,length",
      limit: (req.query.limit as string) || "5000",
    });

    if (req.query.filters) {
      const filters = Array.isArray(req.query.filters)
        ? (req.query.filters as string[])
        : [req.query.filters as string];
      for (const f of filters) params.append("filter", f);
    }
    if (req.query.collapse) params.append("collapse", req.query.collapse as string);

    const resp = await fetch(`${CDX_ENDPOINT}?${params.toString()}`);
    if (!resp.ok) return res.status(resp.status).json({ error: "CDX upstream error" });
    const json = await resp.json();
    const [headers, ...rows] = json;
    const normalized = rows.map((row: string[]) =>
      Object.fromEntries(headers.map((h: string, i: number) => [h, row[i]]))
    );
    res.json({ headers, rows: normalized });
  } catch (err) {
    console.error(err);
    res.status(500).json({ error: "Internal error" });
  }
});

app.listen(3000, () => console.log("CDX proxy running on :3000"));
            

Your front‑end can now call /api/cdx?url=example.com&matchType=domain&collapse=digest.

3. Store & Model CDX Data

Schema Design

Step 3 — Turn the Stream Into a First‑Class Dataset

To build durable modules, you need CDX data in a database that supports temporal and domain‑centric queries. A relational model works well; you can always layer search/graph systems on top later.

3.1 Suggested relational schema

For PostgreSQL‑style systems, a solid baseline:

              CREATE TABLE cdx_capture (
  id            BIGSERIAL PRIMARY KEY,
  url_key       TEXT NOT NULL,
  original      TEXT NOT NULL,
  timestamp     TIMESTAMPTZ NOT NULL,
  mime          TEXT,
  statuscode    INT,
  digest        TEXT,
  length        BIGINT,
  host          TEXT,
  domain        TEXT,
  path          TEXT,
  query         TEXT,
  source_label  TEXT NOT NULL DEFAULT 'web.archive.org'
);

CREATE INDEX idx_cdx_domain_time ON cdx_capture (domain, timestamp);
CREATE INDEX idx_cdx_host_time   ON cdx_capture (host, timestamp);
CREATE INDEX idx_cdx_digest      ON cdx_capture (digest);
CREATE INDEX idx_cdx_status      ON cdx_capture (statuscode);
            

url_key can store the SURT key if available; domain, host, path, and query are derived from original.

3.2 Normalizing and ingesting (Python)

Create cdx_ingest.py:

              from urllib.parse import urlparse
from datetime import datetime
from typing import Dict, Any, List
import psycopg2

def parse_domain(host: str) -> str:
    parts = host.split(".")
    return ".".join(parts[-2:]) if len(parts) >= 2 else host

def to_ts(ts_str: str) -> datetime:
    padded = ts_str.ljust(14, "0")
    return datetime.strptime(padded, "%Y%m%d%H%M%S")

def normalize_row(row: Dict[str, Any]) -> Dict[str, Any]:
    original = row.get("original", "")
    parsed = urlparse(original)
    host = parsed.hostname or ""
    return {
      "url_key": row.get("urlkey") or row.get("url_key") or "",
      "original": original,
      "timestamp": to_ts(row["timestamp"]),
      "mime": row.get("mime") or row.get("mimetype"),
      "statuscode": int(row["statuscode"]) if row.get("statuscode") else None,
      "digest": row.get("digest"),
      "length": int(row["length"]) if row.get("length") else None,
      "host": host,
      "domain": parse_domain(host),
      "path": parsed.path or "/",
      "query": parsed.query or "",
    }

def ingest_rows(conn, rows: List[Dict[str, Any]]):
    with conn.cursor() as cur:
        for raw in rows:
            r = normalize_row(raw)
            cur.execute("""
              INSERT INTO cdx_capture
                (url_key, original, timestamp, mime, statuscode, digest, length, host, domain, path, query)
              VALUES
                (%(url_key)s, %(original)s, %(timestamp)s, %(mime)s, %(statuscode)s, %(digest)s, %(length)s,
                 %(host)s, %(domain)s, %(path)s, %(query)s)
              ON CONFLICT DO NOTHING;
            """, r)
    conn.commit()
            

Wire it together with the client from Step 2 to ingest a full domain’s index in batches, respecting any pagination/resumption behaviors.

4. Build Analysis Modules

Site Autopsy & Friends

Step 4 — Modules on Top of the CDX Ledger

With CDX indexed locally, you can now build higher‑level views — the modules that actually feel like tools. These are patterns, not products; you can integrate them into CLIs, services, or SPAs.

4.1 Site Autopsy Analyzer

Reconstruct the life and death of a domain.

Step A — Ingest the domain index

                from cdx_client import fetch_all_cdx
from cdx_ingest import ingest_rows

rows = fetch_all_cdx(
    url="example.com",
    match_type="domain",
    fields="timestamp,original,mime,statuscode,digest,length",
    filters=["statuscode:200"],
)
ingest_rows(conn, rows)
              

Step B — Build a change spine per digest

                SELECT DISTINCT ON (digest)
  digest, domain, original, timestamp, mime, statuscode
FROM cdx_capture
WHERE domain = 'example.com'
ORDER BY digest, timestamp;
              

Step C — Aggregate “version history”

                SELECT
  digest,
  MIN(timestamp) AS first_seen,
  MAX(timestamp) AS last_seen,
  COUNT(*)       AS capture_count,
  ARRAY_AGG(DISTINCT statuscode) AS statuses
FROM cdx_capture
WHERE domain = 'example.com'
GROUP BY digest
ORDER BY first_seen;
              

Visualize this timeline with your plotting stack of choice (or a front‑end SPA): time on the x‑axis, digest versions as nodes, anomalies highlighted by status or capture density.

4.2 Shadow Domain Mapper

Enumerate historical hosts and infra surface.

Step A — Group by host

                SELECT
  host,
  MIN(timestamp) AS first_seen,
  MAX(timestamp) AS last_seen,
  COUNT(*)       AS capture_count
FROM cdx_capture
WHERE domain = 'example.com'
GROUP BY host
ORDER BY first_seen;
              

Step B — Flag transient hosts

                SELECT * FROM (
  SELECT
    host,
    MIN(timestamp) AS first_seen,
    MAX(timestamp) AS last_seen,
    COUNT(*)       AS capture_count
  FROM cdx_capture
  WHERE domain = 'example.com'
  GROUP BY host
) h
WHERE capture_count < 20
ORDER BY capture_count ASC;
              

Render this into a timeline or graph view to reveal ephemeral infra, forgotten subdomains, and legacy endpoints.

4.3 Content Lineage Tracker

Follow a content digest as it propagates.

Step A — Choose a digest from Site Autopsy output.

Step B — Query across all domains

                SELECT
  digest,
  domain,
  host,
  original,
  MIN(timestamp) AS first_seen,
  MAX(timestamp) AS last_seen,
  COUNT(*)       AS capture_count
FROM cdx_capture
WHERE digest = $1
GROUP BY digest, domain, host, original
ORDER BY first_seen;
              

Use this result to build a propagation graph: nodes are domains, edges connect domains where the same digest appears over time.

4.4 Historical API Diff Engine

Track JSON/API changes over time.

Step A — Filter API‑like captures

                SELECT * FROM cdx_capture
WHERE domain = 'api.example.com'
  AND mime LIKE 'application/json%'
  AND path LIKE '%/api/%';
              

Step B — Collapse by digest per endpoint

                SELECT
  original,
  digest,
  MIN(timestamp) AS first_seen,
  MAX(timestamp) AS last_seen,
  COUNT(*)       AS capture_count
FROM cdx_capture
WHERE domain = 'api.example.com'
GROUP BY original, digest
ORDER BY original, first_seen;
              

Then fetch representative JSON from Wayback replay URLs and diff them (e.g., with deepdiff) to surface breaking changes.

4.5 Archival Density Heatmap

Visualize capture frequency as a signal.

Step A — Bucket by month

                SELECT
  date_trunc('month', timestamp) AS bucket,
  COUNT(*) AS capture_count
FROM cdx_capture
WHERE domain = 'example.com'
GROUP BY bucket
ORDER BY bucket;
              

Normalize and render as a heatmap or time‑series chart in your UI; spikes and gaps are investigation hooks.

5. Reason Over the Ledger

From Data → Signals

Step 5 — Let Your Systems Think With CDX

Once the modules exist, you can move beyond visualizations into reasoning patterns – functions that compute stability, suspicion, or narrative drift scores over your CDX‑backed dataset.

5.1 Stability score per URL

Define a stability score as:

              -- stability_score = distinct_digests / total_captures
SELECT
  original,
  COUNT(DISTINCT digest) AS digest_count,
  COUNT(*)               AS capture_count,
  COUNT(DISTINCT digest)::float / COUNT(*)::float AS stability_score
FROM cdx_capture
WHERE domain = 'example.com'
GROUP BY original
ORDER BY stability_score ASC;
            

Low score → stable, rarely changing pages. High score → volatile or frequently edited resources.

5.2 Suspicion score per host

Combine status volatility and lifespan:

              SELECT
  host,
  MIN(timestamp) AS first_seen,
  MAX(timestamp) AS last_seen,
  COUNT(*)       AS capture_count,
  COUNT(DISTINCT statuscode) AS status_variants
FROM cdx_capture
WHERE domain = 'example.com'
GROUP BY host
ORDER BY status_variants DESC, capture_count ASC;
            

Hosts with many different status codes and short capture histories can be flagged as suspicious or ephemeral infra.

5.3 Narrative drift on key pages

For pages like /about, /policy, or /terms, track how often content changes:

              SELECT
  path,
  COUNT(DISTINCT digest) AS versions,
  MIN(timestamp)         AS first_seen,
  MAX(timestamp)         AS last_seen
FROM cdx_capture
WHERE domain = 'example.com'
  AND path IN ('/about', '/policy', '/terms')
GROUP BY path
ORDER BY versions DESC;
            

Highly edited “narrative” pages get special attention in governance, legal, or media forensics workflows.

6. End‑to‑End Checklist

From Zero → CDX‑Native

Step 6 — The Complete Wiring Plan

Here is the condensed path from nothing to a CDX‑native stack that pulls the index, builds modules, and lets your systems reason over the archive’s ledger.

Stage	Action	Output
1. Exploration	Use `curl` to query CDX with various params.	Hands‑on understanding of fields and filters.
2. Clients	Implement Python and/or Node clients.	Reusable functions to fetch JSON CDX.
3. Storage	Create `cdx_capture` schema + indexes.	CDX becomes queryable, persistent data.
4. Ingestion	Normalize and load domains into DB.	Historical index mirrored locally.
5. Modules	Build Site Autopsy, Shadow Mapper, etc.	Higher‑level, reusable analysis surfaces.
6. Reasoning	Add scoring and drift detection.	Systems can “think” with CDX signals.
7. Exposure	Expose as APIs, CLIs, or dashboards.	Human + machine access to the ledger.

Operational takeaway

The Wayback UI is a convenience. CDX is the evidence. Wiring it into your tooling turns historical captures into a live analysis substrate your systems can reason over.

Search This Blog

The Power of Micronization: Redefining Scale in Problem-Solving λ: 𝑠𝑡𝑎𝑡𝑒 ↦ 𝑛𝑒𝑥𝑡 𝑠𝑡𝑎𝑡e

How to Wire the Wayback CDX Server Into Your Own Tooling

What “Wiring CDX Into Your Tooling” Actually Means

Step 1 — Learn the API Like a Human First

Step 2 — Write Clients for CDX

Step 3 — Turn the Stream Into a First‑Class Dataset

Step 4 — Modules on Top of the CDX Ledger

Step 5 — Let Your Systems Think With CDX

Step 6 — The Complete Wiring Plan

Comments

Post a Comment

Popular posts from this blog