Mastering Internet Archive Asset Retrieval
A complete lesson built from a live deep‑dive: how to pull every asset from an Internet Archive item using the official metadata API, canonical download URLs, advanced regex patterns, self‑validating code, and powerful browser bookmarklets.
The canonical way to get every asset from an Internet Archive item
Most people approach the Internet Archive through its web UI: a play button, a couple of download links, some thumbnails. That’s just the surface. Behind every item lies a structured, machine‑readable description that tells you exactly what files exist, in what formats, and under what names.
If you want zero guessing and zero missing derivatives, you must rely on the
Metadata API, not the UI. We’ll use a concrete example:
CNN_20110506_150000_CNN_Newsroom.
1.1 The metadata endpoint
Every Internet Archive item exposes a JSON metadata document at:
https://archive.org/metadata/<IDENTIFIER>
For our example:
https://archive.org/metadata/CNN_20110506_150000_CNN_Newsroom
Open that in a browser, use curl, or fetch it from a script. This is the source of truth: it knows
about every file, original or derivative, attached to the item.
1.2 The files[] array
Inside the JSON, there’s a key called files:
"files": [
{
"name": "CNN_20110506_150000_CNN_Newsroom.mp4",
"...": "..."
},
{
"name": "CNN_20110506_150000_CNN_Newsroom.ogv",
"...": "..."
},
...
]
This files[] array lists every file the Archive knows about for the item:
- Original uploads: e.g., MPEG‑2 streams.
- Video derivatives: MP4, OGV, etc.
- Audio‑only derivatives: MP3, OGG.
- Captions / transcripts: SRT, XML, JSON.
- Thumbnails: JPEG previews and contact sheets.
- Metadata files:
_meta.xml,_reviews.xml,_files.xml.
The crucial field is name. You never guess filenames; you always read them from files[].
files[], you don’t invent a URL for it.
You trust the metadata and only use the filenames it gives you.
1.3 Constructing official direct download URLs
Every file in files[] becomes a canonical download URL using a single, stable pattern:
https://archive.org/download/<IDENTIFIER>/<FILENAME>
For our example item:
https://archive.org/download/CNN_20110506_150000_CNN_Newsroom/CNN_20110506_150000_CNN_Newsroom.mp4
That’s the official, canonical method. No Wayback links, no HTML scraping, no guessed derivatives — just identifier + filename from metadata.
All asset URLs for the example item
For the item CNN_20110506_150000_CNN_Newsroom, the full list of direct asset URLs looks like this
(Option A: all files as derived from metadata):
2.1 Video files
https://archive.org/download/CNN_20110506_150000_CNN_Newsroom/CNN_20110506_150000_CNN_Newsroom.mp4
https://archive.org/download/CNN_20110506_150000_CNN_Newsroom/CNN_20110506_150000_CNN_Newsroom.ogv
https://archive.org/download/CNN_20110506_150000_CNN_Newsroom/CNN_20110506_150000_CNN_Newsroom.mpeg2
2.2 Audio‑only derivatives
https://archive.org/download/CNN_20110506_150000_CNN_Newsroom/CNN_20110506_150000_CNN_Newsroom.mp3
https://archive.org/download/CNN_20110506_150000_CNN_Newsroom/CNN_20110506_150000_CNN_Newsroom.ogg
2.3 Closed captions / transcripts
https://archive.org/download/CNN_20110506_150000_CNN_Newsroom/CNN_20110506_150000_CNN_Newsroom.cc5.srt
https://archive.org/download/CNN_20110506_150000_CNN_Newsroom/CNN_20110506_150000_CNN_Newsroom.cc5.xml
https://archive.org/download/CNN_20110506_150000_CNN_Newsroom/CNN_20110506_150000_CNN_Newsroom.cc5.json
2.4 Thumbnails / preview images
https://archive.org/download/CNN_20110506_150000_CNN_Newsroom/CNN_20110506_150000_CNN_Newsroom.thumbs.jpg
https://archive.org/download/CNN_20110506_150000_CNN_Newsroom/CNN_20110506_150000_CNN_Newsroom.thumbs_small.jpg
https://archive.org/download/CNN_20110506_150000_CNN_Newsroom/CNN_20110506_150000_CNN_Newsroom.jpg
2.5 Metadata files
https://archive.org/download/CNN_20110506_150000_CNN_Newsroom/CNN_20110506_150000_CNN_Newsroom_meta.xml
https://archive.org/download/CNN_20110506_150000_CNN_Newsroom/CNN_20110506_150000_CNN_Newsroom_reviews.xml
https://archive.org/download/CNN_20110506_150000_CNN_Newsroom/CNN_20110506_150000_CNN_Newsroom_files.xml
2.6 Original metadata JSON
https://archive.org/metadata/CNN_20110506_150000_CNN_Newsroom
files[], all these URLs
are just the same function applied repeatedly.
Inside the Internet Archive’s storage nodes: the weird, the legacy, and the outliers
Once you start following real IA URLs, you stumble across hostnames that look deeply strange:
ia801005.us.archive.org/20, ia903507.archive.org/0/items/..., and other
“non‑country” clusters. They look geographic, but they’re not.
.us or .eu are internal namespaces, not guarantees of physical
location or jurisdiction.
3.1 High‑level node family table
| family_name | hostname_pattern | example_hostname | typical_role | notes |
|---|---|---|---|---|
| ia600_us | ia6ddddd.us.archive.org |
ia600209.us.archive.org |
older storage cluster | Older items; mixed originals and derivatives. |
| ia800_us | ia8ddddd.us.archive.org |
ia801005.us.archive.org |
storage & derivatives | Very common; shard directories like /20 appear here. |
| ia900_us | ia9ddddd.us.archive.org |
ia902507.us.archive.org |
newer storage cluster | Common for newer uploads and derivatives. |
| ia800_root | ia8ddddd.archive.org |
ia803408.archive.org |
storage / derivatives | Hostnames without .us, more recent naming style. |
| ia900_root | ia9ddddd.archive.org |
ia903507.archive.org |
storage / derivatives / shards | Often appears with /0/items/ shard roots. |
| ia_geo_legacy | ia[6-8]ddddd.<cc>.archive.org |
ia600301.eu.archive.org |
legacy namespace | Rare .eu, .ca, etc.; not real geo routing. |
| shard_dir | ia[6-9]ddddd(.us).archive.org/<n> |
ia801005.us.archive.org/20 |
shard directory exposer | Raw shard directories revealing internal storage layout. |
| items_root | ia[6-9]ddddd(.us).archive.org/[0-1]/items/... |
ia903507.archive.org/0/items/IDENTIFIER |
shard item root | Backdoor‑like view into shard contents. |
3.2 The “oddest” patterns
Some storage URLs are especially strange. A few highlights:
- Ghost nodes: appear in URLs, but often return nothing or 403/503.
- Hybrid namespace mutants: mixing
.usand root naming across generations. - Non‑country country codes:
.eu.archive.org,.ca.archive.orgas labels. - Shard directory exposers: paths like
/20,/21,/19. - Items shard roots:
/0/items/IDENTIFIER,/1/items/IDENTIFIER.
/details/), storage nodes (ia8...), embeds, and raw shard paths.
For archival‑grade workflows, you always normalize back to
https://archive.org/download/<IDENTIFIER>/<FILENAME>.
One-click bookmarklet: regex-powered asset extraction directly in your browser
Doing this manually is fine once or twice. But the real power comes from a bookmarklet that runs on any Internet Archive page, detects the identifier, talks to the metadata API, and generates all canonical download URLs — with regex‑based filtering.
4.1 What this bookmarklet does
- Auto-detects the identifier from
/details/,/embed/,/metadata/, or?identifier=using regex. - Falls back to a prompt if detection fails.
- Fetches
/metadata/<IDENTIFIER>. - Builds all
https://archive.org/download/IDENTIFIER/FILENAMEURLs. - Allows regex filters (e.g., only
.mp4, only captions, etc.). - Opens results in a clean popup with a
<textarea>for fast copying.
4.2 Readable JavaScript (core logic)
(function () {
// Log helper
function log() {
console.log("[IA Bookmarklet]", ...arguments);
}
// 1. Extract identifier from various URL patterns
function extractIdentifierFromUrl(url) {
var patterns = [
/https?:\/\/(?:www\.)?archive\.org\/details\/([^\/?#]+)/i,
/https?:\/\/(?:www\.)?archive\.org\/(?:download|embed)\/([^\/?#]+)/i,
/https?:\/\/(?:www\.)?archive\.org\/metadata\/([^\/?#]+)/i,
/[?&]identifier=([^]+)/i
];
for (var i = 0; i < patterns.length; i++) {
var m = url.match(patterns[i]);
if (m && m[1]) {
return decodeURIComponent(m[1]);
}
}
return null;
}
// 2. Resolve identifier (auto or manual)
function resolveIdentifier() {
var url = String(window.location.href || "");
var id = extractIdentifierFromUrl(url);
if (id) {
log("Detected identifier:", id);
return id;
}
var manual = window.prompt(
"Enter Internet Archive identifier (e.g. CNN_20110506_150000_CNN_Newsroom):",
""
);
if (!manual) return null;
manual = manual.trim();
return manual || null;
}
// 3. Ask for optional regex filter
function askRegexFilter() {
var hint =
"Optional regex to filter file names.\n" +
"Examples:\n" +
" \\.mp4$ -> only MP4\n" +
" \\.mp3$ -> only MP3\n" +
" \\.cc5\\.(srt|xml|json)$ -> only caption files\n" +
"Leave blank for ALL files.";
var input = window.prompt(hint, "");
if (!input) return null;
try {
return new RegExp(input);
} catch (e) {
alert("Invalid regex: " + e.message);
return null;
}
}
// 4. Fetch metadata JSON
function fetchMetadata(identifier) {
var metaUrl = "https://archive.org/metadata/" + encodeURIComponent(identifier);
log("Fetching metadata:", metaUrl);
return fetch(metaUrl, { cache: "no-store" }).then(function (res) {
if (!res.ok) {
throw new Error("Metadata request failed with status " + res.status);
}
return res.json();
});
}
// 5. Build canonical download URLs from files[]
function buildDownloadUrls(identifier, files, regexFilter) {
var base = "https://archive.org/download/" + encodeURIComponent(identifier) + "/";
var out = [];
files.forEach(function (f) {
if (!f || !f.name) return;
var name = String(f.name);
if (regexFilter && !regexFilter.test(name)) return;
out.push(base + encodeURIComponent(name).replace(/%2F/g, "/"));
});
return out;
}
// 6. Render results in a popup window
function openResultsWindow(identifier, metaUrl, regexFilter, urls) {
var w = window.open("", "_blank", "noopener,noreferrer");
if (!w) {
alert("Popup blocked. Allow popups for this site and try again.");
return;
}
var filterInfo = regexFilter ? regexFilter.toString() : "None (ALL files)";
var html =
"<!DOCTYPE html>" +
"<html><head><meta charset='utf-8'>" +
"<title>IA URLs - " + identifier + "</title>" +
"<style>" +
"body{font-family:system-ui,-apple-system,BlinkMacSystemFont,'Segoe UI',sans-serif;padding:1rem 1.25rem;" +
"line-height:1.5;background:#f9fafb;color:#111827;}" +
"h1{font-size:1.3rem;margin-bottom:0.4rem;}" +
"code{background:#e5e7eb;padding:0.1em 0.3em;border-radius:3px;font-size:0.95em;}" +
"textarea{width:100%;height:60vh;margin-top:0.75rem;font-family:SFMono-Regular,Menlo,Monaco,Consolas," +
"'Liberation Mono','Courier New',monospace;font-size:0.85rem;}" +
"small{color:#6b7280;}" +
"</style></head><body>" +
"<h1>Internet Archive asset URLs</h1>" +
"<p><strong>Identifier:</strong> <code>" + identifier + "</code></p>" +
"<p><strong>Metadata:</strong> <a href='" + metaUrl + "' target='_blank' rel='noopener'>" +
metaUrl + "</a></p>" +
"<p><strong>Filter:</strong> " + filterInfo + "</p>" +
"<p><strong>Count:</strong> " + urls.length + " file(s)</p>" +
"<p>All URLs are direct <code>/download</code> links derived from metadata <code>files[]</code>.</p>" +
"<textarea readonly>" + urls.join("\\n") + "</textarea>" +
"<p><small>Tip: Ctrl+A / Cmd+A inside the box, then copy.</small></p>" +
"</body></html>";
w.document.open();
w.document.write(html);
w.document.close();
}
// 7. Orchestration
(function run() {
var identifier = resolveIdentifier();
if (!identifier) {
alert("No identifier provided. Aborting.");
return;
}
var regexFilter = askRegexFilter(); // can be null
fetchMetadata(identifier)
.then(function (meta) {
if (!meta || !Array.isArray(meta.files)) {
throw new Error("Metadata JSON does not contain a valid files[] array.");
}
var urls = buildDownloadUrls(identifier, meta.files, regexFilter);
if (!urls.length) {
alert("No files matched your filter. Try again without a filter or with a different regex.");
return;
}
openResultsWindow(
identifier,
"https://archive.org/metadata/" + encodeURIComponent(identifier),
regexFilter,
urls
);
})
.catch(function (err) {
console.error(err);
alert("Error processing metadata: " + err.message);
});
})();
})();
4.3 Minified bookmarklet version
Create a new bookmark, then paste this into its URL field. It’s the same logic, compressed:
javascript:(function(){function l(){console.log("[IA Bookmarklet]",...arguments)}function c(e){var t=[/https?:\/\/(?:www\.)?archive\.org\/details\/([^\/?#]+)/i,/https?:\/\/(?:www\.)?archive\.org\/(?:download|embed)\/([^\/?#]+)/i,/https?:\/\/(?:www\.)?archive\.org\/metadata\/([^\/?#]+)/i,/[?&]identifier=([^]+)/i];for(var n=0;n<t.length;n++){var r=e.match(t[n]);if(r&&r[1])return decodeURIComponent(r[1])}return null}function a(){var e=String(window.location.href||""),t=c(e);if(t)return l("Detected identifier:",t),t;var n=window.prompt("Enter Internet Archive identifier (e.g. CNN_20110506_150000_CNN_Newsroom):","");return n?(n=n.trim())||null:null}function i(){var e="Optional regex to filter file names.\nExamples:\n \\.mp4$ -> only MP4\n \\.mp3$ -> only MP3\n \\.cc5\\.(srt|xml|json)$ -> only caption files\nLeave blank for ALL files.",t=window.prompt(e,"");if(!t)return null;try{return new RegExp(t)}catch(n){return alert("Invalid regex: "+n.message),null}}function o(e){var t="https://archive.org/metadata/"+encodeURIComponent(e);return l("Fetching metadata:",t),fetch(t,{cache:"no-store"}).then(function(n){if(!n.ok)throw new Error("Metadata request failed with status "+n.status);return n.json()})}function f(e,t,n){var r="https://archive.org/download/"+encodeURIComponent(e)+"/",d=[];return t.forEach(function(u){if(u&&u.name){var s=String(u.name);(!n||n.test(s))&&d.push(r+encodeURIComponent(s).replace(/%2F/g,"/"))}}),d}function m(e,t,n,r){var d=window.open("","_blank","noopener,noreferrer");if(!d){alert("Popup blocked. Please allow popups for this site and try again.");return}var u=n?n.toString():"None (ALL files)",s="<!DOCTYPE html><html><head><meta charset='utf-8'><title>IA URLs - "+e+"</title><style>body{font-family:system-ui,-apple-system,BlinkMacSystemFont,'Segoe UI',sans-serif;padding:1rem 1.25rem;line-height:1.5;background:#f9fafb;color:#111827;}h1{font-size:1.3rem;margin-bottom:0.4rem;}code{background:#e5e7eb;padding:0.1em 0.3em;border-radius:3px;font-size:0.95em;}textarea{width:100%;height:60vh;margin-top:0.75rem;font-family:SFMono-Regular,Menlo,Monaco,Consolas,'Liberation Mono','Courier New',monospace;font-size:0.85rem;}small{color:#6b7280;}</style></head><body><h1>Internet Archive asset URLs</h1><p><strong>Identifier:</strong> <code>"+e+"</code></p><p><strong>Metadata:</strong> <a href='"+t+"' target='_blank' rel='noopener'>"+t+"</a></p><p><strong>Filter:</strong> "+u+"</p><p><strong>Count:</strong> "+r.length+" file(s)</p><p>All URLs are direct <code>/download</code> links derived from the metadata <code>files[]</code> array.</p><textarea readonly>"+r.join("\n")+"</textarea><p><small>Tip: Press Ctrl+A / Cmd+A inside the box to select all, then copy.</small></p></body></html>";d.document.open(),d.document.write(s),d.document.close()}(function(){var e=a();if(!e){alert("No identifier provided. Aborting.");return}var t=i();o(e).then(function(n){if(!n||!Array.isArray(n.files))throw new Error("Metadata JSON does not contain a valid files[] array.");var r=f(e,n.files,t);if(!r.length){alert("No files matched your filter. Try again without a filter or with a different regex.");return}m(e,"https://archive.org/metadata/"+encodeURIComponent(e),t,r)}).catch(function(n){console.error(n),alert("Error while processing metadata: "+n.message)})})();})();
Regex catalog and self-validating patterns (pandas-style)
To really own this, you want a catalog of URL and node patterns, plus code that validates those patterns before you trust them. This is where a pandas‑style approach shines: each pattern becomes a row; you mark it as valid or invalid after testing.
5.1 Node pattern regex table
| pattern_name | regex | example_url | role |
|---|---|---|---|
| node_800_us | ^https?://ia8\d{4,5}\.us\.archive\.org(?:/.*)?$ |
https://ia801005.us.archive.org/20 |
800-series .us node, shard directories like /20. |
| node_shard_dir | ^https?://ia[6-9]\d{4,5}\.(?:us\.)?archive\.org/\d{1,3}(?:/.*)?$ |
https://ia903004.archive.org/19 |
Raw shard directory paths revealing internal storage layout. |
| canonical_download | ^https?://(?:www\.)?archive\.org/download/([^/]+)/(.+)$ |
https://archive.org/download/foo/file.mp4 |
Canonical download endpoint, identifier + filename. |
| backdoor_raw_file | ^https?://ia[6-9]\d{4,5}\.(?:us\.)?archive\.org/\d{1,3}/items/([^/]+)/(.+)$ |
https://ia902908.us.archive.org/0/items/foo/foo.mp4 |
Raw shard item file path; can be normalized to /download. |
You can treat these patterns like rows in a DataFrame and use them to classify URLs coming from logs, scraped pages, or other sources.
5.2 Self-validating pattern catalog in Python
This script defines patterns, compiles them, and checks them against example URLs:
import re
import pandas as pd
from dataclasses import dataclass, asdict
from typing import Optional, List
@dataclass
class NodePattern:
pattern_name: str
regex: str
example_url: str
role: str
notes: str
valid_regex: bool = False
matches_example: bool = False
compile_error: Optional[str] = None
def build_raw_patterns() -> List[NodePattern]:
return [
NodePattern(
pattern_name="node_800_us",
regex=r"^https?://ia8\d{4,5}\.us\.archive\.org(?:/.*)?$",
example_url="https://ia801005.us.archive.org/20",
role="storage & derivatives (.us namespace)",
notes="800-series node; shard directory paths like /20 appear here.",
),
# ...add more rows here...
]
def validate_patterns(patterns: List[NodePattern]) -> List[NodePattern]:
for p in patterns:
try:
compiled = re.compile(p.regex)
p.valid_regex = True
except re.error as e:
p.valid_regex = False
p.compile_error = str(e)
p.matches_example = False
continue
if compiled.match(p.example_url):
p.matches_example = True
else:
p.matches_example = False
return patterns
def patterns_dataframe(patterns: List[NodePattern]) -> pd.DataFrame:
return pd.DataFrame([asdict(p) for p in patterns])
if __name__ == "__main__":
raw_patterns = build_raw_patterns()
validated = validate_patterns(raw_patterns)
df = patterns_dataframe(validated)
print(df[["pattern_name", "regex", "example_url",
"role", "valid_regex", "matches_example", "compile_error"]])
Normalizing any IA URL back to canonical download (with validation)
Many URLs can point to the same file: raw node URLs, embed URLs, details pages, query‑based links. A robust workflow normalizes them back to:
https://archive.org/download/<IDENTIFIER>/<FILENAME>
6.1 Normalization rule examples
| rule_name | input_pattern | example_in | example_out |
|---|---|---|---|
| from_download | ^https?://(?:www\.)?archive\.org/download/([^/]+)/(.+)$ |
https://archive.org/download/foo/file.mp4 |
https://archive.org/download/foo/file.mp4 |
| from_raw_node_file | ^https?://ia[6-9]\d{4,5}\.(?:us\.)?archive\.org/\d{1,3}/items/([^/]+)/(.+)$ |
https://ia902908.us.archive.org/0/items/foo/foo.mp4 |
https://archive.org/download/foo/foo.mp4 |
| from_details | ^https?://(?:www\.)?archive\.org/details/([^/?#]+) |
https://archive.org/details/CNN_20110506_150000_CNN_Newsroom |
https://archive.org/download/CNN_20110506_150000_CNN_Newsroom/ |
6.2 Self-validating normalization rules (Python)
import re
from dataclasses import dataclass
from typing import Optional, List
import pandas as pd
@dataclass
class NormalizationRule:
rule_name: str
regex: str
description: str
example_in: str
expected_out: str
valid_regex: bool = False
matches_example: bool = False
actual_out: Optional[str] = None
compile_error: Optional[str] = None
def canonical_from_match(rule: NormalizationRule, m: re.Match) -> Optional[str]:
if rule.rule_name == "from_download":
identifier, filename = m.group(1), m.group(2)
return f"https://archive.org/download/{identifier}/{filename}"
if rule.rule_name == "from_raw_node_file":
identifier, filename = m.group(1), m.group(2)
return f"https://archive.org/download/{identifier}/{filename}"
if rule.rule_name in ("from_items_root", "from_embed", "from_details", "from_query_param"):
identifier = m.group(1)
return f"https://archive.org/download/{identifier}/"
return None
def build_normalization_rules() -> List[NormalizationRule]:
return [
NormalizationRule(
rule_name="from_download",
regex=r"^https?://(?:www\.)?archive\.org/download/([^/]+)/(.+)$",
description="Direct /download URL – already canonical.",
example_in="https://archive.org/download/foo/file.mp4",
expected_out="https://archive.org/download/foo/file.mp4",
),
NormalizationRule(
rule_name="from_raw_node_file",
regex=r"^https?://ia[6-9]\d{4,5}\.(?:us\.)?archive\.org/\d{1,3}/items/([^/]+)/(.+)$",
description="Raw node /shard/items/IDENTIFIER/FILENAME -> /download.",
example_in="https://ia902908.us.archive.org/0/items/foo/foo.mp4",
expected_out="https://archive.org/download/foo/foo.mp4",
),
# ...add more rules...
]
def validate_normalization_rules(rules: List[NormalizationRule]) -> List[NormalizationRule]:
for r in rules:
try:
compiled = re.compile(r.regex)
r.valid_regex = True
except re.error as e:
r.valid_regex = False
r.compile_error = str(e)
continue
m = compiled.match(r.example_in)
if not m:
r.matches_example = False
r.actual_out = None
continue
out = canonical_from_match(r, m)
r.actual_out = out
r.matches_example = (out == r.expected_out)
return rules
if __name__ == "__main__":
rules = build_normalization_rules()
validated = validate_normalization_rules(rules)
df_rules = pd.DataFrame([{
"rule_name": r.rule_name,
"regex": r.regex,
"example_in": r.example_in,
"expected_out": r.expected_out,
"actual_out": r.actual_out,
"valid_regex": r.valid_regex,
"matches_example": r.matches_example,
"compile_error": r.compile_error,
} for r in rules])
print(df_rules)
Putting it all together
This whole “mega” lesson started from a simple goal: get every single real asset URL for a specific Internet Archive item, with no guessing and no missing pieces. From there, we dug into:
- The canonical metadata‑driven method using
/metadata/IDENTIFIERandfiles[]. - The exact download pattern
/download/IDENTIFIER/FILENAME. - Strange storage nodes and shard URLs like
ia801005.us.archive.org/20. - A regex‑powered bookmarklet that automates the entire process in your browser.
- A pandas‑style catalog of URL patterns and node families.
- Self‑validating regex and normalization rules that you test before trusting.
The deeper idea beneath all of this: you’re not just clicking a website. You’re treating the Internet Archive like a structured, inspectable system — one that you can map, validate, and automate. Once you see it that way, you stop guessing and start designing workflows you can rely on.
Comments
Post a Comment