A single, self-contained HTML deck that documents the entire toolchain: how you locate .pyc files, unmarshal code objects, disassemble them, interpret opcodes, and orchestrate the whole thing through a ChatGPT-style frontend with engineered prompts.
Pipeline & methodology
From `.pyc` files on disk to bytecode-level understanding.
The core method is a four-stage forensic pipeline:
- Discovery: recursively scan for `.pyc` files.
- Extraction: open `.pyc`, skip header, unmarshal code object.
- Disassembly: feed code object to
dis.Bytecode. - Interpretation: map opcodes and control flow to high-level semantics.
ChatGPT is used as an analysis engine: you feed it disassembly output and it returns structured, high-context explanations, cheat sheets, and refactorings.
The environment is constrained: limited visibility into source, sometimes only compiled artifacts,
virtualenv paths like /opt/pyvenv/lib/python3.11/site-packages, and potentially restricted
direct inspection tools. The toolchain works entirely with what is available on disk.
ChatGPT sits on top as an orchestration layer: it does not execute code here, but it interprets disassembled output to bridge low-level bytecode with high-level Python semantics.
Toolchain functions
Every function you used to make Python show its compiled guts.
Recursively walks a base directory (e.g. site-packages) and collects all
paths ending in .pyc. This is the enumeration phase of the pipeline, defining
the boundary of the forensic scan.
def find_pyc_files(base_dir):
pyc_files = []
for root, dirs, files in os.walk(base_dir):
for file in files:
if file.endswith(".pyc"):
pyc_files.append(os.path.join(root, file))
return pyc_files
Opens a .pyc file, skips the 16-byte header, reads the marshaled code object,
then uses dis.Bytecode to produce a human-readable instruction stream.
def inspect_pyc_file(pyc_file_path):
try:
with open(pyc_file_path, 'rb') as pyc_file:
pyc_file.seek(16)
code_object = marshal.load(pyc_file)
bytecode = dis.Bytecode(code_object)
return [f"{instr.opname} {instr.argrepr}"
for instr in bytecode]
except Exception as e:
return str(e)
You targeted specific installed components inside the virtualenv:
site_packages_dir = '/opt/pyvenv/lib/python3.11/site-packages'
pycache_dir = '/opt/pyvenv/lib/python3.11/site-packages/keras/api/_v1/keras/datasets/imdb/__pycache__'
pycache_files = os.listdir(pycache_dir) if os.path.exists(pycache_dir) else []
Then disassembled a specific artifact:
pyc_file_path = (
'/opt/pyvenv/lib/python3.11/site-packages/keras/'
'api/_v1/keras/datasets/imdb/__pycache__/__init__.cpython-311.pyc'
)
disassembled_pyc = inspect_pyc_file(pyc_file_path)
Bytecode opcodes
The opcodes that actually appeared in your disassemblies, with explicit stack behavior and semantics.
Core opcodes orchestrating frames and return paths:
RESUME initializes a frame in Python 3.11's adaptive executor. It is where actual execution starts when a coroutine or function is entered.
RETURN_VALUE pops the top-of-stack and returns it as the function’s result, unwinding the current frame.
These opcodes move constants and variable names between code object's metadata and the stack:
LOAD_CONST pushes a constant from co_consts onto the stack.
STORE_NAME pops a value and binds it into the local/global namespace by name.
LOAD_NAME looks up a name in locals, then globals, then builtins.
DELETE_NAME removes a binding from the namespace.
Opcodes responsible for orchestrating module imports and attribute extraction:
IMPORT_NAME calls into the __import__ machinery with arguments configured
by preceding LOAD_CONST instructions (module name, fromlist, level).
IMPORT_FROM does a getattr-like operation on the imported module to pull
out a symbol.
POP_TOP discards the top of the stack (often unused imports or temporary values).
Opcodes implementing dynamic construction of classes and functions:
PUSH_NULL introduces a marker used by the new calling convention.
LOAD_BUILD_CLASS loads the built-in responsible for class construction.
MAKE_FUNCTION packages a code object plus default values into a function object.
PRECALL prepares the stack for the CALL according to argument layout.
CALL actually performs a function, method, or constructor call using the stack.
Opcodes implementing attribute and subscription operations:
LOAD_ATTR pops an object and pushes the value of its named attribute.
BINARY_SUBSCR pops index/key and container and pushes container[index] or
container[key].
This is exactly how _sys.modules[__name__] is resolved during the TensorFlow
TFModuleWrapper wrapping sequence.
Case studies
Concrete disassemblies you performed, interpreted byte-by-byte.
Disassembly excerpt from json5/parser.py:
['RESUME ',
'LOAD_CONST 0',
'LOAD_CONST None',
'IMPORT_NAME unicodedata',
'STORE_NAME unicodedata',
'PUSH_NULL ',
'LOAD_BUILD_CLASS ',
'LOAD_CONST <code object Parser ...>',
'MAKE_FUNCTION ',
"LOAD_CONST 'Parser'",
'PRECALL ',
'CALL ',
'STORE_NAME Parser',
'LOAD_CONST None',
'RETURN_VALUE ']
This is the canonical bytecode sequence for defining a class: import dependencies, build the class using
LOAD_BUILD_CLASS and MAKE_FUNCTION, then store it under the name
Parser.
Disassembly excerpt from Keras IMDB __init__.cpython-311.pyc:
['RESUME ',
"LOAD_CONST 'Public API for tf.keras.datasets.imdb namespace.\\n'",
'STORE_NAME __doc__',
'LOAD_CONST 0',
"LOAD_CONST ('print_function',)",
'IMPORT_NAME __future__',
'IMPORT_FROM print_function',
'STORE_NAME _print_function',
'POP_TOP ',
'LOAD_CONST 0',
'LOAD_CONST None',
'IMPORT_NAME sys',
'STORE_NAME _sys',
...
'IMPORT_NAME keras.datasets.imdb',
'IMPORT_FROM get_word_index',
'STORE_NAME get_word_index',
...
'IMPORT_NAME tensorflow.python.util',
'IMPORT_FROM module_wrapper',
'STORE_NAME _module_wrapper',
...
'LOAD_NAME isinstance',
'LOAD_NAME _sys',
'LOAD_ATTR modules',
'LOAD_NAME __name__',
'BINARY_SUBSCR ',
'LOAD_NAME _module_wrapper',
'LOAD_ATTR TFModuleWrapper']
This bytecode reveals the entire public API re-export and TensorFlow TFModuleWrapper
machinery that dynamically wraps the module inside sys.modules.
Prompt-oriented architecture
How the chat frontend (e.g. chatgpt.com) becomes a programmable analysis engine for bytecode.
The chat frontend acts as a high-level orchestrator: you paste disassembly output and specify what you want (e.g., “generate flashcards”, “explain each opcode”, “map this to Python source semantics”). The model applies structured, engineering-aware transformations.
You effectively treat the model as a programmable analyst wired to your toolchain: Python does the discovery and disassembly, the model does the semantic compression, refactoring, and documentation synthesis.
The entire system is driven by an engineering mindset: no toy explanations, no shallow gloss. Every opcode, every function, and every bytecode sequence is interpreted in terms of:
- Stack effects and namespace mutations.
- Runtime semantics and security implications.
- How it composes into a predictable, auditable pipeline.
This deck is built to be “colorful as hell” visually, but structurally strict and precise.
1. Discovery · find all .pyc files
You start by pointing the tooling at the virtual environment:
/opt/pyvenv/lib/python3.11/site-packages. The find_pyc_files() function uses
os.walk to recursively enumerate every file and filter by the .pyc suffix.
The result is a list of candidate modules, including internal libraries, vendor code, and public APIs. This gives you the complete surface area to explore, not just what you remember to inspect.
2. Selection · choose modules to dissect
From the global set, you hone in on target directories of interest, like:
- JSON5 parser internals (for textual parsing logic).
- Keras IMDB dataset API modules (for public API facades and TensorFlow interaction).
This stage is where your architectural instincts shape the exploration: you prioritize modules that sit on structural boundaries (API surfaces, wrappers, loaders, adaptors).
3. Extraction · interpret .pyc file format
The .pyc file has a fixed layout: a 16-byte header followed by a marshaled code object. By
skipping the first 16 bytes and calling marshal.load, you get back a live code
object identical to what CPython would load during normal execution.
This is a powerful move because it lets you perform post-mortem analysis of already-installed modules without needing source, internet access, or modification of import hooks.
4. Disassembly · turn code objects into instruction streams
The dis.Bytecode wrapper yields a structured sequence of instructions with:
- Opcode name (
opname), e.g.,LOAD_CONST,IMPORT_NAME. - Arguments and readable representation (
argrepr). - Offsets, jump targets, and more (if you want deeper control-flow graphs).
You simplify that into a list of lines like "LOAD_CONST 'Parser'" for chat consumption.
5. Interpretation · chat frontend as analysis engine
At this point, Python’s job is done. You now have pure text representing the bytecode flow. This text is passed into ChatGPT with a prompt that specifies:
- Explain opcodes, stack effects, and semantics.
- Map bytecode back to conceptual Python constructs (imports, classes, wrappers).
- Produce structured artifacts (flashcards, documentation, architecture notes).
This creates a hybrid loop: Python provides truthful low-level reality; the model turns it into engineered, human-usable insight.
6. Iteration · targeted drilling & refinement
When you see something interesting (like TFModuleWrapper), you drill deeper: disassemble more
modules, ask for more focused explanations, and grow the deck incrementally. The system is explicitly designed
to be iterative and composable.
Key constraints
- You operate inside a virtualenv at
/opt/pyvenv/. - You may not always control how packages were installed or compiled.
- Sources may be missing or different from the
.pycyou see.
What the toolchain guarantees
-
It works entirely from disk state:
.pycis the source of truth, not assumptions about how the code “should” look. -
All introspection uses standard library modules (
os,marshal,dis), so it’s portable and future audit-friendly. - Nothing in the pipeline mutates state; it’s purely observational (read-only).
Relationship with ChatGPT
The environment doesn’t allow the model to read files directly. Instead, you run disassembly locally and paste the textual output. The model uses that to reconstruct high-level structure and behavior.
That separation is important: Python handles introspection and correctness; the model handles synthesis and explanation. The boundaries are clean and auditable.
Function definition
def find_pyc_files(base_dir):
pyc_files = []
for root, dirs, files in os.walk(base_dir):
for file in files:
if file.endswith(".pyc"):
pyc_files.append(os.path.join(root, file))
return pyc_files
Behavior & complexity
- Traversal: Depth-first or breadth-first (implementation detail of
os.walk). - Filter: Only filenames ending with
.pycare selected. - Path construction: Uses
os.path.jointo produce absolute or relative paths. - Complexity: O(N) in the number of files in
base_dir.
Engineering notes
- Function is pure: no side effects beyond filesystem reads.
- Easily testable by injecting a controlled directory tree.
- Forms a reusable primitive for any bytecode-related pipeline (not tied to disassembly).
Variants
You could extend this function to collect metadata (file size, modification time) or to prune specific
directories (e.g. skip tests or dist-info), but the base version is intentionally
minimal.
Function definition
import marshal
import dis
def inspect_pyc_file(pyc_file_path):
try:
with open(pyc_file_path, 'rb') as pyc_file:
# Skip the first 16 bytes (header information in .pyc file)
pyc_file.seek(16)
# Read the marshaled code object
code_object = marshal.load(pyc_file)
# Disassemble the code object into human-readable bytecode
bytecode = dis.Bytecode(code_object)
return [f"{instruction.opname} {instruction.argrepr}"
for instruction in bytecode]
except Exception as e:
return str(e)
.pyc header layout (simplified)
- 4 bytes: magic number (interpreter + version signature).
- 4 bytes: bitfield / flags.
- 8 bytes: timestamp or hash (for cache validation).
By seeking 16 bytes, you position the file pointer exactly at the beginning of the marshaled code object.
marshal.load()
marshal.load reconstructs a live code object from the bytestream. It’s the same
format CPython uses internally. This is powerful but dangerous in adversarial contexts; it should be treated as
untrusted input, just like pickles.
dis.Bytecode()
dis.Bytecode wraps the code object and yields Instruction objects with fields such as:
opname: human-readable opcode name, e.g.,LOAD_CONST.arg,argval,argrepr: how the argument is represented.offset: instruction offset in bytecode.
You simplify this by emitting just "OPNAME ARGREPR" strings, which is ideal for feeding into a
chat-based explanation engine.
JSON5 parser
Parsing libraries are excellent bytecode specimens: they encode non-trivial control flow, string handling,
error management, and often custom data structures. Disassembling json5.parser.Parser gives you a
view into how a real-world parser is structured at the bytecode level.
Keras IMDB dataset API
This module is a public API facade layered over deeper TensorFlow internals. The disassembly reveals:
- Future import handling for cross-version compatibility.
- Public re-exports of
load_dataandget_word_index. - TensorFlow
module_wrapperintegration throughTFModuleWrapper. - Dynamic module manipulation via
sys.modules[__name__].
This is exactly the sort of boundary layer where architecture and runtime trickery are most visible.
RESUME
Role: Entry point into a frame in Python 3.11’s reworked interpreter. It sets up the internal execution cursor and coordinates with the adaptive tier.
Stack effect: No visible effect on the operand stack.
Semantics: “Start running this code block now.” It is a structural opcode, not expressing a Python-level operation.
RETURN_VALUE
Role: End of a function or code object.
Stack effect: Pops TOS and returns it to the caller as the function result.
Semantics: The last pushed value in the frame becomes the return value. Any remaining stack content is discarded, and the frame is popped.
LOAD_CONST
Stack effect: Pushes a constant from co_consts onto the stack.
Example: LOAD_CONST 'Public API for tf.keras.datasets.imdb namespace.\n' pushes
the docstring into the stack so it can be stored as __doc__.
STORE_NAME
Stack effect: Pops TOS and binds it to a name in the current namespace.
Example: STORE_NAME __doc__ takes the string on the stack and binds it as the
module docstring.
LOAD_NAME
Stack effect: Pushes the value of a name found by searching locals, then globals, then builtins.
Example: LOAD_NAME isinstance fetches the isinstance built-in
function.
DELETE_NAME
Stack effect: No stack modification, but removes a binding from the namespace.
Example: DELETE_NAME _print_function cleans up a temporary alias after using a
__future__ import for compatibility.
IMPORT_NAME
Stack behavior (simplified): Consumes level and fromlist values
(from preceding LOAD_CONST instructions) and pushes the imported module.
Example: LOAD_CONST 0, LOAD_CONST ('print_function',),
IMPORT_NAME __future__ fetches the __future__ module with a specific fromlist.
IMPORT_FROM
Stack effect: Pops a module, pushes the retrieved attribute value, and re-pushes the module in some variants.
Example: IMPORT_FROM get_word_index pulls get_word_index from
keras.datasets.imdb and prepares it to be STORE_NAMEd into the current namespace.
POP_TOP
Stack effect: Pops and discards TOS. Commonly used to discard modules after extracting attributes or to ignore return values.
Example: After IMPORT_FROM print_function, the __future__ module is discarded with POP_TOP.
PUSH_NULL
Role: Part of Python 3.11’s revised calling convention. It helps separate “callable” positions from argument positions on the stack for optimized dispatch.
LOAD_BUILD_CLASS
Role: Loads the built-in responsible for constructing class objects. It’s invoked during class body execution.
MAKE_FUNCTION
Stack effect: Pops a code object and any defaults/closure cells, pushes a function object.
Example: LOAD_CONST <code object Parser ...> followed by
MAKE_FUNCTION creates the function representing the class body of Parser.
PRECALL & CALL
PRECALL: Prepares the stack according to the arguments and the type of callable, enabling
faster CALL handling by the interpreter.
CALL: Pops arguments and the callable, invokes it, and pushes the result.
In the Parser case, this is where the combination of LOAD_BUILD_CLASS and the class
body function yields the final class object, which is then STORE_NAMEd as Parser.
LOAD_ATTR
Stack effect: Pops an object, pushes getattr(obj, name).
Example: LOAD_ATTR modules after LOAD_NAME _sys yields the
sys.modules mapping.
BINARY_SUBSCR
Stack effect: Pops index/key and container, pushes container[index].
Example: _sys.modules[__name__] is implemented as:
LOAD_NAME _sysLOAD_ATTR modulesLOAD_NAME __name__BINARY_SUBSCR
This is the exact mechanism by which TensorFlow’s TFModuleWrapper finds and wraps the current
module object.
Disassembly excerpt
RESUME
LOAD_CONST 0
LOAD_CONST None
IMPORT_NAME unicodedata
STORE_NAME unicodedata
PUSH_NULL
LOAD_BUILD_CLASS
LOAD_CONST <code object Parser ...>
MAKE_FUNCTION
LOAD_CONST 'Parser'
PRECALL
CALL
STORE_NAME Parser
LOAD_CONST None
RETURN_VALUE
Step-by-step interpretation
- RESUME: Frame starts.
- LOAD_CONST 0, LOAD_CONST None: Prepare import arguments.
- IMPORT_NAME unicodedata: Import the
unicodedatamodule. - STORE_NAME unicodedata: Bind import to module-level name.
- PUSH_NULL, LOAD_BUILD_CLASS: Prepare for class construction.
- LOAD_CONST <code object Parser ...>: Load the class body code object.
- MAKE_FUNCTION: Turn it into a function object.
- LOAD_CONST 'Parser': Push the class name.
- PRECALL, CALL: Invoke
build_classto produce the class. - STORE_NAME Parser: Bind the resulting class object to
Parser. - LOAD_CONST None, RETURN_VALUE: Standard module exit.
Why this matters
It gives you a concrete, bytecode-level grammar for class definitions—knowledge you can carry into any other module you disassemble. Any time you see this pattern, you know a class is being defined, even without the original source.
Key roles of this module
- Define a docstring communicating API purpose.
- Apply
__future__compatibility for print behavior. - Re-export
get_word_indexandload_data. - Integrate TensorFlow’s
TFModuleWrapperaround the module object.
Disassembly patterns of interest
Future import handling:
LOAD_CONST 0
LOAD_CONST ('print_function',)
IMPORT_NAME __future__
IMPORT_FROM print_function
STORE_NAME _print_function
POP_TOP
...
DELETE_NAME _print_function
Temporary binding of _print_function then clean-up after adjusting behavior.
Public API re-exports:
LOAD_CONST 0
LOAD_CONST ('get_word_index',)
IMPORT_NAME keras.datasets.imdb
IMPORT_FROM get_word_index
STORE_NAME get_word_index
POP_TOP
TensorFlow module wrapper integration:
IMPORT_NAME tensorflow.python.util
IMPORT_FROM module_wrapper
STORE_NAME _module_wrapper
...
LOAD_NAME isinstance
LOAD_NAME _sys
LOAD_ATTR modules
LOAD_NAME __name__
BINARY_SUBSCR
LOAD_NAME _module_wrapper
LOAD_ATTR TFModuleWrapper
This expresses: “grab the current module object from sys.modules, and pass it through
TFModuleWrapper to decorate or wrap it.” The rest of the sequence (not shown) applies the wrapper
and reassigns back into sys.modules.
Role of the prompt
The prompt is the “runtime configuration” of the chat model. You are explicit about expectations: engineering-grade detail, no shallow summaries, every opcode, every function, and clear attention to Python’s object model and interpreter design.
Loop structure
- Run Python toolchain locally (discovery + disassembly).
- Copy the relevant output (opcode sequences, code snippets).
- Paste into chat with precise instructions (what you want the model to build).
- Model returns structured artifacts (flashcards, docs, architectural sketches).
- Optionally re-inject those artifacts back into your dev environment as docs or reference.
Why this works well for .pyc forensics
- The bytecode itself is deterministic and unambiguous.
- The model excels at pattern recognition (e.g., “this pattern = class definition”).
- You provide guardrails via the prompt to keep the output technical and complete.
Key principles
- Precision over gloss: Every opcode and function is described in terms of stack behavior and runtime semantics.
- Composability: Discovery, extraction, disassembly, and explanation are cleanly separable and reusable.
- Auditability: The path from .pyc bytes to explanation is explicit and inspectable.
- Visual clarity: The HTML/CSS is maximalist and colorful, but the structure is tight and intentional.
Where this can go
- Export this deck into internal docs for your runtime platform.
- Extend the opcodes section into a complete Python 3.11 opcode reference with stack deltas.
- Attach more disassemblies to grow the “case studies” section into a library.
Comments
Post a Comment