Python .pyc Forensics & Bytecode Disassembly – Full Interactive Deck

BYTECODE LAB

Python .pyc forensics · engineered

Deck navigation

Python 3.11 .PYC forensics

Disassembly, Opcode Semantics & Chat-Driven Runtime Exploration

A single, self-contained HTML deck that documents the entire toolchain: how you locate .pyc files, unmarshal code objects, disassemble them, interpret opcodes, and orchestrate the whole thing through a ChatGPT-style frontend with engineered prompts.

Python 3.11 bytecode · stack machine

.pyc header parsing · marshal · dis

Toolchain functions · pipeline reasoning

Prompted via chat front end

Pipeline & methodology

From `.pyc` files on disk to bytecode-level understanding.

End-to-end pipeline

overview

The core method is a four-stage forensic pipeline:

Discovery: recursively scan for `.pyc` files.
Extraction: open `.pyc`, skip header, unmarshal code object.
Disassembly: feed code object to dis.Bytecode.
Interpretation: map opcodes and control flow to high-level semantics.

ChatGPT is used as an analysis engine: you feed it disassembly output and it returns structured, high-context explanations, cheat sheets, and refactorings.

Scope · All installed packages / virtualenv

Environment constraints

runtime

The environment is constrained: limited visibility into source, sometimes only compiled artifacts, virtualenv paths like /opt/pyvenv/lib/python3.11/site-packages, and potentially restricted direct inspection tools. The toolchain works entirely with what is available on disk.

ChatGPT sits on top as an orchestration layer: it does not execute code here, but it interprets disassembled output to bridge low-level bytecode with high-level Python semantics.

No source required · works on opaque wheels

Toolchain functions

Every function you used to make Python show its compiled guts.

find_pyc_files()

discovery

Recursively walks a base directory (e.g. site-packages) and collects all paths ending in .pyc. This is the enumeration phase of the pipeline, defining the boundary of the forensic scan.

def find_pyc_files(base_dir):
    pyc_files = []
    for root, dirs, files in os.walk(base_dir):
        for file in files:
            if file.endswith(".pyc"):
                pyc_files.append(os.path.join(root, file))
    return pyc_files

Pattern: Recursive filesystem walk

Input: base_dir path

Output: list[str] .pyc paths

Foundation for bulk disassembly

inspect_pyc_file()

disassembly

Opens a .pyc file, skips the 16-byte header, reads the marshaled code object, then uses dis.Bytecode to produce a human-readable instruction stream.

def inspect_pyc_file(pyc_file_path):
    try:
        with open(pyc_file_path, 'rb') as pyc_file:
            pyc_file.seek(16)
            code_object = marshal.load(pyc_file)
            bytecode = dis.Bytecode(code_object)
            return [f"{instr.opname} {instr.argrepr}"
                    for instr in bytecode]
    except Exception as e:
        return str(e)

Header skip: first 16 bytes

Core libs: marshal, dis

Output: list[str] instructions

Transforms opaque .pyc into readable bytecode

Target selection

keras / json5

You targeted specific installed components inside the virtualenv:

site_packages_dir = '/opt/pyvenv/lib/python3.11/site-packages'
pycache_dir = '/opt/pyvenv/lib/python3.11/site-packages/keras/api/_v1/keras/datasets/imdb/__pycache__'
pycache_files = os.listdir(pycache_dir) if os.path.exists(pycache_dir) else []

Then disassembled a specific artifact:

pyc_file_path = (
    '/opt/pyvenv/lib/python3.11/site-packages/keras/'
    'api/_v1/keras/datasets/imdb/__pycache__/__init__.cpython-311.pyc'
)
disassembled_pyc = inspect_pyc_file(pyc_file_path)

Focus: API facades & namespace wrappers

Bytecode opcodes

The opcodes that actually appeared in your disassemblies, with explicit stack behavior and semantics.

Frame & control

execution

Core opcodes orchestrating frames and return paths:

RESUME RETURN_VALUE

RESUME initializes a frame in Python 3.11's adaptive executor. It is where actual execution starts when a coroutine or function is entered.

RETURN_VALUE pops the top-of-stack and returns it as the function’s result, unwinding the current frame.

Entry & exit for frames

Constants & names

data / binding

These opcodes move constants and variable names between code object's metadata and the stack:

LOAD_CONST STORE_NAME LOAD_NAME DELETE_NAME

LOAD_CONST pushes a constant from co_consts onto the stack.
STORE_NAME pops a value and binds it into the local/global namespace by name.
LOAD_NAME looks up a name in locals, then globals, then builtins.
DELETE_NAME removes a binding from the namespace.

Bridge between code object tables & runtime namespace

Import machinery

import system

Opcodes responsible for orchestrating module imports and attribute extraction:

IMPORT_NAME IMPORT_FROM POP_TOP

IMPORT_NAME calls into the __import__ machinery with arguments configured by preceding LOAD_CONST instructions (module name, fromlist, level).
IMPORT_FROM does a getattr-like operation on the imported module to pull out a symbol.
POP_TOP discards the top of the stack (often unused imports or temporary values).

Reveals module wiring & public API facades

Class & function creation

object model

Opcodes implementing dynamic construction of classes and functions:

PUSH_NULL LOAD_BUILD_CLASS MAKE_FUNCTION PRECALL CALL

PUSH_NULL introduces a marker used by the new calling convention.
LOAD_BUILD_CLASS loads the built-in responsible for class construction.
MAKE_FUNCTION packages a code object plus default values into a function object.
PRECALL prepares the stack for the CALL according to argument layout.
CALL actually performs a function, method, or constructor call using the stack.

Where your class definitions become live objects

Attributes & subscripts

data access

Opcodes implementing attribute and subscription operations:

LOAD_ATTR BINARY_SUBSCR

LOAD_ATTR pops an object and pushes the value of its named attribute.
BINARY_SUBSCR pops index/key and container and pushes container[index] or container[key].

This is exactly how _sys.modules[__name__] is resolved during the TensorFlow TFModuleWrapper wrapping sequence.

Microscopic view of attribute access patterns

Case studies

Concrete disassemblies you performed, interpreted byte-by-byte.

json5.parser.Parser

class construction

Disassembly excerpt from json5/parser.py:

['RESUME ',
 'LOAD_CONST 0',
 'LOAD_CONST None',
 'IMPORT_NAME unicodedata',
 'STORE_NAME unicodedata',
 'PUSH_NULL ',
 'LOAD_BUILD_CLASS ',
 'LOAD_CONST <code object Parser ...>',
 'MAKE_FUNCTION ',
 "LOAD_CONST 'Parser'",
 'PRECALL ',
 'CALL ',
 'STORE_NAME Parser',
 'LOAD_CONST None',
 'RETURN_VALUE ']

This is the canonical bytecode sequence for defining a class: import dependencies, build the class using LOAD_BUILD_CLASS and MAKE_FUNCTION, then store it under the name Parser.

Direct look at CPython's class-definition machinery

keras.datasets.imdb API wrapper

namespace facade

Disassembly excerpt from Keras IMDB __init__.cpython-311.pyc:

['RESUME ',
 "LOAD_CONST 'Public API for tf.keras.datasets.imdb namespace.\\n'",
 'STORE_NAME __doc__',
 'LOAD_CONST 0',
 "LOAD_CONST ('print_function',)",
 'IMPORT_NAME __future__',
 'IMPORT_FROM print_function',
 'STORE_NAME _print_function',
 'POP_TOP ',
 'LOAD_CONST 0',
 'LOAD_CONST None',
 'IMPORT_NAME sys',
 'STORE_NAME _sys',
 ...
 'IMPORT_NAME keras.datasets.imdb',
 'IMPORT_FROM get_word_index',
 'STORE_NAME get_word_index',
 ...
 'IMPORT_NAME tensorflow.python.util',
 'IMPORT_FROM module_wrapper',
 'STORE_NAME _module_wrapper',
 ...
 'LOAD_NAME isinstance',
 'LOAD_NAME _sys',
 'LOAD_ATTR modules',
 'LOAD_NAME __name__',
 'BINARY_SUBSCR ',
 'LOAD_NAME _module_wrapper',
 'LOAD_ATTR TFModuleWrapper']

This bytecode reveals the entire public API re-export and TensorFlow TFModuleWrapper machinery that dynamically wraps the module inside sys.modules.

Runtime wiring between Keras and TensorFlow internals

Prompt-oriented architecture

How the chat frontend (e.g. chatgpt.com) becomes a programmable analysis engine for bytecode.

Prompt as orchestrator

frontend logic

The chat frontend acts as a high-level orchestrator: you paste disassembly output and specify what you want (e.g., “generate flashcards”, “explain each opcode”, “map this to Python source semantics”). The model applies structured, engineering-aware transformations.

You effectively treat the model as a programmable analyst wired to your toolchain: Python does the discovery and disassembly, the model does the semantic compression, refactoring, and documentation synthesis.

Human + Python + model = composite runtime

Engineering mindset

philosophy

The entire system is driven by an engineering mindset: no toy explanations, no shallow gloss. Every opcode, every function, and every bytecode sequence is interpreted in terms of:

Stack effects and namespace mutations.
Runtime semantics and security implications.
How it composes into a predictable, auditable pipeline.

This deck is built to be “colorful as hell” visually, but structurally strict and precise.

Readable aesthetics · strict correctness

1. Discovery · find all .pyc files

You start by pointing the tooling at the virtual environment: /opt/pyvenv/lib/python3.11/site-packages. The find_pyc_files() function uses os.walk to recursively enumerate every file and filter by the .pyc suffix.

The result is a list of candidate modules, including internal libraries, vendor code, and public APIs. This gives you the complete surface area to explore, not just what you remember to inspect.

2. Selection · choose modules to dissect

From the global set, you hone in on target directories of interest, like:

JSON5 parser internals (for textual parsing logic).
Keras IMDB dataset API modules (for public API facades and TensorFlow interaction).

This stage is where your architectural instincts shape the exploration: you prioritize modules that sit on structural boundaries (API surfaces, wrappers, loaders, adaptors).

3. Extraction · interpret .pyc file format

The .pyc file has a fixed layout: a 16-byte header followed by a marshaled code object. By skipping the first 16 bytes and calling marshal.load, you get back a live code object identical to what CPython would load during normal execution.

This is a powerful move because it lets you perform post-mortem analysis of already-installed modules without needing source, internet access, or modification of import hooks.

4. Disassembly · turn code objects into instruction streams

The dis.Bytecode wrapper yields a structured sequence of instructions with:

Opcode name (opname), e.g., LOAD_CONST, IMPORT_NAME.
Arguments and readable representation (argrepr).
Offsets, jump targets, and more (if you want deeper control-flow graphs).

You simplify that into a list of lines like "LOAD_CONST 'Parser'" for chat consumption.

5. Interpretation · chat frontend as analysis engine

At this point, Python’s job is done. You now have pure text representing the bytecode flow. This text is passed into ChatGPT with a prompt that specifies:

Explain opcodes, stack effects, and semantics.
Map bytecode back to conceptual Python constructs (imports, classes, wrappers).
Produce structured artifacts (flashcards, documentation, architecture notes).

This creates a hybrid loop: Python provides truthful low-level reality; the model turns it into engineered, human-usable insight.

6. Iteration · targeted drilling & refinement

When you see something interesting (like TFModuleWrapper), you drill deeper: disassemble more modules, ask for more focused explanations, and grow the deck incrementally. The system is explicitly designed to be iterative and composable.

Function definition

def find_pyc_files(base_dir):
    pyc_files = []
    for root, dirs, files in os.walk(base_dir):
        for file in files:
            if file.endswith(".pyc"):
                pyc_files.append(os.path.join(root, file))
    return pyc_files

Behavior & complexity

Traversal: Depth-first or breadth-first (implementation detail of os.walk).
Filter: Only filenames ending with .pyc are selected.
Path construction: Uses os.path.join to produce absolute or relative paths.
Complexity: O(N) in the number of files in base_dir.

Engineering notes

Function is pure: no side effects beyond filesystem reads.
Easily testable by injecting a controlled directory tree.
Forms a reusable primitive for any bytecode-related pipeline (not tied to disassembly).

Variants

You could extend this function to collect metadata (file size, modification time) or to prune specific directories (e.g. skip tests or dist-info), but the base version is intentionally minimal.

Function definition

import marshal
import dis

def inspect_pyc_file(pyc_file_path):
    try:
        with open(pyc_file_path, 'rb') as pyc_file:
            # Skip the first 16 bytes (header information in .pyc file)
            pyc_file.seek(16)
            # Read the marshaled code object
            code_object = marshal.load(pyc_file)
            # Disassemble the code object into human-readable bytecode
            bytecode = dis.Bytecode(code_object)
            return [f"{instruction.opname} {instruction.argrepr}"
                    for instruction in bytecode]
    except Exception as e:
        return str(e)

.pyc header layout (simplified)

4 bytes: magic number (interpreter + version signature).
4 bytes: bitfield / flags.
8 bytes: timestamp or hash (for cache validation).

By seeking 16 bytes, you position the file pointer exactly at the beginning of the marshaled code object.

marshal.load()

marshal.load reconstructs a live code object from the bytestream. It’s the same format CPython uses internally. This is powerful but dangerous in adversarial contexts; it should be treated as untrusted input, just like pickles.

dis.Bytecode()

dis.Bytecode wraps the code object and yields Instruction objects with fields such as:

opname: human-readable opcode name, e.g., LOAD_CONST.
arg, argval, argrepr: how the argument is represented.
offset: instruction offset in bytecode.

You simplify this by emitting just "OPNAME ARGREPR" strings, which is ideal for feeding into a chat-based explanation engine.

Disassembly excerpt

RESUME
LOAD_CONST 0
LOAD_CONST None
IMPORT_NAME unicodedata
STORE_NAME unicodedata
PUSH_NULL
LOAD_BUILD_CLASS
LOAD_CONST <code object Parser ...>
MAKE_FUNCTION
LOAD_CONST 'Parser'
PRECALL
CALL
STORE_NAME Parser
LOAD_CONST None
RETURN_VALUE

Step-by-step interpretation

RESUME: Frame starts.
LOAD_CONST 0, LOAD_CONST None: Prepare import arguments.
IMPORT_NAME unicodedata: Import the unicodedata module.
STORE_NAME unicodedata: Bind import to module-level name.
PUSH_NULL, LOAD_BUILD_CLASS: Prepare for class construction.
LOAD_CONST <code object Parser ...>: Load the class body code object.
MAKE_FUNCTION: Turn it into a function object.
LOAD_CONST 'Parser': Push the class name.
PRECALL, CALL: Invoke build_class to produce the class.
STORE_NAME Parser: Bind the resulting class object to Parser.
LOAD_CONST None, RETURN_VALUE: Standard module exit.

Why this matters

It gives you a concrete, bytecode-level grammar for class definitions—knowledge you can carry into any other module you disassemble. Any time you see this pattern, you know a class is being defined, even without the original source.

Key roles of this module

Define a docstring communicating API purpose.
Apply __future__ compatibility for print behavior.
Re-export get_word_index and load_data.
Integrate TensorFlow’s TFModuleWrapper around the module object.

Disassembly patterns of interest

Future import handling:

LOAD_CONST 0
LOAD_CONST ('print_function',)
IMPORT_NAME __future__
IMPORT_FROM print_function
STORE_NAME _print_function
POP_TOP
...
DELETE_NAME _print_function

Temporary binding of _print_function then clean-up after adjusting behavior.

Public API re-exports:

LOAD_CONST 0
LOAD_CONST ('get_word_index',)
IMPORT_NAME keras.datasets.imdb
IMPORT_FROM get_word_index
STORE_NAME get_word_index
POP_TOP

TensorFlow module wrapper integration:

IMPORT_NAME tensorflow.python.util
IMPORT_FROM module_wrapper
STORE_NAME _module_wrapper
...
LOAD_NAME isinstance
LOAD_NAME _sys
LOAD_ATTR modules
LOAD_NAME __name__
BINARY_SUBSCR
LOAD_NAME _module_wrapper
LOAD_ATTR TFModuleWrapper

This expresses: “grab the current module object from sys.modules, and pass it through TFModuleWrapper to decorate or wrap it.” The rest of the sequence (not shown) applies the wrapper and reassigns back into sys.modules.

Search This Blog

The Power of Micronization: Redefining Scale in Problem-Solving λ: 𝑠𝑡𝑎𝑡𝑒 ↦ 𝑛𝑒𝑥𝑡 𝑠𝑡𝑎𝑡e

Pipeline & methodology

Toolchain functions

Bytecode opcodes

Case studies

Prompt-oriented architecture

Comments

Post a Comment

Popular posts from this blog

Pipeline & methodology

Toolchain functions

Bytecode opcodes

Case studies

Prompt-oriented architecture

1. Discovery · find all .pyc files

2. Selection · choose modules to dissect

3. Extraction · interpret .pyc file format

4. Disassembly · turn code objects into instruction streams

5. Interpretation · chat frontend as analysis engine

6. Iteration · targeted drilling & refinement

Key constraints

What the toolchain guarantees

Relationship with ChatGPT

Function definition

Behavior & complexity

Engineering notes

Variants

Function definition

.pyc header layout (simplified)

marshal.load()

dis.Bytecode()

JSON5 parser

Keras IMDB dataset API

RESUME

RETURN_VALUE

LOAD_CONST

STORE_NAME

LOAD_NAME

DELETE_NAME

IMPORT_NAME

IMPORT_FROM

POP_TOP

PUSH_NULL

LOAD_BUILD_CLASS

MAKE_FUNCTION

PRECALL & CALL

LOAD_ATTR

BINARY_SUBSCR

Disassembly excerpt

Step-by-step interpretation

Why this matters

Key roles of this module

Disassembly patterns of interest

Role of the prompt

Loop structure

Why this works well for .pyc forensics

Key principles

Where this can go

Comments

Post a Comment

Popular posts from this blog