Python .pyc Forensics & Bytecode Disassembly – Full Interactive Deck

Pipeline & methodology

From `.pyc` files on disk to bytecode-level understanding.

End-to-end pipeline
overview

The core method is a four-stage forensic pipeline:

  • Discovery: recursively scan for `.pyc` files.
  • Extraction: open `.pyc`, skip header, unmarshal code object.
  • Disassembly: feed code object to dis.Bytecode.
  • Interpretation: map opcodes and control flow to high-level semantics.

ChatGPT is used as an analysis engine: you feed it disassembly output and it returns structured, high-context explanations, cheat sheets, and refactorings.

Environment constraints
runtime

The environment is constrained: limited visibility into source, sometimes only compiled artifacts, virtualenv paths like /opt/pyvenv/lib/python3.11/site-packages, and potentially restricted direct inspection tools. The toolchain works entirely with what is available on disk.

ChatGPT sits on top as an orchestration layer: it does not execute code here, but it interprets disassembled output to bridge low-level bytecode with high-level Python semantics.

Toolchain functions

Every function you used to make Python show its compiled guts.

find_pyc_files()
discovery

Recursively walks a base directory (e.g. site-packages) and collects all paths ending in .pyc. This is the enumeration phase of the pipeline, defining the boundary of the forensic scan.

def find_pyc_files(base_dir):
    pyc_files = []
    for root, dirs, files in os.walk(base_dir):
        for file in files:
            if file.endswith(".pyc"):
                pyc_files.append(os.path.join(root, file))
    return pyc_files
Pattern: Recursive filesystem walk
Input: base_dir path
Output: list[str] .pyc paths
inspect_pyc_file()
disassembly

Opens a .pyc file, skips the 16-byte header, reads the marshaled code object, then uses dis.Bytecode to produce a human-readable instruction stream.

def inspect_pyc_file(pyc_file_path):
    try:
        with open(pyc_file_path, 'rb') as pyc_file:
            pyc_file.seek(16)
            code_object = marshal.load(pyc_file)
            bytecode = dis.Bytecode(code_object)
            return [f"{instr.opname} {instr.argrepr}"
                    for instr in bytecode]
    except Exception as e:
        return str(e)
Header skip: first 16 bytes
Core libs: marshal, dis
Output: list[str] instructions
Target selection
keras / json5

You targeted specific installed components inside the virtualenv:

site_packages_dir = '/opt/pyvenv/lib/python3.11/site-packages'
pycache_dir = '/opt/pyvenv/lib/python3.11/site-packages/keras/api/_v1/keras/datasets/imdb/__pycache__'
pycache_files = os.listdir(pycache_dir) if os.path.exists(pycache_dir) else []

Then disassembled a specific artifact:

pyc_file_path = (
    '/opt/pyvenv/lib/python3.11/site-packages/keras/'
    'api/_v1/keras/datasets/imdb/__pycache__/__init__.cpython-311.pyc'
)
disassembled_pyc = inspect_pyc_file(pyc_file_path)

Bytecode opcodes

The opcodes that actually appeared in your disassemblies, with explicit stack behavior and semantics.

Frame & control
execution

Core opcodes orchestrating frames and return paths:

RESUME RETURN_VALUE

RESUME initializes a frame in Python 3.11's adaptive executor. It is where actual execution starts when a coroutine or function is entered.

RETURN_VALUE pops the top-of-stack and returns it as the function’s result, unwinding the current frame.

Constants & names
data / binding

These opcodes move constants and variable names between code object's metadata and the stack:

LOAD_CONST STORE_NAME LOAD_NAME DELETE_NAME

LOAD_CONST pushes a constant from co_consts onto the stack.
STORE_NAME pops a value and binds it into the local/global namespace by name.
LOAD_NAME looks up a name in locals, then globals, then builtins.
DELETE_NAME removes a binding from the namespace.

Import machinery
import system

Opcodes responsible for orchestrating module imports and attribute extraction:

IMPORT_NAME IMPORT_FROM POP_TOP

IMPORT_NAME calls into the __import__ machinery with arguments configured by preceding LOAD_CONST instructions (module name, fromlist, level).
IMPORT_FROM does a getattr-like operation on the imported module to pull out a symbol.
POP_TOP discards the top of the stack (often unused imports or temporary values).

Class & function creation
object model

Opcodes implementing dynamic construction of classes and functions:

PUSH_NULL LOAD_BUILD_CLASS MAKE_FUNCTION PRECALL CALL

PUSH_NULL introduces a marker used by the new calling convention.
LOAD_BUILD_CLASS loads the built-in responsible for class construction.
MAKE_FUNCTION packages a code object plus default values into a function object.
PRECALL prepares the stack for the CALL according to argument layout.
CALL actually performs a function, method, or constructor call using the stack.

Attributes & subscripts
data access

Opcodes implementing attribute and subscription operations:

LOAD_ATTR BINARY_SUBSCR

LOAD_ATTR pops an object and pushes the value of its named attribute.
BINARY_SUBSCR pops index/key and container and pushes container[index] or container[key].

This is exactly how _sys.modules[__name__] is resolved during the TensorFlow TFModuleWrapper wrapping sequence.

Case studies

Concrete disassemblies you performed, interpreted byte-by-byte.

json5.parser.Parser
class construction

Disassembly excerpt from json5/parser.py:

['RESUME ',
 'LOAD_CONST 0',
 'LOAD_CONST None',
 'IMPORT_NAME unicodedata',
 'STORE_NAME unicodedata',
 'PUSH_NULL ',
 'LOAD_BUILD_CLASS ',
 'LOAD_CONST <code object Parser ...>',
 'MAKE_FUNCTION ',
 "LOAD_CONST 'Parser'",
 'PRECALL ',
 'CALL ',
 'STORE_NAME Parser',
 'LOAD_CONST None',
 'RETURN_VALUE ']

This is the canonical bytecode sequence for defining a class: import dependencies, build the class using LOAD_BUILD_CLASS and MAKE_FUNCTION, then store it under the name Parser.

keras.datasets.imdb API wrapper
namespace facade

Disassembly excerpt from Keras IMDB __init__.cpython-311.pyc:

['RESUME ',
 "LOAD_CONST 'Public API for tf.keras.datasets.imdb namespace.\\n'",
 'STORE_NAME __doc__',
 'LOAD_CONST 0',
 "LOAD_CONST ('print_function',)",
 'IMPORT_NAME __future__',
 'IMPORT_FROM print_function',
 'STORE_NAME _print_function',
 'POP_TOP ',
 'LOAD_CONST 0',
 'LOAD_CONST None',
 'IMPORT_NAME sys',
 'STORE_NAME _sys',
 ...
 'IMPORT_NAME keras.datasets.imdb',
 'IMPORT_FROM get_word_index',
 'STORE_NAME get_word_index',
 ...
 'IMPORT_NAME tensorflow.python.util',
 'IMPORT_FROM module_wrapper',
 'STORE_NAME _module_wrapper',
 ...
 'LOAD_NAME isinstance',
 'LOAD_NAME _sys',
 'LOAD_ATTR modules',
 'LOAD_NAME __name__',
 'BINARY_SUBSCR ',
 'LOAD_NAME _module_wrapper',
 'LOAD_ATTR TFModuleWrapper']

This bytecode reveals the entire public API re-export and TensorFlow TFModuleWrapper machinery that dynamically wraps the module inside sys.modules.

Prompt-oriented architecture

How the chat frontend (e.g. chatgpt.com) becomes a programmable analysis engine for bytecode.

Prompt as orchestrator
frontend logic

The chat frontend acts as a high-level orchestrator: you paste disassembly output and specify what you want (e.g., “generate flashcards”, “explain each opcode”, “map this to Python source semantics”). The model applies structured, engineering-aware transformations.

You effectively treat the model as a programmable analyst wired to your toolchain: Python does the discovery and disassembly, the model does the semantic compression, refactoring, and documentation synthesis.

Engineering mindset
philosophy

The entire system is driven by an engineering mindset: no toy explanations, no shallow gloss. Every opcode, every function, and every bytecode sequence is interpreted in terms of:

  • Stack effects and namespace mutations.
  • Runtime semantics and security implications.
  • How it composes into a predictable, auditable pipeline.

This deck is built to be “colorful as hell” visually, but structurally strict and precise.

Comments

Popular posts from this blog