Skip to content

4-atari-hard: Go-Explore (exploration phase) on Montezuma's Revenge + benchmark#132

Closed
dnddnjs wants to merge 1 commit into
masterfrom
ai/montezuma-go-explore
Closed

4-atari-hard: Go-Explore (exploration phase) on Montezuma's Revenge + benchmark#132
dnddnjs wants to merge 1 commit into
masterfrom
ai/montezuma-go-explore

Conversation

@dnddnjs

@dnddnjs dnddnjs commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Go-Explore phase 1 (exploration only) on Montezuma's Revenge — the archive + emulator-restore paradigm, side by side with the PPO+RND row.

Best end-of-episode score: 31,000 at 500M agent steps (~5.5h, Mac Studio M4 Max, 12 explorer processes, no neural network). Single seed. Replay-verified: re-executing the stored 5,336-action trajectory from reset reproduces exactly 31,000.

Protocol notes (also in the README block):

  • Deterministic ALE (no sticky actions, frameskip 4, fixed seed) — required by restore-based exploration, not comparable to the sticky-action RL rows.
  • Score = best end-of-episode trajectory found by search, not an RL policy score; the paper's robustification phase is not run here.
  • Reference: Nature exploration-phase mean without domain knowledge is 24,758 at the same 2B-frame budget (50+ seeds vs our single seed). Rooms found: 24.

W&B (full metrics history + gameplay video): https://wandb.ai/rlcode/rl-atari-hard-go-explore/runs/m6ox4l3m

(Single-seed diagnostic run; merge is a human decision.)

… benchmark

Go-Explore phase 1 (Ecoffet et al. 2019 / Nature 2021), no neural net:
an archive of downscaled-frame cells (11x8, 9 gray levels), emulator
state save/restore to return to frontier cells, repeated random actions
to explore from them. 12 explorer processes over raw gymnasium ALE
(envpool exposes no clone API, hence the separate env_go_explore.py).

Result: best end-of-episode score 31,000 at 500M agent steps (~5.5h on
a Mac Studio M4 Max), single seed, replay-verified (re-executing the
stored 5,336-action demo from reset reproduces the score exactly).
Deterministic protocol (no sticky actions) -- a trajectory-search
result, not an RL policy score; see the README caveat.
@dnddnjs

dnddnjs commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Superseded by #133 (combined Go-Explore exploration + robustification).

@dnddnjs dnddnjs closed this Jun 12, 2026
dnddnjs pushed a commit that referenced this pull request Jun 12, 2026
…ation + robustification)

Merge the three Montezuma results into a single table with a protocol column,
trim the prose to one note. Restores the exploration row (31,000, replay-verified)
that lived only on the superseded #132 branch.
dnddnjs added a commit that referenced this pull request Jun 12, 2026
…ification) (#133)

* Ignore local CLAUDE.md collaboration notes

* 4-atari-hard: PPO + RND scaffold for Montezuma's Revenge

New chapter for hard-exploration Atari. PPO with Random Network
Distillation (Burda et al., 2018) as the curiosity bonus.

- env.py: ALE/MontezumaRevenge-v5 (and pitfall, private_eye) with the
  standard Atari preprocessing, no FireResetEnv, no LifeLossTerminalEnv
  (uninterrupted episodes so intrinsic returns can chain across deaths).
- 1-ppo-rnd.py: two-value-head ActorCritic, RND target/predictor with
  LeakyReLU, single-frame normalized input clipped to [-5, 5], obs RMS
  seeded by 50 rollouts of a random agent, intrinsic reward scaled by
  running std of discounted intrinsic returns, dual GAE (extrinsic
  episodic + intrinsic non-episodic), predictor updated on 25% of each
  minibatch, combined advantage A = 2*A_ext + 1*A_int.

Not run end-to-end yet. Sanity-checked static shapes and module wiring.

* 4-atari-hard: add envpool count-based exploration

* 4-atari-hard: PPO+RND on Montezuma's Revenge + benchmark

PPO+RND made reproducible and resumable. Shared run plumbing (seed,
metrics.jsonl, periodic/milestone/best checkpoints, resume, final summary)
lives in env.py's RunLogger, keeping the algorithm file focused. 512 parallel
envs crack the first-key bottleneck (128 envs never scored in 50M); final mean
per-game return ~3120 @ 65M steps, single seed (M4 Max), above the paper PPO
baseline (2497). Adds a README benchmark row. Count-based exploration is
deferred to a later PR (not yet trained/benchmarked).

* README: link the Montezuma PPO+RND W&B report

* 6-atari-go-explore: Go-Explore Phase 1 (exploration) for Montezuma

Restore-based archive exploration (Ecoffet et al. 2019/2021), no neural
net: 11x8x9 downscaled-frame cells, 1/sqrt(seen+1) selection, repeated
random actions (p=0.95), raw-score accept rule, virtual DONE cell, global
experience log with prev_id chains (demo source for robustification),
12-worker spawn pool over raw gymnasium ALE clone/restore.

Run contract: --seed/--total-frames/--run-dir/--ckpt-every/--resume;
explog flushed as compressed chunks; checkpoint = archive+log+RNG at
batch boundaries. Smoke: 23k steps/s aggregate, first key at 100k steps.

* 6-atari-go-explore: resolve flushed explog chunks from the resumed run's dir

Cross-run-dir resume (harness relaunches into a fresh run dir) could not
see chunks flushed by the original run; chunk lookup now falls back to
the ancestor run's explog dir and resume fails loudly if any chunk is
unreachable.

* Move Go-Explore into 4-atari-hard alongside PPO+RND

Same hard-exploration domain, two paradigms side by side: 1-ppo-rnd.py
(gradient + intrinsic reward, envpool) and 2-go-explore.py (archive +
emulator restore, raw ALE). Go-Explore keeps its own plumbing in
env_go_explore.py since the two stacks share nothing.

* 4-atari-hard/2-go-explore: use sampling-time captures in the archive walk

A result earlier in the same batch can replace a cell; walking a later
result against the cell's CURRENT score/trajectory stitched actions
executed from the old state onto the new prefix, fabricating scores no
single playthrough achieved. sample() now freezes snapshot/score/
trajectory per pick and the walk uses the capture — matching the
official Go-Explore, which ships these values inside each task.
Caught by publish-time demo replay verification (score mismatch).

* 4-atari-hard: Go-Explore robustification (backward algorithm) — demo extract + GRU PPO

extract_demo.py: pull the best Phase-1 trajectory from the GE checkpoint +
experience log, replay-verify it reproduces the archived score (31,000),
truncate after the last reward, save actions/rewards/periodic ALE states.

env_robustify.py: ReplayResetEnv (episodes restore to a demo point and play
forward under sticky actions; success = raw score >= demo return; lag/success
kills) + ResetManager curriculum (starting points march backward as the agent
matches the demo, forward-cumsum move rule per atari-reset, nudge forward on
collapse).

3-robustify.py: recurrent (GRU) PPO over N restore-capable ALE envs, truncated
BPTT with done-masked state, advantage chains cut at artificial success resets,
periodic from-reset sticky eval -> final.json. Single-machine scaled port of
openai/atari-reset; SIL/multi-demo/autoscale are off-by-default flags. Runs
end-to-end; curriculum logic covered by harness T0.

* 4-atari-hard/3-robustify: eval honors game_over + faithful resume RNG

Preflight audit found the from-reset eval reused ReplayResetEnv with the
training-curriculum kills active, so eval episodes were cut by lag/success-kill
before game_over — value_mean reported a key-but-slower-than-demo policy as ~0.
Disable both kills in evaluate() and cap at the standard 18000-frame Montezuma
episode (4500 agent steps) so eval runs from reset to a real game_over.

Also checkpoint torch + per-env RNG state and restore them on resume (was global
numpy RNG only), so the kill->resume contract is faithful for the deterministic
streams (MPS policy sampling has no bit determinism).

* 4-atari-hard/extract_demo: --max-rewards to truncate at the Kth reward

Cutting the demo just after the first reward (--max-rewards 1) yields a short
first-key-only demo (~250 actions vs ~5300), a far shorter horizon for the
robustification backward curriculum to bootstrap on.

* 4-atari-hard/3-robustify: --ent-coef flag for entropy tuning

The first-key robustification curriculum plateaus where as_good_as_demo caps ~0.34:
the policy commits before reliably executing the demo suffix under sticky actions.
Expose the entropy bonus as a flag (default unchanged) to test whether more
exploration breaks the plateau.

* README: Go-Explore robustification — single-machine negative result

Document the backward-algorithm robustification (3-robustify.py): the curriculum
bootstraps with a first-key demo + 128 envs but plateaus ~22% of the way, with no
from-reset score on a single machine. Honest negative result, single seed, no
benchmark row claimed.

* README: consolidate Montezuma into one table (RND + Go-Explore exploration + robustification)

Merge the three Montezuma results into a single table with a protocol column,
trim the prose to one note. Restores the exploration row (31,000, replay-verified)
that lived only on the superseded #132 branch.

---------

Co-authored-by: soyoung park <ssoyyoung.p@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant