Skip to content

Releases: facebookresearch/ProgramBench

v1.1.0

18 Jun 21:34
ede4bdb

Choose a tag to compare

What's Changed

This release fixes several issues with the eval harness. If you are evaluating on ProgramBench we strongly recommend you to update. Most fixes should not require rerunning agents except for a small loophole described in #45 and #14 (first raised by suche-ux in #14) and fixed by new docker images (#46). Annotating existing agent trajectories should make it easy to flag which instances were affected.

  • Fix(eval): block build-script internet for submissions by @klieret in #41
  • Fix(eval): Ignore flaky and otherwise unsuitable tests by @klieret in #40
  • Fix(eval): evaluate in :task_cleanroom images by @klieret in #42
  • Fix(eval): default to v6 docker images by @klieret in #46

New Contributors

Full Changelog: v1.0.2...v1.1.0

v1.0.2

11 May 16:58
b33e660

Choose a tag to compare

This minor release ignores ~30 tests that caused hangs when evaluating incorrect solutions.

Full Changelog: v1.0.1...v1.0.2

v1.0.1

07 May 12:45
1fe64c8

Choose a tag to compare

What's Changed

  • Fix: stderr messages can corrupt XML coverage report (#5), thanks for the report @darshanmakwana412

New Contributors

Full Changelog: v1.0.0...v1.0.1

ProgramBench 🦊

05 May 14:31
2803dcc

Choose a tag to compare

How much of SQLite, FFmpeg, PHP compiler can Opus 4.7 rebuild from scratch? Given just an executable and no starter code or internet access.

Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end.

Read more: https://programbench.com/

image