GLM5 excels in long-task engineering by maintaining consistency over 700+ tool calls and 800+ context handoffs across 24+ hours, enabling AI to function as a persistent process for complex projects like building a full Game Boy Advance emulator, rather than short conversational interactions.
- E01 Research gained early access to GLM5 to test its long-task capabilities, focusing on a challenge to build a Game Boy Advance (GBA) emulator from scratch in JavaScript, embedded in a 3D rendered scene, using a single agent without parallelism.
- The task simulates real engineering: research, architecture, implementation, testing, and documentation across multiple sessions, stretching beyond context limits via meta-rules and loops.
- Challenge input: system prompt and hardware documentation; model must scope work, adjust strategies, switch roles (architect, engineer, designer), and hand off context accurately.
- Two test versions: "Easy mode" with gbajs reference code (GLM5 reimplemented independently, achieving working core emulator, ROM loading, 3D scene; demo at https://e01.ai/gba); "Zero reference" mode (no code or web search, ran 24+ hours, completed CPU instruction set, ongoing progress).
- Prior models failed by looping, forgetting goals, or erroneous tool calls.
- Success mechanism: Prompt as a meta-loop (work → test → log → advance), persisting state in files (/notes/progress.md, /notes/decisions.md, /notes/blockers.md) for context resets.
- Observation 1: GLM5 showed no degradation—consistent tool calls (700+), strict instruction adherence (800+ switches), reliable context relay from files.
- Implications: Enables goal-driven agents (autonomous planning/execution), parallel multi-agents (one human supervising many), applications beyond code (e.g., AI for Science: experiment design, research); patterns like long-recurring (iterative workflows) and long-exploring (open-ended exploration).
- Observation 2: Challenges include hidden multi-session loops (human-detected, e.g., brute-forcing 3D model), over-diligence without pause thresholds; needs explicit "stop and ask for help" instructions.
- Future needs: Observability (visualization/monitoring), intervention (alerts, nudges), evaluation metrics (context relay, progression rate, decay), trust (incremental validation), cost/infra (budgeting, pause/resume), research (relay limits, self-evaluation).
- Prompt design guide: Define goal + phases with "done" criteria (e.g., CPU core, memory, graphics); conventions (file structure, testing); notes protocol (session updates); testing gates (unit/integration via Node.js, not browser); loop breaking (retry logs, time limits); recovery (read notes/files first).
- Mistakes to avoid: Vague notes, no loop-breaking/logs (resets across sessions), assuming memory persistence, over-specifying code, skipping tests (compounds errors).
- Experiment run via OpenCode/Claude Code; tags include GLM5, AI, Long Task, Writing Prompts, Machine Learning.