OpenAI dropped GPT-5.5 on Thursday, and the headline number is an 82.7% score on Terminal-Bench 2.0, which tests whether a model can handle complex command-line workflows that require actual planning and tool coordination. For context, GPT-5.4 scored 75.1%, Anthropic's Opus 4.7 came in at 69.4%, and Google's Gemini 3.1 Pro landed at 68.5%. OpenAI is pulling noticeably ahead on agentic capability, at least by this particular benchmark.
The bigger story here is what GPT-5.5 is being pointed at. It is the new engine behind Codex, OpenAI's coding agent, which now has around 4 million developers using it weekly. But OpenAI is framing this as much broader than just coding. They are talking about "general digital work tasks," scientific hypothesis generation, and autonomous multistep work done without human guidance. That last part is the one worth sitting with.
Greg Brockman used the word "enable" carefully during the press call, and that framing matters. Is it just a better chatbot? No, it is infrastructure for agents that can operate a computer independently, with GPT-5.5 scoring 78.7% on OSWorld-Verified, the benchmark that measures exactly that. Anthropic's not-yet-released Mythos model is reportedly finished, which means the competitive pressure behind this release is real.