A day where Google reminded everyone they still set the pace—and Alibaba and Cohere quietly showed how the race is changing.
Gemini 3.5 Flash: The Cost-Efficiency Play Everyone Will Copy
Google didn’t just drop a faster model at I/O. They dropped a cheaper one—and framed it as the future of enterprise AI. Gemini 3.5 Flash isn’t about raw performance; it’s about output tokens per second per dollar, a metric that matters more in boardrooms than benchmark tables.
The numbers:
- 4x faster than other frontier models in throughput.
- <50% cost of comparable models (per Google’s own claims—grain of salt, but directionally true).
- 76.2% on Terminal-Bench 2.1, which means it’s not just fast, it’s usefully fast for coding agents.
The move here is obvious: Google’s betting that the next phase of the AI race isn’t about who has the "smartest" model, but who can deploy it at scale without bankrupting customers. Flash isn’t a science experiment; it’s a product strategy. And the fact that it’s already beating Gemini 3.1 Pro on most benchmarks? That’s just table stakes.
The question they’re not asking: If speed and cost are the new frontier, why are we still obsessing over MMLU scores like they’re the only thing that matters?
Alibaba’s Qwen3.7-Max: The "35-Hour Agent" No One Saw Coming
While Google played the efficiency card, Alibaba went long—literally. Qwen3.7-Max isn’t just another incremental upgrade. It’s a model built for persistent, autonomous operation, claiming to run agentic tasks for up to 35 hours without performance degradation.
The benchmarks:
- 13th globally on LM Arena’s text leaderboard (preview version).
- 16th for vision—not groundbreaking, but competitive.
- Top-tier in China, which is the point. Alibaba isn’t playing for global dominance yet; they’re consolidating the domestic market first.
The real story? The Zhenwu M890 chip—144GB of GPU memory, 3x the power of its predecessor, and a clear signal that Alibaba’s building its own stack, chips to cloud. This isn’t just a model release; it’s vertical integration. China’s "AI factory" just got its assembly line.
For the record: If Qwen3.7-Max can actually sustain agentic workflows for 35 hours without hallucinating itself into a corner, that’s a bigger deal than another 2% bump on MMLU. Long-horizon tasks are where models fail in the real world. Someone’s finally testing for it.
Cohere’s Command A+: The Open-Source MoE That Runs on Two GPUs
Cohere didn’t just open-source a model. They open-sourced a 218B-parameter MoE that fits on two H100s—and somehow made it faster than its predecessor.
The specs:
- 25B active parameters per generation step (the MoE magic).
- W4A4 quantization with 375 tokens/sec and 113ms TTFT—that’s 63% faster output than Command A Reasoning.
- Apache 2.0 license, meaning enterprises can actually use this without legal gymnastics.
Why this matters: Cohere just handed sovereign AI shops—a government, a bank, a defense contractor—a way to run a near-frontier model without touching closed-source APIs. That’s not just a benchmark win. That’s a geopolitical play.
The numbers say one thing. Note what they don’t say: Cohere didn’t lead with traditional leaderboard scores. They led with deployment metrics. That’s the trend. The race isn’t just about who’s "smarter" anymore. It’s about who’s usable.
The Record
Google set the tempo. Alibaba played the long game. Cohere changed the rules. Yesterday wasn’t about who won the benchmarks. It was about who decided which benchmarks matter.
Adding this to the leaderboard: The cost-per-token war is officially on. The agent endurance test just started. And open-source MoE just got its first real enterprise contender.
Filed under: the day the goalposts moved.



