Horizon Alpha Launch – Discord Timeline

until 1754038353


July 29 2025 – Day 1: The Tease
12 : 18 PM – Toven announces: “New stealth model available for testing: Horizon Alpha. Try it with code generation and frontend! It’s free to use during this testing period…” Immediate speculation explodes; most hope it is an OpenAI preview (GPT‑5 rumours abound) while a few suggest DeepSeek or Amazon.
3 : 00 PM – Dailyfocus asks why people want a “ClosedAI” model; wish‑lists for Claude 4 Haiku, DeepSeek R2, and improved Llama follow.
6 : 56 PM – Leo tempers expectations, noting the cost of hosting “zenith/summit”‑class models for free.

July 30 2025 – Day 2: The Great Wait
1 : 05 AM – xiaoqianWX: “not here yet?”
4 : 20 AM – Thunder begins the mantra: “day 1 of waiting for Horizon Alpha.”
4 : 32 AM – Toven teases: “y’all better be ready with your benchmarks.”
Community members track past stealth‑model drop times (Quasar, Optimus, Cyber) and convince themselves the delay signals OpenAI involvement.
8 : 33 AM – Leo notes naming symmetry: “Zenith… Summit… Horizon… sounds like zenith > horizon > summit.”
11 : 18 AM – Kyle declares the delay alone “proves” it must be an OpenAI model.

July 31 2025 – Day 3: Edge of Madness (Pre‑Launch)
12 : 09 PM – ja: “it’s so over.”
2 : 53 PM – Dailyfocus jokes Amazon interference.
8 : 24 PM – para opens a “waiting room”; counting down from 2000 begins.
10 : 56 PM – P4tr1ckB4t3m4n theorises cautious GPT‑5 testing; tension peaks.

July 31 2025 – Day 4: The Release
8 : 18 AM – Toven: “who up rn.”
8 : 21 AM – “SOOM: REAL SOON … it will drop today.”
10 : 04 AM – toriset: “it loaded for me.” Horizon Alpha is live.
10 : 05 AM → mid‑day – First tests: maths benchmark 6/6 (better than o4‑mini), impressive ASCII balloons, but glaring reasoning errors (e.g., “9.11 is larger than 9.9,” mis‑counting letters in strawberrry).
10 : 15 AM – System prompt reveals “GPT‑4 family optimised for fast, cost‑effective responses,” dashing GPT‑5 hopes.
10 : 47 AM – Early GPQA‑Diamond score: 36.87 %, cementing disappointment.
By evening the consensus is that Horizon Alpha is likely OpenAI’s long‑promised open‑weights model—fast but underwhelming in reasoning.

August 1 2025 – Day 5: The Switch to “Smart” Horizon
Users wake to a transformed model: Time‑to‑First‑Token slows, API begins showing reasoning tokens, and benchmarks skyrocket—MMLU ~94 %; GPQA‑Diamond ~87 %; complex maths and full chess engines succeed. Leo sums up: “hydrogen bomb vs coughing baby.”
The flip is inconsistent: repeated prompts sometimes return the old “fast‑dumb” model, sometimes the new “slow‑smart” one, indicating live A/B routing. Frustration mounts: “the model switching under our feet is really pissing me off.”

August 1 2025 – Day 5 (Evening): Leaks and Theories
Hugging Face leaks (user “yofo‑happy‑panda”) surface configs for a 120 B sparse Mixture‑of‑Experts and a 20 B dense OpenAI model. Community consensus crystallises:
• Day 4’s “dumb” Horizon = 20 B dense (or 120 B MoE with reasoning throttled).
• Day 5’s “smart” Horizon = 120 B MoE with full reasoning budget.
OpenRouter users realise they have been unwitting participants in OpenAI’s phased, multi‑model A/B test.


Aftermath and Takeaways

  • Performance whiplash showcased the stark gap between small and large reasoning‑enabled models.
  • Minimal staff intervention (aside from Toven’s teases and Alex Atallah’s “go” signal) let the community uncover capabilities organically.
  • OpenAI’s open‑source ambitions became clear: a dense 20 B baseline and a SOTA 120 B MoE poised to shake up the public LLM landscape.

The launch began on July 29, 2025, when Toven announced a new stealth model—Horizon Alpha—inviting users to test its code generation and frontend skills, with usage free and prompts logged for feedback. This instantly ignited wild speculation. The dominant hope was that Horizon Alpha was a true OpenAI preview (many openly yearning for GPT-5), while theories flew about links to “zenith” or “summit,” the anonymous high-performers from LMSYS Arena. Some feared another lacklustre Amazon release. Even the model’s delayed access became a possible OpenAI fingerprint, intensifying hype.

During the wait, meme culture and running gags about being “edged” by OpenRouter staff took over. Users kept watch for timezone clues and dissected every staff message or edit. The channel became a microcosm of collective anticipation, with power users prepping benchmarks and test prompts for launch.

When Horizon Alpha finally went live on July 31, the tone immediately shifted. Early user feedback was almost universally negative. The model was incredibly fast—but performed poorly. It failed basic logic and maths, confused numbers, and struggled to count letters. Coding output was oddly minified and concise (“code-golfing” style), lacking markdown or formatting, despite its code-gen billing. Disappointment was open: a “horrendous” 47.98% on GPQA Diamond, and feedback ranged from “mid” to “ass.”

Yet, users confirmed its OpenAI lineage: OpenAI’s tokenizer, responses to system prompts in the classic style (including terms like “oververbosity” and “Juice: 0”), and the general vibe matched previous “zenith” and “summit” models, believed to be GPT-5 candidates. The prevailing theories: either a long-awaited OpenAI open-source model (and a huge letdown) or a tiny “nano” model.

About a day later, users experienced total whiplash. Suddenly, the model’s behaviour changed completely: responses slowed (higher “time to first token”), and new API metrics tracked “reasoning tokens.” The performance leap was staggering—MMLU jumped to ~94%, GPQA Diamond hit ~87% (rivaling Grok 4), and the model aced maths, complex coding tasks (like chess with full rules), and nuanced reasoning. The difference was described as “hydrogen bomb vs coughing baby.” The mood in the channel flipped from disappointment to awe.

But this “smart” version was inconsistent. Users found that repeated prompts could randomly yield the “dumb” fast model or the “smart” slower one, pointing to active A/B testing or dynamic routing between model configs. This inconsistency led to frustration: “the model switching out from underneath our feet is really pissing me off.”

Leaks soon surfaced on Hugging Face, showing configs for OpenAI’s upcoming open-source models—a sparse 120B Mixture-of-Experts (MoE) and a 20B dense model. The community concluded they were participating in a multi-model, phased test. The first phase (“dumb”) was likely the 20B dense or MoE with reasoning limited; the second phase (“smart”) was the 120B MoE with reasoning enabled—demonstrating near state-of-the-art performance. The shifting behaviour matched deliberate A/B testing to gather comparative data.

Throughout the event, staff interaction was limited: Toven stoked suspense, Alex Atallah gave the launch signal, and the rest was customer-driven. The Discord thread became a live experiment—combining detective work, benchmarking, memes, leaks, and rapid feedback—offering the community a chaotic but fascinating preview of OpenAI’s imminent entry into the open-source LLM arena.

Edit

Pub: 01 Aug 2025 10:36 UTC

Views: 38