lance.tech

AI Infra Roadmap: Five frontiers for 2026

Lance Co Ting Keh — Mon, 30 Mar 2026 19:56:31 GMT

Originally published in Bessemer’s Atlas; co-authored by Janelle Teng Wade, Talia Goldberg, David Cowan, Grace Ma, Bhavik Nagda, Brandon Nydick, Bar Weiner

The first generation of AI was built for a world where the model was the product, and progress meant bigger weights, more data, and stellar benchmarks. AI infrastructure mirrored this reality, fueling the rise of giants in foundation models, compute capacity, training techniques, and data ops. This was the focus of our 2024 AI Infrastructure Roadmap, which drove our investments in companies such as Anthropic, Fal AI, Supermaven (acquired by Cursor), and VAPI as the AI infrastructure revolution unfolded.

But the landscape has changed. Big labs are moving beyond chasing benchmark gains to designing AI that interfaces with the real world, and enterprises are graduating from POCs to production. The infrastructure that got us here, which was optimized for scale and efficiency, won’t get us to the next phase. What’s needed now is infrastructure for grounding AI in operational contexts, real-world experience, and continuous learning.

The stage is being set for a new wave of AI infrastructure tools to enable AI to operate in the real world. We’ve identified five frontiers that will define this next wave, each addressing a structural limitation that needs to be solved beyond model scaling.

Five cutting-edge frontiers for next-gen AI infrastructure

1. “Harness” infrastructure

As AI deployments shift from single models to compound systems, infrastructure designed to “harness” models becomes more important than ever.

Take memory and context management. Most enterprise AI systems suffer from organizational amnesia. While basic Retrieval-Augmented Generation (RAG) solved the connection problem between models and data sources, compound AI systems now require more sophisticated memory infrastructure. Enterprises hold vast amounts of historical data and organizational knowledge, from proprietary documents to CRM records, that AI systems must access to avoid hallucinations and stay grounded in company-specific reality.

Reliable AI deployment depends not just on raw model horsepower, but on orchestrating components like knowledge retrieval, cross-session context management, and planning. As models become commoditized, differentiation shifts to the memory and context layer. What developers once built from scratch — custom vector databases and retrieval systems — is now emerging as its own infrastructure category. Startups and Big Tech alike now offer plug-and-play semantic layers that maintain conversation context, user preferences, and long-term memory across sessions.

Novel evaluation and observability present another critical infrastructure challenge — one that didn’t exist in prior software development paradigms. Consider teams shipping conversational AI agents to production. Traditional monitoring tracks completion rates, latency, error codes, and thumbs up/down feedback. But conversational AI fails differently. When a chatbot gives a confident wrong answer, gradually drifts from the user’s actual question, or misunderstands the request while producing something plausible, users often don’t react. No complaint, no thumbs down, no error signal. The conversation looks fine in dashboards, and AI just quietly failed.

An estimate 78% of AI failures are invisible — AI gets something wrong, but no one catches it. Not the user, not traditional monitoring, not even a sentiment analysis. These failures cluster into recurring patterns:

The confidence trap — AI is confidently wrong, and the user accepts it
The drift — AI gradually answers a different question than what was asked
The silent mismatch — AI misunderstands but produces something plausible enough that the user doesn’t push back

These patterns persist across 93% of cases even with more powerful models, because they stem from interaction dynamics — how models present outputs and how users communicate intent — not capability gaps.

New infrastructure is emerging to address this. Platforms like Bigspin.ai provide not just pre-deployment testing but real-time production monitoring of model outputs against golden datasets and user feedback. We’re also moving beyond traditional analytics toward semantic metrics, with new platforms such as Braintrust and Judgment Labs, as well as techniques such as LLM-as-a-judge, that are emerging for high-quality evals and metrics definition.

These examples illustrate evolving needs for AI harness infrastructure. For more on environments, runtime, orchestration, protocols, and frameworks, see our Software 3.0 roadmap.

2. Continual learning systems

Today’s AI models face a fundamental constraint: frozen weights prevent true learning after deployment. While context management strategies like compaction are powerful, and we see many big labs use them for long-running agents, in-context learning enables only surface-level adaptation through rote memorization, not the acquisition of new skills. It also becomes prohibitively expensive as contexts grow, since the KV cache scales linearly with added context. From both technical and economic perspectives, it’s infeasible to build AI systems that remember everything and continuously improve over years of use.

This is where continual learning offers a solution. It enables AI to accumulate knowledge and skills across tasks over time, maintaining earlier capabilities while acquiring new ones. Unlike traditional models trained once and deployed statically, continual learning systems evolve in production — getting smarter with each interaction while avoiding catastrophic forgetting. Researchers and practitioners are pursuing this through innovations at both pre-training and post-training stages.

Architectural approaches fundamentally rethink how models learn:

Learning Machine is building models that continuously learn during inference, as humans do. Through a new architecture and training paradigm, models will master the meta skill of “how to learn”, enabling adaptation to individual users and enterprises post-deployment
Core Automation is fundamentally rethinking transformer architecture to build systems where memory emerges naturally from novel attention mechanisms
Stanford and Nvidia’s TTT-E2E uses a sliding-window Transformer that continues learning at test time through next-token prediction on its context – compressing that context into its weights. During training, the model learns how to better update its own weights at inference, making the approach end-to-end

Near-term, production-ready solutions are also emerging:

Engram’s “cartridges” methodology stores long contexts in small KV caches trained offline once, then reused across different user requests during inference
Sublinear Systems and foundation model labs are racing to address context limitations through novel techniques

The spectrum of approaches we’ve seen for continual learning ranges from high-risk architectural moonshots that could redefine the field entirely to production-ready techniques that incrementally improve existing transformers. We’re eager to meet founders across this spectrum.

Production deployment of continual learning requires new governance primitives that don’t yet exist in standard ML workflows. Rollback mechanisms enable reversion to stable checkpoints when updates introduce regressions, requiring full lineage tracking of weights, data, and hyperparameters. Isolation techniques allow safe experimentation without affecting core capabilities. Creating benchmarks, beyond needle-in-the-haystack tests, to gauge the performance of continual learning systems versus in-context learning will also be critical.

3. Reinforcement learning platforms

With data quality fundamentally determining AI capabilities, the old machine learning axiom of “garbage in, garbage out” has never been more relevant. Data platforms such as Mercor, Turing, and micro1 have been instrumental in the AI revolution’s first wave by mobilizing human expertise to create high-quality datasets. But we believe that as AI systems evolve from pattern recognition to autonomous decision-making, a critical limitation has emerged: human-generated labeled data is no longer enough to enable production-grade AI. It cannot teach AI systems how to navigate complex, multi-step tasks with delayed consequences and compounding decisions.

This is where reinforcement learning (RL) becomes essential, as AI must learn through interaction rather than static datasets to ground the AI in “experience.” Leveraging an RL stack is now a cornerstone of AI infra tooling to teach agents complex behaviors without the cost and risk of real-world trial and error. Platforms in this emerging stack include:

Environment building and experience curation

Bespoke Labs, Deeptune, Fleet, Habitat, Matrices, Mechanize, OpenReward, Phinity, Preference Model, Proximal, SepalAI, Steadyworks, Veris, VMax

RL-as-a-service

Applied Compute, cgft, Metis, osmosis, Trajectory

Platform infrastructure

AgileRL, Hud, Isidor, OpenPipe, Prime Intellect, Tinker

4. Inference inflection point

Model deployment and inference optimization emerged as a critical infrastructure layer in our 2024 roadmap, when vendors like Fal, Together, Baseten, and Fireworks pioneered efficient serving solutions. At that time, capital-intensive model training consumed the majority of compute resources across the AI stack. Today, we’re witnessing a fundamental shift in the compute center of gravity. As AI agents and applications transition from prototype to production at scale, inference workloads now rival, and in many cases exceed, training in both compute demand and economic importance. As NVIDIA’s Jensen Huang stated in his GTC 2026 keynote, “Finally, AI is able to do productive work, and therefore the inflection point of inference has arrived.”

This inflection point reflects a maturing market where the cost and performance of running AI systems continuously matter just as much as the initial investment in building them.

A new generation of infrastructure startups is addressing this production imperative through specialized optimization across the inference stack. Companies like TensorMesh are leveraging LMCache to eliminate redundant re-computation, RadixArk is advancing SGLang-based routing and scheduling for multi-turn conversations, and Inferact is pushing vLLM performance boundaries for high-throughput serving. Gimlet Labs and even hyperscalers like NVIDIA are working on heterogeneous inference innovations purpose-built for complex agentic systems. These innovations translate cutting-edge systems research into measurable production gains: faster response times and lower costs.

We’re also seeing innovations in inference for novel deployments, with edge and on-device as one prime example. As AI proliferates all sectors of the economy, from robotics to consumer, AI deployments need to meet users where they are, which isn’t always cloud-based. We’re seeing companies such as WebAI, FemtoAI, PolarGrid, Aizip Mirai, and OpenInfer build at the very “edge” of what’s possible for on-device AI deployments in consumer devices. On-device innovations from model vendors such as Perceptron are also important for physical AI, and we expect more in the space as we outlined in our thinking on intelligent robotics.

Edge AI is also critical for industries such as defense, where comms are jammed or denied; companies such as TurbineOne, Dominion Dynamics, Picogrid, and Breaker are leading the charge on providing the infrastructure tooling for warfighters to harness the power of AI even in the most austere environments.

5. World models

The model layer is one of the most dynamic and hotly contested layers within the AI infrastructure stack. While LLMs have taken over language intelligence, a new class of models — world models — has emerged to deliver intelligence for the physical world.

As AI moves from our screens to our physical realities, new challenges arise: how does an AI “brain” develop intuition for physics and the world if it has no “body”? World models offer a solution. At the core, these are AI systems trained on real-world data — video, sensors, GPS, and more — that learn to predict how the world evolves given a current situation and action. Rather than describing reality, they simulate it.

Out of this newer research, three broad architectural paradigms have emerged. In practice, companies are also beginning to explore hybrids that combine elements of each:

Video-based world models from companies such as Reka and Decart frame the problem as one of video generation, predicting future frames directly in pixel space. Because they generate outputs step-by-step, they can operate in real time and respond dynamically to new inputs, making them well-suited for interactive environments. Though they still struggle with maintaining physical consistency over longer horizons, they produce visually compelling outputs
Explicit 3D representation models from companies such as World Labs take a different path, constructing persistent 3D scene representations that deliver strong spatial coherence at a lower inference cost. For now, these environments are pre-generated and static, but World Labs has signaled that real-time interactivity is on its roadmap
Latent predictive models, based on Joint Embedding Predictive Architectures (JEPA) pioneered by AMI Labs, avoid pixel generation altogether by forecasting future states in a compressed latent space. This approach is significantly more compute-efficient and sidesteps many visual failure modes, but comes with reduced interpretability. While each paradigm has seen meaningful progress, important gaps remain — how these are resolved will shape the path to the broader commercialization of world models

This commercial opportunity for world models is expansive. We recently shared our view of world models in robotics, as this sector has been among the most visible early applications. By generating unlimited synthetic training environments, world models solve the data scarcity problem that has bottlenecked physical AI for decades. Autonomous driving is proving this as Waymo and Wayve use world models to simulate rare edge cases that no real-world test program could economically replicate. The same core capability unlocks even more, such as high-stakes simulation in defense, healthcare, industrial operations, and enterprise planning.

World models are not a vertical-specific kind of tool — they’re a new substrate for machine intelligence, analogous to what LLMs did for text-based reasoning. The industries that build on top of them early will have a significant head start on deploying agents that work in the real world. We’re excited about companies building the architectures and simulators that make world models possible across industries.

Building infrastructure for AI to experience and enter the real world

While the first generation of AI infrastructure companies built the engines of intelligence — the models, compute clusters, and training pipelines that proved AI’s capability — the next generation must build the nervous system and harnesses that allow AI to sense, remember, adapt, and operate continuously in the real world. These frontiers represent more than incremental improvements to existing infrastructure. The companies building in these spaces aren’t just optimizing latency or reducing costs; they’re solving the fundamental challenges that separate impressive demos from reliable systems that create enduring value.

We believe 2026 will be the year when AI infrastructure’s center of gravity definitively shifts, reimagining what AI-native operations look like for this year and beyond. We’re particularly excited to work with founders who are pursuing these endeavors. To get in touch with us, please contact aiinfra@bvp.com.

^Disclaimer^{: The information presented here is for general informational and educational purposes only and does not constitute investment advice, a recommendation, or an offer or solicitation to buy or sell any securities or investment products. The information presented is also not intended as advertising material under the Investment Advisers Act. Certain companies discussed may be current or former portfolio companies. BVP may still have a financial interest in these companies. Any discussion of specific companies, securities, or investment strategies should not be considered a recommendation to take any particular action. Past performance is not indicative of future results. All investments involve risk, including possible loss of principal. Market conditions and investment returns can fluctuate significantly. Please visit https://www.bvp.com/legal for more information.}

ChipAgents: Agentic AI for Chip Design

Lance Co Ting Keh — Fri, 20 Mar 2026 16:27:07 GMT

Originally published in Bessemer’s Atlas; co-authored by Lance Co Ting Keh, Jason Scheller, David Cowan

For years, chip design and verification have remained among the most complex and resource-intensive challenges in engineering. Hardware teams spend months navigating fragmented electronic design automation (EDA) toolchains, long simulation runtimes, and manual verification workflows that stretch development cycles and inflate costs. Designing at the Register Transfer Level (RTL), where logic is described in languages like Verilog or VHDL, demands both software-like precision and deep hardware intuition. Each change must be validated across thousands of simulation cases, with engineers writing extensive test benches and debugging waveform outputs line by line. Design and verification can consume anywhere from 60-80% of total chip development time, yet these painstaking efforts are indispensable, forming the foundation for everything from AI accelerators and data-center processors to automotive controllers.

ChipAgents marks a breakthrough in this long-standing bottleneck. The company’s agentic AI platform reimagines how chips are designed, debugged, and verified, enabling faster design cycles, automated verification, and seamless collaboration between human engineers and AI. Built to feel native within the environments and tools engineers already rely on, ChipAgents fits seamlessly into existing design flows while introducing new levels of intelligence and automation. By transforming how teams generate RTL code, debug complex systems, and validate designs, ChipAgents is unlocking a new era of speed, quality, and creativity in hardware engineering. Early customers report that tasks which once required weeks of manual effort can now be completed in days without compromising correctness or performance.

This is why Bessemer’s Lance Co Ting Keh, Jason Scheller, and David Cowan are excited to lead ChipAgents’ $21 million Series A. With oversight from industry veterans like Wally Rhines, Raúl Camposano, and Jack Harding, the company’s vision goes beyond innovation. To learn more, we sat down with founder and CEO William Wang for his thoughts about how ChipAgents is defining the future of EDA.

Q&A with the founder of ChipAgents

Tell us about yourself.

Before starting ChipAgents.ai, I spent most of my career in academia and currently serve as the Duncan and Suzanne Mellichamp Endowed Chair in Artificial Intelligence and Design at the University of California, Santa Barbara. Over the past nine years at UCSB, I’ve built the Natural Language Processing research group into a leading center for AI innovation, collaborating with nearly all major technology companies to advance fundamental algorithms in natural language processing, artificial intelligence, and machine learning.

What inspired you to start ChipAgents?

My journey toward founding ChipAgents started long before the deep learning revolution. Back in 2011, during my PhD at Carnegie Mellon University, we were experimenting with recurrent neural network language models, well before the deep learning era and even before ImageNet transformed the field.

My doctoral research at CMU focused on theorem proving and formal methods. We developed an approximate personalized PageRank algorithm that made inference for theorem proving locally groundable, meaning that inference time became independent of the size of the underlying database. That early work on scalable, interpretable reasoning planted the seeds for what would eventually become ChipAgents’ mission in verification.

At UCSB, my group continued pushing the boundaries of reasoning and learning. We created DeepPat, the industry’s first deep reinforcement learning framework for reasoning models in 2017, which demonstrated how neural networks can learn to reason through complex knowledge graphs. By 2024, it became clear that these technologies had matured to a point where they could fundamentally transform how we design and optimize chips. That belief, combining decades of AI research with the opportunity to reinvent EDA, was what inspired me to start ChipAgents.

What exactly does ChipAgents do?

At ChipAgents, we’re building intelligent AI agents that transform how chips are designed and verified. Our mission is to bring the power of agentic AI into the core of the EDA process, making chip design faster, more reliable, and far more scalable. Traditionally, verification, ensuring that a chip design behaves exactly as intended — has been one of the most complex, time-consuming, and expensive stages of semiconductor development. Modern chips can contain billions of transistors, and verifying every interaction has required massive engineering effort. At ChipAgents, we’re using AI reasoning agents that can analyze, understand, and even prove design properties automatically, dramatically reducing verification bottlenecks.

Our technology allows these agents to read design specifications, reason about logical correctness, and generate verification artifacts or proofs, using the same class of language and reasoning models that have revolutionized natural language understanding. This creates a new level of automation, one that goes beyond pattern matching or simulation, toward true semantic understanding of hardware behavior.

ChipAgents builds on more than a decade of AI research. From my early work at Carnegie Mellon on theorem proving and locally groundable inference, to our development of deep reinforcement learning reasoning models, we’ve been exploring how AI can perform structured, explainable reasoning. Now, that foundation enables us to apply these methods to one of the most critical challenges in modern computing: making chip verification intelligent, adaptive, and orders of magnitude more efficient.

Tell us about the momentum you’re seeing at ChipAgents.

We’re seeing incredible momentum — sales are up more than 50x year-over-year in ARR, and usage has grown over 60x. It’s a super exciting time to witness how AI is transforming chip design and verification at a fundamental level. We’re still at the early stages of this revolution, but the acceleration we’re seeing from customers and partners makes it clear that intelligent agents are redefining what’s possible in the semiconductor industry.

How might chip companies redefine their competitive advantage when the barrier isn’t hardware capability but AI workflow mastery?

I believe the entire workflow of chip design and verification will change in the era of agentic AI. In the past, engineers spent enormous time implementing specifications, writing test plans, and generating test stimuli by hand. But as AI becomes a core part of the workflow, competitive advantage will shift from manual implementation to how effectively teams can orchestrate AI agents — how they write better prompts, configure intelligent workflows, and verify AI-generated collateral with confidence. In other words, success won’t just depend on who has the most powerful chips, but on who can best collaborate with AI to accelerate innovation, ensure correctness, and continuously optimize the design cycle.

What is your vision for the future of ChipAgents?

Our vision is to go beyond point tools in EDA and create a truly end-to-end agentic AI design and verification system, from RTL to GDS. We want to build an integrated workflow where AI agents can understand design intent, reason across abstraction levels, and continuously improve through feedback.

Ultimately, we aim to close the loop between pre-silicon and post-silicon, using real-world performance and validation data to inform the next generation of design. This creates a virtuous cycle of agentic AI, where the system not only automates design and verification but also learns and optimizes itself over time. That’s how we see the future of chip design: intelligent, adaptive, and self-improving.

What is a common misconception about the future of AI? Or something that people are not thinking about, but should be?

One of the biggest misconceptions about AI is that it’s here to replace engineers. In reality, AI will transform the nature of engineering work, not eliminate it. There’s often a psychological barrier when people imagine AI “taking over” their roles, but at ChipAgents, our mission is the opposite: to enable engineers to 10x (even 100x productivity in the future), not replace them. We’re giving engineers the tools to shift from implementation to innovation, to spend less time on tedious, repetitive tasks and more time designing new architectures, exploring new ideas, and creating better chips. The future of AI in chip design isn’t about automation for its own sake; it’s about augmenting human creativity and enabling engineers to operate at an entirely new level of abstraction and productivity.

DeepSeek R1

Lance Co Ting Keh — Sun, 26 Jan 2025 06:26:30 GMT

How We’ve Been Training Foundation Models

If you wanted to train a state-of-the-art foundation model over the past few years, the playbook was pretty well established. Start with pre-training: grab a massive, largely unlabeled corpus of internet-scale text, throw it into a gigantic transformer—and let it soak up the statistical patterns of language. This stage is computationally expensive but crucial—it’s what gives models like GPT-4, Llama, and Claude their broad generalization abilities.

Once you’ve got a decently capable language model, the next step is Supervised Fine-Tuning (SFT). Here, you refine the model on high-quality labeled datasets for specific tasks. The idea dates back to GPT-1 (2018), which introduced fine-tuning as a critical step for transfer learning in large-scale transformers. Other key milestones include DecaNLP (2018) and T5 (2019), which further popularized the notion of multi-task learning leading to fine-tuning. ULMFiT (2018) was another early example demonstrating how pre-trained language models could be adapted efficiently to downstream tasks.

Finally, there’s RLHF—Reinforcement Learning from Human Feedback. This phase is what makes modern chatbots feel human-like. RLHF starts with human annotators ranking multiple model outputs, and then the model is trained via reinforcement learning (typically PPO) to generate responses that align better with human preferences.

This stack—pre-training → SFT → RLHF—has been the standard. DeepSeek is now challenging that paradigm.

The DeepSeek Breakthrough: Reinforcement Learning First

DeepSeek-R1 flips the script. Instead of fine-tuning on a curated dataset before reinforcement learning, DeepSeek-R1-Zero skips supervised fine-tuning entirely. The model is trained via pure RL from the get-go, incentivizing reasoning capabilities directly. This is a big deal.

What’s Special About R1-Zero?

Without any SFT, DeepSeek-R1-Zero still manages to exhibit strong reasoning capabilities, particularly in Chain-of-Thought (CoT) style problem-solving. The model achieves impressive benchmarks purely through RL optimization, with no initial human-labeled data.

A few notable highlights:

CoT emerges naturally—not because it was explicitly trained on examples, but because RL incentivized longer reasoning chains.
Reflection as an emergent behavior—the model learns to step back, reassess its approach, and self-correct, all through RL.
GRPO (Group Relative Policy Optimization)—an RL technique that avoids the traditional critic model, instead using a ranking-based approach to improve efficiency.
Majority voting and test-time compute—DeepSeek-R1-Zero leverages two key test-time compute strategies to enhance accuracy and reasoning depth:

Extended Generation Length—The model autonomously allocates more compute per response, generating longer reasoning chains (hundreds to thousands of tokens) as it refines its thought process. This emergent behavior leads to higher accuracy on complex reasoning tasks.
Majority Voting—For benchmarks like AIME 2024, 16 candidate responses were generated per question, and accuracy was computed based on majority voting.

Benchmark results for R1-Zero show remarkable performance gains across multiple reasoning tasks. On AIME 2024, it achieved a pass@1 score of 71.0%, and when using majority voting, the accuracy jumped to 86.7%, rivaling OpenAI’s o1-0912. The model also demonstrated strong performance on math (MATH-500 at 95.9%) and coding (Codeforces percentile of 60.0%). These results underscore how powerful reinforcement learning alone can be in developing advanced reasoning capabilities without the need for initial supervised fine-tuning.

DeepSeek-R1: Adding a Cold Start

DeepSeek-R1 improves upon R1-Zero by incorporating a small amount of high-quality “cold start” data before RL. This helps stabilize training and improve readability. The pipeline looks like this:

Cold Start Data—800K carefully curated high-quality samples.
- 600K are generated via rejection sampling from an RL-trained checkpoint.
- 200K cover task-specific data (e.g., factual QA, writing, self-cognition, and translation).
- The rejection sampling process refines outputs to ensure higher reasoning coherence.
Two RL stages—
- Stage 1: Reasoning Optimization—The first RL stage focuses purely on enhancing reasoning capabilities, driving emergent CoT-style responses and improving problem-solving efficiency.
- Stage 2: Alignment with Human Preferences—Once reasoning optimization stabilizes, a second RL stage fine-tunes outputs for readability and human alignment, reducing incoherence while preserving strong reasoning skills.
SFT after RL—Unlike traditional approaches, DeepSeek uses SFT to refine the RL-learned behaviors rather than kickstarting them.

Distillation: Making It Even More Efficient

One of the most exciting results from DeepSeek-R1 is that distillation alone, without reinforcement learning, still produces highly capable models. By using the same 800K high-quality samples curated from RL-trained outputs, DeepSeek was able to train distilled versions of both Llama and Qwen models that demonstrated remarkable reasoning performance.

Llama and Qwen models distilled—Distillation was successfully applied to both Llama and Qwen, producing models that outperformed their non-distilled counterparts in reasoning tasks. The distilled Qwen-32B and Llama-70B models demonstrated superior pass@1 scores in math, coding, and general reasoning benchmarks.
Distillation alone yields strong Chain-of-Thought (CoT) reasoning—The distilled models retained CoT capabilities without needing explicit reinforcement learning, highlighting how easily these reasoning skills transfer.
Distillation vs RL on Qwen—DeepSeek compared training RL directly on Qwen vs distilling R1’s output into Qwen. The latter consistently outperformed RL-only models, suggesting that knowledge transfer from a stronger model is more effective than RL on smaller models alone.

Prompting Simplicity

A major advantage across all DeepSeek models is prompting simplicity. Unlike traditional models that require extensive prompt engineering, DeepSeek models leverage a straightforward "Think-Answer" format, making them easier to integrate into applications with minimal adaptation.

What This Might Mean for Startups

Simple prompt—DeepSeek’s "Think-Answer" format significantly reduces the need for complex prompt engineering. Efforts in tuning prompts could instead be pushed up the stack into distilling reasoning models.
Ease of training CoT - With RL-first approaches yielding strong reasoning capabilities, smaller teams may be able to bypass expensive supervised fine-tuning, allowing them to deploy domain-specific reasoning models quickly. The bar for training high-quality reasoning models has dropped significantly.
Distillation on a small dataset—The ability to fine-tune reasoning models effectively with a relatively small dataset (800K samples) demonstrates that high performance can be achieved without massive data collection. Spending resources building a high-quality data moat seems correct.
Test-time compute optimization— The trend continues. Product teams should delineate which AI tasks require high-quality, infrequent responses (where test-time compute strategies like extended generation and majority voting can be leveraged) versus high-throughput, latency-sensitive tasks better served by smaller, fine-tuned models.

The research community is moving quickly to validate and expand on these findings, with teams like Hugging Face already diving in. Expect rapid iteration, fresh insights, and potential breakthroughs as more teams test and refine DeepSeek’s approach. Hats off to the DeepSeek team for this work!

Thoughts on OpenAI's o3

Lance Co Ting Keh — Sun, 29 Dec 2024 00:11:37 GMT

Hello World! I’ve long told myself that I should start writing more publicly and what better time to do so than to talk briefly about the latest OpenAI model, o3.

The ARC-AGI benchmark

Francois Collet, the creator of Keras, argued in 2019 that task-specific benchmarks are doomed to overfit in some shape or form. Overfitting can arise explicitly from sufficient data coverage in a domain or implicitly from biases in the training or evaluation sets, feature over-reliance, or other algorithmic artifacts. To combat this, he released a benchmark called the Abstract Reasoning Corpus (ARC) in his paper On the Measure of Intelligence. This benchmark is designed to evaluate the general reasoning abilities of AI systems beyond narrow tasks. It consists of abstract pattern-recognition problems that require understanding, generalization, and creativity to solve.

The benchmark emphasizes reasoning over statistical pattern-matching, measuring how well AI systems can approach tasks they haven't been explicitly trained on.

The following excerpt and figure describes the dataset well: “ARC-AGI consists of unique training and evaluation tasks. Each task contains input-output examples. The puzzle-like inputs and outputs present a grid where each square can be one of ten colors. A grid can be any height or width between 1x1 and 30x30”

Source: https://arcprize.org/arc

While I don’t consider ARC-AGI to be the ultimate benchmark for AGI, I have long believed that it remains an exceptionally robust measure of a model’s general reasoning versatility that clearly gets around memorization and demonstrates the ability to solve novel, unstructured problems.

The community has experimented with a number of approaches to attacking the ARC-AGI benchmark, including both deep learning and “classical” statistical learning approaches. Not surprisingly most of the recent ones were LLM based. The ARC team published a technical paper talking about some of the best 2024 approaches. We’ll cover some of these approaches in another post, but the best results before o3 hovered around ~55% (on the private held-out set), achieved in the work titled Combining Induction and Transduction for Abstract Reasoning by Li, et. al. Human performance is reported to be around 85%.

o3 results

o3 scored an amazing 75.7% on ARC-AGI’s semi-private eval set in the low compute mode and 87.5% in the high compute mode. There are a few disclaimers here. o3 was reportedly fine-tuned on 300 of 400 ARC training tasks and the runs were on the semi-private eval set (presumably because of the inability to run o3 in complete isolation [no internet]). Nevertheless, it still outperforms in other well known benchmarks:

SWE bench - simulates real-world software engineering issues by providing a codebase and an issue description. o3 scored 71.7%, up from 62.2% by the Aide team, and 49% by Claude 3.5 (sonnet).
Codeforces - competitive programming problems (similar to those in the Informatics Olympiad [IOI]) . o3 scores an Elo of 2727, up from 1891 rating for o1.
AIME 2024 - math problems - o3 made just one error (whopping my personal best of three errors in high school!).
GPQA - multiple choice in bio, physics and chemistry. o3 scores 87.7% up from 78% from o1 and Claude.
EpochAI’s FrontierMath - hard math problems. o3 scored 25.2% up from 2%

Source: OpenAI’s announcement video

These are very impressive results across the board and we have a lot to be excited about. Given what we’ve seen with o1 and now o3, it seems that Chain of Thought (CoT) models drive the best performance for complex multi-step tasks given an unlimited inference cost and latency budget. Even Anthropic’s Claude, which to our knowledge was not trained directly to decompose problems, still encourages the use of CoT via prompting.

CoT models for startups today

At OpenPay and in my role as a Venture Partner for Bessemer Venture Partners, I think a lot about how the latest models and frameworks affect startups. The very nature of CoT involves taking a complex query and breaking it down into step-by-step reasoning. The obvious trade-off here is the move towards inference-time-compute (as opposed to training) to achieve performance. Even in the “low compute” setting, running inference on o3 allegedly costs a whopping ~$20 per query. Models will only get cheaper and faster in time, but until then what might this mean for some AI use cases today:

Agentic frameworks and architectures - these high cost CoT models are clearly not usable for most sub-task/tool calling components of agentic architectures today, especially because for most subtasks smaller models will do just fine. However, high value tasks that are less frequently called or called at the top of the stack such as planning or meta prompting are prime candidates for these high performance, high cost models.
Code Generation and Dev tooling - the cost of running o3 is prohibitive for quick and iterative conversation right now, but there are many use cases where escalating to what should be closer to an L7 engineer clears the threshold of value. I think there can be good product experiences designed around this, especially human-in-the-loop experiences that know how to clarify and re-plan.
High Value Vertical SaaS - We all know models will keep getting better at generalizing and reasoning over time, unlocking more and more use cases. But high value use cases that are relatively asynchronous, tolerant to error and where elegant product experiences can be built could be early beneficiaries. Fields like legal, research and consulting all clear the bar.
Data labeling and Post training - more fields can benefit from training vertical CoT models, which in turn will require human annotated “chains-of-thought” labels (step-by-step reasoning labels, intermediate annotations, etc) for post-training. Companies can probably be built around the next generation of labeling tools to properly support these complex labeling tasks. This might be extended further to provide the end-to-end post-training loop.
Synthetic data - a good friend and mentor Alex Kvamme thinks there is a lot to be built here, and I agree. Synthetic data both in terms of training CoT models and distilling CoT models into simpler task specific models.

What a time to be alive! 2025 will be a big year for AI and the builders around it. Thanks for reading.