2025 State of AI Engineering Survey: Key Insights from the AI Engineer World Fair

The 2025 State of AI Engineering Survey by Barr Yaron from Amplify Partners offers a comprehensive snapshot of how engineering teams are building, managing, and scaling LLM-powered applications in production. With responses from 500 practitioners, the survey reveals critical insights about the rapid pace of model and prompt iteration, the persistent challenges around evaluation, and perhaps most unsurprisingly—that large majority of teams use prompt management tooling. These findings underscore the growing importance of robust observability and management platforms as LLM features become increasingly central to both internal tools and customer-facing products.

The Evolving Role of AI Engineers

The survey reveals that while few practitioners formally hold the title "AI Engineer," the majority of the 500 respondents—primarily those identifying as software or AI engineers—are actively engaged in AI engineering work under various titles.

This reflects the reality that AI engineering is becoming a core competency across engineering teams rather than a specialized role—making accessible tooling even more critical.

LLMs Serve Both Internal and External Use Cases

Half of all respondents use LLMs for both internal tooling and customer-facing features, demonstrating the dual value proposition of large language models in modern organizations.

Models Are Swapped Frequently in Production

50% of respondents regularly update models monthly or more frequently, with 17% updating weekly. The landscape is changing fast. The latest-and-greatest models are changing weekly.

Prompts Are Updated Even More Frequently Than Models

The survey shows remarkable dynamism in model selection: 70% of teams update their underlying models at least monthly, with 10% making changes daily.

This frequent updates make it essential to have a system that can track performance across different models, prompt versions, and user segments—a core value proposition of prompt management tools like PromptLayer.

The Prompt Management Gap

A large majority, 69% of teams, are using tooling for prompt management. It's pretty clear how critical prompt versioning, testing, and organization is— but 31% of teams still rely on ad-hoc solutions or manual processes.

We find that LLM teams go through a journey. They begin with ad-hoc and manual processes. Next they might move to a spreadsheet or Notion doc. Prompt management tooling becomes most valuable after proving out the AI product or adding headcount to the team. To build reliable AI, you need to track and collaborate on prompts.

AI Agents Face Quality Challenges

While there's significant interest in AI agents, fewer than 20% of respondents say agents are "working well" in their organizations—a stark contrast to the success of simpler LLM features.

Agents require more rigor. Errors are easy to compound, and your team needs to invest more in agentic evals.

Most Teams Haven't Deployed Agents to Production Yet

The majority of teams are still in the experimental phase with agents, though less than 10% say they'll never use them.

Monitoring Approaches Vary Widely

Teams employ diverse monitoring strategies: 60% use standard application observability tools like DataDog, while 50% rely on offline evaluations, often combining multiple approaches.

At PromptLayer, we believe you need both. Offline evals are great for development-phase iteration. But the key to prompting truth is production data.

Human Review Remains the Gold Standard for Evaluation

Despite advances in automated evaluation, human review remains the most popular quality assessment method, supplemented by user data collection and benchmarks.

LLM-as-judge servers to augment and scale human insight, but it's hard to get right.

Evaluation Is the Top Pain Point

When asked about their single most painful aspect of AI engineering, evaluation tops the list—a clear signal of where the industry needs better tooling and practices.

Effective evaluation requires not just running tests but managing test sets, tracking results over time, and comparing performance across different prompts and models. Read more about advanced eval strategies.

OpenAI Dominates Production Deployments

Three of the top five—and five of the top ten—models in production come from OpenAI, showing their continued market leadership in customer-facing applications.

This concentration makes it crucial for teams to avoid vendor lock-in by using abstraction layers that make it easy to switch between providers.

All prompts in PromptLayer are model-agnostic. You can easily switch between them and test multiple providers against eachother.

Code and Content Generation Lead Use Cases

The top LLM applications are code generation/intelligence and writing assistant features, reflecting the technology's strengths in these domains.

Teams Deploy Multiple LLM Use Cases

An overwhelming 94% of teams use LLMs for at least two distinct use cases, with 82% supporting three or more applications.

RAG Is the Dominant Customization Technique

While few-shot prompting serves as the baseline, 70% of teams have adopted Retrieval-Augmented Generation (RAG), with fine-tuning also showing surprising popularity.

Fine-Tuning Methods Show Diversity

Among teams using fine-tuning, 40% leverage parameter-efficient methods like LoRA/QLoRA, while others employ DPO, RLHF, and supervised approaches.

Text Dominates, But Multimodal Interest Grows

While text remains the primary modality in production, 37% of teams not currently using audio plan to adopt it, indicating a "multimodal production gap."

Vector Databases Are Standard Infrastructure

65% of teams store embeddings in dedicated vector databases, split roughly equally between self-hosted and managed solutions.

Open and Closed Models Expected to Converge

The majority of practitioners expect open-source and closed-source model capabilities to eventually converge, suggesting a more competitive future landscape.

The AI Companion Future

In a lighter note, respondents predict an average of 26% of Gen Z will have AI girlfriends/boyfriends—a glimpse into potential social applications of the technology... Scary!

The 2025 State of AI Engineering Survey (watch the full talk here) paints a picture of an industry in rapid evolution, where teams are iterating quickly, deploying diverse use cases, and grappling with fundamental challenges around evaluation and management.

The fact that nearly a third of teams lack prompt management tooling, combined with the frequency of prompt and model updates, suggests significant opportunity for platforms that can bring structure and observability to LLM operations.

PromptLayer is an end-to-end prompt engineering workbench for versioning, logging, and evals. Engineers and subject-matter-experts team up on the platform to build and scale production ready AI agents.

Made in NYC 🗽

Sign up for free at www.promptlayer.com 🍰

Top AI Tools for ML Engineers

Langtrace vs Langfuse: Features, Pricing & Use Cases Compared

2025 State of AI Engineering Survey: Key Insights from the AI Engineer World Fair

The Evolving Role of AI Engineers

LLMs Serve Both Internal and External Use Cases

Models Are Swapped Frequently in Production

Prompts Are Updated Even More Frequently Than Models

The Prompt Management Gap

AI Agents Face Quality Challenges

Most Teams Haven't Deployed Agents to Production Yet

Monitoring Approaches Vary Widely

Human Review Remains the Gold Standard for Evaluation

Evaluation Is the Top Pain Point

OpenAI Dominates Production Deployments

Code and Content Generation Lead Use Cases

Teams Deploy Multiple LLM Use Cases

RAG Is the Dominant Customization Technique

Fine-Tuning Methods Show Diversity

Text Dominates, But Multimodal Interest Grows

Vector Databases Are Standard Infrastructure

Open and Closed Models Expected to Converge

The AI Companion Future

GPT-5 vs. GPT-5 Pro vs. GPT-5 “Thinking Mode”: Features, Capabilities & Differences

LangGraph vs. Atomic Agents: Graph Orchestration vs. Modular Control

Google Antigravity: First Impressions of the Agent-First IDE

The first platform built for prompt engineering

Usage

Company

Follow Us

2025 State of AI Engineering Survey: Key Insights from the AI Engineer World Fair

The Evolving Role of AI Engineers

LLMs Serve Both Internal and External Use Cases

Models Are Swapped Frequently in Production

Prompts Are Updated Even More Frequently Than Models

The Prompt Management Gap

AI Agents Face Quality Challenges

Most Teams Haven't Deployed Agents to Production Yet

Monitoring Approaches Vary Widely

Human Review Remains the Gold Standard for Evaluation

Evaluation Is the Top Pain Point

OpenAI Dominates Production Deployments

Code and Content Generation Lead Use Cases

Teams Deploy Multiple LLM Use Cases

RAG Is the Dominant Customization Technique

Fine-Tuning Methods Show Diversity

Text Dominates, But Multimodal Interest Grows

Vector Databases Are Standard Infrastructure

Open and Closed Models Expected to Converge

The AI Companion Future

RECENT ARTICLES

The first platform built for prompt engineering

Usage

Company

Follow Us