Back

Black Box Prompt Engineering: Why Not Knowing How It Works Is Actually the Point

Oct 30, 2025
Black Box Prompt Engineering: Why Not Knowing How It Works Is Actually the Point

I recently sat down with Stewart Alsop III on the Crazy Wisdom Podcast to talk about PromptLayer, AI engineering, and why the shift from deterministic to probabilistic systems is fundamentally changing how we build software. Most developers are still struggling to adapt... and I think it's because they're asking the wrong questions.

The Mirage of Reasoning

Stewart opened with a provocative question: "Do you think there's really reasoning going on or is it a mirage of reasoning?"

It's reasoning in the way that it could trick humans into thinking it's reasoning. I don't think it's reasoning in the biblical sense of what intelligence is—whatever that means for what makes us human. But I think it's something that maybe can be indistinguishable.

And honestly? For building products, that distinction doesn't really matter.

PromptLayer: Infrastructure for the Black Box Era

We describe PromptLayer as an AI engineering workbench. It's the platform people use to build and test their LLM products—basically AI development infrastructure. Whether you're building an agent, a workflow, or just doing something with AI, PromptLayer is the tool to help you do it in a rigorous way.

Our core thesis is simple: the best prompt engineers aren't machine learning engineers. They're the people closest to the domain—the subject matter experts. We're trying to scale their work and distribute it.

This makes sense when you think about it. Even if we had the best AI ever—AGI, whatever you want to call it—there's still taste involved. If you're building an AI therapy app, there's no global solution to the problem of therapy. I live in New York, and there are five therapists on every block, all with different styles. You've got CBT therapists, meditation therapists, acupuncture... The subject matter expert brings the implicit knowledge that can't be captured in training data. That's what the prompter does based on their expertise.

Embracing the Black Box

My background? I'm a hacker, a tinkerer. I never got myself to use a debugger when programming. I'm a print statement debugger. I'm always just trying something, checking something.

And that's the way of thinking that makes me a good prompt engineer.

The people who struggle—and it's harder for them to pick up this new way of building software—are the ones who want to know how it works, what's going on inside. What we tell people is: you don't need to know how the model works. That's why subject matter experts are often better, because they know they're not going to figure out how a neural net works and they don't need to.

It's a black box. The really cool thing is it's an input to an output. That's all it is. One of the inputs is which model you choose, but the model itself? It's just treating reasoning as high, low, or medium.

When you're building things with LLMs, you just have to say there's an input, there's an output. Maybe I'm not always going to get that output, but all I have control of is the inputs and we're going to test it a lot of times.

A customer said this to us once: if you're not getting the right output all of the time, it's kind of a skill issue on the prompt engineer's side. You just have to do better. Maybe you use the prompt twice. There are ways to solve that.

The Eval Problem (And Two Solutions)

Evals are a big part of rigorous AI development, and there's a lot of discourse around whether they actually matter. I think they do, but two distinct modalities are emerging:

1. Rigorous Testing

This is the traditional approach—sanity checks, backtests. You're running your prompt over your last thousand production requests and seeing how much they change. You need this if you're deploying to production. You need these engineering principles to not fly blind.

Rubric-based evals are incredibly popular.

Backtesting is an old world phrase from the deterministic era. You have your AI application live for two weeks and you want to do an update, so you run the update on the last two weeks worth of data and see if the outputs change. It's really hard to build sample data, especially for a lot of these use cases. Just use production data.

2. Sprint-Based Iteration

This is what I've been doing more and more, and I've noticed our customers doing it too. Instead of building the golden eval that gives you a score back, a lot of people just work in sprints.

Here's a real example: a customer support agent that responds in the wrong language based on the user's name instead of the message (case study). Okay, we have this problem. Now I'm going to work on a sprint to fix this problem.

They bootstrap a dataset—grab the last 10,000 production requests, maybe do an LLM-as-judge to identify issues, and they're working in one-day or two-week sprints. The eval isn't necessarily a final score, it's just an exploration and a batch run of prompts to solve a specific problem.

The Vibe Coding Revolution

Vibe coding has changed how we do engineering at PromptLayer, especially with Claude Code and OpenAI's Codex. They're both the first coding agents you can really use hands-off.

We have a rule now: if it'll take less than an hour with Claude Code, just do it. It's changed the way we do engineering.

There was an example recently where someone using our agent builder didn't realize you could add conditional edges. I wanted to make it more obvious, so I used Claude Code to make that tooltip bigger. I know how to do that, but I didn't know where in the code it is. These tools are really good at understanding the context of the code base first—they're not going in stale like I would.

An engineer who's a front-end engineer but needs to do back-end? Same thing. It's not going to make you write bad code. But if you write normal code yourself at 3 AM after three beers, you're going to put bugs in too. It might be a little harder to steer AI-generated code, but we fully embrace it here. We made everyone use it.

Stewart had a great analogy for working with these tools: treat them like an intern. You hire an intern with this grand vision for them, but they don't know what to do. You give them the first step, say "do this," and when they're done, you talk about the next step.

Context and LLM Idioms

So what separates good prompting from bad? If I look at someone's prompt as a human and I have no idea what's going on, maybe the LLM will understand, but it's probably not a good way to build it. Just like you can write software code that's hard to understand, when I'm writing a prompt, I'm trying to make it understandable. I'm using section headers, and I think it helps the LLM too.

There's also this LLM idiom concept I like to think about. If I reference a musician to you, it's giving you more information than just the name because you're now thinking about his music and the vibe. In the same way, when I talk in JSON to the LLM, or XML, or syntax, it's putting it in a context.

If I wanted to write a love poem, I don't want it to output it inside a code block because the code block is code thinking. That's the super non-technical way to explain it, and I don't know if I can explain it in a technical way, but that's the intuition.

Conclusion

We're not replacing human expertise with AI. We're distributing and scaling it. The winners will be the people closest to the problem, armed with the right tools to treat AI as what it is: a powerful black box that turns inputs into outputs.

Stop trying to understand the neural network. Start understanding your domain. The rest is just prompt engineering.


Want to try rigorous AI development for yourself? Sign up for PromptLayer for free—we have hackers, hobbyists, and enterprise teams all using it to build better LLM products.

You can find me on X at @imjaredz.

The first platform built for prompt engineering