The Wrong Question

The AI requirements debate is stuck on the wrong problem.

A perspective on AI, evaluation, and engineering intelligence in regulated systems.

Jordan Kyriakidis, Ph.D.
CEO & Co-Founder, QRA Corp

Opinion Piece
~15 min read

The requirements engineering community has spent the last two years arguing about whether AI can write good requirements. It’s a reasonable question. It’s also the wrong one.

The right question is harder, and less flattering: can your organization even see what’s in its own engineering artifacts — and what’s missing from them?

I lead QRA, a company that builds tooling for requirements engineering in regulated industries — aerospace, defense, medical devices, nuclear. Our customers operate under standards like DO-178C, ISO 26262, and IEC 61508, where a single ambiguous requirement can cascade into certification delays, costly rework, or genuine safety risk.

What follows is what I’ve come to believe. Some of it is contrarian. I think that’s appropriate, because the conventional wisdom on AI in requirements engineering is converging on an answer that is comfortable, marketable, and wrong.

AI Needs Rails. The Question Is Where They Belong.

On this much, everyone agrees: unconstrained AI in regulated engineering would be disastrous. You cannot hand generation to a language model and walk away. AI in this domain absolutely needs rails.

The emerging consensus on how to build those rails goes something like this: constrain the generation. Force outputs through narrow templates. Enforce controlled natural language. Restrict vocabulary to pre-approved terms. Apply rigid syntactic rules. Score against fixed rubrics. Keep the human in the loop. The rails go around the Scribe’s hands.

The instinct is understandable. But it gets the architecture exactly backwards — and in doing so, it neutralizes the very capability it’s trying to harness.

Large language models are fundamentally probabilistic. They work by exploring a vast space of possible completions and selecting from among them. This is not a defect to be engineered around. It is precisely what makes them remarkable. Generation is inherently a creative act, and the statistical nature of LLMs is their superpower — it’s what allows them to synthesize across disparate inputs, surface non-obvious connections, and produce candidate outputs that a deterministic system never would.

When you constrain that process at the point of generation — when you force the model through rigid templates and pre-approved vocabularies — you don’t get safe AI. You get diminished AI. You take away the very capability that justified using a language model in the first place, and what remains is an expensive autocomplete engine that could have been a lookup table. You’ve built the rails, but you’ve also removed the engine.

The question isn’t whether AI needs rails. It’s where the rails belong. And I believe the answer is not at the point of generation, but at the point of judgment.

The Gavel and the Scribe

I think of this as the Gavel and the Scribe.

Let the Scribe be the Scribe. Let it draft, expand, explore, and generate with the full power of what probabilistic inference makes possible. Don’t hobble the creative engine. Instead, pair it with a strong, independent Gavel — an adjudicator that evaluates each output against explicit criteria: completeness, testability, consistency with existing artifacts, traceability, and compliance with applicable standards. When the output doesn’t meet the bar, the Gavel sends it back. The Scribe tries again. And again, if necessary, drawing on its generative power to find a different path to the same destination.

The rails are real. They’re just maintained by the Gavel, not welded onto the Scribe.

This is not an absolute. In practice, you will still guide the Scribe — scope its task, provide domain context, shape the output format. That’s good practice, not constraint. The distinction is between guidance and governance. You can brief the Scribe without shackling it. What you should not do is force determinism and rigid reproducibility onto the generative process itself, because that robs the LLM of its power. The center of gravity for control belongs at the point of judgment, not at the point of creation.

In regulated industries, the Gavel must be more than a set of heuristic checks. It needs to be deterministic where determinism is possible. It needs to be evidentiary — able to point to the specific standards, rules, and source artifacts that informed its judgment. It needs to be explanatory — able to articulate why a candidate requirement was accepted or rejected in terms that a certification authority can inspect. The Gavel doesn’t guess. It has a model of what good looks like, and it applies that model consistently.

This separation is also, not coincidentally, how the best human engineering teams already work. The person drafting the requirement and the person reviewing it are rarely the same person, and for good reason. The drafter needs creative latitude. The reviewer needs explicit criteria. Combining those roles compromises both.

What Does the Gavel Judge Against?

The Gavel-and-Scribe model solves the architectural problem. But it exposes a deeper one — because both the Scribe and the Gavel are only as good as the knowledge available to them. The Scribe needs rich, structured context to generate meaningful candidates. The Gavel needs specific, grounded criteria to evaluate them. Where does that knowledge come from?

This is where most AI-for-requirements thinking stops one layer too early. The hardest challenge in AI-assisted requirements engineering is not generation, and it’s not evaluation. It’s comprehension.

Ethan Mollick, who studies AI adoption at Wharton, has been running an extended research program on how organizations actually use AI. His most useful finding, for our purposes, isn’t about model capability — the models are clearly getting better. It’s about what he calls “purpose loss“: organizations using AI to do more of the same, faster, rather than to see things they couldn’t see before.

This maps exactly onto the mistake I see in requirements engineering. Most AI-for-requirements tools promise to generate requirements faster. Faster is fine. But the problem at most of the regulated enterprises I work with isn’t that requirements are slow to write. It’s that the engineering artifacts those requirements should be derived from are incomplete, inconsistent, and poorly understood.

The high-level requirements are spotty. The data dictionaries are incomplete. The glossaries are maintained in someone’s head. The same concept is called one thing in the code, another thing in the requirements document, a third thing in the test procedures. Critical engineering knowledge — the kind that determines whether a candidate requirement is meaningful or nonsensical — lives in the experience of senior engineers who never wrote it down.

Natalia Quintero, who consulted with over a hundred companies on AI adoption, arrived at a finding that I think is the single most important sentence in the current AI discourse: “You can only automate what you can clearly define.”

She meant it as advice for enterprises adopting AI internally. But for requirements engineering in regulated industries, it lands as a diagnosis. The reason AI-generated requirements so often disappoint isn’t that the models are bad. It’s that the inputs — the engineering artifacts, the domain knowledge, the semantic context — aren’t in a state where any system, human or artificial, can reason over them reliably.

This raises a question that I think gets to the heart of the matter: what would it mean to make those artifacts reasonably complete? What kind of representation would you need to build?

Correlations Are Not Models

Before I founded QRA, I spent fifteen years as a theoretical physicist. I mention this not as a credential but because it shapes how I see this problem — specifically, how I think about what “comprehension” requires — and the perspective matters.

Physicists build models for a living. A model, in the way a physicist uses the word, is a structured, interrogable representation of a system. It makes predictions. It can be tested. It has internal consistency — you can query it and get answers that cohere with each other. And critically, it fails in specific, diagnosable ways. When a model breaks, you learn something. That’s the whole practice: build, test, refine, and know precisely where the model’s boundaries are.

Large language models, despite the name, do not build models in this sense. They find statistical correlations in training data and produce outputs that look like understanding. Gary Marcus has made this point persistently and, I believe, correctly: an LLM trained on millions of chess games will still make illegal moves, because it has correlated game transcripts rather than modeled the rules of chess. An LLM trained on geographic data will place cities in the middle of the Atlantic Ocean, because it has captured word associations rather than built a representation of physical space. As Marcus’s collaborator Ernie Davis puts it: “Models that can’t be used downstream are scarcely worthy of the name.”

This distinction — between correlation and model — is the crux of the problem in AI-assisted requirements engineering.

An LLM trained on aerospace documentation will produce text that sounds like it understands the relationship between a high-level requirement and a code module. But it doesn’t have a model of that relationship. It cannot tell you that airspeed_kts in the code and V_air in the requirements refer to the same quantity unless someone has built the explicit structure that captures that mapping. It cannot identify that a particular code behavior has no corresponding requirement, because it has no representation of the trace graph. It generates plausible language about engineering artifacts without modeling the artifacts themselves.

Broader enterprise surveys — not aerospace-specific, but directionally consistent with what I see firsthand — suggest this isn’t a niche problem. By some measures, fewer than one in ten enterprises consider their data fully AI-ready.

In regulated engineering, the data problem has a specific shape. The knowledge isn’t just messy — it’s distributed across artifacts that were never designed to talk to each other, described in vocabularies that have drifted over decades of institutional use, and supplemented by tacit knowledge that exists only in the minds of experienced engineers. When a senior avionics engineer reads a high-level requirement and immediately knows which code modules it relates to, which interface definitions are relevant, and which edge cases the requirement doesn’t cover, she is performing a feat of semantic integration that no tool currently replicates — because no tool has access to the knowledge she’s drawing on. She has a model. The LLM has correlations.

This is why I believe the most important layer in AI-assisted requirements engineering is not the generation layer and not even the evaluation layer. It’s the intelligence layer — the explicit, structured model of engineering knowledge that sits beneath both.

I mean something specific by this. An engineering intelligence layer captures the vocabulary of a program — every term, alias, and implicit synonym, reconciled across code, requirements, design documents, and test procedures. It maps the relationships between artifacts: which high-level requirements trace to which code behaviors, where trace gaps exist, and where terminology conflicts create ambiguity. It accumulates the judgments that engineers make during review — which term mappings are correct, which candidate requirements are accepted or rejected, and why — and it compounds that knowledge over time. It is, in the physicist’s sense, a model: queryable, testable, and transparent about where its boundaries are.

To make this concrete: imagine a code module that implements a rate-limiting behavior — capping a control surface deflection rate under certain flight conditions. The code is functional. It does something real. But there is no corresponding requirement anywhere in the requirements document. No high-level requirement traces to it; no low-level requirement describes it. The behavior exists because an engineer added it during development, for good reason, but it was never captured in the formal requirements baseline. Without an intelligence layer that maps code behaviors to the requirements graph, this orphan behavior is invisible. It will not appear in any trace analysis. It will not be covered by verification. It is a gap that no amount of faster requirement generation will close, because the problem isn’t missing text — it’s missing awareness. The intelligence layer is what makes that gap visible.

When this model exists, the Scribe has something to work with. Generation is no longer a matter of producing plausible-sounding text from a general-purpose model’s correlations. It becomes generation grounded in a specific, structured, evolving representation of what this particular engineering program actually contains. And the Gavel has something to evaluate against — not just generic rules of good requirements writing, but the specific standards, terminology, and relationships that define quality in this context.

When this model doesn’t exist — when the Scribe is generating from spotty inputs and the Gavel is checking against generic rules — you get requirements that are syntactically correct and semantically hollow. They pass template checks. They fail the engineering review. They are, in Davis’s terms, not worthy of the name.

Generation Is the Wrong Pitch

I should be honest about the environment this work is happening in. It is not friendly to AI-powered tooling.

Broad industry surveys paint a sobering picture. Fifty-six percent of CEOs surveyed by PwC report getting “nothing” from their AI adoption efforts. Gartner places generative AI in the “trough of disillusionment.” According to some analyses, 90 to 95% of AI pilots fail to reach production. These numbers come from general enterprise contexts, not regulated engineering specifically — but the RAND Corporation’s analysis of AI project failure cuts closer. Its first identified root cause is misunderstood problem definition, and that resonates directly with what I see in our market.

The decision-makers at the aerospace, defense, and medical device companies I sell into are not early adopters brimming with enthusiasm. They are experienced engineering leaders who have watched two years of AI promises fail to materialize into measurable outcomes. When they hear “AI will generate your requirements,” they hear another vendor pitch in a market already saturated with AI promises that haven’t delivered.

This is why leading with generation is a strategic mistake — even if your generation is genuinely good. The pitch that works is not “AI will write your requirements.” It’s: “Your engineering artifacts are inconsistent and incomplete in ways you can’t currently see. We make those gaps visible and help you close them.” Lead with the diagnosis, not the prescription. The generation follows from the insight and carries evidence.

Tina He, writing about which businesses will survive model commoditization, argues that the defensible layer in the AI era is not the model or the interface but the infrastructure that AI must flow through. She identifies domain-specific infrastructure with defensible data as the archetype for industries where the cost of errors is high enough that “good enough” AI isn’t acceptable. In those domains, the layer that holds domain knowledge, validation logic, and ongoing human judgment becomes essential — and becomes the moat.

That’s the business I’m trying to build. Not a smarter way to write requirements. An intelligence layer for regulated engineering — one that makes AI-assisted generation possible by first making the engineering artifacts legible, structured, and complete enough to reason over.

The Audience With Veto Power

There’s one more dimension that the general AI adoption literature almost entirely ignores, and it’s the one that matters most in my world.

In regulated industries, the final audience for AI-assisted engineering artifacts is not the engineer. It’s the certification authority. The FAA’s Designated Engineering Representative reviewing the output of an AI-assisted requirements tool needs to be satisfied that the artifacts meet the intent of DO-178C — not just that they look correct, but that their provenance is traceable, their derivation is explainable, and the human review process is visible and auditable.

Current aerospace certification standards were not designed for AI-generated outputs. The FAA published its Roadmap for AI Safety Assurance, and EASA is developing guidance, but the regulatory posture is unsettled. Open questions remain around validation of intent, sufficiency of verification, and uncertainty quantification. A recent paper in Frontiers in Aerospace Engineering noted “insufficient practical experience to establish best practices” for certifying AI-enabled systems.

This means that even perfectly generated candidate requirements face a trust and acceptance problem that sits entirely outside the product architecture. The Gavel-and-Scribe model isn’t just good engineering practice — it’s a certification strategy. The separation of generation from evaluation, the evidence trails, the inspectable reasoning at each stage — these exist not just to help the engineer but to satisfy the authority who must ultimately sign off.

A system that produces plausible requirements from a black box will be rejected, regardless of accuracy, because the certification process requires demonstrated understanding, not just correct outputs.

What Comes Next

I don’t think the requirements engineering community has fully reckoned with what generative AI means for the discipline. The conversation is still largely organized around whether AI can write a good requirement — as if the challenge were a more capable typewriter.

The real transformation is less dramatic and more consequential. It’s the slow, difficult work of building structured representations of engineering knowledge that currently lives in documents no one reads, spreadsheets no one maintains, and the heads of engineers who are five years from retirement. It’s the unsexy infrastructure work of reconciling vocabularies, mapping trace relationships, and capturing the judgments that turn raw engineering artifacts into something a machine — or a new team member — can actually reason over.

For decades, the category was tools for writing artifacts — editors, templates, syntax checkers, and now AI copilots that draft faster. That category is maturing. The next one is different. It is systems for making engineering knowledge computable: structured, queryable, traceable representations that sit beneath generation, beneath evaluation, beneath the tools themselves. Whoever builds this layer for regulated engineering — and earns the trust of the organizations and certification authorities that depend on it — will define how AI-assisted engineering actually works. Not the model providers. Not the copilot vendors. The layer that holds the knowledge.

Speed changes how requirements are written. Comprehension determines whether they work.

Related Discussion

From Rules to Intelligence: Rethinking How We Build Complex Systems

View the conversation → Watch recording

Our Blog

The Wrong Question