Neel Somani on The Risks of Code Generation

The rise of AI-powered code generation has transformed software development, but the systems producing this code are introducing new risks that traditional quality assurance methods were never designed to handle. As organizations deploy LLM-generated code at scale, they face an important question: How do you verify systems that write themselves?

Neel Somani argues that the explosion of machine-generated code has created a silent crisis in software reliability. While AI tools promise to accelerate development, they simultaneously undermine the assumptions that have historically made code auditable, debuggable, and trustworthy.

The problem extends beyond bugs. When code is written by humans, failures can be traced to specific decisions, logic errors, or misunderstood requirements. When code is generated by a language model, the origin becomes unclear. The system that produced it cannot explain why it made particular choices, and developers inheriting that code often lack the context to evaluate whether it handles edge cases correctly.

For industries in which software failures carry real consequences, from finance to healthcare to infrastructure, this opacity constitutes an unacceptable liability. The question is not whether AI will continue generating code, but whether organizations can develop frameworks to verify it before it becomes load-bearing.

The Problem with Generated Code

Code generation tools operate by predicting plausible outcomes based on vast training corpora. They excel at producing code that looks correct, compiles without errors, and passes basic functionality tests. What they cannot reliably do is reason about correctness under all possible inputs, handle rare edge cases, or maintain invariants across complex systems.

This creates a subtle but dangerous dynamic. The generated code appears to work in common cases, thereby encouraging developers to trust it. But correctness in the average case is not the same as correctness in general. When that code is integrated into production systems, unexpected inputs or unusual state transitions can expose failures that were never caught during development.

Traditional testing helps, but it is probabilistic. You can test a thousand cases and still miss the one scenario that causes failure. For safety-critical systems, this is not sufficient. What’s needed is a way to make exhaustive claims about behavior, not just distributional ones.

Neel Somani frames this as a fundamental tension between productivity and verifiability. “We have so much slop code being produced by LLMs that no one’s checking that carefully,” he notes. “Provable guarantees become a lot more valuable when there’s no close supervision.”

The concern is not hypothetical. As AI-generated code becomes ubiquitous, organizations are accepting systems they cannot fully audit. Dependencies proliferate, technical debt accumulates, and the capacity to reason about program behavior degrades. The result is software that functions until it doesn’t, with failures that are difficult to predict or prevent.

Why Traditional Verification Falls Short

Existing approaches to code quality rely heavily on testing, static analysis, and code review. These methods catch many problems, but they were designed for human-written code with traceable logic and clear intent.

AI-generated code undermines these assumptions in several ways. First, the volume of generated code exceeds the capacity for human review. Second, the logic may be structurally sound but semantically fragile, handling common cases while failing on rare inputs. Third, the generating model provides no formal guarantees about what the code does or does not do.

Static analysis tools can identify certain classes of errors, such as type mismatches or memory safety violations, but they cannot prove that a program satisfies high-level specifications. Testing can validate behavior on known inputs, but it cannot certify the absence of bugs across all possible inputs.

The gap between “works most of the time” and “provably correct” widens as systems grow more complex. When code interacts with external services, handles asynchronous state, or processes untrusted input, the number of possible execution paths explodes. Without formal reasoning, verification becomes infeasible.

This is where traditional software engineering hits its limit. The tools that worked when humans wrote and reviewed every line do not scale to systems where significant portions of the codebase are machine-generated, frequently updated, and deployed without human oversight.

Autoformalization as a Path Forward

Autoformalization offers a different approach. Rather than relying on testing or informal reasoning, it translates code into formal mathematical representations that can be verified against explicit specifications. This process makes it possible to prove properties about program behavior, not just test for them.

The core idea is to express what the code should do in a formal language, and then to mechanically verify that the implementation satisfies those properties. If the proof succeeds, you have a guarantee. If it fails, the verification system produces a counterexample showing exactly where the code violates the specification.

This approach is not new. Formal verification has been used in safety-critical domains such as aerospace and cryptography for decades. What has changed is the feasibility of applying it more broadly. Advances in proof assistants, automated theorem proving, and type systems have made formal methods more accessible. Autoformalization tools can now translate informal specifications into verifiable statements with less manual effort.

Somani’s work emphasizes the practical application of formal methods to AI systems. His project, Symbolic Circuit Distillation, demonstrates how neural mechanisms can be proven formally equivalent to symbolic programs on bounded domains. The same principles apply to generated code.

“Autoformalization is a way of applying formal methods at scale,” Somani explains. By automating the translation from informal reasoning to formal proof, verification becomes tractable even for large, rapidly evolving codebases. This is especially valuable when the code itself is generated by systems that cannot provide guarantees.

The process typically involves several steps. First, the desired behavior is specified formally, often using domain-specific constraints or properties that the code must satisfy. The generated code is then translated into a representation amenable to formal analysis, such as a typed functional language or a logical assertion system. Lastly, automated tools verify that the code satisfies the specification.

If verification succeeds, the code is certified correct within the bounds of the specification. If it fails, the system generates a counterexample that exposes the precise conditions under which the code violates the requirement.

The Role of Proof Systems

Proof systems such as Lean, Coq, and Isabelle play a central role in autoformalization. These tools allow developers to express mathematical statements about programs and construct machine-checked proofs that those statements hold. While historically used by specialists, recent efforts have made them more accessible to mainstream developers.

The advantage of the systems lies in their rigor. Once a proof is accepted, it cannot be wrong unless the system’s foundational axioms are flawed. This eliminates the uncertainty inherent in testing and manual review. For code that handles sensitive data, controls physical systems, or mediates financial transactions, this level of assurance is invaluable.

Integrating proof systems with code generation workflows remains an open challenge. The process is labor-intensive, requiring careful specification and often manual intervention when automated tools fail. However, progress is accelerating. Tools like Harmonic’s Aristotle automate considerable portions of formalization, reducing the burden on developers.

Somani highlights the importance of this integration: “These are all well-studied questions prior to autoformalization, but no one wants to use a super sophisticated type system, so I see autoformalization as a way of applying formal methods at scale.”

Bounded Guarantees and Practical Trade-offs

One misconception about formal verification is that it requires proving properties globally across all possible inputs. In practice, this is neither necessary nor feasible. What matters is proving correctness on bounded domains where the code will actually be used.

This mirrors Somani’s argument in his work on mechanistic interpretability. You cannot prove that a neural network will never fail, but you can prove that specific subcircuits behave correctly on specific input distributions. The same logic applies to code. You cannot prove that a program is correct in all conceivable contexts, but you can prove it satisfies requirements within a defined scope.

For example, a formally verified CUDA kernel might guarantee memory safety on inputs within a specified range, or that a statistical model cannot internally represent certain prohibited outputs. These are bounded, compositional guarantees that compose into larger assurances as systems are built.

The trade-off is between generality and certainty. Testing provides probabilistic confidence across broad domains. Formal verification provides absolute confidence within narrow domains. The optimal strategy combines formal methods, where correctness is critical, with testing to validate behavior more broadly.

Organizational Implications

Implementing autoformalization requires changes in how organizations develop software. Specifications must be written explicitly. Verification infrastructure must be integrated into CI/CD pipelines. Developers must learn to interpret proof failures and refine specifications iteratively. For organizations deploying AI-generated code in production, the question is whether the investment in verification is justified by the risk of unverified systems.

The calculus depends on context. For consumer-facing applications where failures are annoying but not catastrophic, informal testing may suffice. For systems in which failures carry legal, financial, or safety consequences, formal verification provides a competitive advantage.

Somani’s broader argument is that debuggability, the capacity to localize, intervene on, and certify system behavior, is becoming a necessity as AI systems proliferate. Autoformalization is one tool in a broader toolkit for maintaining control over autonomous systems.

The Path Forward

The future of code generation will likely involve hybrid workflows where models produce candidate implementations, formal tools verify their correctness, and developers oversee the process. This division of labor plays to each component’s strengths: AI for exploration and pattern matching, formal methods for certification, and humans for specification and judgment.

Progress depends on making verification more accessible. If only specialists can use systems, adoption will remain limited. If automated formalization tools can translate natural-language specifications into formal statements and automatically verify code, the barrier to entry drops.

Somani’s work on autoformalization in mathematics provides a template. By focusing on structured reasoning within formal frameworks, AI becomes a useful collaborator rather than a black box. The same principles apply to software: structure the problem, formalize the requirements, and verify mechanically.

As AI-generated code becomes ubiquitous, the organizations that invest in verification infrastructure will have an advantage. They will deploy systems with confidence, knowing not just that code works in testing, but that it provably satisfies critical properties. In a world where software failures can have cascading consequences, that certainty is worth the investment.

For leaders navigating the transition to AI-assisted development, the choice is clear: invest in the tools and processes that make verification tractable, or accept the growing liability of systems no one fully understands.