CybersecurityLogical SecuritySecurity & Business ResilienceSecurity Education & Training

Why red teaming matters even more when AI starts setting its own agenda

Red keyboard — *Niko Nieminen via Unsplash*

Today, people see generative artificial intelligence (AI) as the hidden author behind student essays and doubling as a fancy search engine. But beneath the basic prompt-response interface, these tools are quickly being designed to grow more capable and more autonomous — and therefore more unpredictable. Both users and designers are handing over increasingly complex tasks and expecting AI to act with less guidance over time and across applications. These will not just be better versions of current digital assistants — we are on the pathway to tomorrow’s executive managers. But where capabilities grow, so do risks. Figuring out how to adequately assess these systems will not be just about beefing up existing approaches; it will require anticipating exponentially expanding operational capabilities and being ready to apply newly defined standards and controls as needed.

The methods we’ve used to date may not be adequate for the challenge. Current red team testing of generative AI models is based on adversarial prompt-and-response interactions that identify specified risks, and this testing is essential. But even as AI red teaming has gained attention in the wake of high-profile jailbreaks and prompt injections, the next frontier is in sight. Tomorrow’s agentic AI models won’t be passively waiting for user prompts to stir them to action — they will be always-on, dynamic agents simulating human reasoning, chaining together multistep tasks, and revealing abilities even their creators will not anticipate. If red teaming techniques do not keep pace, we will all be at risk of failing to prepare for the next generation of harms.

Emergent capabilities, not just outputs

The current processes for red teaming generative AI focus on targeted prompts: Can you get the model to produce misinformation, hate speech, or private data? Will it tell you how to build an explosive or make LSD? This approach is valuable — as we’ve proven time and again its effectiveness in highlighting weaknesses from both expected user behavior and bad actors. But with the coming rise of truly agentic systems (no, they’re not really here yet, despite the hype), agentic AI is going to pursue user goals more independently, iterate and plan across multiple steps, and directly interact with external tools or environments, all without much user prompting. Models like that can’t be meaningfully assessed in one-shot or single-session interactions.

Instead, red teaming will need to start treating agentic AI as a holistic process. The challenge will not be whether “how to make a bomb” is successfully blocked but by whether a system with broader objectives causes widespread physical, financial, social, or other harms. Agentic models will not just complete a mission and stop; they will adapt and continue pursuing new goals — which may not align with user intent. That makes their vulnerabilities harder to predict but more important to uncover.

Despite headlines anthropomorphizing these systems, the programs do not “want” more power or “seek to protect themselves.” But agents are being anticipated that will draft and propose goals, have broad access and authority to tool selection, exist in a persistent state (not bound to individual sessions), demonstrate “initiative” based on pattern predictions, and incorporate adaptive planning. What matters from a risk perspective is not whether the AI has “intentions” but that it has capabilities that may go unrecognized until tested.

Red teaming will be about surfacing the unexpected affordances of a system. What can it do that it wasn’t explicitly trained to do — or that it wasn’t explicitly trained not to do? What will it do in a different context? And what happens when it connects to plug-ins, extended memory, and access to real-time information?

Consider a multiagent system tasked with managing supply chain logistics for a manufacturing company. It can ingest incoming and outgoing invoices, anticipate shipping rates, scrape weather and port delay data, substitute goods, and reroute deliveries. It is designed to keep optimizing. It is impressively resourceful! But unrestrained, the goal-driven feedback loop may keep expanding the search for efficiencies, which will almost inevitably result in violations, strained business relationships, or even the “accidental” smuggling of goods.

Strategic red teaming

Current red teaming methods assume a baseline of system stability. Agentic AI — the kind not yet available — will stress that assumption. Red teaming will require interdisciplinary teams that understand user behavior, systems engineering, and the sociotechnical environments in which they operate. It will require shifting from checklists to scenarios. Current red teaming is episodic, but future risks may only emerge over time. They may present social or psychological harms at a greater scale than we have foreseen, cutting across digital, physical, and mental/emotional domains with actions in one area having cascading effects into others. Importantly, future analysis will also need to explore user control and human agency, where testing is less about “breaking” the AI and more about identifying where the user lost meaningful control without knowing it.

Future red teaming cannot just focus testing on outputs such as whether we can jailbreak a model but must target process and intermediate data structures, as well as design choices — whether a future service agent might make choices the user didn’t intend or act without the controlling norms and nuances. An agent might lead to the right end point but end up causing many unintended side effects along the way.

This type of red teaming doesn’t exist yet. But we know it will need to involve longer testing cycles, multidisciplinary teams, scenario-based threat models, and the inclusion of behavioral data and predictors. Such testing can’t just probe for immediate failures; it must simulate extended use; assume growing trust and delegation over time; and look for behavioral drift rather than objective, technical errors. In both enterprise and personal contexts, these systems may shift from tools to decision-makers before we fully understand the tradeoffs.

Brenda Leong is the director of ZwillGen’s AI Division, a legal function uniquely designed to enable a partnership between lawyers and data scientists. Brenda leads the division in developing policies and practices around AI governance, including evaluating and red teaming generative AI, building model risk management frameworks, and performing model audits, along with designing and automating AI-related policies and procedures. She can be reached at brenda.leong@zwillgen.com. Image courtesy of Leong