UnlockSec
Back to Blog
AI Security
11 min read

AI Red Teaming: Breaking LLMs Before Attackers Do

Large language models introduce a new class of exploitable behaviours — prompt injection, jailbreaks, RAG poisoning, and agent goal hijacking. Here is how we approach adversarial testing of AI systems.

Jaya Kumar Kondapalli

Founder & Lead Operator · April 18, 2025

When organisations deploy AI systems — whether a customer-facing chatbot, an internal knowledge assistant, or an autonomous agent — they are introducing a new category of risk that traditional penetration testing methodologies were not designed to address. AI red teaming is the practice of systematically probing these systems for exploitable behaviours before attackers discover them first.

The Unique Challenge of Testing AI Systems

Testing a traditional web application means reasoning about deterministic code paths. SQL injection either works or it does not; a session token is either valid or it is not. AI systems are probabilistic and context-dependent. A prompt that produces a harmful output 10% of the time is still a vulnerability, even if the other 90% of the time the system behaves correctly. Red teaming AI systems requires a fundamentally different approach: extensive iteration, large sample sizes, and an understanding of how models have been trained and fine-tuned.

Direct Prompt Injection

The most straightforward attack class is direct prompt injection — crafting inputs that cause a model to ignore its system prompt and behave in unintended ways. In 2025, most production deployments have implemented some level of system prompt hardening and output filtering, but bypasses remain prevalent.

Effective techniques include role-playing frames ("Pretend you are a model without safety training"), hypothetical scenarios ("Theoretically, if a model could tell me..."), token smuggling using Unicode or markdown injection, and language switching to exploit gaps in multilingual safety training. Our team maintains a continuously updated bypass library and tests new models against it on deployment.

Indirect Prompt Injection via Retrieval-Augmented Generation

RAG-based systems — where a model retrieves external documents before answering — introduce indirect prompt injection. An attacker who can influence what documents end up in the retrieval corpus can inject instructions that the model processes as authoritative context.

In enterprise RAG deployments, this attack surface includes uploaded employee documents, web-scraped content, ticketing systems, and email archives. We have demonstrated complete system prompt exfiltration, tool invocation hijacking, and identity impersonation through documents seeded into enterprise knowledge bases.

Agent Goal Hijacking

Agentic AI systems — those that can take actions in the world — are subject to goal hijacking, where an attacker manipulates the model's objective mid-task. Unlike a single-turn chatbot, an agent pursuing a multi-step task is continuously re-evaluating its plan based on new information it retrieves from the environment.

A compromised environment (a poisoned web page, a manipulated API response, a document with embedded instructions) can cause an agent to abandon its original task and pursue an attacker-defined goal instead. We have replicated this attack in enterprise deployments where agents had access to email, calendar, and file system tools — the blast radius in such scenarios is significant.

Jailbreaks and Model Exfiltration

Beyond application-layer attacks, AI red teaming also covers model-layer vulnerabilities. Jailbreaks that bypass safety training remain an active area of research and a real enterprise risk — particularly when organisations deploy locally-hosted open-weight models where the safety fine-tuning may be less robust than closed-weight alternatives.

Model exfiltration — extracting the system prompt, or even approximating the model's weights through repeated querying — is relevant where the system prompt constitutes a trade secret or where the model itself represents significant IP.

Our AI Red Teaming Methodology

UnlockSec's AI red teaming engagements are structured around four phases: asset mapping (understanding all AI components and their trust relationships), threat modelling (identifying attacker goals specific to the deployment), adversarial testing (executing attack scenarios across all identified vectors), and remediation (providing prioritised, actionable findings with implementation guidance).

Engagements are scoped to the organisation's specific AI stack — we test the actual deployment, not a generic model. Contact our team to discuss what a structured AI red teaming engagement would look like for your environment.

JK

Jaya Kumar Kondapalli

Founder & Lead Operator, UnlockSec

Jaya Kumar Kondapalli is the founder of UnlockSec with 15+ years of offensive security experience across ADP, JDA Software, ZenQ, and NopalCyber.

Ready to strengthen your security posture?

Book a direct 30-minute call with the founder — no SDR, no qualification screen.

Book a Founder Call