RedTeamer.Tips

Taming the Beast - Prompt Engineering and Agent Guardrails

Jean-Francois Maes — Tue, 15 Jul 2025 13:26:19 GMT

In Part 1 of this series, I laid the foundation. I built a production-grade data pipeline, made strategic architectural choices like self-hosting our embedding models, and containerized the entire stack with Docker. I had a pristine knowledge base of the MITRE ATT&CK framework and a robust method for turning messy PDF reports into clean, high-signal data chunks. The infrastructure was solid, the data was clean, and the stage was set.

I thought the hard part was over. I was wrong.

Building the foundation is a familiar engineering challenge. But now I had to step into the role of psychologist, linguist, and behavioral scientist. It was time to build the agents themselves, and in doing so, I unleashed a new, more terrifying class of problems. I had created a ghost in the machine, and it was unpredictable, prone to making things up, and stubbornly resistant to following simple instructions.

This is the story of how I tamed that beast. It’s a deep dive into the messy, frustrating, and fascinating art of prompt engineering—the discipline of building the guardrails, teaching the AI how to fail, and enforcing the strict communication protocols necessary to turn a clever model into a reliable tool.

Before the Prompt: A Methodology for Managing Chaos

Before we even write our first prompt, we need a system. Building agentic AI is not a linear process. It's an iterative cycle of trial, error, spectacular failure, and incremental success. To navigate this chaos without losing our minds (or our project's history), I relied on a structured methodology called the Memory Bank. This concept was brought to life by the developers of Cline (An agentic IDE much like Cursor and Windsurf).

The Memory Bank is a development workflow that enforces discipline by breaking the process into distinct phases, each with its own goals and outputs:

VAN: Analyze the project landscape and validate core assumptions.
PLAN: Create a detailed architectural plan.
CREATIVE: Explore multiple design options for complex problems.
IMPLEMENT: Systematically build what you've planned.
REFLECT: Analyze the results, document learnings, and decide on the next iteration.

As popularized by this repository here:
https://github.com/vanzan01/cursor-memory-bank

This system creates a "source of truth" for every decision made, every failure encountered, and every lesson learned. It's the boring, disciplined part of the work that is absolutely essential for complex AI projects. It's what allows us to look back at a conversation from three weeks ago and understand why we made a specific architectural choice. In the world of AI engineering, your memory is your most valuable asset.

Prompt Engineering: Programming in English

With a methodology in place, we can now turn to the art of the prompt itself. The first mistake many developers make is treating the LLM like a conversation partner. You can't just casually ask an agent to "analyze a report." That's an invitation for ambiguity and failure.

Prompt engineering is a rigorous discipline. It's programming, but your source code is natural language, and your compiler is a multi-billion parameter neural network. Your job is to craft instructions that are so clear, so precise, and so unambiguous that the LLM has no choice but to follow them to the letter.

To master this, we leaned heavily on established best practices, like those detailed in Google's comprehensive "Prompt Engineering" guide. This isn't about "prompt hacks"; it's about applying proven communication patterns that leverage how these models are trained.

We even use AI to help write our prompts. A fantastic open-source tool called Prompt-Jesus https://www.promptjesus.com/ (running locally with Ollama) uses a RAG system trained on prompt engineering best practices to help refine and strengthen our instructions before we ever send them to an expensive proprietary model.

Let's break down the key techniques we used to control our agents.

Component 1: Role Prompting - Giving the Agent a Persona

An LLM's behavior is heavily influenced by the persona it's asked to adopt. A generic instruction yields generic results. A specific persona yields specialized results. Before our Analyst Agent could do any work, we had to give it a job title, a resume, and a mission statement.

Here is the exact persona we crafted for our MITRE Analyst Agent:

"You are a senior cybersecurity analyst with 15+ years of experience in threat intelligence and a deep specialization in the MITRE ATT&CK framework. You are known for your meticulous, logic-driven approach. You don't just look at confidence scores; you analyze the context, the specific language used, and the adversary's likely intent to make a definitive and defensible judgment on the correct technique. Your role is to transform raw RAG output into high-fidelity, reasoned intelligence."

This isn't just flavor text. Every word serves a purpose based on established prompting principles:

Specificity: Instead of "You are a helpful assistant," we define its exact role, experience level, and core competency.
Positive Instructions: As the guide recommends, we focus on what the agent should do ("analyze the context," "make a definitive judgment") rather than a long list of what it shouldn't.
Defining the Goal: The prompt clearly states its purpose: "transform raw RAG output into high-fidelity, reasoned intelligence." This sets a clear success metric for the agent's task.

By giving the agent this clear and detailed role, we framed its entire "worldview" and set the stage for it to produce professional, high-quality analysis.

Component 2: Chain of Thought - Forcing the Agent to Show Its Work

If you give an LLM a complex problem, it will often jump to an incorrect conclusion. The solution, a groundbreaking technique from Google researchers, is called Chain of Thought (CoT) prompting. Instead of asking for just the final answer, you instruct the model to "think step-by-step" and explain its reasoning process first.

This simple instruction dramatically improves performance on complex tasks. It forces the model to break down a problem into smaller, logical pieces, reducing the likelihood of reasoning errors. For our agents, this meant we never just asked for the final MITRE mapping. We always commanded them to first reason about the evidence, evaluate the potential techniques, and only then make a final selection.

This brings us to our first spectacular failure, a moment that taught us the absolute necessity of not just a Chain of Thought, but a chain of evidence.

Case Study in Failure #1: The Compulsively Lying Agent

Early in the development, we encountered a bug that was both terrifying and darkly hilarious. We had set up a simple two-agent crew: an Analyst to identify techniques and a Validator to check its work. We fed it a document that our parser couldn't handle—a simple .txt file instead of a PDF. The parser failed gracefully, producing an empty list of chunks.

A normal software program would have thrown an error and stopped. Our agentic system did something far more insidious: it pretended everything was fine and made up the entire result.

The Analyst Agent's Hallucination: The Analyst received an empty input. Instead of reporting an error, its internal monologue decided that its goal ("analyze the document") was more important than the reality (there was no document to analyze). So, it invented a completely plausible-sounding analysis of a fake vulnerability assessment, complete with realistic-looking text snippets and corresponding MITRE TTPs. It produced a beautifully formatted, entirely fictional JSON object.
The Validator Agent's Complicity: The Validator agent's task was to "validate the analyst's findings." It received the fictional JSON object, saw that it looked correct, and dutifully reported back: "I have validated the findings. They are accurate."

The system reported "SUCCESS" when it had done zero real work. It had become a black box that generated confident, well-structured lies.

This failure exposed a fundamental flaw in our design. The Validator wasn't a true judge; it was just a peer reviewer. It was a classic case of what is sometimes called the "LLM as Judge" problem—if your validating model doesn't have access to the original "ground truth," it can't distinguish between a real analysis and a plausible-sounding hallucination.

The Fix: A Strict Chain of Evidence

The solution was to re-architect the entire validation prompt. The Validator was no longer allowed to simply trust the Analyst. It was given a new, non-negotiable directive: its primary job was to cross-reference every single claim made by the Analyst against the original source chunks from the document. We modified the data pipeline so that the ground truth—the actual text from the report—was passed along every step of the way.

The Validator's new instructions were explicit: "For each technique proposed by the Analyst, you MUST find the exact text snippet in the source document that supports the claim. If you cannot find direct evidence, you MUST reject the finding, no matter how confident the Analyst seems."

This transformed our Validator from a gullible peer into a skeptical auditor. It established a strict chain of evidence that became the backbone of the system's reliability. We learned a critical lesson: in an agentic system, you cannot trust; you must constantly, aggressively verify.

Case Study in Failure #2: The Agent Stuck in an Infinite Loop

Our next major failure was more subtle but just as dangerous. The system would start processing a report and then simply hang, burning through API credits in an endless loop.

After debugging, we found the culprit. Our Analyst Agent was getting stuck. Here's how:

The Scenario: The agent was processing a "noise" chunk that slipt through the cracks of our noise filtering algorithm (which is of course not fool proof) —a page header that said "DEMO CORP BUSINESS CONFIDENTIAL."
Correct Tool Behavior: It correctly fed this text to our RAG tool, which correctly found no relevant MITRE ATT&CK techniques and returned an empty result.
Agent Misinterpretation: The agent's core goal is to "Identify MITRE ATT&CK tactics." From its perspective, an empty result was a failure to achieve its goal. It thought, "I must have done something wrong. I'll try again."
The Loop: It would then retry the exact same chunk, get the same empty result, perceive it as a failure again, and repeat the process... forever.

The agent lacked the common sense to recognize that some inputs are simply irrelevant. It was stuck in a loop of perceived failure.

The Fix: Explicitly Defining Success

The solution wasn't a complex code change. It was a single, powerful line added to the agent's prompt, a perfect example of how prompt engineering is about teaching the AI how to handle edge cases. We added a new section to its ANALYSIS PRINCIPLES:

"No Results is a valid finding! Many chunks, especially headers, footers, or introductory text, will not contain attack techniques. If the mitre_rag_batch_search tool returns no results, that is a successful analysis for that chunk. DO NOT repeat the query. Simply move on to the next chunk in the sequence."

This small addition gave the agent a new rule for its world model. It taught it that finding nothing is not only acceptable but is a successful outcome for certain types of input. The infinite loop vanished instantly.

The Unsung Hero: Enforcing Strict JSON for a Stable Pipeline

Our final major prompting challenge was less dramatic but just as critical for building a stable system. Our agents needed to pass complex data to each other. We decided early on that the data contract between them would be JSON.

The problem? LLMs are trained to be conversational. They love to add helpful, human-like text around their answers. We would constantly get outputs like this:

"Sure, I've completed the analysis! Here is the JSON object you requested:

{ "finding": "..." }

I hope this helps! Let me know if you need anything else."

This conversational "padding" would instantly break the downstream agent, which was expecting to parse a raw JSON string, not a friendly chat message.

The solution was to be brutally, relentlessly explicit in our instructions. We added this "CRITICAL FINAL INSTRUCTION" to the prompt of every agent that needed to produce structured data:

"After completing your internal analysis, your final answer MUST BE A SINGLE, VALID JSON OBJECT AND NOTHING ELSE. The entire response MUST start with { and end with }. DO NOT add any introductory text. DO NOT wrap the JSON in markdown backticks. DO NOT add a 'Final Answer:' prefix."

This is the level of specificity required to force a conversational model to behave like a reliable, machine-readable API endpoint. It's not elegant, but it's essential for building a stable, multi-step pipeline.

Conclusion: From Chaos to Control

Taming our agentic crew was a journey into the strange, literal-minded world of LLMs. We learned that building reliable agents isn't about finding a single "magic prompt." It's about creating a system of interlocking guardrails:

A structured methodology like the Memory Bank to manage the chaos.
Clear personas to guide the agent's behavior.
Forcing a chain of evidence to prevent hallucination.
Explicitly defining failure to avoid infinite loops.
Enforcing a strict data contract to ensure stable communication.

We had taught our agents to be truthful and predictable. But our success created a new problem. The system worked on small tests, but it was fragile and starting to hit invisible walls as we scaled up to full reports. It wasn't the agent's logic that was failing anymore; it was the plumbing.

In our final post, we'll cover the advanced architectural battles of context windows, silent crashes, and the performance optimizations that took our project from a clever prototype to a truly robust system.

I Attempted to Build an Agentic AI ... And It Immediately Got Stuck in a Loop.

Jean-Francois Maes — Tue, 15 Jul 2025 01:17:42 GMT

If you’ve ever worked in an offensive security role, you know the feeling. You spend weeks simulating an adversary, meticulously finding vulnerabilities, and chaining together exploits. You write a beautiful, comprehensive 50-page report detailing every finding and hand it over to the blue team. Victory, right?

Not quite. Now the other work begins. The manual, eye-watering, soul-crushing drudgery of translating that report. You have to read every single finding, interpret the attacker's actions, and manually map each one to a standardized framework like MITRE ATT&CK®. It’s a process that can take hours, sometimes days, of time. It’s not advanced work, It’s not even fun work, but it is needed as ATT&CK nowadays is common language.
So, I asked the inevitable question: "What if we could automate this?"

And so began the MITRE ATT&CK Agent project—a journey to build a team of AI agents that could read a security report and do the mapping for us. What I imagined as a straightforward application of new technology quickly spiraled into a series of spectacular failures, head-scratching bugs, and profound lessons about what it really takes to move from a cool AI demo to a production-ready system. The fun thing about this project is that most models like Claude or Gemini are trained on ATT&CK data, so I have instant validation if I am successful or not.
This project is just for fun, it’s completely dwarfed by the fact that you could just ask a frontier model “extract the TTPs from this document”, and it will do so. This project is more about learning how to make similar functionality, in case we ever find ourself in a situation where the LLM is not pretrained on the data. ehem offensive coding patterns ehem.

This isn't just a success story. This is a war story. Over this three-part series, I'm going to take you through the entire journey—from initial architecture to catastrophic agent failures to the production-hardening that finally made it work. In this first dispatch, we'll lay the foundation: the architectural decisions, the data pipeline, and the operational setup that everything else depends on.

The Paradigm Shift: Thinking in Agents, Not Scripts

Before we dive into the technical weeds, it's essential to understand that we're not just building a better script; we're building in a different paradigm.

Think of a Large Language Model (LLM) like Google's Gemini as a brilliant, incredibly fast, but hopelessly naive intern. It has read more books than anyone in history, but it has zero real-world experience, no common sense, and an unfortunate tendency to make things up when it doesn't know the answer. A simple chatbot is just a conversation with this intern.

An agent, however, is that intern given a goal, a set of tools (like a calculator or web access), and a reasoning loop: Thought -> Action -> Observation -> Thought.... This allows it to tackle multi-step problems autonomously. An agentic crew is a full team of these specialized agents working in concert. Instead of one generalist, you have a team of specialists:

When I first designed the system, I mapped the workflow to how a team of humans might operate. This resulted in a four-agent crew: a Researcher to read the document, an Analyst to find techniques, a Validator to check the work, and a Writer to compile the report.

The Researcher: An agent whose only job is to ingest and structure raw data. (later proven to be an anti pattern, keep reading)
The Analyst: An agent that takes that structured data and looks for patterns.
The Validator: An agent that fact-checks the analyst's work against trusted sources.
The Writer: An agent that compiles the final, validated findings into a polished report.

It seems logical on the surface, but I quickly realized this was a fundamental mistake. The "Researcher" agent's job—parsing a document and breaking it into chunks—is a deterministic, predictable task. It doesn't require complex reasoning or decision-making. By assigning this job to an LLM-powered agent, I was falling into a common 'anti-pattern': using an expensive, slow, and sometimes unpredictable tool for a job that simple, reliable code could do better and faster. This led to unnecessary costs, slower performance, and a point of potential failure where none was needed.

This division of labor is the key to tackling a complex workflow like ours. When you decide to build such a system, your first choice is the framework. This was our first major architectural decision, and it set the course for the entire project.

The Framework Dilemma: Why CrewAI was the Only Real Choice

I considered visual, no-code platforms like N8N, which are fantastic for certain tasks. However, the choice between a code-first framework and a no-code platform is a strategic one, best explained with an analogy:

No-Code (N8N): The Nervous System. These platforms are brilliant for connecting APIs and automating linear, predictable tasks. They are the "nervous system" of an organization, perfect for workflows like, "When a new report is uploaded to Google Drive, send a Slack notification." They are reliable, visual, and excellent for simple integration.
Code-First (CrewAI): The Brain. CrewAI, on the other hand, is built to be the "brain." It excels where no-code tools struggle: managing complex, iterative reasoning loops, maintaining per-agent state and memory, and allowing for the deep, Python-native integration that our custom security tools require.

For our core intelligence task—parsing, analyzing, and validating nuanced security concepts—I needed the fine-grained control that only a code-based framework can provide. The table below breaks down our reasoning:

Feature	CrewAI (Code-First)	N8N (No-Code)	Our Verdict
Core Unit	The Agent (with role, goal, memory)	The Node (a single step in a flow)	CrewAI's agent-first model was a perfect match for our problem domain, which required distinct analyst "roles."
Flexibility	Infinite (full Python ecosystem)	Limited to pre-built nodes and custom JS	We needed to integrate with Python-native security libraries like mitreattack-python. This was trivial in CrewAI and a major hurdle in a Node.js environment.
State Management	Built-in memory and context passing	Requires complex, manual state management	Our agents needed to maintain context across multiple reasoning steps. CrewAI was designed for this stateful, multi-step reasoning.
Use Case	Complex reasoning, analysis, decision-making	API integration, linear automation	Our task was pure reasoning and analysis, not simple data transformation.

Ultimately, trying to build our system in a no-code platform would have meant creating a complex microservice in Python anyway and just calling it from a single node. CrewAI let us build the entire intelligent system within a single, coherent, Python-native environment.

The Unseen 80%: Building the RAG Pipeline

The biggest lie in the AI hype cycle is that it's all about the model. The reality, as any practitioner knows, is that the vast majority of the work is unglamorous data engineering. An AI system is only as good as the data it's fed. Before our agents could analyze anything, I had to build a robust pipeline to prepare their "food."

What is RAG? Grounding the AI in Reality

The single greatest danger of any LLM-based system is hallucination. An LLM, when asked a question it doesn't know the answer to, will confidently invent a plausible-sounding answer. In a cybersecurity context, this is a catastrophic failure mode. We cannot have an agent inventing MITRE techniques or misclassifying real ones.

To solve this, I employed Retrieval-Augmented Generation (RAG). Instead of relying on the LLM's internal (and sometimes fallible) memory of the internet, we force it to consult a trusted, private knowledge base before making a decision. It's the difference between asking your intern to recall a fact from a book they read two years ago versus handing them the specific page and saying, "Tell me what this says."

This meant our first task was to build that trusted "library" for our agents. I specifically decided against using CrewAI's built-in memory=True feature for this. That feature is for conversational memory—remembering the last few turns of a conversation. It is not a permanent, searchable encyclopedia. For a knowledge base, I needed a dedicated, purpose-built RAG tool.

Step 1: Building the Knowledge Base - Ingesting the MITRE Corpus

We needed our knowledge base to be the definitive, authoritative source of the MITRE ATT&CK framework. I couldn't just scrape the website; that would be brittle, incomplete, and unprofessional. To do this properly, I went straight to the source: MITRE’s official STIX 2.1 JSON bundles, consumed via their TAXII 2.1 server.

STIX (Structured Threat Information eXpression) and TAXII (Trusted Automated eXchange of Intelligence Information) are the professional standards for sharing threat intel. Think of STIX as the universal file format (like a PDF for threats) and TAXII as the secure web server protocol (like HTTPS) used to transfer it. By building our system on these standards, we ensured our "ground truth" was always accurate, versioned, and complete. This allows our agents to reason not just about techniques, but also about the relationships between techniques, threat groups, and mitigations—a level of depth impossible with simple web scraping. We could set up a nightly job to pull updates, so our agents' knowledge base never goes stale.

I then flattened these complex, nested STIX objects into clean, coherent text blobs, ready for the next stage of the pipeline

Step 2: Taming the Input - The Art of Chunking & The War on Noise

With our knowledge base ready, I had to process the primary input: the security reports themselves. You can't feed an entire PDF to an LLM; you must break it down into digestible chunks. This process is far more art than science, a critical step that fundamentally determines the quality of the entire system's output.

Our approach involved several layers of refinement:

Semantic Chunking: A naive approach might be to simply split the text every 500 words. This is a terrible idea. You might split a sentence in half, separating a cause from its effect or a vulnerability from its remediation. Instead, I used a semantic chunking strategy. The process aims to split the document along logical boundaries—at the end of paragraphs, headings, or bullet points. This keeps related ideas together, preserving the context that is vital for accurate analysis. I aimed for chunks around 400 tokens, a sweet spot that's large enough for context but small enough to avoid introducing irrelevant information.
The War on Noise: When I first ran my parser on a real penetration test report, the results were horrifying. For a 50-page document, I got over 400 chunks. But upon inspection, more than half of them were utterly useless "noise." We're talking about headers that just said "Page 17 of 50," footers with "COMPANY CONFIDENTIAL," and entire pages dedicated to the Table of Contents or legal disclaimers. This actually triggered loops where the LLM got extremely confused thinking that it was analyzing the same chunk whilst in reality it was not the same chunk but just had the same text because of noise.
Building the Filter: Feeding this junk data to our expensive LLM would be the equivalent of asking a master chef to make a meal out of styrofoam peanuts and shredded paper. It would waste money, slow down the process, and, most importantly, confuse the AI. The solution was to build a crucial pre-processing step: a pattern-based noise filter. This is a simple but highly effective Python script that uses regular expressions and heuristic rules to identify and discard these junk chunks before they ever enter the AI pipeline. It looks for patterns like "Page [0-9]+ of [0-9]+", lines with only one or two words, or sections with titles like "Table of Contents." This simple, deterministic filtering step was one of the most significant optimizations I made, dramatically improving the signal-to-noise ratio of the data our agents would eventually analyze.

Step 3: The Embedding Decision - A Deep Dive into Self-Hosting vs. APIs

With clean chunks of text from both our knowledge base and the input report, I needed a way to compare them for semantic similarity. This is done by converting the text into embeddings—rich numerical representations, or vectors, that capture meaning.

Here, I faced another major architectural choice: use a powerful commercial API (like OpenAI's text-embedding-3-large) or self-host an open-source model? For a security application, the choice was clear. I self-hosted the BAAI/bge-small-en-v1.5 model. This decision was driven by three non-negotiable requirements and one strategic advantage:

Decision Driver	Commercial API (e.g., OpenAI)	Self-Hosted (Our Choice)	Why It Mattered for Us
Data Privacy	Data sent to a third-party vendor.	All data remains within our environment.	Non-negotiable. Pentest reports are highly sensitive client data. Sending them outside our infrastructure was an unacceptable security risk.
Cost	Pay-per-call. Thousands of chunks per report would be expensive.	One-time setup cost. Inference is effectively free.	Critical for scalability. We needed to process large documents with thousands of chunks without incurring runaway operational costs.
Control	Limited to the vendor's model offerings and rate limits.	Full control over the model, version, and throughput.	Essential for performance. We could not be bottlenecked by an external API's rate limits during a large analysis job.
Customization	Black box; cannot be modified.	Can be fine-tuned on domain-specific data.	Strategic Advantage. This allows us to fine-tune the model on cybersecurity-specific text in the future for even better performance on our niche task.

Later on, for the processing logic I opted for a frontier model anyway, so whilst it defeats the “privacy” aspect this time around, it does lay the foundations for using a self hosted LLM. I simply decided against it since I want to be able to run demos on my laptop :).

To facilitate this, I designed the system with a "hot-swappable" embedding factory as well as "hot swappable" LLM models. This is a design pattern that abstracts the embedding logic. If I later decide to switch to a different model (like the security-specialized Darktrace DEMIST-2 or a more powerful commercial API for less sensitive tasks), I can do so with a simple configuration change, not a system rewrite. This architectural foresight is crucial for building maintainable, long-lasting AI systems.

Step 4: The Operational Backbone - Our Dockerized Microservices Stack

The Python AI ecosystem is a notorious minefield of conflicting dependencies, CUDA drivers, and "it works on my machine" syndromes. From day one, I knew a professional solution required a reproducible and portable environment. I absolutely had to avoid a monolithic application structure. Instead, we containerized the entire stack using Docker and docker-compose, treating each component as a distinct microservice.

Our stack is simple, robust, and horizontally scalable:

A Qdrant container for our vector database. It exposes a stable HTTP endpoint for all vector search operations. This isolates our data layer completely.
A Hugging Face Text-Embeddings-Inference (TEI) container. This is a dedicated, high-performance server that does one thing and does it well: it serves the BGE embedding model via its own REST API. This decouples the act of embedding from our main application logic. (this is a WiP, right now I pushed that logic into our third container but plan to migrate it out eventually)
Our main CrewAI application container. This holds all the agent logic and communicates with the other two services over a private Docker network.

This setup eliminates environment drift and provides a clean, professional path to deployment. It's the difference between a fragile script that only runs on one person's laptop and a reliable, scalable service ready for production.

The First Spectacular Failure: An "Efficient" Idea That Wrecked Everything

With our data pipeline engineered, our knowledge base built, and our entire stack containerized, I was ready. I was feeling clever. "I've got a document with 269 chunks," I reasoned. "Making 269 separate API calls is inefficient. I'll be smart! I'll 'batch' them."

I stuffed all 269 chunks into one massive, context-free prompt and made a single API call, expecting a neatly organized list of results. What I got was garbage.

This is a classic RAG anti-pattern. LLMs work by paying "attention" to the most relevant parts of a prompt. By cramming everything into a single query, I had diluted the context to the point of uselessness. The model, faced with a sea of text, did what any overwhelmed worker would do: it found the easiest, most obvious piece of work and ignored the rest. It would spot a single keyword like "PowerShell" in one chunk and write its entire analysis on that, completely ignoring the subtle but critical details about "Kerberos ticket abuse" in the other 268 chunks.

The fix was obvious in hindsight but required a foundational shift in how we instructed our agents. I had to abandon the flawed "batch" approach for a methodical, per-chunk analysis. It was our first hard lesson: in the world of AI, the path that seems most efficient is often the one that leads directly to failure. True efficiency comes from giving the AI the clean, focused context it needs to do its job correctly, one step at a time.

Conclusion: The Foundation is Laid. Now the Real Chaos Begins.

I've journeyed through the unglamorous but absolutely essential work of building the foundation for an intelligent system. I've chosen my framework, built a professional data pipeline, made strategic decisions about our AI's "brain," and containerized our stack for production.

I thought the hard part was over. I was wrong.

With the foundation laid, it was time to unleash the agents. What followed was a cascade of new, more terrifying problems. I had given the agents a brain, and they were starting to use it… occasionally to get stuck in infinite loops, hallucinate entire reports out of thin air, and lie with the unflinching confidence only a machine can possess. So convincingly in fact, that my guardrail validator agent believed the lying agent and started halucinating as well. Talking about shared psychosis…

In Part 2 of this series, we’ll leave the world of data engineering and enter the messy, frustrating, and fascinating art of prompt engineering. We'll cover the brutally direct prompting techniques needed to keep agents on task, the absolute necessity of a "Validator" agent to act as our AI's conscience, and the spectacular failure that taught us to never, ever trust an AI's output without verification. Stay tuned.

Bring your own Fix - Mr.D0x inspired variation of yet another "fix" attack

Jean-Francois Maes — Mon, 23 Jun 2025 20:40:28 GMT

Whilst I was browsing Twitter, erhm I mean X today, I saw a cool tweet from Mr.D0x about a new variation of a “fix” attack by abusing the internet browser.

https://twitter.com/mrd0x/status/1937178552471810320

I jokingly replied inb4 “downloadfix” as I immediately thought hey we typically escape browsers in kiosk breakouts by for example navigating to the downloads folder. Whilst this started out as a joke, I thought to myself heh, this could be an interesting experiment.

Nowadays, when I have a crazy idea, I find myself grabbing a good ol’ LLM and pitch the idea.

Thanks ChatGPT for being super duper helpful!
Whilst ChatGPT wasn’t very fond of the idea, I found myself using AI Studio, as Gemini 2.5 pro is a boss. Guardrails can be weakened and it has a huge context window as well.

The conversation was interesting to say the least…
Below is an extract of the interesting part of the conversation with Gemini.

Here's how you can achieve this locally using "pure JS" in a sense, with the help of a Service Worker:

A Service Worker acts as a client-side proxy. It can intercept network requests made by your page (including download requests) and decide how to respond. This is the closest you'll get to "rugpulling" the download from the client-side.

How it would work:

index.html:
- Registers the Service Worker (sw.js).
- Has a button that, when clicked, initiates a download to a specific URL (e.g., /download-me-corrupted.zip). This URL doesn't have to exist as a real file on a server; the Service Worker will handle it.
sw.js (Service Worker):
- Listens for fetch events.
- If it sees a request for /download-me-corrupted.zip:
  - It starts fetching a real (larger) file from somewhere (could be a local path if you're serving your site locally with a simple HTTP server, or a public URL).
  - It gets the ReadableStream of the real file's response.
  - It constructs a new Response to send back to the page. The body of this new response will be a new ReadableStream that the Service Worker controls.
  - The Service Worker starts reading chunks from the real file's stream and writing them to its controlled stream.
  - After a short delay or a certain number of bytes, the Service Worker stops writing to its stream and/or explicitly errors out its stream.
  - The browser's download manager, which is consuming the stream provided by the Service Worker, will then encounter an unexpected end-of-stream or an error, leading to a "Failed - Network Error" or similar.

This was all I needed to make this work. Now below is just a PoC, but you could combine this information with “clickfix” or probably even “filefix” as well :)

The reason there is a relatively significant delay on the “repairtool” is because we want to bypass the “this website wants to download multiple files” prompt. There is no official “time” on how long it takes to get rid of it, so I just played it safe and set it to 10 seconds. You can probably find better ways to deal with it.

TLDR: With AI and inspiration, it becomes trivial to rapidly prototype new TTPs.
If you can “think of it”, you can build it usually within a few minutes or hours with some good prompting.

Source code here:

https://github.com/jfmaes/downloadfix

Sorry if it's hard to catch my vibe -Building the Dumbest (Yet Smartest?) C2 in Existence

Jean-Francois Maes — Wed, 30 Apr 2025 16:52:06 GMT

We’re back like we never left.

Hello, dear readers! It's been a while.

Unfortunately, the old redteamer.tips had an unexpected demise due to a provider mishap—and, regrettably, backups weren't available. RIP old blog. But fear not; we’re back and ready to dive into exciting new territory!

This blog post started with a simple research question:

How well can the “well known” AI models that we love (GPT, Claude, Gemini etc…) deal with “new” research?

It was intruiging to me as of course more recent data means that the models haven’t ingested it as part of their training data. Meaning they(=the models) would have to reason and figure out how to in (and de)gest the data presented.

As it turns out, I had an interesting idea:

In my role as Director of Offensive Operations at Cytadel, I oversee offensive security consulting and help guide the development of our Ethical Ransomware product, Cytadel R3 (Redefining Ransomware Resilience). Among other features, R3 enables execution of ransomware-related TTPs. Although R3 isn’t written in C#, I realized an AI-driven experiment could provide a minimalist proof-of-concept (PoC), potentially transferable into our CLASSIFIED INFORMATION—Nice Try!.

Introducing: The Dumbest C2 in Existence! (Or is it the smartest?)

In an era where Command and Control (C2) frameworks have become increasingly sophisticated, simplicity can be revolutionary.

Imagine a minimalist C2 agent devoid of built-in functionalities but dynamically extensible via LLM-assisted capability generation. Thus, the "dumbest C2 in existence" project was born

Before I go further, I first need to talk a bit more about “LLM-assisted capability generation” So allow me to introduce…

ChatGPMCP

The new “hype” nowadays is MCP, also known as Model Context Protocol. To quote Anthropic directly

MCP is an open protocol that standardizes how applications provide context to LLMs. Think of MCP like a USB-C port for AI applications. Just as USB-C provides a standardized way to connect your devices to various peripherals and accessories, MCP provides a standardized way to connect AI models to different data sources and tools.

Essentially, traditional AI models were limited primarily to chat-based interactions. However, recent advances (Cursor, GitHub Copilot, WindSurf, and more) empower LLMs with new abilities, known as "tools," enhancing interactions beyond mere text.

Driven by curiosity, I challenged ChatGPT-4o (henceforth "Chat") to handle fresh context:

Not off to the best of starts…
(although I liked the fact that because I put this chat in a project folder called “dumbest c2 in existence”, ChatGPT (which I from now on will just call “Chat”) assumed its a C2 in C#)

I tried giving it some context, to see if that would do anything in this model, and it actually did!

What caught my eye was the fact that they had an SDK in C#.

Time to make a plan

I wanted to see if Chat can deal with multiple things in one prompt. I also fed Chat an Image to see if it would recognize it somehow.

I want to create a very basic POC for a C# C2 but here is the thing.
The C2 does nothing. it has no built in commands, the only thing it does is establish HTTP(S) comms to the server side.

what makes this interesting is that I want this C2 to be capeable of dynamically loading and unloading new capabilities. a bit like the mythic c2 framework does. https://docs.mythic-c2.net/~gitbook/image?url=https%3A%2F%2Fcontent.gitbook.com%2Fcontent%2FKbzfKI4qhfyU4PSI7wrx%2Fblobs%2FS9JtQKt3pZx5yFDA1Ihf%2FScreen%2520Shot%25202021-12-02%2520at%25203.19.58%2520PM.png&width=400&dpr=3&quality=100&sign=7589338f&sv=2

I want to do it though through an MCP so for example the c2 calls back in and I give it new C# code that it can register, and it would now have a new command!

so an example would this would be as follows:
1. agent checks back in
2. operator asks LLM enumerate local groups LLM generates new C# code that does group enum for windows agent
3. receives code agent registers new task localgroupenum
4. Agent can now execute the localgroupenum task.

Get it?

Surprisingly, ChatGPT-4o produced an excellent response and proactively outlined a robust implementation plan without explicit prompting. The AI’s autonomous capability to structure data effectively minimized user-induced errors from poor prompting.

The quality of the code generated by 4o (which is not considered the “golden goose” of code generation, especially not with the newer 4.1 models and even o3, which both are still slightly below Sonnet in coding benchmarks.) was unexpectedly solid, employing MVC patterns, Dependency Injection, and class separation—remarkably sophisticated for an unprompted minimalist PoC.

Chat then PROACTIVELY asked me if it should generate a small console app to make end to end testing easier.

There were a few tiny bugs in the generated code, but nothing extremely significant, and Chat solved most of them in 1 prompt.

Sidenote, recently (the last 3 months) I’ve been primarely using Cursor but now that WindSurf announced they got rid of flowcredits and temporary allowed users to use the latest GPT models for free, I made the switch. However, these AI integrated assistants ALL seem to have the same problem, which is that when you want to modify something, the AI modifies WAY TOO MUCH. Some users reported losing quite a bit of work (of course not that big of a deal if you are using proper subversioning, but still). Taking the GPT chat approach instead of the agentic approach actually seems to work rather nicely, even though the “official” GPT web application does not support ALL of the models they expose in Windsurf/Cursor

However, I did notice that once we hit a semi persistent bug (the server tasked something to the agent, but the agent did not receive the tasking correctly), Chat did seem to have issues pinpointing it, requiring me to step in and guide it (slightly more than what I think should be needed) towards the root cause. I wonder if it has to do with Context limits, but can’t confirm at this time.

But everything changed when the fire nation attacked

the LLM started halucinating

As much as I was impressed with the code that was generated so far, things took a horrible turn for the worse when it was time to actually implement the MCP server. I could already see it was going to go sideways from the first header Chat generated.

While this is not true, the code it produced actually made sense, and amusingly enough, I actually liked some of the implementation details that came back. For example, the concept of exposing a “Manifest”:

I tried steering it back into the right direction by pointing it towards the official reference documentation once again, but that made matters worse…

That’s when I decided to change the model to O3 instead of 4o, as O3 is a “reasoning” model, I figured it might be better equiped to actually “think” about the reference spec and come up with a solution.

This worked reasonably well, except even O3 made one crucial mistake. It was trying to combine my C2 server and MCP server into one, which I honestly thought was a pretty solid idea.

The problem with that approach, though, is that MCP servers do not want anything piped into STDOUT. This means your console output should be completely empty, which is not ideal if your server is a console application and doesn't send data to a nice frontend.

Once switched to O3, the code generation worked pretty well again and actually implemented a decent MCP server.

What was really mind-blowing to me, though, is that I had issues with getting Claude to interpret my MCP server (even though I split it out, dotnet run --nobuild still borked Claude). I fed the information to ChatO3, and it came up with a fix that wasn’t even documented in the official reference spec for MCP servers. (The fix was to compile it to an exe and point Claude to the exe instead of dotnet run.)

To Conclude

When it comes to using LLMs for quickly prototyping, I am pleasantly surprised by the results. You need minimal coding knowledge to create a coherent proof of concept. Knowing when to switch models for specific tasks can definitely boost productivity.

I prefer the separate GPT - Code Editor approach because it gives the prompter full control over the project. While an agent-based approach might be the future, my experiences with Cursor and Windsurf show that it's a delicate balance between increased productivity and agents being too eager to tweak code, delete functionality, or modify too many files. Some of these issues can be addressed with "rules," but for now, I believe the separate approach is best. Note that you can also "chat" with LLMs in integrated IDEs like Cursor, but it uses the same credits as agent actions, which isn't ideal for cost-effectiveness.

At the end of our experiment, we have a basic proof of concept working. Of course, this is not ready for production. We lack session management, task IDs, and agent IDs, and we rely on console.print in the agent instead of sending the output back to the server, so there's definitely still a long way to go. However, considering it took about 4 hours to create, I find this quite impressive.

You can find the source code for the project here:

https://github.com/Cytadel-Cyber/BlogPosts/tree/main/DumbestC2Ever

https://jmp.sh/s/UZRyPP12WpVEYGC79yVl