Guardrails are not an AI policy. They are a technical layer you must test

When a company says “we have AI guardrails”, it usually means one of three different things — and that confusion is exactly where a false sense of safety comes from. An AI policy, legal compliance and technical safeguards are three separate layers. Guardrails are the last one, the most concrete. And they are most often treated like a switch you flip and never test.

Three things the market confuses

An AI policy is a document: what the company allows, what it forbids, who is responsible. Important — but a document blocks nothing. The model never reads it.
Compliance (e.g. with the EU AI Act) is a legal obligation: risk classification, human oversight, documentation. It tells you what you must ensure, not how.
Guardrails are a technical layer in your application’s runtime: code and configuration that actually inspect the model’s input and output before it reaches the user.

A policy says “we don’t disclose personal data”. A guardrail is the thing that actually detects and masks a national ID number in the model’s response — or fails to, if it is misconfigured. The difference between declaring and enforcing lives right here.

What guardrails really are (using AWS Bedrock Guardrails)

To avoid hand-waving, look at a concrete implementation. On AWS we use Amazon Bedrock Guardrails, which provide several independent layers of control:

Content filters — block harmful content (hate, violence and more) at a configurable strength, including a dedicated prompt-attack filter (jailbreaks, prompt injection, prompt leakage).
Denied topics — define subjects the system must not engage with (e.g. advice you are not allowed to give).
Sensitive-information filters — PII detection with the option to block or mask it (predefined types plus your own regex patterns).
Contextual grounding checks — verify that a response stays anchored to its source instead of making things up. This is a technical hallucination check.
Automated reasoning checks — formal, logic-based verification that a response complies with rules you define (generally available since 2025).

Crucially, these safeguards run on input and output, and through the ApplyGuardrail API you can apply them independently of the model — including to models outside Bedrock (self-hosted, third-party). So guardrails are not “one vendor’s feature”; they are a separate layer you design.

Turning guardrails on is the start, not the finish

Here is the part most teams skip. Enabling a filter does not mean it behaves the way you assume.

The prompt-attack filter detects attack-like patterns, but it does not guarantee catching every attack. That kind of detection always has a gap — it is a filter, not an impenetrable shield.
A filter set too strong produces false positives: it blocks legitimate customer queries and breaks the product. Too weak — it lets through what it was meant to stop. You only learn the right threshold on your own data.
Grounding has thresholds you have to tune; PII filters have to be configured for your domain — entity types, a block-or-mask action, custom patterns. Defaults rarely fit a specific case.

This is not us arguing against the vendor — it is the approach AWS itself documents as best practice: run the guardrail in detect mode on representative traffic, start with a high filter strength, find the false positives, tune. Then monitor it in production with metrics (CloudWatch). In other words: a guardrail is something you measure, not merely switch on.

Without tests and evaluation, guardrails are decoration

If a guardrail has to be tuned, then you need to know whether it is tuned well. That means evaluation, not declaration.

A test set: adversarial prompts, jailbreak attempts, PII examples, out-of-scope questions. Without it you don’t know what the guardrail lets through.
A metric: how many real threats it catches versus how many legitimate queries it blocks along the way. “Enabled” is not a metric.
Regression: change the model or the prompt and the whole balance can shift. Guardrails must be re-tested after every meaningful change.

A guardrail with no test set and no metric is a configuration you assume works. In security, assumption is not the state you want to be in.

Guardrails are a process, not a project — they need an owner

The last and most overlooked element: an owner. The model changes, attacks evolve, the inputs change. A guardrail that was good six months ago may now under-block or over-block. Someone has to watch it, react to the metrics and decide when thresholds change.

This is the same requirement the AI Act places on high-risk systems as human oversight (Art. 14): a real person who understands the system’s output and can stop it. Guardrails are the technical side of that same problem. Without an assigned owner, every alert belongs to no one, and the “safety layer” slowly turns into decoration.

How we approach this in practice

In mojApteczka — the production AI system we built — safeguards are not a declaration but something we measure on a validation set. AI extraction of data from a medicine package reaches 100% accuracy on the validation set (n=200), and answers are grounded in sources rather than invented. No magic — the same discipline: define what you expect, test it on data, measure, tune.

In AWS projects we implement guardrails as a native Amazon Bedrock layer, not a bolted-on extra — and treat them exactly the same way: as something to measure, not to declare.

What’s next

If you run (or are building) an AI system and don’t know whether its guardrails actually work — get in touch. Safeguards are also part of AI Act compliance and of hallucination control in RAG on company documents. In every one of those cases the rule is the same: you have to test the technical layer before you call it safety.