Decode EU AI Act Article 13 for Engineers: The Instructions-for-Use Spec You Actually Need

The EU AI Act is forcing a shift that most engineering teams haven’t had to make before:

Before you “add compliance,” you have to read the spec.

Article 13 is often summarized as “be transparent.” That summary hides the real requirement.

Article 13 is an “instructions for use” requirement. It’s telling providers what documentation must exist so deployers can interpret outputs and use the system appropriately — including, when relevant, how deployers can collect and interpret logs.

If you’ve already seen the generic “log your LLM calls” advice, skip it. This post is the deeper version: what Article 13 actually says, and the concrete artifacts an engineering team can ship.

What Article 13 Actually Says (And Why It’s Not Just “Logging”)

Here’s the core obligation, in the text of Article 13(1):

“High-risk AI systems shall be designed and developed in such a way as to ensure that their operation is sufficiently transparent to enable deployers to interpret a system’s output and use it appropriately.”

And then Article 13(2) makes it operational:

“High-risk AI systems shall be accompanied by instructions for use … that include concise, complete, correct and clear information … accessible and comprehensible to deployers.”

In other words: Article 13 is primarily about the documentation pack. If your “documentation” is a slide deck or a marketing page, you’re not doing Article 13 — you’re doing positioning.

The punchline:

If you can’t write honest instructions for use, you don’t understand your system well enough to deploy it safely.

The Article 13 Deliverable: An Instructions-for-Use (IFU) Pack

Article 13(3) lists what must be in the instructions for use. For engineering teams, it’s easiest to translate each item into an artifact you can generate and maintain.

Below is a practical IFU template for LLM-powered systems.

1) Provider identity and contacts (13(3)(a))

Artifact: an owned “system dossier” page with provider contact, security contact, and escalation path.

2) Intended purpose (13(3)(b)(i))

Artifact: an “intended use” statement you can test against.

For LLMs, write this as:

tasks allowed (e.g., “summarize customer tickets”, “draft responses for human review”)
tasks explicitly not allowed (e.g., “approve loans automatically”, “make hiring decisions”)
required operating constraints (language, jurisdictions, data categories)

3) Accuracy metrics + robustness + cybersecurity, plus known circumstances that affect them (13(3)(b)(ii))

Artifact: an evaluation report and a repeatable eval harness.

For LLMs, “accuracy” isn’t a single number. Your IFU should include:

what you measure (e.g., exact match / rubric score / semantic similarity / policy violation rate)
baseline numbers and confidence intervals (where applicable)
known degraders (context length spikes, retrieval failures, provider rate limits, model updates)

4) Known/foreseeable misuse and failure modes that can create risk (13(3)(b)(iii))

Artifact: a failure-mode catalog.

For LLM systems, this is where you document:

hallucination modes (fabricated citations, false certainty)
prompt injection / jailbreak risk
PII leakage risk (inputs and outputs)
unsafe advice risk (domain-specific)

5) Explainability / output interpretation guidance (13(3)(b)(iv) and 13(3)(b)(vii))

Artifact: “How to read the output” guidance for deployers.

For LLMs, that can include:

what the output is (suggestion vs decision)
what confidence signals mean (if you emit them)
required post-processing (format validation, policy checks, human review)

6) Group-specific performance (when appropriate) (13(3)(b)(v))

Artifact: fairness/performance slice report (only if relevant to your use case).

This is the uncomfortable part, but it’s better than pretending it doesn’t exist: document where performance differs across languages, dialects, user segments, or content types.

7) Input data specifications / relevant training/validation/testing dataset info (when appropriate) (13(3)(b)(vi))

Artifact: an input contract.

For LLM apps this often means:

what data types are allowed in prompts (and what are forbidden)
how you handle personal data (allowed categories, redaction rules)
constraints on retrieved context (sources, freshness, access control)

8) Pre-determined changes (13(3)(c))

Artifact: a “change surface” list.

For LLM systems this should include:

model version changes (provider-side and your pinned versions)
prompt version changes (what can change, who approves)
tool/plugin changes (new tools can change outputs dramatically)

9) Human oversight measures (13(3)(d))

Artifact: an oversight playbook.

Examples:

where humans must approve before action is taken
escalation paths for violations or unsafe outputs
operational controls (rate limits, feature flags, kill switch)

10) Compute/resources, lifetime, maintenance and updates (13(3)(e))

Artifact: an operational runbook (SRE-style).

Include:

expected latency envelope and scaling assumptions
dependency update cadence (model/provider SDKs)
rollback strategy and “safe mode”

11) Logging mechanisms (when relevant) (13(3)(f) referencing Article 12)

This line is the one engineering teams should underline:

“Where relevant, a description of the mechanisms … that allows deployers to properly collect, store and interpret the logs…”

Artifact: a logging/trace specification.

This is where observability becomes legally useful — but only as a supporting system for the IFU pack.

So What Should You Log? (The Minimal Evidence Record)

When you do need logs/traces, keep them structured and purposeful. For each meaningful LLM call (or agent step), capture:

1) Identity + correlation

trace_id (unique)
span_id and parent_span_id (if you have multi-step workflows)
request_id (your app’s request correlation id)
organization_id (tenant boundary)
user_id (pseudonymized or tokenized if needed)

2) Time

start_time, end_time
derived duration_ms

3) Model + provider

provider (OpenAI / Anthropic / Bedrock / Azure OpenAI / …)
model (exact model identifier)
optional: region / deployment name (Azure), model ARN (Bedrock), etc.

4) Cost + usage

input_tokens, output_tokens, total_tokens
estimated_cost_usd (or your internal cost allocation)

5) The “what happened” envelope

status (ok / error / blocked)
a compact set of attributes/tags (feature name, environment, app version)

6) Content capture strategy (explicit)

You need a policy for whether you store:

the full prompt/completion content,
a hashed representation,
a redacted version,
or nothing at all (only metadata).

The critical requirement is: your capture policy must be deliberate and defensible, and it must be consistent with your data protection obligations.

The Common Mistake: Treating Article 13 as a Logging Project

If you read Article 13 carefully, you’ll notice something:

it does not say “log everything”
it says “enable deployers to interpret outputs and use the system appropriately”
and it mandates an IFU pack that includes performance limits, failure modes, oversight, and (when relevant) logging mechanisms

Logging is necessary, but it’s not sufficient. The IFU is the core deliverable.

PII and Policy: “Transparency” Without Safeguards Is a Liability

If personal data can appear in prompts or completions, you need two capabilities:

Detection (know when PII is present)
A policy response (log, warn, block, redact, alert)

This is why “we have logs” isn’t enough. Logs tell you something happened. They don’t tell you whether it violated a policy.

The safest architecture is to detect at the HTTP boundary (before data leaves your infrastructure to an external provider). It gives you a single enforcement point across all apps and services.

A Better “This Week” Plan: Write the IFU First

If you want a practical starting point that isn’t repetitive:

Draft the IFU using the template above (even if it’s ugly).
For each IFU section, write the one artifact you’ll maintain (eval report, failure mode catalog, oversight playbook, logging spec).
Only then implement logging/tracing/governance features that populate those artifacts automatically.

How GetStackLens Maps to Article 13 (Without Hand-Waving)

GetStackLens splits the problem into two cooperating systems:

StackTrace (observability): every LLM call becomes a structured trace (model, provider, tokens, cost, latency, attributes/tags).
GovernAI (governance): PII detection + policy decisions, correlated back to the trace.

The key is correlation. The trace is the record. Governance outcomes attach to it.

If you also use FlowOps for prompt versioning, you can link a trace to the exact prompt version that ran — which is the difference between “we changed something” and “we know what changed.”

The Honest Reality

Most teams won’t be “done” by August 2026.

But regulators (and customers) can tell the difference between:

a team with no logs, no policy, no plan, and
a team that has structured traces, basic detection, and repeatable exports.

If you do nothing else this week: implement the minimum evidence record, make it queryable, and run an export drill.

If you want early access to StackTrace + GovernAI as we ship the rest of the platform, the waitlist is at getstacklens.ai.