Is your site agent-ready? Here's what I did to find out

The web has had two audiences for thirty years — humans and search bots. We know exactly what to give each of them. HTML and UX for humans; sitemaps, structured data, and clean URLs for crawlers. AI agents are a third audience, and they are becoming increasingly important intermediaries between content and the people who want it. I assumed my blog was fine. Testing it against isitagentready.com made it clear I had assumed without checking.

Diagram showing the three pillars of agent readiness: Discoverability (robots.txt with Content Signals, sitemap.xml), Context (llms.txt), and Structured Data (JSON-LD, canonical, Open Graph)

Start with isitagentready.com

isitagentready.com is a free audit tool. You give it a URL and it checks a list of agent-readiness signals: sitemap, robots.txt, structured data, content signals, and a few others. My blog scored poorly — not because the HTML was broken, but because nothing had been done to make the content machine-friendly beyond the bare minimum that any halfway modern site has.

What makes the tool useful is that it flags concrete, actionable issues. You do not get a vague recommendation to "improve your discoverability" — you get a specific thing that is missing and a pointer to where to learn more. I went through the report top to bottom and fixed what I could. This does not mean I get a perfect score, or even close to it, because things like adding an MCP does not make sense in my case, but it tells me that my site is in a good place for what it is.

robots.txt with Content Signals

Most developers know robots.txt as the place where you tell crawlers what not to index. The Disallow: /admin/ directive has been a fixture since the mid-nineties. What is new is the Content Signals draft standard, which extends robots.txt with a Content-Signal directive for declaring AI usage preferences explicitly:

User-agent: *
Allow: /
Disallow: /search/

Content-Signal: ai-train=yes, search=yes, ai-input=yes

Sitemap: https://blog.dotnetnerd.dk/sitemap.xml

Three values, each yes or no: ai-train for whether your content may be used to train models, search for standard search indexing, and ai-input for use as retrieval context in AI responses. You can opt out of any of them. I chose yes to all three — my writing is public and I want it to be useful — but the point is that the choice is now yours to declare, not left to each AI company's interpretation of silence. This is a draft RFC, not finalised. But the tooling is already checking for it, and the cost of adding it is one line.

sitemap.xml and llms.txt

A surprising number of sites still do not have a sitemap. Mine did not, because nobody had ever written the code to generate one. The fix was straightforward: add a generation step to the static site generator that emits a sitemap.xml on every build, with all posts and tag pages included and proper lastmod dates derived from each post's actual modification time. Dynamic by design — every new post gets added automatically.

llms.txt is a newer convention, proposed by Jeremy Howard, for giving AI agents a structured entry point to a site. Where sitemap.xml is a machine-readable index of URLs, llms.txt is a human-and-machine-readable overview in Markdown: who runs the site, what it is about, a list of all posts with dates and excerpts, and links to any special resources. An agent that reads llms.txt can understand what your site contains without crawling every page individually.

There is also llms-full.txt, which includes the full text of every post. For a blog with 200+ posts that file would be several megabytes. I skipped it — the excerpts in llms.txt combined with clean per-post URLs are sufficient. Overkill is still overkill even when it has a spec behind it.

Canonical links, Open Graph, and JSON-LD

These three are not new ideas, but they are often missing or inconsistently implemented. Canonical links tell search engines and agents which URL is authoritative for a given page — particularly important if your site deploys to multiple domains. My blog has an Azure staging URL and a production domain; without canonicals, two versions of every post were competing with each other in any index that touched both.

Open Graph tags (og:title, og:description, og:url, og:type) are how platforms and agents understand a page when they cannot render it. They have been around since 2010 and are trivially easy to add. Post pages get og:type="article" and additional article:published_time and article:modified_time metadata; everything else gets og:type="website".

JSON-LD with Schema.org types is the most structured of the three. A BlogPosting schema on every post gives agents machine-readable access to the title, description, author, publication date, modification date, and keywords — without parsing HTML. It is what powers Google's rich results, and it is increasingly what AI agents use to understand content type and attribution. Adding it to a static site generator is maybe thirty lines of code.

The case for automating all of it

The reason none of this existed on my blog before is that it requires effort without any immediately visible payoff. A sitemap does not change how the site looks. Open Graph tags are invisible unless you share a link. JSON-LD is only useful to machines. It is easy to deprioritise.

What made it practical was adding everything to the static site generator rather than maintaining it by hand. Now, when I publish a new post, the sitemap is updated, the post gets canonical, Open Graph, and JSON-LD automatically, and llms.txt includes the new excerpt. The investment was a few hours of generator work. The payoff is permanent — and it compounds as the archive grows.

If you have not audited your own site, isitagentready.com is a good place to start. The issues it finds are usually fixable in an afternoon, and most of them have been best practice for years — the agent-readiness framing just gives a useful reason to finally do them.