Also, there is a technical side to this.
As webmaster/designers the code makes a real difference. Even with the same content, two sites with different markup can greatly impact AI visibility.
tldr;
Semantic HTML + JSON-LD schema + open robots.txt + llms.txt + accurate sitemap. That covers about 80% of the technical work. The rest is content depth, freshness, and named authorship — the human side that no code trick can fake.
The basics every page needs
-
A clear <title> tag. AI uses this to label the page in citations.
-
A meaningful <meta name="description">. Still pulled into snippets and answer cards.
-
<html lang="en"> (or your language). AI uses this to route queries.
-
<meta charset="UTF-8"> and the viewport tag for mobile.
-
A canonical URL: <link rel="canonical" href="...">. This stops duplicate-content confusion.
-
HTTPS sitewide. Bots downgrade trust on plain HTTP.
Semantic HTML over div soup
AI parses your page by reading the tags. Real semantic markup helps it find the actual content:
-
One <h1> per page, then <h2> and <h3> in order.
-
<article> for the main content piece on a blog post.
-
<main>, <header>, <footer>, <nav>, <aside> for page regions.
-
<ul> and <ol> for actual lists, never styled divs pretending to be lists.
-
Use <strong> for emphasis. Plain bold CSS has no semantic meaning to crawlers.
-
<time datetime="2026-05-13">May 13, 2026</time> for any date that should be machine-readable.
-
<figure> and <figcaption> for images that have a story.
If your Elements stacks output <div> for everything, the AI sees a wall of soup. Semantic tags tell it where the real content lives.
Schema markup (JSON-LD)
This is the single biggest win after clean HTML. Drop a <script type="application/ld+json"> block in the <head> of each page with the right schema for that page type:
-
Blog posts: Article or BlogPosting with headline, author, datePublished, dateModified, and image.
-
FAQs: FAQPage with each Q/A pair. These get cited heavily.
-
Tutorials: HowTo with steps.
-
Sitewide: Organization or Person schema in the footer or homepage.
-
Local services: LocalBusiness with address and hours.
-
Products: Product with price, availability, and reviews.
Validate at schema.org or Google’s Rich Results Test before publishing.
Open Graph and Twitter cards
When someone pastes your URL into ChatGPT or Claude, these tags often get read first:
<meta property="og:title" content="Page Title">
<meta property="og:description" content="One-sentence summary">
<meta property="og:image" content="https://...">
<meta property="og:url" content="https://...">
<meta property="og:type" content="article">
<meta name="twitter:card" content="summary_large_image">
Robots and crawler control
Your robots.txt at the site root is where you tell bots what they can do. To allow the main AI crawlers:
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: CCBot
Allow: /
To block any of them, swap Allow: for Disallow:. A few worth knowing:
-
GPTBot — OpenAI’s training crawler
-
OAI-SearchBot — OpenAI’s live search (ChatGPT browsing)
-
ClaudeBot — Anthropic’s training crawler
-
Claude-Web / Claude-User — Anthropic’s live fetch
-
PerplexityBot — Perplexity
-
Google-Extended — Google’s AI training (separate from regular Googlebot)
-
CCBot — Common Crawl, which feeds many models
You can also block specific bots at the Cloudflare level, which is the dashboard work Daniel mentioned in this thread.
llms.txt
A new convention, simple to add. Place a Markdown file at /llms.txt listing your most useful pages:
# Your Site Name
> One-sentence description of what your site is about.
## Main Content
- [Page Title](https://yoursite.com/page): Brief description.
- [Another Page](https://yoursite.com/other): Brief description.
## About
- [About Us](https://yoursite.com/about): Who runs this site.
No solid evidence yet that crawlers fetch this on their own. The real use case is humans pasting your URL into Claude or ChatGPT — the tool often follows the llms.txt link from there.
Sitemap and freshness
-
An XML sitemap at /sitemap.xml.
-
Real <lastmod> dates that reflect when content was actually updated.
-
Submit it to both Google Search Console and Bing Webmaster Tools. Copilot and Meta AI use Bing’s index, and ChatGPT has used Bing in the past.
Bots use the lastmod field to decide what to recrawl. Refreshing dates without changing content is a known anti-pattern and gets pages demoted.
Image handling
-
Real alt text on every meaningful image. AI reads alt text directly.
-
Descriptive file names (pottery-wheel-class-2026.jpg, not IMG_4582.jpg).
-
A clear <figure>/<figcaption> pattern when the caption adds context.
Performance basics
Core Web Vitals still count. AI engines treat slow, broken sites as low quality. Fast TTFB, no layout shift, no render-blocking JS above the fold. Elements handles most of this for you if you stick to native stacks.
One pitfall to watch
If you use JavaScript-loaded content — review widgets, dynamic FAQs, lazy-loaded sections that appear after the page renders — AI crawlers usually do not see it. Either render it server-side or accept that those parts are invisible to AI.
Now go make a website!