Web Standards Reference for Technical SEO

Most technical SEO articles cite other technical SEO articles. That works until the conversation gets serious, until you need to settle whether a 308 is interchangeable with a 301, whether noindex belongs in robots.txt or in a meta tag, whether trailing slashes change a URL's identity, or whether a structured data property is required or optional. At that point you need to cite the actual specifications, not a blog post written three years ago.

Web standards reference for technical SEO across HTTP, URI, robots, sitemaps, HTML, and structured data specifications.

Updated for April 2026, this reference indexes the web standards that technical SEO is built on, HTTP, URI syntax, the robots exclusion protocol, sitemaps, HTML, structured data, language tags, and performance APIs. Each section names the canonical specification, links to it, and explains the practical SEO consequences. Use it as a starting point for the technical SEO audit work that benefits from grounded sources, and as a reference when the team needs to settle a technical disagreement with evidence instead of opinion.

Why technical SEO sits on top of web standards

Search engines do not invent the rules they enforce. They implement them on top of the same standards every browser, server, and HTTP client uses, RFCs published by the IETF, W3C recommendations, WHATWG living standards, and a small set of industry conventions like schema.org and the Open Graph Protocol. When a technical SEO question gets contested, the fastest way to resolution is the spec, not the SEO blog.

Two reasons this matters in practice:

Disputes go away when both sides cite the same source. Engineering teams trust standards documents. Marketing teams sometimes don't. The team that cites RFC 9110 wins the canonical conversation more often than the team that cites a Moz article.
The spec captures intent, not just behavior. Google's behavior is observed; the spec is normative. Knowing what the standard says lets the team distinguish "Google does this today" from "this is how the web is supposed to work", those are different.

The framing throughout: this is a reference, not a tutorial. We link to the canonical source for every claim and call out where Google's interpretation diverges from the underlying spec.

HTTP, the protocol every SEO decision rides on

HTTP semantics live in RFC 9110, the current core specification. It supersedes the older RFC 7230–7235 series that most older SEO articles still cite. The 9110/9111/9112/9113/9114 cluster covers semantics, caching, HTTP/1.1, HTTP/2, and HTTP/3 respectively.

The status-code registry, the authoritative list of every HTTP status code, including new additions, is maintained by IANA at the HTTP Status Code Registry. When a question comes up about whether a status code "means" something, this registry is the source of record.

Status codes, caching, and conditional GETs

Practical SEO consequences:

301 vs 302 vs 307 vs 308, RFC 9110 §15.4 defines all of them precisely. 301 and 308 are permanent; 302 and 307 are temporary. The difference between 301/302 and 307/308 is whether the request method must be preserved on redirect. We covered the SEO-side mechanics in HTTP status codes for SEO and crawlers.
Caching headers, RFC 9111 defines Cache-Control, ETag, Last-Modified, and the conditional request mechanism. Most caching SEO problems trace back to misunderstanding which directive composes with which.
Conditional GET, If-Modified-Since and If-None-Match matter for crawl efficiency. Googlebot honors them. RFC 9110 §13 covers the semantics.

Where Google's behavior diverges from the spec

A few places worth knowing:

Google treats 308 as functionally equivalent to 301 for indexing purposes, even though the spec preserves the method semantic that matters for clients
Google may keep an old URL indexed for weeks after a 301, this is implementation behavior, not spec
Google does not require Cache-Control to be set, but search behavior on cacheable responses is more predictable than on Cache-Control: no-store

The spec is the right baseline. Google's documented behavior layers on top.

URLs, RFC 3986 is the source of record

URI syntax lives in RFC 3986. This is the document that defines what a URL is, what characters are reserved, what percent-encoding does, and how URLs are normalized.

Where this matters for SEO:

Trailing slashes change the URL's identity, /path and /path/ are different URIs per RFC 3986 §6.2.2.3. Whether they should resolve to the same content is a server-side decision; whether they're "the same URL" is a spec question.
Case sensitivity, the path component of an HTTP URL is case-sensitive per spec, even though many servers treat it as case-insensitive. /Page and /page are different URIs.
Query parameter order, RFC 3986 doesn't define equality for query strings, which is why Google sometimes treats ?a=1&b=2 and ?b=2&a=1 as different URLs and sometimes doesn't.
Percent-encoding equivalence, %2F and / are not equivalent in path components per spec. They are sometimes treated as equivalent by Google, but this is implementation behavior.

Settling "are these two URLs the same?" disputes

If your team is debating whether two URLs are "the same," start with RFC 3986. The SEO-side patterns for URL architecture are covered in site taxonomy and URL architecture for large websites.

robots.txt, RFC 9309

The Robots Exclusion Protocol was a de-facto standard for over 25 years before it was formally documented in RFC 9309, published in 2022. Knowing the spec matters because Google's behavior is one implementation among several, Bing, Baidu, and AI crawlers may interpret directives slightly differently.

Key spec points:

The Disallow directive is the only crawl-control mechanism in RFC 9309. There is no Noindex directive in the spec, Noindex: /path in robots.txt is a Google extension that was deprecated in 2019.
Wildcards (*) and end-anchors ($) are extensions, widely supported but not universal
The first matching rule wins per crawler implementation, but the spec is not strict about ordering
A missing or unreachable robots.txt should be interpreted as "no restrictions" per RFC 9309 §2.3.1.4

Why Disallow does not prevent indexation

Where teams get into trouble: putting Disallow: /path in robots.txt to prevent indexation. That's not what Disallow does. It prevents crawling. A page that is not crawled cannot have its noindex meta tag read, so the page may stay indexed via inbound links. The fix is noindex in HTML or HTTP header, not Disallow in robots.txt.

The Google-specific extensions and AI-crawler patterns are in llms.txt and AI crawl directives.

Sitemaps, the sitemaps.org protocol

Sitemap XML is defined at sitemaps.org, maintained jointly by Google, Microsoft, and Yahoo. It is not an RFC, but it is the de-facto standard every major search engine implements.

The protocol covers:

The <urlset> and <sitemapindex> schemas
The <lastmod>, <changefreq>, and <priority> optional fields
The 50,000-URL and 50 MB limits per sitemap file
Image and video extensions (extensions.html)
News sitemaps (separate spec at Google News sitemap docs)

Which sitemap fields actually matter in 2026

Practical points:

<changefreq> is documented in the spec, but Google has stated publicly it ignores the field. <priority> is similar, both are spec-supported but practically irrelevant in 2026.
<lastmod> is the only field that carries operational weight. We covered the architecture model in XML sitemap guide for technical SEO.
The robots.txt Sitemap: directive (defined in RFC 9309 §2.4) is the canonical way to advertise a sitemap. Submitting via Search Console is supplementary.

HTML, the WHATWG living standard

The HTML specification is maintained by WHATWG as a living standard at html.spec.whatwg.org, not by W3C. The W3C HTML5 Recommendation from 2014 is historical, every browser implements the WHATWG version.

For SEO, the parts of the spec that matter most:

<meta> elements, the name="robots" directive, name="description", http-equiv semantics. WHATWG defines what's valid; Google's robots meta directives are documented separately at the Google robots meta tag reference.
<link rel> values, canonical, alternate, prev/next (deprecated by Google in 2019 but still in the WHATWG <link> registry), preload, prefetch. The full registry is the HTML Living Standard link types section.
Heading semantics, <h1>–<h6> are still the document outline primitives. Multiple <h1> per page is technically valid since HTML5 but practically a bad idea for SEO.
<a> and link semantics, the rel attribute (nofollow, sponsored, ugc) is registry-defined. The full list lives in the WHATWG link types section.

The image element

<img> ships with width, height, alt, loading, srcset, and sizes. All are spec-defined. The MDN reference at MDN HTML img is the practical companion. We tied these to image SEO patterns in image SEO at scale for modern frameworks.

Structured data, schema.org and JSON-LD

Structured data on the web has two layers:

Vocabulary, schema.org, maintained as a community project sponsored by Google, Microsoft, Yahoo, and Yandex. Defines types like Article, Product, FAQPage, Person.
Syntax, JSON-LD as defined by the W3C JSON-LD 1.1 specification. This is the canonical format Google uses, though Microdata and RDFa are also valid syntaxes for schema.org.

What matters for SEO:

Schema.org is not a spec in the IETF/W3C sense, it's a controlled vocabulary. Properties marked as required or recommended are guidance, not enforcement. Google publishes its own structured data documentation listing which properties are required for each rich result.
The JSON-LD must be valid JSON-LD, which the W3C spec defines precisely. Tools like the Schema Markup Validator and the Rich Results Test check both layers.

We covered the production mechanics in structured data for AI visibility.

Language tags, BCP 47

hreflang values are language tags as defined by IETF BCP 47, which is a collection of RFCs (RFC 5646 plus updates). These tags also drive the HTML lang attribute and the JSON-LD inLanguage property.

The pattern: language-script-region. For example:

en, English
en-US, English as used in the United States
zh-Hans-CN, Chinese (Simplified) as used in China
pt-BR, Portuguese (Brazil)
x-default, fallback for the default version

Where teams trip up: using ISO 3166 country codes alone (uk for United Kingdom would actually mean Ukrainian), inconsistent casing (case is not significant per BCP 47, but tools sometimes care), and using deprecated tags. The IANA Language Subtag Registry is the authoritative list.

The full hreflang model is in international SEO and hreflang for modern frameworks.

Open Graph and Twitter Cards

The Open Graph Protocol is a Facebook-originated protocol that became the de-facto standard for social-card metadata across LinkedIn, Slack, Discord, Telegram, Pinterest, and most messaging clients. It's not a spec maintained by a standards body.

Practical points:

og:image should be 1200×630 (1.91:1 aspect ratio) per the Facebook reference. Other ratios get cropped.
og:type, og:title, og:description, og:url, og:image are the minimum useful set.
Twitter Cards are documented at Twitter/X developer docs. They overlap with Open Graph; if both are present, X uses the Twitter-specific tags.

These are not part of the WHATWG HTML spec but are widely implemented as <meta property="og:..."> in <head>.

RFC 6265 defines HTTP cookies. The newer draft RFC 6265bis updates the spec for SameSite, partitioned cookies, and cookie attributes.

Why this matters for SEO:

Cookie-based personalization can break crawl behavior if the bot receives different content than the user
Cookie-walled content (forced login behind a cookie banner) can prevent indexing, this is rendered-after-cookie behavior the bot doesn't follow
Consent banners that block content rendering until the user accepts can fail Core Web Vitals and crawl

The patterns for handling consent and crawler routing are in redirect bot traffic to prerendering.

Performance APIs, W3C Web Performance

The Web Vitals metrics, LCP, INP, CLS, TTFB, FCP, are not RFCs. They're metric definitions published by Google at web.dev/vitals, measured via the W3C Web Performance Working Group's APIs at W3C Web Performance.

The browser-side APIs that produce the metric values:

Largest Contentful Paint API, W3C Working Draft
Layout Instability API, W3C Working Draft (drives CLS)
Event Timing API, W3C Working Draft (drives INP)
Navigation Timing, W3C Recommendation (drives TTFB)

Search ranking uses field data from the Chrome User Experience Report, not lab data. The metric definitions are Google's; the browser APIs producing them are W3C. We covered the engineering implications in Core Web Vitals optimization for engineering teams.

Where standards intersect with Google's interpretations

A few patterns worth keeping in mind across all of these:

A spec defines what is valid; Google defines what produces a ranking signal. These are different. Valid hreflang doesn't guarantee Google honors it; valid schema doesn't guarantee a rich result.
When the spec and Google disagree, Google's behavior is what affects search. The spec is the right place to start the conversation; the behavior is the right place to end it.
Other search engines may follow different paths. Bing, DuckDuckGo, Yandex, and AI engines like Perplexity each implement standards with their own gaps. A standards-grounded approach handles all of them better than a Google-only approach.
Standards change. RFC 9110 superseded RFC 7230–7235 in 2022. RFC 9309 formalized the robots protocol after 25 years of de-facto status. Citing the current spec matters.

How to use standards in technical SEO disputes

When the team disagrees on a technical SEO decision, the fastest path to resolution is usually:

Identify which standard governs the question (HTTP? URI? HTML? schema.org?)
Cite the relevant section of the canonical document
Identify whether Google's documented behavior aligns or diverges
If they align, the answer is the standard. If they diverge, document the divergence and decide which to follow per goal.

This sounds bureaucratic. In practice it ends most disagreements within an hour because everyone is now arguing about the same evidence. We use this exact pattern on every audit handoff. A team that grounds its technical SEO in standards has fewer "well, I read somewhere..." conversations and ships faster.

Conclusion

Technical SEO is built on a small number of well-defined standards. Knowing where to find them, RFC 9110 for HTTP, RFC 3986 for URIs, RFC 9309 for robots, sitemaps.org for XML sitemaps, WHATWG for HTML, schema.org for structured data vocabulary, JSON-LD for syntax, BCP 47 for language tags, turns most technical SEO disputes from opinion exchanges into evidence-based decisions.

The standards are not the whole story. Google's behavior layers on top, and so do other engines'. But the standards are where every conversation should start, and citing them by name and section number is the cheapest credibility move available in technical SEO work.

Content Cocoon

Web Standards & Specifications Cluster

Connect web standards work back to the technical SEO articles where each spec is applied, HTTP status codes, URL architecture, robots, sitemaps, hreflang, structured data, and the broader engineering audit.

Internal Pathways

HTTP Status Codes for SEO and Crawlers

The applied SEO companion to RFC 9110, status code semantics in the context of indexation, redirects, and crawler behavior.

XML Sitemap Guide for Technical SEO

The sitemaps.org protocol applied to real architecture decisions for large sites and modern frameworks.

International SEO and hreflang for Modern Frameworks

BCP 47 language tags and hreflang implementation patterns for multi-language sites.

Technical SEO Audit

The parent service for teams grounding their technical SEO decisions in canonical specifications.

External Technical References

RFC 9110, HTTP Semantics

The current core specification for HTTP semantics, including status code definitions and caching headers.

IANA HTTP Status Code Registry

The authoritative registry of every HTTP status code with its defining specification.

Frequently Asked Questions

Which RFC defines HTTP status codes for SEO?+

RFC 9110 defines HTTP semantics including the full status code system. It superseded RFC 7230 to 7235 in 2022. The IANA HTTP Status Code Registry maintains the authoritative list of registered codes including additions like 308 Permanent Redirect.

Is robots.txt actually a standard?+

Yes, since 2022. RFC 9309 formalized the Robots Exclusion Protocol after 25 years as a de-facto standard. The Disallow directive is the only crawl-control mechanism in the spec. Noindex in robots.txt is a deprecated Google extension, not part of RFC 9309.

Why cite RFCs in SEO discussions instead of Google docs?+

RFCs and WHATWG specs define what is valid. Google docs define what produces ranking signals. Both are useful but the spec is the right starting point, engineering teams accept it as authoritative, and other search engines may diverge from Google in ways the spec captures clearly.

Is schema.org an official web standard?+

Schema.org is a community vocabulary, not an IETF or W3C standard. It is sponsored by Google, Microsoft, Yahoo, and Yandex. The JSON-LD syntax used to express schema.org is a W3C Recommendation. Both layers matter for valid structured data.