Sitemap.xml, should links be ended with "index.html"?

Hi, the sitemap.xml that rapidweaver generates automatically puts “index.html” at the end of every listed link, that did’t seem correct to me, as I want Google to index the pages without it, so I generate a sitemap.xml with some website that creates it for me, and the copy the content to the sitemap.xml file.

Am I doing it right?

Hi @santi,

With extensions is correct.

You want spiders to index the content of your pages, not the folder they reside in on the server. Note that Google will always look for html, php or htm file regardless of what the sitemap file tells it to do, but other spiders may not. To be sure and compliant, you should specify the full path and filename.

Cheers,
Erwin

But what if the URL is canonicalized? I have canonical tags in my pages so that those “index.html” URLs don’t get indexed instead of the real ones.

I have read this on a website:

"Your XML Sitemap should only contain URLs you wish for search engines to index. If a URL is canonicalized, this is an explicit statement to search engines that you do NOT wish for the URL to be indexed, and instead wish for the canonical URL to consolidate indexing signals.

As such, including a canonicalized URL in a sitemap provides conflicting information to search engines, and may impact what they consider is the canonical URL, which may in turn mean that unintended URLs get indexed"

Are you sure you’re reading that correctly?

That merely states which links you should and should not include in sitemap.xml, not whether you should add the index file to the URL or not.

Specifically for Google, it really doesn’t matter unless your URLs end in something different than index.htm(l) or index.php files, which is bad practice anyway (but can sometimes be necessary). Google will look for any index.php, index.html or index.htm file in any directory that you include in the sitemap (and, secretly, in any directory - they’ll just not display those in search results).

But the search indexing world is bigger than just Google. You’ll find that every day, a few hundred spiders will crawl your website. Some of those spiders will only index full URLs, and so will only index a page if you include the file at the end of the path in the URLs in your sitemap.

So, in short:

If your contact page is website.ext/contact/, Google will find the index.php file inside and index it, but some search engines might not.

If your contact page is website.ext/contact.php, google won’t index it unless you include the full path in your sitemap.xml

If your contact page is website.ext/contact/contact.php the same applies. If there happens to be an index.html, index.htm, or index.php file in that same directory (perhaps left over from earlier publications), Google will index that instead if you don’t specify the full path in sitemap.xml.

Best practice remains to include the filenames in sitemap.xml, although I do admit not always doing that either (if I know there won’t be conflicts and the client only prioritises Google oriented SEO).

Cheers,
Erwin