Sitemaps for SEO: Strategy, Hygiene and Implementation
Table of Contents
A sitemap tells search engines what to crawl on your website. Done well, it accelerates indexing, signals content priority, and protects crawl budget. Done badly — or ignored entirely — it can quietly suppress the visibility of pages you’ve worked hard to build.
This guide covers the full picture: what sitemaps are, why sitemap hygiene matters more than most guides admit, how to handle international SEO through hreflang in sitemaps, and how to create, submit and maintain them across the major platforms. It’s written for business owners and marketing managers who want practical answers, not just definitions.
What is a Sitemap and Why Does it Matter?
A sitemap is a file that lists the URLs on your website and provides structured information about each one. Its primary job is discovery: it tells search engine crawlers where your pages are, when they were last updated, and how frequently they tend to change. For websites where internal linking is strong and page count is modest, Google will often find everything on its own. But even in those cases, a sitemap acts as a safety net and confirmation signal.
The more important shift in how sitemaps function today is their role in priority signalling. When Googlebot has a fixed crawl budget to allocate across your site, your sitemap helps it decide where to focus. If your sitemap is cluttered with redirected URLs, pages returning 404 errors, or pages with canonical tags pointing elsewhere, you are actively directing crawlers toward dead ends. That wastes budget that should be spent on pages generating revenue or traffic.
“Sitemaps are often treated as a one-time setup task, but they need ongoing attention,” says Ciaran Connolly, founder of ProfileTree. “We regularly audit client sitemaps as part of our technical SEO work and find pages that have been redirected or removed still sitting in the sitemap months later. That kind of dead weight compounds over time.”
The practical case for investing in your sitemap is straightforward: a clean, current sitemap reduces indexing lag, helps new content get found faster, and gives you a clearer signal in Google Search Console when something goes wrong.
Types of Sitemaps
There are four sitemap types worth understanding, though most websites will only need one or two of them.
- XML sitemaps are the standard format for search engines. They list URLs in a structured file with optional metadata including last modification date, change frequency, and priority weighting. Every website that cares about SEO should have one. XML sitemaps must be UTF-8 encoded, kept under 50MB uncompressed, and limited to 50,000 URLs per file. If your site exceeds either limit, you use a sitemap index file that points to multiple child sitemaps.
- HTML sitemaps are designed for human visitors, not crawlers. They present an accessible, navigable list of your site’s pages — useful on large websites where footer navigation or category menus don’t surface everything. They have diminishing SEO value as a standalone tool, but can still contribute to internal linking depth on complex sites.
- Image and video sitemaps allow you to give search engines additional metadata about multimedia content: file URLs, captions, titles, licensing information and, for videos, thumbnail URLs and play page URLs. Without these, images and videos can be indexed, but Google has less context to work with. For websites built around video content, a dedicated video sitemap is worth the setup time. ProfileTree’s video marketing services team recommends this as a baseline for any client site with a substantial video library.
News sitemaps are only relevant if you publish news content and want to appear in Google News. They require specific markup and are subject to Google’s publisher guidelines.
| Sitemap Type | Primary Audience | When to Use |
|---|---|---|
| XML | Search engines | All websites — non-negotiable |
| HTML | Human visitors | Large sites with deep page structures |
| Image/Video | Search engines | Sites with significant media content |
| News | Search engines | Google News publishers only |
Sitemap Hygiene: Quality Over Quantity
This is the section most guides skip, and it’s where most sitemap problems originate.
The common misconception is that including more URLs in your sitemap is better. In practice, the opposite is true. A sitemap should only contain URLs that are 200-status, canonical, and indexed by choice. Every URL that falls outside that definition is dead weight — and dead weight costs you crawl budget.
Here is what should never appear in your sitemap:
- Redirected URLs (301/302): If a URL redirects to another page, the redirect destination is the canonical version. The redirect source has no business being in your sitemap. When crawlers follow a sitemap URL and hit a redirect, they note the inconsistency. Enough of these and you have a site that signals poor maintenance hygiene.
- 404 pages: Any URL returning a 404 that appears in your sitemap is a direct signal that your sitemap is out of date. Google Search Console will flag these explicitly, but the underlying issue is that someone removed pages without updating the sitemap.
- Non-canonical URLs: If a page has a canonical tag pointing to a different URL, include only the canonical version in your sitemap. Listing both creates a contradiction: you’re telling Google “this is the master copy” via the canonical tag, and simultaneously “this URL matters” via the sitemap. Pick one.
- Paginated URLs beyond page one: Unless you are explicitly targeting paginated pages for indexing (rare), only the root category or archive URL belongs in your sitemap.
- Session IDs and tracking parameters: These create URL variants that are technically duplicate content. If these are being generated dynamically, check your sitemap configuration and URL parameter settings in Search Console.
The practical test for any URL before including it in your sitemap is simple: does Google indexing this page serve a business or content purpose? If not, leave it out.
A cleaner sitemap also makes Search Console data more actionable. When every URL in your sitemap is one you genuinely want indexed, error rates become meaningful signals rather than noise.
International SEO and Hreflang in Sitemaps
For businesses in Northern Ireland, the Republic of Ireland, and the UK, international SEO is a practical concern, not an edge case. A company based in Belfast serving customers in Dublin and London is operating across two jurisdictions with distinct search audiences. Getting your hreflang implementation right is what tells Google which version of a page to serve to which audience.
Hreflang can be implemented in three ways: in the HTTP header (for PDFs and non-HTML files), in the <head> of each page, or in your XML sitemap. The sitemap method is often the cleanest approach for large sites because it centralises the implementation and avoids the need to modify every page template.
Here is what a sitemap-based hreflang entry looks like in practice:
<url>
<loc>https://example.co.uk/services/web-design/</loc>
<xhtml:link rel="alternate" hreflang="en-gb" href="https://example.co.uk/services/web-design/"/>
<xhtml:link rel="alternate" hreflang="en-ie" href="https://example.ie/services/web-design/"/>
<xhtml:link rel="alternate" hreflang="x-default" href="https://example.co.uk/services/web-design/"/>
</url>
The relationship is bidirectional: if page A points to page B as its Irish equivalent, page B must point back to page A as its UK equivalent. Missing reciprocal tags are the most common hreflang error and one of the hardest to catch manually.
A few points specific to UK and Irish businesses:
Use en-gb for Great Britain content, en-ie for Irish content targeting the Republic, and x-default as the fallback for users who don’t match a specific locale. Do not use en alone as your primary tag if you are targeting geographically specific audiences; it is too broad to be useful.
If your site operates a single domain targeting both markets (common for Northern Ireland businesses), hreflang becomes particularly important. Without it, Google may serve the wrong page variation to users in Dublin versus Belfast, suppressing performance in one or both markets.
ProfileTree’s digital strategy team routinely includes hreflang audits as part of international SEO reviews for clients operating across the island of Ireland.
How to Create and Submit Your Sitemap
The method you use to create your sitemap depends on your CMS. Most modern platforms handle this automatically once configured correctly.
WordPress: The most widely used approach is through a plugin. Yoast SEO generates a sitemap index automatically at yoursite.com/sitemap_index.xml, breaking content by type — posts, pages, categories, and custom post types. Rank Math works similarly and gives you more granular control over which post types are included. The key configuration decision is which content types to exclude. Password-protected pages, archive pages with thin content, and tag archives are usually better left out.
Check your sitemap at yourdomain.com/sitemap_index.xml once the plugin is active. If it doesn’t resolve, check that your WordPress permalink settings are not set to “Plain” — sitemaps require pretty permalinks.
Shopify:Shopify generates a sitemap automatically at yourstore.com/sitemap.xml. You cannot customise it directly, but you can control which pages appear by managing their visibility in the Shopify admin. If you have pages you don’t want indexed (duplicate product variants, for example), use the noindex meta tag rather than removing them from Shopify’s navigation.
Headless CMS and JavaScript frameworks (Next.js, Nuxt): This is the area with the least good documentation in most sitemap guides. In a Next.js application, you can generate a dynamic sitemap using the next-sitemap package or by building a custom API route that queries your CMS for all published URLs. The critical requirement is that the sitemap reflects only published, live content — not draft or preview URLs that may exist in your CMS but should not be indexed.
For Nuxt, the @nuxtjs/sitemap The module handles most cases, but again, ensure your sitemap generation excludes any URLs that shouldn’t be publicly indexed.
Submitting to Google Search Console:
- Sign in to Search Console and select your property
- Go to Sitemaps in the left navigation
- Enter the sitemap URL (e.g.
sitemap_index.xml) in the “Add a new sitemap” field - Click Submit
Google does not re-read your sitemap on every crawl. To signal a significant update, you can ping Google directly or resubmit through Search Console after major content changes.
Submitting to Bing Webmaster Tools:
- Sign in to Bing Webmaster Tools with your Microsoft account
- Select your site
- Go to Sitemaps from the left menu
- Enter your sitemap URL and click Submit
Submitting to Bing also covers Yahoo results, since both run on the same index.
Advanced Sitemap Management for Large Sites
Once a site exceeds 50,000 URLs or 50MB uncompressed per sitemap file, you need a sitemap index. A sitemap index is itself an XML file that does not list individual URLs — it lists other sitemaps.
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-pages.xml</loc>
<lastmod>2026-06-01</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-products.xml</loc>
<lastmod>2026-06-01</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-blog.xml</loc>
<lastmod>2026-06-01</lastmod>
</sitemap>
</sitemapindex>
The practical benefit of splitting sitemaps by content type is diagnostic clarity. When you can see in Search Console that 94% of your product sitemap pages are indexed but only 61% of your blog sitemap pages are, you know immediately where the problem sits. A single monolithic sitemap gives you no such signal.
For large e-commerce sites, a common structure is: one sitemap for core service or category pages; one for product pages (broken further if needed); one for blog or editorial content; one for images. The index file ties them together and is the only URL you need to submit.
Dynamic sitemaps for large sites also benefit from a generation schedule. Rather than regenerating on every page save (which can cause performance issues), most enterprise implementations regenerate sitemaps on a scheduled basis — hourly or daily, depending on publication frequency. The lastmod date in the sitemap then accurately reflects when content was last changed, which is a genuine signal to crawlers rather than a placeholder.
Auditing and Troubleshooting Common Errors in Search Console
Search Console’s Sitemaps report shows you submitted sitemaps, the number of URLs detected versus indexed, and any errors. The gap between detected and indexed is your starting point for any audit.
“Sitemap could not be read”: Three common causes. First, the sitemap URL is returning a non-200 status code — check that the URL is correct and publicly accessible. Second, the XML is malformed — a missing closing tag or invalid character breaks the entire file. Validate your sitemap at sitemaps.org or using an online XML validator. Third, the Content-Type header is wrong — your sitemap must be served as application/xml or text/xml, not text/html.
“Sitemap contains URLs which are blocked by robots.txt”: This means your robots.txt file has a Disallow rule covering URLs that appear in your sitemap. Fix this either by removing those URLs from the sitemap or by adjusting your robots.txt rules. Having a URL in your sitemap and also blocked in robots.txt sends a contradictory signal, and most crawlers will ignore the URL.
High “Discovered, not indexed” count: This is different from a sitemap error but surfaces in the same workflow. It means Google found the URLs, crawled them (or intends to), but has decided not to index them — usually a content quality judgement. The sitemap is not the problem here; the content is.
“URL is unknown to Google”: This is a crawler orphaning issue — the URL exists on your site but Google has no path to it through internal links. Being in the sitemap helps with discovery, but the longer-term fix is ensuring the page is reachable through your internal link structure. Sitemaps supplement internal linking; they don’t replace it. ProfileTree’s website development team checks crawl accessibility as a standard part of every build and website management review.
Keeping your sitemap current: Set a quarterly review cycle at a minimum. After any significant site restructure, redirect campaign, or content deletion, update the sitemap before the next crawl cycle picks up the changes. If you’re managing a large WordPress site, check that your SEO plugin is still regenerating the sitemap correctly after any major plugin updates or hosting migrations.
Conclusion
A sitemap is one of the most straightforward technical SEO tools available, but it does require active management to deliver its full value. Build it clean, keep only indexable canonical URLs, update it when content changes, and audit it quarterly. For UK and Irish businesses working across multiple regions, adding hreflang to your sitemap is one of the highest-value configuration changes you can make with relatively little development effort.
If your sitemap is returning errors, your indexed page count is declining, or you’re planning a significant site restructure, get in touch with the ProfileTree team for a technical SEO review.
FAQs
Do I need a sitemap if my site is small?
Yes. Even on a 10-page website, a sitemap gives search engines a direct signal about your content without relying solely on internal linking. It takes minutes to set up and the downside risk is zero.
What is the difference between an XML sitemap and a robots.txt file?
A sitemap is a map for discovery — it tells crawlers where your pages are. A robots.txt file is a set of access rules — it tells crawlers which areas they are allowed or not allowed to access. They serve different purposes and both need to be configured correctly. Having a URL in your sitemap but blocked in robots.txt is a contradiction that will cause indexing problems.
Why is Google Search Console showing “Sitemap could not be read”?
The three most common causes are: the sitemap URL is returning a non-200 status code; the XML file contains a syntax error; or the file is being served with the wrong Content-Type header. Validate your sitemap using an online XML validator and check that the URL returns a 200 status in a browser.
Can a sitemap help get my pages indexed faster?
It can reduce discovery lag, particularly for new pages that don’t yet have internal links pointing to them. But it does not guarantee indexing. Content quality, site authority, and internal link structure are the primary indexing factors.
Is there a limit to how many URLs I can include in a single sitemap?
Yes. The limit is 50,000 URLs and 50MB uncompressed per sitemap file. Sites exceeding either limit need a sitemap index file that points to multiple child sitemaps.
Does my sitemap need to be in the root directory?
By convention, yes — yourdomain.com/sitemap.xml is where crawlers look first. You can place it elsewhere and reference it in your robots.txt file using the Sitemap: directive, but root placement is simpler and reduces the chance of it being missed.
How often should I update my sitemap?
Whenever content changes significantly. Most CMS plugins handle this dynamically. For static or manually maintained sitemaps, review quarterly at a minimum, and immediately after any redirect campaign, content deletion, or site restructure.
What pages should I exclude from my sitemap?
Any page that returns a non-200 status code, any page with a canonical tag pointing elsewhere, any page you have marked noindex, any redirect source URLs, and any thin or duplicate content pages you don’t want indexed.