Generating an XML sitemap

Generating an XML sitemap

Here is a script which generates a valid XML sitemap, which Google (and others) will use to index your site:

#!/bin/bash

# generate an XML sitemap for this site

# script adapted from: http://www.lostsaloon.com/technology/how-to-create-an-xml-sitemap-using-wget-and-shell-script/

[ -f .site_config ] && . .site_config

sitedomain=https://${blog_domain}${blog_url}

mv sitemap.xml sitemap_prev.xml 2>/dev/null

echo "Generating sitemap.xml, please wait.."

wget --spider --recursive --level=inf --no-verbose --output-file=linklist.txt "$sitedomain"
grep -i URL linklist.txt | awk -F 'URL:' '
{print $2}' | awk '{$1=$1};1' | awk '{print $1}' | sort -u | sed '/^$/d' > sortedurls.txt

header='<?xml version="1.0" encoding="UTF-8"?><urlset
      xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
            http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">'

echo $header > sitemap.xml

while read p; do
  case "$p" in
  */ | *.html | *.htm)
    echo '<url><loc>'$p'</loc></url>' >> sitemap.xml
    ;;
  *)
    ;;
 esac
done < sortedurls.txt

echo "</urlset>" >> sitemap.xml

rm linklist.txt sortedurls.txt &>/dev/null

[ ! -f sitemap.xml ] && exit 1

rm sitemap_prev.xml 2>/dev/null

exit 0