You’re a busy person. You’ve got a big website and a small team (or no team at all), so some of the more advanced SEO tasks can get ignored. Website crawling is one of those things that’s easy to let fall by the wayside.

This is a mistake.

Crawling your website uncovers all sorts of technical problems that impact how humans and search engines interact with your pages. Crawling your website will diagnose, or help prevent, all sorts of problems that tank user experience and SEO:

  1. Duplicate content
  2. Broken pages
  3. Broken Links
  4. Bad Redirects
  5. Insecure pages
  6. Non-Indexable

Start Your Site Crawl Now

Duplicate Content

Duplicate content is something everyone doing SEO is concerned with. And for good reason: duplicate pages often don’t rank highly, and can get left out of search results altogether. Even unique pages hosted on domains that have a lot of duplicate content can struggle to rank.

Duplicate content warning in SERP

When the concept of "duplicate content" first came out, the big focus was on plagiarized, scraped and syndicated content.

However, you can end up with duplicate content on your website via:

  • CMS issues
  • Multilingual sites
  • WWW resolve
  • Migration from HTTP to HTTPS

But these are pretty technical problems. What’s a non-techie to do?

Crawl your site, that’s what.

Site Crawl analyzes the pages on your website and checks their content against each other, flagging text that’s similar.

Site Crawl page content issues

It also takes a look at important on-page elements that Google uses as indicators of duplicate content like title tag and meta descriptions.

Site Crawl duplicate titles

Canonical and Hreflang

Canonical URLs and canonical tags are used to help search engines find the original version of duplicate pages. Hreflang tags tell search engines which pages to serve based on user language.

These tags are an important part of avoiding duplicate content on your website.

If you’ve got a big site with lots of similar/duplicate pages, like an ecommerce site, you’ve probably got a lot of these tags. Checking these tags manually doesn’t make much sense unless you have an almost concerning amount of time on your hands. The good news is that with a crawler you can find every instance of a canonical tag, as well as instances of canonical tags that...

  • Conflict with your XML sitemap
  • Don’t load properly
  • Differ from your Open Graph entry

Site Crawl canonical tags

As you can imagine, broken pages and links are not good for anyone. Sending people to nonexistent or otherwise inaccessible pages will cause users to flee your website. Plus, too many pages that return error codes will have a serious impact on your domain’s authority and trustworthiness.

Checking internal links is super important because these links move not only users from page to page, but link juice as well. These broken links represent a double whammy of reduced user experience and poor SEO.

Crawling your website is just about the only reliable way to check all of your pages and links for errors. Do you really want to visit every page and click every link?

I thought not.

Crawlers work by accessing pages via your links. They’ll also try to follow external links, but won’t actually crawl those domains. So, by definition, an SEO crawler will verify your internal and external links.

Site Crawl checks the HTTP status code for each URL it encounters. It will then show you each URL that returns an error code that blocks users from accessing that page:

  • 4xx client errors
  • 5xx server errors
  • 3xx redirect errors

Site Crawl HTTP error section

Redirect Errors

While technically an HTTP status code, redirects are considered a beast of their own. That’s because returning a 3xx HTTP status isn’t a problem. SEO problems concerning redirects arise when:

  • You’ve got a redirect pointing at another redirect (redirect chain)
  • Two redirections pointing back at each other (redirect loop)
  • A redirect pointing at a URL returning an error code (broken redirect)

These redirect errors result in increased load time (chains) and dead links (broken redirects). Most browsers won’t even won’t even let a user enter a redirect loop, displaying an error page instead.

HTTP Assets On HTTPS Pages

Using HTTPS URLs is a really, really good idea. It’s more secure for your users and you. And Google uses it as a ranking boost. So it’s a good thing you migrated over to HTTPS. But did you make sure all your images, CSS and JavaScript files moved too?

Having secure pages with HTTP assets will cause the user to see a scary red warning every time they try to access the page, which is incredibly annoying. Plus, your site won’t be totally secure. Google won’t like any of that.

Use Site Crawl to make sure you didn’t miss any of those pesky little files when you migrated, or to find the ones you did. When it comes to HTTP within HTTPS, even the littlest file can cause a huge headache.

Non-Indexable Pages

The two ways to keep Google from indexing pages are through robots.txt files and meta robots tags. There are lots of reasons you’d want to make a page, folder or site non-indexable:

  • You want to avoid duplicate and thin content problems
  • You don’t want search engines wasting crawl budget on useless pages
  • You’ve got particular pages or file types you don’t want to be crawled

However, getting a little carried away with the disallow command (or messing up the wildcard) and/or meta robots is one of the main causes of organic traffic declines.

And, unfortunately, getting even one character wrong here can cause whole sections of your site to fall out of Google’s index.

Fortunately, your SEO crawler will access and read your robots.txt file before crawling your site. So Site Crawl knows right away what pages Google won’t be able to access. And when the bot lands on a page, it checks for the "NoIndex" attribute in the meta robots tag.

Site Crawl non-indexable pages

It also checks for the "NoFollow" meta robots attribute. The “NoFollow” attribute tells bots not to follow any of the links on the page. So even if the page is indexable, it won’t pass any link juice or connect crawlers to the rest of the site.

These non-indexable pages technically aren’t errors. Remember, there are reasons to NoIndex a page. But you should definitely check the Indexing site of your crawl report. If the URLs here don’t make sense, you definitely need to check your robots.txt file and any meta robots tags you have.

WooRank Is Here To Help

Crawling is one of those SEO things that many people might not consider, particularly if they’re not dedicated marketers. However, it’s a super necessary step to discover problems that are preventing you from ranking or to prevent those issues from arising in the first place.

Many crawlers are intimidating, just creating a list of URLs with their corresponding attributes and leaving the analysis up to you. That’s one of the reasons WooRank created Site Crawl - it does the analysis for you and alerts you to anything that needs your attention. However, whether you use Site Crawl or not, you should still be regularly crawling your site to prevent small mistakes from becoming big problems for your website.