LIBERTAS
Search Engine Crawlers

What is a Search Engine Crawler?

Search engine crawlers, also called spiders, robots or just bots, are programs or scripts that systematically and automatically browse pages on the web. The purpose of this automated browsing is typically to read the pages the crawler visits in order to add them to the search engine’s index.

Search engines, such as Google, use web crawlers to read web pages and store a list of the words found on the page and where those words are located. They also collect usability data such as speed and HTTP error statuses encountered.

This data is stored in the search engines’ index — essentially huge databases of web pages.

When you perform a search on Google, you are actually searching Google’s index, not the actual web. Google then displays the indexed pages relevant to the query and provides links to the actual pages.

Since the modern web contains several different types of content and search engines have ways to search specifically for that type of content, the biggest search engines have crawlers dedicated to crawling specific types of pages or files. These fields include:

  • General web content
  • Images
  • Video
  • News
  • Ads
  • Mobile

Each type of crawler has a different user-agent. See what each user-agent is crawling for in our robots.txt guide.

How Do Search Engine Crawlers Work?

On a practical level, "crawling" happens when a crawler receives a URL to check, fetches the page and then stores it on a local computer. You can do this yourself by going to a page, right clicking and then clicking “Save As…”

Crawlers receive their URLs either by checking a domain’s sitemap or by following the links it finds on another page.

Sitemaps play an important role in this step as they provide crawlers with a nice, organized list of URLs to access. They also provide details that impact how Google decides to crawl each page.

What is Crawl Budget?

Of course, even Google has limited resources (no matter how high that limit is). Therefore, Googlebot works with what’s known as a "crawl budget". Crawl budget is simply the number of URLs on a website that Google wants to and can crawl.

There are 2 ingredients that go into Google’s crawl budget for a website:

  • Crawl rate limit: Google doesn’t want to impact a website’s user experience while crawling it, so it limits the number of pages its crawler can fetch at once.

  • Crawl demand: To put it simply, this is Google’s desire to crawl your site. Google isn’t interested in crawling URLs that don’t look like they add value to users (URL parameters, faceted navigation, session identifiers, etc.). So even if Googlebot doesn’t reach its crawl rate limit, it won’t waste its own resources crawling these pages.

The good news is that crawl rate limit and crawl demand can change depending on what Google finds on your website. These factors impact your site’s crawl budget:

  • Site speed: Google doesn’t like to wait, so fast pages will entice it to crawl more pages. Plus, speed is a sign of a healthy website, so Google will be able to put more resources into the crawl.

  • Error pages: If a server responds to a lot of requests from Google with error codes, that will discourage Google from trying to crawl pages because that will look like a website with a lot of problems.

  • Popularity: The more popular Google thinks your page is, the more often it will crawl it in order to keep it up to date in its index.

  • Freshness: It’s no secret that Google likes fresh (new, up to date) content. Publishing new content will tell Google that your website has new pages to crawl on a regular basis. Fresher content means more crawls.

Alternate URLs like AMP or hreflang may be crawled by Google — the same for JavaScript and CSS.

What is Search Indexing?

Once a page has been crawled, Google needs to extract information about the page to store in its index. Search engines use various algorithms and heuristics to determine which words in the page content are important and relevant. Adding semantic markup like Schema.org will help search engines better understand your page.

Once a page has been fetched, stored and parsed, the information extracted from it is saved in the search engine’s index. When someone uses a query in a search, the information in the index is used to determine the pages relevant to that query.

How to Optimize Google’s Crawl

In order to rank in search results, a page must first be indexed. In order to be indexed a page must first be crawled. Therefore, crawlability (or lack thereof) has a huge impact on SEO.

You can’t directly control what pages Google’s crawlers decide to crawl, but you can give them clues as to which pages would be best for them to crawl and which ones they should ignore.

There are three main ways to help control when, where and how Google crawls your pages. They aren’t absolute (Google has a mind of its own), but they will help ensure that your most important pages are found by crawlers.

The role of Robots.txt

The very first thing a crawler does when it lands on a page is open the site’s robots.txt file. This makes the robots.txt file the first opportunity to point crawlers away from what they would consider low-value URLs.

You can use the robots.txt disallow directive to keep crawlers away from pages you don’t necessarily care about appearing in search results:

  • Thank you or order confirmation page
  • Duplicate content
  • Site search result pages
  • Out of stock or other error pages

Do not use your robots.txt file to disallow embedded URLs like JavaScript or CSS. Crawlers have to use crawl budget on these URLs but Google needs to be able to fully render a page in order to understand it correctly.

Blocking CSS and JS files will result in inaccurate or incomplete crawling and indexing, causing Google to see a page differently from humans could even result in reduced rankings.

The role of XML sitemaps

Read the guide to XML sitemaps to learn more about how they impact crawling.

XML sitemaps are like the opposite of the robots.txt file. They tell search engines what pages they should crawl. And while Google isn’t obligated to crawl all URLs in a sitemap (unlike robots.txt, which is obligatory), you can use the information included about pages to help Google crawl pages more intelligently.

Your sitemap is also very important in making sure Google can find pages on your site, a vital tool if your internal linking structure isn’t very strong.

Using nofollow tags

Remember that crawlers move from page to page by following links. However, you can add the rel="nofollow” attribute to tell crawlers not to follow links. When a search engine encounters a nofollow link it will ignore it.

You can nofollow a link two ways:

  • Meta tag: If you don’t want search engines to crawl any link on a page, add the content="nofollow” attribute to the robots meta tag. The tag looks like this:

    <meta name="robots” content=”nofollow”>

  • Anchor tags: If you want a granular approach to nofollow links, add the rel="nofollow” attribute to the actual link tag, like this:

    <a href="www.example.com” rel=”nofollow>anchor text</a>

This way crawlers won’t follow that link, but they can still follow other links on the page.

Using rel="nofollow” on link tags won’t pass link juice to the destination page but that link will still count against the amount of link juice available to pass to each link.

In both instances (meta tag or anchor tag), the destination URL could still be crawled and indexed if another link is pointing at that page. So disallow that page via robots.txt — don’t rely on nofollow for internal links.

You might be wondering how using the "noindex" attribute in the meta robots tag affects crawling. In short, it doesn’t. Google will still crawl a page with the noindex attribute and follow all the dofollow links on the page. It just won’t store the page and its data in the index.

Finding Crawl Errors

Crawl errors occur when Google tries to fetch a page but is unable to access a URL for some reason. Crawl errors can occur on a site-wide level (DNS, server downtime or robots.txt issues), or on a page level (timeout, soft 404, not found, etc.).

The index coverage report in the Google Search Console will list the pages Google encounters that it has trouble crawling, along with the problem that prevents Google from properly indexing it.

Take a tour of the new Search Console.

Recent guides