How A Web Crawler Works – Back To The Basics
The world wide web is full of information. If you want to know something, you can probably find the information online. But how can you find the answer you want, when the web contains trillions of pages? How do you know where to look?
Fortunately, we have search engines to do the searching for us. But how do search engines know where to look? How can search engines recommend a few pages out of the trillions that exist? The answer lies with web crawlers.
Web crawlers are computer programs that scan the web, ‘reading’ everything they find. Crawlers are also known as spiders, bots and automatic indexers. These crawlers scan web pages to see what words they contain, and where those words are used. The crawler turns its findings into a giant index. The index is basically a big list of words and the web pages that feature them. So when you ask a search engine for pages about hippos, the search engine checks its index and gives you a list of pages that mention hippos. Crawlers scan the web regularly so they always have an up-to-date index of the web.
The SEO Implications Of Web Crawlers
Now that you know how a web crawler works, you can see that their behaviour has implications for how you optimize your website.
For example, you can see that, if you sell parachutes, it’s important that you write about parachutes on your website. If you don’t write about parachutes, search engines will never suggest your website to people searching for parachutes.
It’s also important to note that web crawlers don’t just pay attention to what words they find – they also record where the words are found. So the crawler knows that a word contained in headings, meta data and the first few sentences are likely to be more important in the context of the page, and that keywords in prime locations suggest that the page is really ‘about’ those keywords.
So if you want search engines to know that parachutes are a big deal on your website, mention them in your headings, meta data and opening sentences.
The fact that web crawlers regularly trawl the web to make sure their index is up to date also suggests that having fresh content on your website is a good thing too.
Not All Content Can Be Found By Crawlers
Crawlers are very simple programs. They begin with a list of links to scan, and then follow the links they find. Sounds simple, right? Well, yes, it is, until you get to complex pages with dynamic content. Think about on-site search results, Flash content, forms, animations and other dynamic resources. There are many reasons why a crawler would not see your website in the same way that your human visitors do.
In fact, many businesses take steps to ensure that web crawlers ‘see’ all of the content available. This is particularly an issue for websites with lots of dynamic content which may only be visible after making a search.
Here you can see how Google Search Console can be used to understand how many of your pages are indexed, which pages were excluded and why, along with any errors or warnings that were encountered when crawling your website.
The Role Of Robots.txt
You can give instructions to web crawlers by putting them in a file called robots.txt. You might want to ask web robots to ignore your website, or to skip certain sections. You might also want to help the robot to access every part of your website – particularly if you have a complex or dynamic website.
Search Engine Indexes
Once the crawler has found information by crawling over the web, the program builds the index. The index is essentially a big list of all the words the crawler has found, as well as their location.
Why Indexing Is Only The Beginning…
In order to give you relevant responses to your search queries, search engines must interpret the links in their index. Search engines use algorithms, which are essentially complex equations, to ‘rate’ the value or quality of the links in its index.
So when you go searching for ‘parachutes’, the search engine will consider hundreds of factors when choosing which websites to offer you.
The factors that search engines consider include:
- when the page was published
- if the page includes text, pictures and video
- the quality of the content
- how well the content matches user queries
- how quickly your website loads
- how many links from other websites point to your content
- how many people have shared your content online….
…and many more. There are over 200 factors that Google considers when delivering search results.
Getting De-indexed By Google
Google does not want to recommend disreputable websites, so if you engage in a number of spammy practices you may be penalised by having your website de-indexed. What does that mean? It means that your website will no longer feature in Google’s index, and therefore your website will no longer appear in Google search results. As you can imagine, this is a catastrophic scenario for any business that has an online presence.
You can read more about crawler is our Guide to Search Engine Crawlers.