How A Web Crawler Works – Back To The Basics
The world wide web is full of information. If you want to know something, you can probably find the information online. But how can you find the answer you want, when the web contains trillions of pages? How do you know where to look?
Fortunately, we have search engines to do the searching for us. But how do search engines know where to look? How can search engines recommend a few pages out of the trillions that exist? The answer lies with web crawlers.
What Are Web Crawlers?
Web crawlers are computer programs that scan the web, ‘reading’ everything they find.
They crawl entire websites by following internal links, allowing them to understand how websites are structured, along with the information that they include.
Search engine Web crawlers (also known as spiders and search engine bots) scan web pages to get an understanding of the content they contain and the topic they cover.
The crawler then stores its findings in a giant index, which is basically the biggest library in the world, that it can query depending on what a user is searching for.
So when you ask a search engine for pages about hippos, the search engine checks its index and gives you a list of pages it deems to be most relevant.
Search engine crawlers scan the web regularly so they always have an up-to-date index of the web.
Matt Cutts, a former member of Google's search quality team, published a video explaining this process. While it may be slightly outdated, it still gives a good explanation of how a search engine crawler works.
To learn more about how search engine crawlers work, check out Google's guide to How Search Works.
The SEO Implications Of Web Crawlers
Now that you know how a web crawler works, you can see that their behavior has implications for how you optimize your website.
If you were looking to optimize a page on a pet website around the keyword 'Cocker Spaniel puppies', it’s important that you write about Cocker Spaniel puppies within the content. If you don’t include Cocker Spaniel related keywords, search engines may not see your page as relevant for searchers looking for this topic.
This helps to make it super relevant for anyone searching for information on Cocker Spaniel puppies, making it a great page to return to searchers.
It’s also important to note that while web crawlers analyze the keywords they find within a web page, they also pay attention to where the keywords are found.
So the crawler is likely to consider keywords appearing in headings, meta tags and the first few sentences as more important in the context of the page, and that keywords in prime locations signal that the page is really ‘about’ those keywords.
So if you want search engines to know that Cocker Spaniels are a big deal on your website, mention them in your headings, meta data and opening sentences.
The fact that web crawlers regularly trawl the web to make sure their index is up to date also suggests that having fresh content on your website is a good thing too.
Making Pages Accessible To Crawlers
Crawlers are very simple programs. They begin with a list of links to scan, and then follow the links they find. Sounds simple, right? Well, yes, it is, until you get to complex pages with dynamic content.
Think about on-site search results, Flash content, forms, animation and other dynamic resources. There are many reasons why a crawler would not see your website in the same way that your human visitors do.
In fact, many businesses take steps to ensure that web crawlers ‘see’ all of the content available. This is particularly an issue for websites with lots of dynamic content which may only be visible after making a search.
Here you can see how Google Search Console can be used to understand how many of your pages are indexed, which pages were excluded and why, along with any errors or warnings that were encountered when crawling your website.
Using Crawlers To Fix Website Issues
Web crawlers are also provided by some SEO tools to help webmasters identify errors that could lead to SEO issues or even prevent pages from being included in the search results.
WooRank's Site Crawl is an SEO crawler which can help you to find crawl errors that might trip up the search engines. Fixing these issues will help to ensure your pages can be easily accessed and included in search engine results.
You can try our SEO crawler by signing up for a free 14 day WooRank trial!
The Role Of Robots.txt
You can give instructions to web crawlers by putting them in a file called robots.txt. You might want to ask web robots to ignore your website (for example, while it's being built), or to skip certain sections.
You might also want to help the robot to access every part of your website – particularly if you have a complex or dynamic website.
Learn more in our Guide to robots.txt
Search Engine Indexes
Once the crawler has found information by crawling over the web, the program builds the index. The index is essentially a big list of all the content the crawler has found, as well as their location.
Why Indexing Is Only The Beginning…
In order to give you relevant responses to your search queries, search engines must interpret the links in their index. Search engines use algorithms, which are essentially complex equations, to ‘rate’ the value or quality of the pages in its index.
So when you go searching for ‘Cocker Spaniels’, the search engine will consider hundreds of factors when choosing which we pages to return.
Some of the factors that search engines consider include:
- when the page was published
- if the page includes text, pictures and video
- the quality of the content
- how well the content matches user queries
- how quickly your website loads
- how many links from other websites point to your content
- how many people have shared your content online….
…and many more. There are over 200 factors that Google considers when delivering search results.
Getting De-indexed By Google
Google does not want to recommend disreputable websites, particularly those that break their webmaster guidelines. Engaging in shady practices can cause you to wind up being penalised, resulting in part, or all, of your website being de-indexed.
What does that mean? It means that your website will no longer appear in Google’s index, and therefore be excluded from Google's search results.
As you can imagine, this is a catastrophic scenario for any business that has an online presence, so it's always best to be aware of what is considered to be against the rules in Google's eyes, in order to avoid raising any red flags.
Want To Learn More?
You can read more about web crawlers in our Guide to Search Engine Crawlers.
This blog post was updated 18 May 2020