Last Updated: December 9, 2019
If you asked everyone you know to list their topmost fears, spiders would likely sit comfortably in the top five (after public speaking and death, naturally*). Creepy, crawly, and quick, even small spiders can make a grown man jump. But when it comes to the internet, spiders do more than spin webs. Search engines use spiders (also known as web crawlers) to explore the web, not to spin their own. If you have a website, web crawlers have creeped onto it at some point, but perhaps surprisingly, this is something for which you should be thankful. Without them, no one could find your website on a search engine.
Turns out, spiders aren’t so bad after all! But how do web crawlers work?
What Is a Web Crawler?
Although you might imagine web crawlers as little robots that live and work on the internet, in reality they’re simply part of a computer program written and used by search engines to update their web content or to index the web content of other websites.
A web crawler copies webpages so that they can be processed later by the search engine, which indexes the downloaded pages. This allows users of the search engine to find webpages quickly. The web crawler also validates links and HTML code, and sometimes it extracts other information from the website.
Web crawlers are known by a variety of different names including spiders, ants, bots, automatic indexers, web cutters, and (in the case of Google’s web crawler) Googlebot. If you want your website to rank highly on Google, you need to ensure that web crawlers can always reach and read your content.
How Do Web Crawlers Work?
- Discovering URLs: How does a search engine discover webpages to crawl? First, the search engine may have already crawled the webpage in the past. Second, the search engine may discover a webpage by following a link from a page it has already crawled. Third, a website owner may ask for the search engine to crawl a URL by submitting a sitemap (a file that provides information about the pages on a site). Creating a clear sitemap and crafting an easily navigable website are good ways to encourage search engines to crawl your website.
- Exploring a List of Seeds: Next, the search engine gives its web crawlers a list of web addresses to check out. These URLs are known as seeds. The web crawler visits each URL on the list, identifies all of the links on each page, and adds them to the list of URLs to visit. Using sitemaps and databases of links discovered during previous crawls, web crawlers decide which URLs to visit next. In this way, web crawlers explore the internet via links.
- Adding to the Index: As web crawlers visit the seeds on their lists, they locate and render the content and add it to the index. The index is where the search engine stores all of its knowledge of the internet. It’s over 100,000,000 gigabytes in size! To create a full picture of the internet (which is critical for optimal search results pages), web crawlers must index every nook and cranny of the internet. In addition to text, web crawlers catalog images, videos, and other files.
- Updating the Index: Web crawlers note key signals, such as the content, keywords, and the freshness of the content, to try to understand what a page is about. According to Google, “The software pays special attention to new sites, changes to existing sites, and dead links.” When it locates these items, it updates the search index to ensure it’s up to date.
- Crawling Frequency: Web crawlers are crawling the internet 24/7, but how often are individual pages crawled? According to Google, “Computer programs determine which sites to crawl, how often, and how many pages to fetch from each site.” The program takes the perceived importance of your website and the number of changes you’ve made since the last crawl into consideration. It also looks at your website’s crawl demand, or the level of interest Google and its searchers have in your website. If your website is popular, it’s likely that Googlebot will crawl it frequently to ensure your viewers can find your latest content through Google.
- Blocking Web Crawlers: If you choose, you can block web crawlers from indexing your website. For example, using a robots.txt file (discussed in more detail below) with certain rules is like holding a sign up to web crawlers saying, “Do not enter!” Or if your HTTP header contains a status code relaying that the page doesn’t exist, web crawlers won’t crawl it. In some cases, a webmaster might inadvertantly block web crawlers from indexing a page, which is why it’s important to periodically check your website’s crawlability.
- Using Robots.txt Protocols: Webmasters can use robots.txt protocol to communicate with web crawlers, which always check a page’s robots.txt file before crawling the page. A variety of rules can be included in the file. For example, you can define which pages a bot can crawl, specify which links a bot can follow, or opt out of crawling altogether using robots.txt. Google provides the same customization tools to all webmasters, and doesn’t allow any bribing or grant any special privileges.
Web crawlers have an exhausting job when you consider how many webpages exist and how many more are being created, updated, or deleted everyday. To make the process more efficient, search engines create crawling policies and techniques.
Web Crawling Policies and Techniques
- To Restrict a Request: If a crawler only wants to find certain media types, it can make a HEAD request to ensure that all of the found resources will be the needed type.
- To Avoid Duplicate Downloads: Web crawlers sometimes modify and standardize URLs so that they can avoid crawling the same resource multiple times.
- To Download All Resources: If a crawler needs to download all of the resources from a given website, a path-ascending crawler can be used. It attempts to crawl every path in every URL on the list.
- To Download Only Similar Webpages: Focused web crawlers are only interested in downloading webpages that are similar to each other. For example, academic crawlers only search for and download academic papers (they use filters to find PDF, postscript, and Word files and then use algorithms to determine if the pages are academic or not).
- To Keep the Index Up to Speed: Things move fast on the Internet. By the time a web crawler is finished with a long crawl, the pages it downloaded might have been updated or deleted. To keep content up to date, crawlers use equations to determine websites’ freshness and age.
In addition, Google uses several different web crawlers to accomplish a variety of different jobs. For example, there’s Googlebot (desktop), Googlebot (mobile), Googlebot Video, Googlebot Images, and Googlebot News.
Reviewing the Crawling of Your Website
If you want to see how often Googlebot visits your website, open Google Search Console and head to the “Crawl” section. You can confirm that Googlebot visits your site, see how often it visits, verify how it sees your site, and even get a list of crawl errors to fix. If you wish, you may ask Googlebot to recrawl your website through Google Search Console as well. And if your load speed is suffering or you’ve noticed a sudden surge in errors, you may be able to fix these issues by altering your crawl rate limit in Google Search Console.
So . . . How Do Web Crawlers Work?
To put it simply, web crawlers explore the web and index the content they find so that the information can be retrieved by a search engine when needed. Most search engines run many crawling programs simultaneously on multiple servers. Due to the vast number of webpages on the internet, the crawling process could go on almost indefinitely, which is why web crawlers follow certain policies to be more selective about the pages they crawl.
Keep in mind that we only know the general answer to the question “How do web crawlers work?” Google won’t reveal all the secrets behind its algorithms, as this could encourage spammers and allow other search engines to steal Google’s secrets.
See? Spiders aren’t so scary after all. A little secretive, perhaps, but perfectly harmless!
If you’re hoping to build a beautiful, effective website that ranks highly on Google, contact 417 Marketing for help. Our team of knowledgeable, creative, and passionate professionals specializes in SEO, web design and maintenance, and Google AdWords, and we have successfully completed over 700 websites since our inception in 2010. Click here to contact us and learn more about what we can do for your company.