Web Crawlers

Author: Ran Shvergold
Last Update: December 13, 2023
Published: October 31, 2020

Content Hide

1 Google’s simple scanning robot – Googlebot

2 Scanning an HTML page by a browser bot – Headless Browser

3 Retrieving information from the HTML page

4 Unfriendly website to robots

5 Google’s scanning robot – Googlebot

6 Site crawl levels

7 Preventing robots from accessing certain areas of the site

7.1 robots.txt file

7.2 Robots Meta Tag

The way search engines collect information about your website is through Crawling Bots (also called spiders, crawlers and bots), which scan the billions of web pages all the time. These robots are actually programs whose job it is to download web pages into a huge database, search them for links to new pages, and download them as well (and so on).

Google’s simple scanning robot – Googlebot

Google’s crawling bot is called GoogleBot. The robot scans the web pages and downloads the HTML code from which they are generated for analysis. The great advantage of Googlebot is that it is fast and therefore requires relatively few resources. Googlebot’s big problem is that it doesn’t scan all the elements included in the HTML code, which are the ones that together create the page that the visitor sees.

Components that Googlebot does not scan include:

JavaScript – Many websites use the JavaScript language to create interactive content and change the design of the page and the content.
CSS – code that controls the design of the page.
Media – images or videos that are on the page.
Any code located in an external file – JavaScript codes and CSS codes can be located in separate external files.
Flash elements – this technology has almost completely gone out of use.
Cookies – information stored in the browser using cookies cannot be read by the robot.

Googlebot scans different websites and pages with different frequency, when the goal is to scan more central pages, new pages and pages that are updated frequently.

Scanning an HTML page by a browser bot – Headless Browser

For several years now, Google has been operating an additional bot to Googlebot for the purpose of scanning complex web pages, based on a Chrome robot browser that operates without a human user – Headless Browser. This complete reading process is called Rendering. The Chrome browser used by this robot is the latest stable version of Chrome, and it is updated every time Chrome is updated (Evergreen version).

The rendering process of the page begins with the scanning of its HTML code by a system called Caffeine.

After reading the HTML code, a network service called Web Rendering Service is activated, which creates a group of robotic Chrome browsers in the cloud. Each robotic browser reads one of the components that make up the page, so that finally all the components on the page are collected, including the components that the regular Googlebot could not read.

The next step is to run the various collected codes, such as JavaScript and CSS, until the completion of the DOM tree, which is the skeleton of the page, as a normal browser builds it to prepare the page to be displayed to the visitor.

The system knows how to identify errors in reading components in order to make a renewed reading attempt, and knows how to deal with situations where there are errors in the codes that appear on the page.

At this stage, the complete HTML (or actually the complete DOM) is transferred back to the Caffeine system in order to complete the retrieval of the information from the page.

It is important to understand that such a scan requires significantly more resources from Google compared to the simple Googlebot, and therefore the scan rate using the robot’s browsers is much lower than the Googlebot scan rate.

Retrieving information from the HTML page

The web pages that are collected are first filtered to locate “soft” 404 pages, so as not to spend processing time on them. In the next step, data is retrieved from the page.

The first step in retrieving information, which is among the most important steps for Google, is retrieving Structured Data (Schema) marked by the website owner on the page. The data seen by Google’s robot is broader than that shown by Google’s Rich Results Test tool, since at this stage all the structured information present on the page is scanned. Since the retrieval of information is performed on the full DOM tree, Google is also able to read structured information found in external files or generated by JavaScript.

The next step in extracting the information from the page is called Signal Extraction. This is a basic stage of signal collection, which includes, among other things, the following component:

Identifying the canonical version of the page – is it an original content page or a copy of it? For this purpose, Google performs a mathematical examination (Hashing) of the unique content on the page for the purpose of comparing it to other content that exists on the web.

Unfriendly website to robots

Since the robots are quite primitive, they like simple websites. Websites based on more advanced technologies run the risk that the robots will not understand them. Technologies to avoid include:

Flash – Although considerable progress has been made in recent years, today’s robots are able to read little of the things that appear inside the flash. Entire sites built on one Flash will appear in the search engine as a single page, without most of the site’s content. There is a relatively new collaboration between Adobe, the developer of the Flash software, and the search giant Google, which is designed to make it easier to read Flash files, but the collaboration has not yet borne fruit and Flash remains an enemy of scanning robots.
Frames – A method that has almost disappeared from the world. The problem with this method is that the general address of the page remains constant, and the content changes within the frame. Therefore, it is impossible to go directly to a certain page within the site, and only the main page will appear. If an inner page appears, then it will appear without the outer frame. In any case, the result is not good.
I-Frames – This is a newer technology – currently used by the giant social networks YouTube and Facebook (for embedding files) – but still creates the same problem. The content changes within the IFrame, so the robot cannot see the different contents. Definitely not recommended. If the goal is only to obtain an internal frame with scrolling (within it there is code that is not brought from a separate page) then it is better to use DIV or SPAN for which CSS scrolling is defined.
Dynamic pages with Session ID – There are sites that still use the Session ID data within the dynamic URL to track users on the site. This situation makes the robot think that it is a new page that does not yet exist in its database (because the Session ID is new). Pages of this type will eventually disappear from the search results completely, because they will be considered different pages with the same content, i.e. duplicate content, and hence – unnecessary.
Requirement to enable Cookies -Some websites require that the user activate the cookies option to allow him to see the website. The search engine bots do not know how to create cookies, and therefore will not be able to read pages that require it. It’s not that you shouldn’t use cookies – just don’t obligate their use.
Using JavaScript links only – Robots only know how to recognize HREF links, and do not follow JavaScript links. On sites where there are JavaScript type links, the pages to which there is no normal link will not appear.

Google’s scanning robot – Googlebot

Here is what you should know about Google’s scanning robot – Googlebot:

The frequency of scanning – Googlebot scans different websites and pages with different frequency. The parameters used by Googlebot to determine which pages to scan more are the PageRank of the page, links to a certain page, and the number of parameters in the url (if it is a dynamic page – asp or PHP for example). Of course there are other factors, but it is difficult to determine which of them.

Site crawl levels

The robots used to crawl the web do so at three different levels of detail. There are three main scanning levels:

Scanning for new pages – this scan is performed in order to find new pages that do not yet appear in the search engine’s page database. The robot can “discover” the new page following its insertion on the “Add Site” page of the search engine, or following a link to the new page on one of the pages it already has in the page database.
Shallow scan of the important pages – this scan goes over the most important pages on the site (usually the home page), and is done more often.
Deep crawl – In this crawl, all website pages that appear in the search engine’s database are scanned to detect new pages and changes to the content of existing pages. This scan is done once in a while.

Preventing robots from accessing certain areas of the site

It is often requested to prevent access of the search robots to a certain area within the site. A basic example of this is a directory that contains confidential or secret material, or a page that is no longer updated.

There are two main ways to prevent robots from accessing certain areas of the site:

robots.txt file

You will often be interested in preventing the access of a robot of a certain search engine to your website (or part of it), or you will be interested in blocking the access of all robots to a certain area. For this, a robots.txt file was created.

Please note: Prohibiting a search engine’s access to a certain page will indeed prevent the collection of the page’s content, but sometimes, if there are references to that page in pages where the search engines can enter, then the page will appear in the search results, but without the information about it (title, description, etc.) If you want to prevent the page from appearing completely, use the second method (robots tag).

The robotx.txt file should be found in the main directory of the site (usually it does not exist naturally, but needs to be created). Each section within the file includes the type of robot and restrictions on that robot. The file will also contain limitations imposed on all robots.

Robots Meta Tag

To control the way search robots process certain pages on the site, you can use the robots tag. The main topics that this tag controls are:

Whether to add the page into the search engine database or not.
Whether to follow links coming from this page or not.