A considerable part of the crawling robots that operate today will look for the robots.txt file in the main directory of your website (https://seoisrael.com/robots.txt). This file should help them decide which parts of the site they should avoid entering.
Please note: Prohibiting a search engine’s access to a certain page will indeed prevent the collection of the page’s content, but sometimes, if there are references to that page in pages where the search engines can enter, then the page will appear in the search results, but without the information about it (title, description, etc.) If you want to prevent the page from appearing completely, use the second method (robots tag).
How do you create a robots.txt file and what is it made of?
The file must be prepared in a regular text editor (notepad) and not in another software.
In order to better understand how this file is structured, let’s examine a code example:
User-agent: * Disallow: /cgi-bin/ Disallow: /images/ |
If we look at the code, it seems that there are two parts:
- User-Agent: This part defines to whom the instructions that will come immediately after it are addressed.
- Disallow: Which parts of the site do we wish to prevent access from those defined in the User-Agent field.
In the case above, we asked all search engines (the mark for all search engines together is *) to avoid access to the cgi-bin directory and the images directory.
Let’s look at another code example:
User-agent: * Disallow: / |
This example will prevent all search engines from accessing the entire site, meaning the search engines will not crawl the site at all.
A more complicated example:
User-agent: googlebot Disallow: /bonbons/ Disallow: bonbons.htm User-agent: bonboncrawler Disallow: / |
In this example we gave instructions to two different robots. The Googlebot robot was instructed to avoid access to the bonbons directory, and to the bonbons.htm file. The second directive refers to the bonboncrawler robot, which we have banned from accessing the site altogether.
And the last example:
User-agent: googlebot Disallow:User-agent: * Disallow: / |
Note that the first ban (Googlebot) is empty! So, it basically means that Googlebot can crawl all the pages of the website. The second ban prevents all robots from crawling the site. Seemingly a contradiction between the two instructions, but in fact when there is a sign *, which contradicts a more specific instruction, then the more specific instruction wins.
The meaning of the previous code is actually – Googlebot can scan the entire site, while every other bot can’t.
robots.txt for forum websites
A robots.txt file can help you prevent robots from browsing parts of your forum that do not include useful information. An example of pages that you should prevent access to is the user profile pages, the search page, the page for writing a new message and the login page. To prevent access to these pages, you can create a file like this:
User-agent: * Disallow: /forum/post.asp Disallow: /forum/post.asp Disallow: /forum/user_profile.asp Disallow: /forum/search.asp Disallow: /forum/password.asp |
Should I create a robots.txt file even if it is not needed?
In a video posted by Matt Cutts of Google, on August 19, 2011, he addresses this question. The short answer is this: “Think of it this way: Search engine robots in general, and Google in particular, are constantly accessing the path of the Robots.txt file on your site. What if the file isn’t there? Will the server return a server error response ( response code 500) or page not found response (response code 404)? How will the robot treat this response? Is the risk worth the effort involved in uploading the file to the server?”