In general, a robots.txt file is how your website talks to a search engine crawler (or “bot”), indicating which pages can be requested by the search engine.
The file is used by web administrators to manage traffic in cases where it’s possible to overwhelm a server with a high number of requests. It’s also helpful for preventing certain content from being indexed or blocking resources that are not important. The robots.txt resource is not owned by any particular organization or person — instead, it is a protocol originally developed in 1994 to provide directives to a web crawler and forms part of The Robot Exclusion Protocol (REP).
So, how can you use a robots.txt file to help with your own website?
Functionality and format
The most basic robots.txt file can be as short as two lines that include a “user-agent” and a “disallow property.” A user agent is the specified web crawler, while the disallow property addresses what resources are restricted. More complex versions which possess multiple user directives are also possible. When a file includes more than one command, these are paired and separated by a blank line. Therefore, users can disallow certain resources from one bot while allowing other resources.
In the multiple-rules version of the file, each has a user directive and its access limit. The bot will process these rules in a sequential manner where each is read in order from top to bottom. If comments are necessary for readability (or as reminders for why the rules were created) these must be preceded with a pound sign (#). Additional important details that file creators need to know is that the files are case sensitive. And user agents will have the default behavior of crawling the entire site unless otherwise restricted.
Practically all simple text editors that create ASCII or UTF-8 text files are adequate to build your own robots.txt. Microsoft Notepad is a program that you can use to create the files. However, word processors like Microsoft Word are not ideal, as these save files in proprietary formats and add extra formatting to the files that make them incompatible.
Regardless of how you make it, you can use a robots txt validator to ensure that the item is in the proper format and that the syntax is correct. Additional to the ASCII or UTF-8 format rules, the file must be named “robots.txt” and must reside in the root directory of the website. Bots will not read files outside of the root or in a subdirectory. If the webmaster’s intention is to apply the rules to a subdomain, these can reside in the main directory of the subdomain.
Importance in search engine optimization
Websites are generally comprised of multiple files, and bigger websites will logically have more files. When unrestricted, bots crawl each of the pages and files associated with your site. Although this seems like a positive characteristic, it may work against your ranking results. Search engines like Google place a limit on their bots called a crawl budget. URLs that have low value still contribute to the crawl rate limit and crawl demand. Essentially, they use up resources without contributing to the value. In a robots.txt file, webmasters can tell bots to specifically target valuable content while forcing bots to ignore pages that reduce ranking.
Another reason why these files are important to implement in the context of SEO is that search engines do not like duplicate content. There are scenarios when this is impossible to avoid, and it is necessary to replicate the same information in different pages. In these cases, webmasters can add the page to the robots.txt file and avoid being penalized for the copied information.