robots.txt Generator & Tester

Q: What is a robots.txt file?

A robots.txt file is a plain text file placed at the root of a website that tells web crawlers which pages or sections of the site they are allowed or not allowed to access. It follows the Robots Exclusion Protocol, a standard that has been in use since 1994. Search engine crawlers like Googlebot, Bingbot, and others check this file before crawling a site. The file uses simple directives like User-agent, Disallow, Allow, and Sitemap to communicate rules to crawlers.

Q: Where does the robots.txt file go?

The robots.txt file must be placed in the root directory of your website so that it is accessible at https://example.com/robots.txt. It must be served at the exact path /robots.txt on the domain. Placing it in a subdirectory or using a different filename will not work. The file must be a plain text file encoded in UTF-8 and served with a text/plain content type. Each subdomain needs its own robots.txt file.

Q: Can robots.txt block all crawlers?

Yes. To block all crawlers from all pages, use 'User-agent: *' followed by 'Disallow: /' in your robots.txt file. This tells every crawler that no pages on the site should be crawled. However, robots.txt is advisory, not enforceable. Well-behaved crawlers like Googlebot and Bingbot will respect it, but malicious bots may ignore it. For true access restriction, use server-side authentication or firewall rules. Also note that blocking crawlers does not remove pages from search results if they are linked from other sites.

Q: Does Google respect robots.txt?

Yes, Google respects robots.txt directives. Googlebot checks the robots.txt file before crawling any URL on a site. If a page is disallowed, Googlebot will not crawl it. However, if other sites link to a disallowed page, Google may still index the URL (showing it in search results with a note that the description is not available) because robots.txt prevents crawling, not indexing. To prevent indexing, use a noindex meta tag or X-Robots-Tag HTTP header instead.

Q: Can I block specific pages with robots.txt?

Yes. Use a specific Disallow directive with the path you want to block. For example, 'Disallow: /private/' blocks all URLs starting with /private/. 'Disallow: /page.html' blocks that exact page. You can use the wildcard character * to match patterns, such as 'Disallow: /*.pdf$' to block all PDF files. You can also use Allow directives to create exceptions within a broader Disallow rule, such as blocking /private/ but allowing /private/public-page.

Generate robots.txt files with custom crawler rules and sitemaps, or test an existing robots.txt to see if specific URLs are allowed or blocked.

Understanding robots.txt

The robots.txt file is the first file that well-behaved web crawlers check when visiting a site. It lives at the root of your domain and uses the Robots Exclusion Protocol to communicate which parts of your site crawlers should and should not access. While the standard has been in use since 1994, Google formalized its interpretation in a detailed specification and released an open-source parser in 2019. Every website that wants to control how search engines crawl its pages needs a properly configured robots.txt file.

The file uses a straightforward syntax. Each section begins with a User-agent directive specifying which crawler the rules apply to. An asterisk matches all crawlers. Disallow directives list paths that should not be crawled, while Allow directives create exceptions within broader disallow rules. The Sitemap directive points crawlers to your XML sitemap, helping them discover all the pages on your site. Lines beginning with a hash character are comments and are ignored by crawlers.

How Web Crawlers Use robots.txt

When a search engine crawler first visits your domain, it requests the /robots.txt file before crawling any other page. If the file exists and contains rules for that crawler's user agent, the crawler follows those rules. If the file does not exist or returns a 404, the crawler assumes all pages are allowed. If the file returns a 5xx server error, most crawlers will temporarily stop crawling the site and try again later, treating the inability to read robots.txt as a precaution rather than permission. Google caches robots.txt files and refreshes them at least once a day.

Common robots.txt Rules

The most common use case is blocking crawlers from admin areas, login pages, search results pages, and duplicate content. For example, Disallow: /admin/ prevents crawlers from indexing your administration panel. Disallow: /search prevents search engine result pages from appearing in search results, which would be thin, duplicate content. Blocking PDF files, print-friendly pages, or staging environments are other frequent applications. It is important to remember that robots.txt controls crawling, not indexing. A page blocked by robots.txt can still appear in search results if other pages link to it.

Crawling vs. Indexing

A common misconception is that robots.txt can prevent a page from appearing in search results. Blocking a page in robots.txt prevents crawlers from accessing its content, but the URL may still appear in search results if external sites link to it. Google will show the URL with a note that the description is not available because the page is blocked from crawling. To truly prevent indexing, use the noindex meta tag or the X-Robots-Tag HTTP header instead. Critically, the page must be crawlable for Google to see the noindex directive, so do not block a page in robots.txt if you want to use noindex on it.

Testing Your robots.txt

Before deploying a robots.txt file, always test it to make sure it does not accidentally block important pages. Google Search Console provides a robots.txt tester that shows you exactly how Googlebot interprets your rules. You can also use the tester tab in this tool to paste your robots.txt content and check whether specific URLs are allowed or blocked. Test your most important pages, your sitemap URL, and any pages you specifically want to block to verify the rules are working as intended.

Frequently Asked Questions

What is a robots.txt file?

A plain text file at the root of a website that tells crawlers which pages they may or may not access. It follows the Robots Exclusion Protocol using directives like User-agent, Disallow, Allow, and Sitemap.

Where does the robots.txt file go?

It must be at the root of the domain at the exact path /robots.txt. Each subdomain needs its own file. It must be served as text/plain in UTF-8 encoding.

Can robots.txt block all crawlers?

Using "User-agent: * / Disallow: /" blocks all well-behaved crawlers. However, malicious bots may ignore it. For true access restriction, use server-side authentication.

Does Google respect robots.txt?

Yes. Googlebot checks robots.txt before crawling any URL. However, blocked pages may still appear in search results if linked from other sites. Use noindex to prevent indexing.

Can I block specific pages with robots.txt?

Yes. Use "Disallow: /path/" for directories or "Disallow: /page.html" for specific pages. Wildcards like *.pdf$ can match patterns. Allow directives create exceptions.

Embed This

Understanding robots.txt

How Web Crawlers Use robots.txt

Common robots.txt Rules

Crawling vs. Indexing

Testing Your robots.txt

Frequently Asked Questions

What is a robots.txt file?

Where does the robots.txt file go?

Can robots.txt block all crawlers?

Does Google respect robots.txt?

Can I block specific pages with robots.txt?

Related Calculators

Generador de .htaccess

Generador de Configuración Nginx

Constructor de Encabezados CSP

You Might Also Need

Meta Tag Generator

Schema Markup Generator

.htaccess Generator

Recommended Reading

The Rule of 72 is wrong. Here's why that's fine, and the exact rule when it isn't.

15-Year vs 30-Year Mortgage: Which Saves You More?

How to Calculate Your Monthly Mortgage Payment (Step by Step)