Ateica Search engine optimization techniques

Main menu

Pages

6 common robots.txt problems, how to fix them and improve their performance

 6 common robots.txt problems and how to fix them and improve their performance

Learn about the most common robots.txt issues, how they affect your website and search existence, and how to fix them.

The robots.txt file is a relatively useful and powerful tool instructing search engine crawlers how to render your website.

It's not exactly robust (Google "there is no mechanism for fetching web pages from Google"), but it can prevent overloading your website or server with crawler requests.

If you have this tracking block on your website, you should make sure it is used correctly.

This is especially important if you're using dynamic URLs or other methods that theoretically generate infinite pages.

This guide takes a look at some of the most common issues with robots.txt, how they affect your website and search existence, and how to troubleshoot if you think you're running into issues.

But first, let's take a quick look at robots.txt and its alternatives.

What is a robots.txt file?

The robots.txt file uses a plain text file format and is located in the root directory of your website.

It should be at the top of your website directory. If you put it in a secondary directory, search engines will ignore it.

Despite its immense power, robots.txt files are usually relatively simple documents, and a basic robots.txt file can be created in seconds using an editor such as Notepad.

There are other ways to achieve some of the same goals where robots.txt is often used.

Individual pages can include robot meta tags within the page code itself.

You can also use the HTTP X-Robots-Tag header to influence how (and whether) your content appears in search results.

What can a robots.txt file do?

Robots.txt files can produce different results for different content types.

You can prevent web pages from being tracked.

It may still appear in search results, but there is no text description. Non-HTML content is also not dropped on the page.

Media files may appear in Google search results.

Includes image, video, and audio files.

Once the file is published, it remains "existing" on the Internet and can be viewed and linked to, but that particular content does not appear in Google searches.

Source files such as unimportant external scripts may be blocked.

However, this means that if Google crawls a page that requires this resource to be loaded, Googlebot will "check" the version of the page as if that resource does not exist, affecting indexing.

You cannot completely prevent a website from appearing in Google search results with a robots.txt file.

To achieve this, you need to use an alternative approach, such as adding the index meta tag to the page title.

How serious are Robots.txt errors?

Mistakes in your robots.txt file can have unintended consequences, but it's often not the end of the world.

The good news is that by modifying your robots.txt file, you can quickly (usually) get rid of errors completely.

6 Common Robot.txt Errors

  1. The robots.txt file could not be found in the root directory.
  2. Incorrect use of generic characters
  3. Noindex in robots.txt file.
  4. Block scripts and style sheets.
  5. There is no sitemap URL.
  6. Access to development sites

If your site is behaving strangely in search results, the robots.txt file is a good place to look for any errors, syntax errors, and debugging rules.

Let's take a closer look at each of the above errors and see how to make sure you have a valid robots.txt file.

1. The robots.txt file is not in the Root Directory

Auto search can only find the file in the root folder.

For this reason, there should only be a forward slash between your website's .com (or equivalent domain) and the name of the "robots.txt" file in your robots.txt URL.

If the subfolder is there, your robots.txt file will probably not be searchable by robots and your page will likely behave as if there is no robots.txt file at all.

To fix this problem, move the robots.txt file to your root directory

It is worth noting that you have to access this server yourself.

Some CMS upload files to the "middle" (or similar) subfolder by default, so you need to skip that to get the robots.txt file in the right place.

2. Misuse of wildcards

The Robots.txt file supports two wildcard characters:

The asterisk * represents any examples of a valid character, such as the Joker in a deck of cards.

The $ dollar sign indicates the end of the URL, allowing you to apply rules only to the last part of the URL, such as the file type extension.

It makes sense to take a simplistic approach to using wildcards, as they can apply restrictions to a larger part of your website.

It's also relatively easy to end up blocking bot access from your entire site with a poorly placed asterisk.

To fix the wildcard issue, you'll need to locate and move or remove the incorrect wildcard for the robots.txt file to work as intended.

3. Noindex in robots.txt file

This is more common on websites that are more than a few years old.

Google has stopped complying with the index rules for robots.txt files as of September 1, 2019.

If your robots.txt file was created before this date, or if it contains no index instructions, you will likely see those pages indexed in Google search results.

The solution to this problem lies in the application of an alternative "noindex" method.

One option is the bot's meta tag, which you can add to the header of any webpage you want to prevent Google from indexing.

4. Blocked scripts and style sheets

It would seem logical to prevent a crawler from accessing external JavaScript files and cascading style sheets (CSS).

However, remember that Googlebot needs access to CSS and JS files to properly "see" HTML and PHP pages.

If your pages are behaving strangely in Google results, or it seems that Google is not seeing them correctly, check if you are blocking crawler access to requested external files.

A simple solution to this is to remove the line from the robots.txt file that is blocking access.

Or, if you have some files that need to be blocked, cast an exception that restores access to the necessary CSS and JavaScript.

5. No sitemap URL

This is more about SEO than anything else.

You can include your sitemap URL in your robots.txt file.

Since this is the first place Googlebot looks when crawling your website, this gives the crawler a head start on knowing the structure and home pages of your site.

While this isn't entirely wrong, since deleting your sitemap shouldn't negatively affect the actual core functionality and visibility of your website in search results, it's still useful to add the sitemap URL to your robots.txt file if You want to give your SEO efforts a boost.

6. Access to development sites

Blocking crawlers from your live website is prohibited, but it also allows them to crawl and index your pages that are still under development.

It is a best practice to add a disallow instruction to the robots.txt file for a website under construction so that the general public will see it once it's finished.

Likewise, it is necessary to remove the disallow instruction when launching a completed website.

Forgetting to remove this line from your robots.txt file is one of the most common mistakes among web developers, and it can prevent your entire website from being crawled and indexed properly.

If your development site appears to be receiving real-world traffic, or your recently launched website doesn't perform well at all in search, look for the disallow global user agent rule in your robots.txt file:


User agent: *


Disallow: /


If you see this when you shouldn't (or don't see it in time), make necessary changes to your robots.txt file and check to update your website's search appearance accordingly.

How to recover from Robots.txt error

If a robots.txt error has had undesirable effects on your website's search appearance, the most important first step is to correct the robots.txt file and check that the new rules have the desired effect.

Some SEO crawlers can help with this so you don't have to wait for search engines to crawl your site next.

When you are confident that your robots.txt file is behaving as intended, you can try to re-crawl your site as quickly as possible.

Platforms like Google Search Console and Bing Webmaster Tools can help.

Submit an updated sitemap and request that any inappropriately deleted pages be recrawled.

Unfortunately, you're on a Googlebot whim - there's no guarantee how long it might take for any missing pages to appear again in the Google search index.

All you can do is take the right action to reduce that time as much as possible and keep checking until the static robots.txt is executed by Googlebot.

XML-Sitemap-Generator-Tool-Blogger

Comments