robots.txt | Advanced SEO Techniques

Published

Updated

November 4, 2022

You've no doubt heard of _/robots.txt.

It's your magic ticket to controlling what Google crawls, and therefore your path to SEO superstardom, right?

Nope. The truth is, if you're designing a site that's intended for the public to see, and use, you're better off forgetting robots.txt even exists.

You're much more likely to create problems, and damage your site's SEO than anything else.

Really, trust me. There is nothing good to be gained here, unless maybe you're building some private pages for the Department of Defense.

NOTE: You probably shouldn't tamper with the battery in your Tesla either.

Things You Probably Didn't Know

From Google's perspective, _/robots.txt really only tells it what parts of your site to avoid looking at.

Why Does robots.txt Exist?

It was originally designed in the early days of the Internet when the web was the domain of universities and scholars who were playing with funky ideas.

Its job was to prevent crawlers from investigating parts of a website that it shouldn't. Perhaps because;

That area of the site is dynamic, and random, and unpredictable. So much, that every time you visit a page, the content might be radically different, or that page might not be there anymore. So search-engine-indexing it serves no use.
Because that area is fragile

Today, most websites exist to share content with the world. And that's certainly the whole point of Webflow.

Messing with _/robots.txt is like licking your Tesla battery... probably a Bad Idea.

Just because you can, doesn't mean you should.

Mistakes People Make

From Google's perspective, _/robots.txt tells it what parts of your site to avoid looking at. The META robots tag on a page tells it whether you want that page to be indexed or not.

These have different purposes, and sometimes that lack of understanding bites people in the butt.

Let's suppose you have a page in Google's search index, that you don't want to be there anymore. How do you remove it?

Many people will have a shotgun reaction here, and they'll try to block that page everywhere;

They will jump into _/robots.txt and tell Google to stop crawling that page.
And they'll also add the META noindex tag to the page.

But the result is probably not what they wanted.

GoogleBot will check the _/robots.txt first, and see that it tells them not to crawl that page. So, as requested, it won't...

And, it will never even see the META noindex tag you've added, or update your search engine results.

The result?

Your page will stay in the Google search results... forever.

Configuring robots.txt

For most Webflow sites, simply not having a _robots.txt is your best approach. Webflow hosting doesn't support back-end programming, so there is no possibility of wild-and-crazy programming that you want to keep the robots clear of.

You're much better off keeping your _robots.txt empty, and using _<meta> tags to tell Google what things you want excluded from search results.

If you are determined to have a _robots.txt, however, you need to know this.

How to Allow-all in robots.txt

The #1 mistake I see people make is that they misunderstand the _robots.txt syntax, and end up blocking all robots from indexing their site.

If you want to Allow your entire site to be visited - and indexed or excluded based on your page-level META rules, this is the syntax you want;

_{User-Agent: *
Disallow:}

Whtat this says is;

No matter the robot, do not block anything.

If you make the mistake of putting _{Disallow: /} then you are telling robots that they are not permitted to explore any paths beginning with _/. Which mean every path. Which means you have just blocked every page on your site from robots.

Yes you could want that... but most people invest in Webflow because they want great looking sites that are found by the World.

Don't shut the world out, accidentally.

Google syntax

Google also provides for an _Allow keyword, which is a bit more comprehendible. However it's non-standard, and will be ignored by other robots.

_{User-Agent: *
Allow: /}

I'd recommend sticking with standards, whenever possible. Google likes them, too.

More Tools

When in doubt, test.

Google has a built-in robots.txt testing tool that you can use on your Search Console verified properties.

See Google's docs also.

FAQs

Answers to frequently asked questions.