Ready to start

Unless you’ve been doing very well to stay off the internet for at least the past year, you’ll no doubt have heard about the rise of AI. It feels like it’s come from nowhere, but it’s due to stick around for a rather long time.

Whether you embrace the idea of seemingly semi-sentient computer technology (hello ChatGPT) and the way it can assist us humans, or you’re still, understandably, sceptical, it’s already here and in our lives. Where is it getting its knowledge from? Well, it’s harvesting human-created output on the web.

With Google pretty strict on duplicate content (and so it should be), there is an argument for stopping AI from being able to ‘scrape’ your site to power itself, so that your rankings aren’t at risk from being penalised.

In this article, we’ll explore this concept of blocking AI from ‘scraping’ or ‘crawling’ your site.

What you need to know

What is a web crawler?

Downloading and indexing content from all over the internet, the best example of a web page crawler that demonstrates what they do are the Googlebots. These crawl available websites, reading the information on them in order to rank them within the search engine results pages.

What is web scraping?

Web scrapers don’t ‘read’ the information on your website; they instead extract it for use elsewhere. An example of this could be price monitoring, where an external site uses price information that it’s scraped to give you accurate guidance on the best deal.

What is AI content scraping?

AI content scraping combines traditional website scraping methods with some more complex algorithms and processes to streamline and automate the collection of data. For example, it can scrape data from dynamic sites, and organise the data it collects as it goes. It can also, in some cases, bypass ‘anti-bot’ measures that a website may be using too.

What are some different AI platforms?

ChatGPT

ChatGPT is a natural language processing tool, and has blown minds everywhere by answering questions, writing content, and even coding!

CCBOT

CCBOT isn’t as fun in the same ways as ChatGPT, but it is if you’re into web crawling; it leverages Apache Nutch to work towards its aim of ‘creating a copy of the internet’.

Bard

Bard is Google’s own, experimental AI platform, and operates as a chatbot similar to ChatGPT.  

Claude

Another powerful AI chatbot comes in the form of Claude, which has a particular penchant for analysing large texts and documents.

How does AI use your website’s content?

AI platforms need a regular supply of content in order to become more accurate - and your website could be providing fresh fodder. By scraping websites, AI platforms are expanding their pool of content with which they can become even more ‘intelligent’, and can produce more human-like responses to questions and content prompts.

Why are web owners becoming concerned?

This leveraging of your content leaves it open to the risk of duplication, or republication, which could see you being penalised by Google, who hate plagiarism even more than we do. You’ll no doubt have worked hard on researching, writing and optimising your website’s page content and blogs, so to have them gobbled up and churned back out by AI is making many website owners more than a little uncomfortable. You won’t find AI platforms dishing out citations either!

How to block AI web crawlers

If you too aren’t too happy with AI web crawler bots being able to harvest your content, there are ways in which you can seek to prevent them from getting to it. Here are the emerging methods:

Adding code to your robots.txt file

This method requires you to block each bot individually by name. By adding a code to your robots.txt file, you can disallow individual bots from crawling your site in the future, although it doesn’t revoke their right to any content of yours that they’ve crawled in the past. Use the code below to block Open AI (ChatGPT), CCBOT, Bard and Claude via your robots.txt file:

Block OpenAI

User-agent: GPTBot

Disallow:/

Block CCBot (Common Crawl)

User-agent: CCBot

Disallow:/

Block Bard

User-agent: Google-Extended

Disallow:/

Block Claude

User-agent: anthropic-ai

Disallow:/

Use CAPTCHA

You know when you have to prove you’re not a robot to do something online? It’s not often we think about what kind of robots this might be preventing, but AI crawlers are a great example. Protecting your content with a puzzle that, when the answer is input, tells humans and computers apart, prevents AI crawlers from accessing your site.

Implement copyright laws

There is an argument for suing over the infringement of copyright laws that AI scraping software could be accused of, and it has already been done in the US.

Block IP ranges

If you’re willing to keep on top of the IP ranges used by programmes like ChatGPT, blocking these can keep their crawlers off your content. You can find the latest IP addresses being used by their crawlers on their website.

Opt out of OpenAI web crawler’s scraping

Yes, it is possible to opt out of ChatGPT’s scrape. Head to their privacy centre and follow the prompts for opting out.

Should you block all bots in your robots.txt?

As effective as this is for chatbot platforms like we’ve mentioned, blocking AI search engine bots is another kettle of fish; in Google’s case, their bots won’t really be possible to block in this way. That’s because of the secrecy around them - you won’t be able to know them by name, and therefore the code won’t be effective. And besides, if you were to block Google’s bots, chances are you wouldn’t rank!

Blocking SEO platforms such as Ahrefs and Semrush can also prevent you from the crucial analysis required to improve your rankings, so be careful when considering putting these on your block list.

Need help with your digital marketing?

AI has definitely thrown something different into the digital marketing mix, and its appearance may well feel overwhelming, what with everything else you need to keep on top of. At 427 Marketing, we’re all over the latest trends in Search Engine Optimisation as well as improving rankings for our clients; this includes keeping abreast of what the AI platforms are up to.

If you’re looking to work with an SEO company that’s moving forward with the pace of SEO best practice, and committed to making your site a success for the user and on the results page, get in touch with us today. We’d love to help!

Back to blog

About Chris Simmons

Chris is our onpage SEO Specialist at 427 Marketing, having joined the team in early 2023. He works with our content team to cover the 4 pillars of SEO; content, onpage SEO, technical SEO and offpage SEO. Prior to joining the 427 Marketing team, Chris spent almost 10 years applying his SEO and content skills across several different industries in marketing agency and inhouse roles including tool hire, auctioneering, health care within the NHS and high end luxury retail in both B2B and B2C capacities. His passion for writing, content, UX, technical and on page SEO has expanded our content offerings, helping provide reliable advice about all things SEO to 427 Marketing.

Get in touch

Contact one of our experts to arrange your initial consultation

Contact us