How to Stop AI From Scraping Your Website

February 27, 2025
By: Katie McDermitt in the AEO category.

Read Time: 6 minutes

Your website’s content is valuable, and protecting it from unauthorized AI scraping is more important than ever. Large language models like ChatGPT pull information from various sources, including websites that haven’t explicitly granted permission. If you want to block AI tools from accessing your site, you need a combination of technical defenses, legal protections, and content strategy adjustments.

This guide walks you through every method available, from modifying your robots.txt file to enforcing terms of service and implementing CAPTCHA barriers. Whether you’re a business owner, content creator, or developer, these steps help you maintain control over your digital assets and prevent AI models from using your content without consent.

How to Stop ChatGPT, Gemini, and all AI from Scraping Your Website

Stopping ChatGPT and other AI models from scraping your site requires a mix of technical blocks, legal protections, and content strategy tweaks. Use the steps below to prevent unauthorized access and keep your content under your control.

[TECHNICAL BLOCKING METHODS]

1. Adjust Robots.txt File – Instruct AI bots to ignore your website.

Add ‘User-agent: <<bot>> and Disallow: /

Access Your Website’s Root Directory – Use an FTP client, your hosting provider’s file manager, shell, etc., and navigate to your site’s root folder.
Open or Create a robots.txt File – If you don’t have one, create a new text file and name it robots.txt. Yes, it’s as simple as that, just create a text file in notepad and save it as robots.txt.
Add the Following Lines (we’ll use ChatGPT as an example):

User-agent: GPTBot

Disallow: /
Save and Upload the File – If you created or modified the file locally, save it back to your root directory.
Verify the Changes – Visit https://yourwebsite.com/robots.txt in a browser to confirm the new rules are visible.
Test for Compliance – For our example, you can use OpenAI’s GPTBot verification page or use a robots.txt tester tool to ensure it is blocking access properly.

>> DOWNLOAD: robots.txt file that blocks all AI bots as of February 2025 <<

(This file explicitly blocks known AI scrapers and common web crawlers used by AI training datasets. Some bots, like OpenAI’s GPTBot and Google’s Google-Extended, respect robots.txt. However, this is not a guarantee, so additional security measures (e.g., IP blocking, JavaScript obfuscation) might be needed… read on for those instructions).

2. Block AI with Meta Tags (HTML <head> code)

Add the following meta tags inside the `<head>` section of your HTML pages:

<meta name=“robots” content=“noai, noindex, noimageai”>
<meta name=“googlebot” content=“noai”>
<meta name=“bingbot” content=“noai”>
<meta name=“gptbot” content=“noindex”>
…

noai → Tells AI bots not to use your content for training.

noindex → Prevents pages from appearing in search results (might want to stay away from this one or only use sparingly, see more of a micro approach on #14 below).

noimageai → Stops AI from using images for model training.

These steps help block AI bots that respect these directives, but more aggressive scrapers might still bypass them.

>> DOWNLOAD: Head Code that blocks all AI bots as of February 2025 <<

(NOTE: Some of these “AI bots” are also the provider’s main SEO bot – e.g. Baidu, Yandex, etc., so implement with care or contact an AI expert for assistance.)

3. Block AI with HTTP Headers (Server-Side)

For Apache servers, add this to your `.htaccess` file:

<IfModule mod_headers.c>
Header set X-Robots-Tag “noai, noimageai”
</IfModule>

For Nginx servers, add this to your configuration file:

add_header X-Robots-Tag “noai, noimageai”;

For Express.js (Node.js) applications, modify the response headers:

app.use((req, res, next) => {
res.setHeader(“X-Robots-Tag”, “noai, noimageai”);
next();
});

These steps block AI bots at the HTTP level before they access page content and generally works even if AI scrapers ignore robots.txt rules. It also prevents AI from using text and images in model training.

4. IP Blocking – Identify and block known AI bot IP ranges at the server level.

Identify AI Bot IP Ranges:

Check official AI bot documentation for published IP ranges:
- OpenAI (GPTBot): https://platform.openai.com/docs/gptbot
- Perplexity AI: https://www.perplexity.ai
- Google AI bots: https://developers.google.com/search/docs/crawling-indexing/overview
Look at your server logs (access.log, error.log) to find suspicious IPs.
Use reverse DNS lookup (nslookup <IP> or host <IP>) to verify bot origins.

Block AI IPs via Apache (.htaccess File):

If you’re using an Apache server, add these lines to your .htaccess file:
- <RequireAll>
  Require all granted
  Require not ip 192.168.1.1
  Require not ip 104.132.0.0/24
  Require not ip 143.198.0.0/16
  Require not ip 34.120.0.0/14
  </RequireAll>
- NOTE: This is an example and might not be all the ip address limitations you need, do step 1 first to determine IP addresses.

Block AI IPs on Nginx (nginx.conf or .conf File):

For Nginx, add this to your server block:
- server {
  listen 80;
  server_name yourwebsite.com;
  location / {
  deny 192.168.1.1;
  deny 104.132.0.0/24;
  deny 143.198.0.0/16;
  deny 34.120.0.0/14;
  allow all;
  }
  }
NOTE: This is an example and might not be all the ip address limitations you need, do step 1 first to determine IP addresses.

Block AI IPs Using UFW (Linux Firewall – Ubuntu/Debian):

If your server runs UFW (Uncomplicated Firewall), block AI bot IPs with:
- sudo ufw deny from 192.168.1.1
  sudo ufw deny from 104.132.0.0/24
  sudo ufw reload

Keep IP Blocks Updated:

AI companies may change IP addresses. Regularly check bot documentation for updates.
Use firewall automation tools to keep blocks current.

>> DOWNLOAD: List of the latest IP Addresses and Ranges to Block AI Bots as of February 2025 <<

5. Rate Limiting & Captcha (limit excessive requests from unknown bots with CAPTCHAs or request throttling).

Enable Rate Limiting on Your Server For Apache (Using mod_evasive):

Install mod_evasive (assuming it’s not installed):
- BASH | sudo apt-get install libapache2-mod-evasive
Configure rate limits in /etc/apache2/mods-available/evasive.conf
- DOSHashTableSize 3097
  DOSPageCount 5
  DOSSiteCount 50
  DOSBlockingPeriod 600
Configure rate limits in /etc/apache2/mods-available/evasive.conf
- BASH | sudo systemctl restart apache2

Enable Rate Limiting on Your Server For Nginx (Using limit_req_zone):

Nginx configuration (nginx.conf):
- http {
  limit_req_zone $binary_remote_addr zone=one:10m rate=1r/s;
  }

server {
location / {
limit_req zone=one burst=5;
}
}

Restart Nginx
- BASH | sudo systemctl restart nginx

Enable CAPTCHA Challenges Using Cloudflare Turnstile (No User Interaction CAPTCHA):

Sign up for Cloudflare and enable Turnstile CAPTCHA.
Navigate to Security → Bots and turn on “Managed Challenge”.
Apply the challenge to specific pages or high-risk endpoints.

Enable CAPTCHA Using Google reCAPTCHA (w/ PHP Example):

Register your site at Google reCAPTCHA.
Add this script inside the <head> of your HTML:
- <script src=”https://www.google.com/recaptcha/api.js” async defer></script>
Add a CAPTCHA-protected form:
- <form action=”verify.php” method=”POST”>
  <div class=”g-recaptcha” data-sitekey=”YOUR_SITE_KEY”></div>
  <input type=”submit” value=”Submit”>
  </form>
Validate the CAPTCHA response in verify.php:
- <?php
  $secretKey = “YOUR_SECRET_KEY”;
  $response = $_POST[“g-recaptcha-response”];
  $remoteIp = $_SERVER[“REMOTE_ADDR”];
  $verifyUrl = “https://www.google.com/recaptcha/api/siteverify?secret=$secretKey&response=$response&remoteip=$remoteIp”;

$response = file_get_contents($verifyUrl);
$responseData = json_decode($response);

if (!$responseData->success) {
die(“CAPTCHA verification failed.”);
}
echo “Success!”;
?>

Enable Using hCaptcha for Bot Protection (w/ PHP Example):

Register Your Site at hCaptcha and get your site key and secret key.
Add the hCaptcha script inside the <head> section of your HTML:
- <script src=“https://js.hcaptcha.com/1/api.js” async defer></script>
Add hCaptcha to Your Form:
- <form action=”verify.php” method=”POST”>
  <div class=”h-captcha” data-sitekey=”YOUR_SITE_KEY”></div>
  <input type=”submit” value=”Submit”>
  </form>
Validate the hCaptcha response in verify.php:
- <?php
  $secretKey = “YOUR_SECRET_KEY”;
  $response = $_POST[“h-captcha-response”];
  $remoteIp = $_SERVER[“REMOTE_ADDR”];

$verifyUrl = “https://hcaptcha.com/siteverify”;
$data = [
‘secret’ => $secretKey,
‘response’ => $response,
‘remoteip’ => $remoteIp
];

$options = [
‘http’ => [
‘header’ => “Content-Type: application/x-www-form-urlencoded\r\n”,
‘method’ => ‘POST’,
‘content’ => http_build_query($data),
],
];

$context = stream_context_create($options);
$responseData = json_decode(file_get_contents($verifyUrl, false, $context));

if (!$responseData->success) {
die(“hCaptcha verification failed.”);
}

echo “Success!”;
?>

Monitor and Adjust as Needed:

Use server logs (access.log) to identify suspicious traffic.
Adjust rate limits to balance security and user experience.
Implement higher CAPTCHA sensitivity during traffic spikes.

6. Honeypot Traps: Use hidden links to detect and block AI scrapers.

Honeypot traps work by placing hidden links or form fields on your website that humans won’t see or click, but scrapers will. If a bot interacts with them, you can block its IP or take other actions.

How to Set Up a Honeypot Trap:

Add a Hidden Honeypot Link by placing this hidden link in your HTML:
- <a href=”/trap-page” class=”honeypot”>Hidden Link</a>
  <style>.honeypot { display: none; }</style>
Humans won’t see it due to display: none;.
Bots may still follow it, exposing themselves.

Create a Trap Page (trap-page.html):

Log visits to this page to identify scrapers (php example):
- <?php
  $ip = $_SERVER[‘REMOTE_ADDR’];
  $file = ‘honeypot_log.txt’;
  file_put_contents($file, “$ip\n”, FILE_APPEND);
  ?>
  <html>
  <head><meta name=”robots” content=”noindex, nofollow”></head>
  <body>
  Nothing to see here.
  </body>
  </html>
Logs suspicious IPs in honeypot_log.txt.
Prevents indexing so search engines ignore it.

Block Detected Bot IPs:

See above on how to block IPs ^

[LEGAL & POLICY-BASED APPROACHES]

7. Update Your Terms of Service.

Clearly state that AI scraping is prohibited in your terms of service document for the website.

Example Verbiage:

“Unauthorized scraping, data extraction, or use of automated tools (including AI models, bots, and crawlers) to access, store, or repurpose content from this site is strictly prohibited. Any violation may result in legal action, IP bans, and further enforcement measures.”
- NOTE: This is an example only, make sure you use legal advice to determine your own verbiage needed.

8. Issue DMCA Takedown Notices (if necessary).

Issue takedown requests if AI models have already used your content.

How to Issue DMCA Takedown Notices for AI:

Identify Unauthorized Use – Find where AI models or platforms are using your content.
Gather Evidence – Take screenshots, URLs, and timestamps of infringements.
Find the Right Contact – Locate the AI company’s DMCA agent or legal contact (often in their Terms of Service).
Draft a DMCA Notice – Include your contact details, the infringing content, proof of ownership, and a removal request.
Send the Notice – Email or submit the DMCA request through the company’s designated process.
Follow Up – If no action is taken, send a second notice or escalate to a legal representative.
Monitor for Reuse – Regularly check if your content appears in AI outputs again.

9. Send Cease and Desist Notices.

Another one that you should seek legal advice for, but rightful cease and desist notices may be able to help you!

[CONTENT MODIFICATION STRATEGIES]

10. Serve key content through JavaScript to make direct scraping harder (called “JavaScript Obfuscation”).

We’d call this excessive!

Maybe do this only after everything else doesn’t work…

How to Use JavaScript Obfuscation to Make Scraping Harder:

Convert Text to JavaScript Variables – Store key content inside JavaScript instead of plain HTML.
Use innerHTML to Render Content – Dynamically insert content into the page using JavaScript.
Encode Text in Base64 – Convert sensitive content to Base64 and decode it in JavaScript before displaying.
Delay Content Loading – Use setTimeout() or fetch() to load content after a delay to trick bots.
Randomize Element IDs and Class Names – Change identifiers dynamically to prevent pattern-based scraping.
Require User Interaction – Load content only after a click, scroll, or keyboard input.
Use CAPTCHA Before Displaying Content – Prevent bots from seeing content until a CAPTCHA is solved.
Detect and Block Headless Browsers – Use JavaScript checks to identify automated tools like Puppeteer.
Prevent Right-Click and Copying – Use document.oncontextmenu = function() { return false; } to block right-click menus.
Minify and Obfuscate JavaScript – Use tools like Obfuscator.io to make JavaScript unreadable to scrapers.

This makes scraping more difficult, but not impossible—combine it with other protections like IP blocking and honeypot traps.

11. Use authenticated API calls to dynamically load content.

Another excessive step if all the others don’t work.

12. Embed invisible watermarks in your content.

Embed invisible (or transparent) watermarks / unique identifiers to detect scraping.

How to Use Content Watermarking to Detect Scraping:

Embed Invisible Text Markers – Add hidden characters, zero-width spaces, or unique phrases within content.
Use CSS Hidden Elements – Place text in display: none; sections that only appear in raw HTML.
Insert Metadata in Images – Add author information or unique hashes in EXIF metadata of images.
Generate Dynamic Content Variants – Serve slightly different text versions to different users to track leaks.
Use Steganography for Images – Embed subtle, undetectable marks or pixel-level changes to identify copied content.
Add Unique HTML Comments – Insert specific comments in the page source that bots may copy.
Use JavaScript-Based Watermarks – Load text dynamically with unique variations per session.
Track Watermarked Content Online – Use search engines or AI detection tools to find stolen content.
Monitor AI Model Outputs – Test AI-generated content for your hidden markers to detect training use.
Log Unauthorized Access – Track visits to specific watermarked sections using analytics tools.

This helps identify stolen content and prove unauthorized usage if needed.

13. Gate your content from AI (gating content is a common marketing tactic)

Require user logins or subscriptions to access full content (think WSJ.com or New York Times online articles).

How to Use Gated Content to Restrict AI Scraping:

Require User Registration – Ask users to create an account before accessing full content.
Use Login Authentication – Protect content behind a login system to prevent anonymous access.
Limit Guest Access – Show only a content preview to non-logged-in users.
Use Session-Based Access – Grant access only after verifying active sessions or tokens.
Restrict Content with Paywalls – Require a subscription or payment for full access.
Track and Limit Free Users – Allow limited views per user before requiring login.
Use CAPTCHA at Login – Prevent bots from creating fake accounts to bypass restrictions.
Detect and Block Shared Credentials – Monitor for multiple logins from different locations.
Disable Copy-Pasting for Logged-In Users – Prevent direct content extraction using JavaScript.
Monitor User Behavior – Flag suspicious activity such as excessive page views or automated access.

This method limits AI access while ensuring genuine users can still engage.

[SEO & SEARCH ENGINE DIRECTIVES]

14. One-off search engine & simple SEO directives to block AI.

Two simple directives for SEO to block AI at a bit more of a micro level.

A. Use meta tags in the head of a page (single, one-by-one):

Implement the meta robots tag on specific pages: <meta name=“robots” content=“noai, noindex, noimageai”>
- NOTE: Use ‘NOINDEX’ sparingly… You could accidentally kill all organic traffic. Consider just using the tag in this manner instead – <meta name=“robots” content=“noai, noimageai”>

B. Block AI Proxies (other servers or services relaying bot requests anonymously):

Some AI tools use search engine proxies (intermediary servers that allow scraping to be anonymized/masked); monitor and restrict them.

How to Block AI Proxies and Search Engine Proxies
- Analyze Server Logs – Check access logs for unusual traffic patterns or proxy services.
- Block Known Proxy IPs – Use firewall rules to deny requests from public proxy and VPN providers.
- Use Reverse DNS Lookup – Identify and restrict traffic from suspicious hostnames linked to AI services.
- Inspect User-Agent Strings – Detect and block traffic using generic or AI-related user-agents.
- Check X-Forwarded-For Headers – Identify hidden IPs from proxy traffic and restrict access.
- Limit Requests Per IP – Apply rate limiting to reduce bulk scraping from proxies.
- Use JavaScript Challenges – Require JavaScript execution, which some proxy-based scrapers cannot handle.
- Enable CAPTCHA for Unverified Users – Prevent automated tools from bypassing restrictions.
- Deny Access to Data Centers – Block traffic from cloud services like AWS, GCP, and Azure where AI scrapers often run.
- Monitor Search Engine Referrals – Flag traffic coming from unusual search engine queries leading to bulk requests.

This helps reduce AI scraping via proxies while keeping normal user access intact.

[COMMUNITY AND ANTI-AI ADVOCACY]

15. Join the NoAI movement.

If you’re really, REALLY sick of the AI takeover, you can support initiatives advocating AI-content protections.

https://letmegooglethat.com/?q=organizations+that+advocate+for+AI+protections

16. Request exclusions from AI companies and their training.

You can request exclusion from AI training datasets – see the links above for who/where to contact. Or…

https://letmegooglethat.com/?q=list+of+exclusion+urls+for+ai+companies

17. Educate Your Company and Users on how AI scraping affects content creators.

There’s many pros and cons of AI content. That’s why we always have a human in the mix with our content agency and offer the ability for our clients to have a human-only content experience.

What’s Next? Staying Ahead of AI Scrapers

Protecting your website from AI scrapers might mean more than keeping your content safe. Also consider staying one step ahead of competitors who aren’t prepared for the AI-driven future. While others scramble to react when their content appears in AI-generated results, you’re already building walls, setting traps, and locking the doors before unauthorized bots ever reach your site.

This list gives you every tool available today.

From blocking AI bots at the robots.txt level to embedding invisible watermarks that expose stolen content. While AI companies evolve their scraping techniques, you’re ensuring they can’t use your hard work without a fight.

But here’s the real advantage: most businesses aren’t doing this.

If you have the same concerns about safeguarding your content, chances are you’re in an industry where your competition shares those concerns as well. They don’t realize how AI is quietly consuming their content and repurposing it. By implementing even a few of these strategies, you’re already gaining an edge in protecting your intellectual property while your competitors might remain vulnerable.

So what’s the next step?

Advanced detection techniques. Imagine being able to track where your content ends up in AI-generated responses. Stay tuned, because we’re diving into how to monitor AI outputs, detect unauthorized content use, and even push back legally when necessary.

Are you ready to go from defense to offense? You won’t want to miss what’s coming next.