分享笔记:

阻止所有 AI 机器人抓取 WordPress

禁止人工智能抓取

随着科技的进步,未来会出现各种像 ChatGPT 一样的文字类人工智能 AI 会抓取网站并抓取你的内容,收集并使用你的数据来训练 ChatGPT、OpenAI、DeepSeek 等软件以及数千种其他 AI 创作。

如果你想阻止人工智能机器人抓取你的网页,我这里给大家一个比较全面的人工智能阻止列表,防止它们窃取你的内容。

为什么要拦截 AI 爬虫?

很多 AI 机器人会爬取网站的内容,用于训练人工智能模型,如果不加以限制,你的网站内容可能会被未经许可地用于 AI 训练,而你无法控制或获利。AI 机器人爬取网站的方式可能非常频繁且占用大量服务器资源影响网站的访问速度,甚至导致服务器负载过高,影响正常用户的体验。

它们还可能会抓取你的网站内容,并将其用于其他平台。这可能导致内容被盗用,甚至被搜索引擎误认为是抄袭,影响你的 SEO 排名。如果你的站点上有用户评论、论坛、私人信息或其他敏感数据,可能会泄露用户隐私

使用 robots.txt 屏蔽人工智能爬虫

想要屏蔽 AI 机器人(人工智能爬虫)最简单的方法就是通过 robots.txt 文件。它是一个包含规则的文件,规定了搜索引擎爬虫和其他类似机器人应该遵守的爬取规则。我们可以在其中添加规则,限制机器人访问特定页面或整个网站。

Robots.txt 的规则仅仅是建议,机器人并不一定会百分百遵守。正规的搜索引擎爬虫会遵守规则,而恶意爬虫可能会直接忽略并继续爬取网站内容。

以下是用于 robots.txt 文件的规则,可用于屏蔽 AI 机器人:

#Block AI bots
User-agent: Agent GPT
User-agent: AgentGPT
User-agent: AIBot
User-agent: AI2Bot
User-agent: AISearchBot
User-agent: AlexaTM
User-agent: Alpha AI
User-agent: AlphaAI
User-agent: Amazon Bedrock
User-agent: Amazon Lex
User-agent: Amazonbot
User-agent: Amelia
User-agent: anthropic-ai
User-agent: AnyPicker
User-agent: Applebot
User-agent: AutoGPT
User-agent: AwarioRssBot
User-agent: AwarioSmartBot
User-agent: Brave Leo AI
User-agent: Bytespider
User-agent: CatBoost
User-agent: CC-Crawler
User-agent: CCBot
User-agent: ChatGPT
User-agent: Chinchilla
User-agent: Claude-Web
User-agent: ClaudeBot
User-agent: cohere-ai
User-agent: cohere-training-data-crawler
User-agent: Common Crawl
User-agent: commoncrawl
User-agent: Crawlspace
User-agent: crew AI
User-agent: crewAI
User-agent: DALL-E
User-agent: DataForSeoBot
User-agent: DeepMind
User-agent: DeepSeek
User-agent: DepolarizingGPT
User-agent: DialoGPT
User-agent: Diffbot
User-agent: DuckAssistBot
User-agent: FacebookBot
User-agent: Firecrawl
User-agent: Flyriver
User-agent: FriendlyCrawler
User-agent: Gemini
User-agent: Gemma
User-agent: GenAI
User-agent: Google Bard AI
User-agent: Google-CloudVertexBot
User-agent: Google-Extended
User-agent: GoogleOther
User-agent: GPT-2
User-agent: GPT-3
User-agent: GPT-4
User-agent: GPTBot
User-agent: GPTZero
User-agent: Grok
User-agent: Hugging Face
User-agent: iaskspider
User-agent: ICC-Crawler
User-agent: ImagesiftBot
User-agent: img2dataset
User-agent: IntelliSeek.ai
User-agent: ISSCyberRiskCrawler
User-agent: Kangaroo
User-agent: LeftWingGPT
User-agent: LLaMA
User-agent: magpie-crawler
User-agent: Meltwater
User-agent: Meta AI
User-agent: Meta Llama
User-agent: Meta.AI
User-agent: Meta-AI
User-agent: Meta-ExternalAgent
User-agent: Meta-ExternalFetcher
User-agent: MetaAI
User-agent: Mistral
User-agent: OAI-SearchBot
User-agent: OAI SearchBot
User-agent: omgili
User-agent: Open AI
User-agent: OpenAI
User-agent: PanguBot
User-agent: peer39_crawler
User-agent: PerplexityBot
User-agent: PetalBot
User-agent: RightWingGPT
User-agent: Scrapy
User-agent: SearchGPT
User-agent: SemrushBot
User-agent: Sidetrade
User-agent: Stability
User-agent: The Knowledge AI
User-agent: Timpibot
User-agent: VelenPublicWebCrawler
User-agent: WebChatGPT
User-agent: Webzio
User-agent: Whisper
User-agent: x.AI
User-agent: xAI
User-agent: YouBot
User-agent: Zero GTP
Disallow: /

强制阻止 AI 机器人对网站的访问

要完全屏蔽 AI 机器人,可以将以下规则添加到 Apache 配置文件.htaccess 主文件(网站根目录)。这种方法比 robots.txt 更权威,能够有效阻止 AI 机器人访问你的站点。

#Block AI bots
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (Agent\ GPT|AgentGPT|AIBot|AI2Bot|AISearchBot|AlexaTM|Alpha\ AI|AlphaAI|Amazon\ Bedrock|Amazon\ Lex|Amazonbot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Amelia|anthropic-ai|AnyPicker|Applebot|AutoGPT|AwarioRssBot|AwarioSmartBot|Brave\ Leo\ AI|Bytespider|CatBoost) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (CC-Crawler|CCBot|ChatGPT|Chinchilla|Claude-Web|ClaudeBot|cohere-ai|cohere-training-data-crawler|Common\ Crawl) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (commoncrawl|Crawlspace|crew\ AI|crewAI|DALL-E|DataForSeoBot|DeepMind|DeepSeek|DepolarizingGPT|DialoGPT|Diffbot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (DuckAssistBot|FacebookBot|Firecrawl|Flyriver|FriendlyCrawler|Gemini|Gemma|GenAI|Google\ Bard\ AI|Google-CloudVertexBot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Google-Extended|GoogleOther|GPT-2|GPT-3|GPT-4|GPTBot|GPTZero|Grok|Hugging\ Face|iaskspider|ICC-Crawler|ImagesiftBot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (img2dataset|IntelliSeek\.ai|ISSCyberRiskCrawler|Kangaroo|LeftWingGPT|LLaMA|magpie-crawler|Meltwater|Meta\ AI|Meta\ Llama) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Meta\.AI|Meta-AI|Meta-ExternalAgent|Meta-ExternalFetcher|MetaAI|Mistral|OAI-SearchBot|OAI\ SearchBot|omgili|Open\ AI) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (OpenAI|PanguBot|peer39_crawler|PerplexityBot|PetalBot|RightWingGPT|Scrapy|SearchGPT|SemrushBot|Sidetrade|Stability) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (The\ Knowledge\ AI|Timpibot|VelenPublicWebCrawler|WebChatGPT|Webzio|Whisper|x\.AI|xAI|YouBot|Zero\ GTP) [NC]
RewriteRule (.*) - [F,L]
</IfModule>
#END Block AI bots

这个方法与 robots.txt 方法屏蔽的 AI 机器人类似,但关键区别在于

  • robots.txt 只是一个建议,依赖于机器人是否遵守 Disallow 规则。守规矩的机器人会遵循,而恶意爬虫可能会直接忽略。
  • .htaccess 方法可以真正阻止列出的 AI 机器人,因为它直接在服务器拦截请求,拒绝它们的访问。

为笔记评分

平均评分 5 / 5. 摘星者: 2

有疑问?留个言吧!

更多结果...

Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
?>

更多结果...

Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors