User Tools

Site Tools


computing:managingbots

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
computing:managingbots [2025/04/06 19:30] oemb1905computing:managingbots [2025/04/12 01:25] (current) oemb1905
Line 11: Line 11:
 ------------------------------------------- -------------------------------------------
  
-This tutorial is designed for Debian OS and LAMP stack users that want to track and or prohibit bot scraping (or other url requests) that might harm server performance and/or cause it to fail. In my case, I have a multi-site WordPress that includes my tech blog, poetry, and teaching blog. Additionally, I have a separate vhost on the same instance for my dokuwiki and another for file-share only Nextcloud. The same instance also runs my business’s email server with postgres and dovecot. Needless to say, I don’t want this virtual appliance to have down time. The first thing I did was create a script that would scrap and tally all the bots and how much they have done during the last day:+This tutorial is designed for Debian OS and LAMP stack users that want to track and or prohibit bot scraping (or other url requests) that might harm server performance and/or cause it to fail. In my case, I have a multi-site WordPress that includes my tech blog, poetry, and teaching blog. Additionally, I have a separate vhost on the same instance for my dokuwiki and another for file-share only Nextcloud. The same instance also runs my business’s email server with postgres and dovecot. Needless to say, I don’t want this virtual appliance to have down time. The first thing I did was create a script that would scrape and tally all the bots and how much they have done during the last day:
  
   * [[https://repo.haacksnetworking.org/haacknet/haackingclub/-/blob/main/scripts/apache/bot-scrape-daily.sh?ref_type=heads|Daily Bot Scrape]]   * [[https://repo.haacksnetworking.org/haacknet/haackingclub/-/blob/main/scripts/apache/bot-scrape-daily.sh?ref_type=heads|Daily Bot Scrape]]
Line 26: Line 26:
   * [[https://fosstodon.org/@jimsalter/114270715367978012|freeBSD Wiki]]: Bursts up to 10K / min   * [[https://fosstodon.org/@jimsalter/114270715367978012|freeBSD Wiki]]: Bursts up to 10K / min
  
-The worst scraping was roughly 12K/min, so that meant that setting workers to a maximum of 800 would be more than 5x what I needed to ensure that my server would not crash during the bot flood itself. The next question – before I set those values – is whether my hardware can handle such a strong allocation. In my case, the virtual appliance has 16 vCPUs/cores, so assuming 50 threads per core, that would be maybe at most 5GB total of RAM usage, or 2-5MB per thread. Okay, so I configured the mpm_event (no one should be using prefork anymore) module by opening /etc/apache2/mods-available/mpm_event.conf and changing the defaults as follows:+The worst scraping was roughly 12K/min, so that meant that setting workers to a maximum of 800 would be more than 5x what I needed to ensure that my server would not crash during the bot flood itself. The next question – before I set those values – is whether my hardware can handle such a strong allocation. In my case, the virtual appliance has 16 vCPUs/cores, so assuming 50 threads per core, that would be maybe at most 5GB total of RAM usage, or 2-5MB per thread. Okay, so I configured the mpm_event (no one should be using prefork anymore) module by opening ''/etc/apache2/mods-available/mpm_event.conf'' and changing the defaults as follows:
  
   #Defaults   #Defaults
Line 47: Line 47:
   ServerLimit             50   ServerLimit             50
  
-The stock configuration seems to be tailored to allow a popular hobbyist to have a functioning website with minimal configuration changes. The configuration does not, however, prepare the server to sustain requests at the rates we’ve all heard about recently. So, I made those changes first and then I put my eyes on php servers/workers, or child processes. I did this because I rely on php heavily with WordPress multi-site, Nextcloud, and Dokuwiki all running on this appliance. Looking at /etc/php/8.2/fpm/pool.d/www.conf, I adjusted the servers as follows:+The stock configuration seems to be tailored to allow a popular hobbyist to have a functioning website with minimal configuration changes. The configuration does not, however, prepare the server to sustain requests at the rates we’ve all heard about recently. So, I made those changes first and then I put my eyes on php servers/workers, or child processes. I did this because I rely on php heavily with WordPress multi-site, Nextcloud, and Dokuwiki all running on this appliance. Looking at ''/etc/php/8.2/fpm/pool.d/www.conf'', I adjusted the servers as follows:
      
   #default   #default
Line 79: Line 79:
   ignoreregex =   ignoreregex =
      
-It is important to note that this definition stops any url requests that exceed 20K/min, not just AI-bots, hence the title AI-bots and a “+” sign. Next, I restarted apache, fpm, and fail2ban, and because I’m paranoid, I also restarted and then checked each service’s status to ensure that nothing was mis-configured. All was in order, so it was now time to DDOS my machine from another server of mine. To do this, I used ''ab'' as follows:+It is important to note that this definition stops any url requests that exceed 20K/min, not just AI-bots, hence the title AI-bots and a “+” sign. Also, if you prefer a jail that only bans the bots, and not all heinously large url requests, then adjust your jail to something like this instead: 
 + 
 +  [Definition] 
 +  failregex = ^ - - [.] "GET [^"]HTTP[^"]" .(GPTBot|ClaudeBot|Bytespider|PerplexityBot|CCBot|xAI-Bot|DeepSeekBot|Google-Extended|Anthropic-Web-Crawler|facebookexternalhit|ia_archiver|Applebot|bingbot|Twitterbot|Slackbot|Discordbot)\b.* 
 +  ignoreregex = 
 +  ignorecase = true 
 + 
 +Next, I restarted apache, fpm, and fail2ban, and because I’m paranoid, I also restarted and then checked each service’s status to ensure that nothing was mis-configured. All was in order, so it was now time to DDOS my machine from another server of mine. To do this, I used ''ab'' as follows:
  
   ab -n 30000 -c 1000 http://haacksnetworking.org/   ab -n 30000 -c 1000 http://haacksnetworking.org/
      
-It was overkill to do 30K, but I needed to make sure I did enough that it would trigger the rule. It worked properly and banned the IP accordingly for 10 minutes. In short, if the bots in weeks ahead get too frisky, I can time them out for any rate (url requests / minute) I so choose and then time them out (in minutes) for whatever I feel is appropriate. Since my server and virtual appliance can handle it, I’ve set my rule to 20K requests in one minute as the ceiling I tolerate. Your use case, hardware, personal tolerance of bot behavior and associated biased towards bot behavior might differ. Maybe your hardware is less robust or more robust, maybe you think the bots deserve a longer or shorter time out, or maybe you want to ban any IP that does this indefinitely. For me, I want bots to scrape my wiki, tech blog, poetry, or whatever else they want and I don’t think it’s really fair or honest of me to request that they change their behavior for my public site. If I held that sentiment, I would not have made these resources public. But, I also don’t want my server to go down because they scrape or request too much, amounting to a de facto DDOS attack. So, this is what I cooked up to ensure I am prepared in weeks ahead and I also hope it helps others who use Debian + LAMP stacks protect their appliances using common tools available to Debian users  +It was overkill to do 30K, but I needed to make sure I did enough that it would trigger the rule. It worked properly and banned the IP accordingly for 10 minutes. In short, if the bots in weeks ahead get too frisky, I can time them out for any rate (url requests / minute) I so choose and then time them out (in minutes) for whatever I feel is appropriate. Since my server and virtual appliance can handle it, I’ve set my rule to 20K requests in one minute as the ceiling I tolerate. Your use case, hardware, personal tolerance of bot behavior and associated biased towards bot behavior might differ. Maybe your hardware is less robust or more robust, maybe you think the bots deserve a longer or shorter time out, or maybe you want to ban any IP that does this indefinitely. For me, I want bots to scrape my wiki, tech blog, poetry, or whatever else they want and I don’t think it’s really fair or honest of me to request that they change their behavior for my public site. If I held that sentiment, I would not have made these resources public. But, I also don’t want my server to go down because they scrape or request too much, amounting to a de facto DDOS attack. So, this is what I cooked up to ensure I am prepared in weeks ahead and I also hope it helps others who use Debian + LAMP stacks protect their appliances using common tools available to Debian users. 
-   + 
- --- //[[alerts@haacksnetworking.org|oemb1905]] 2025/04/06 19:27//+ --- //[[alerts@haacksnetworking.org|oemb1905]] 2025/04/12 01:24//
computing/managingbots.1743967844.txt.gz · Last modified: 2025/04/06 19:30 by oemb1905