User Tools

Site Tools


computing:managingbots

This is an old revision of the document!



  • managingbots
  • Jonathan Haack
  • Haack's Networking
  • webmaster@haacksnetworking.org

managingbots


This tutorial is designed for Debian OS and LAMP stack users. In my case, I have a multi-site WordPress that includes my tech blog, poetry, and teaching blog. Additionally, I have a separate vhost on the same instance that is this very dokuwiki. The first thing I did was create a script that would scrap and tally all the bots and how much they have done during the last day:

The first report is listed below. I keep this running daily now, including report purging and I can therefore monitor and fine tune it if needed.

After considering my physical host of 48 threads, 384GB of RAM, and the virtual appliance that runs on it for this script was built, which is 16 vCPU cores and 16GB of RAM, I decided that the first thing to do was to tweak apache to ensure it could handle the flood long enough to take action. So, by increasing the maximum workers to 800, with each worked being able to handle approximately 1.5K - 2.5K requests per second, this means that the server could handle 95K - 145K requests. This is well-above what even the most aggressive bot did to my server, and also well-above most reports I've see most impacted servers. Here are some bird's eye averages of some of the reports we've all read about:

Okay, so the multi processing module numbers above are tweaked to be roughly 5X of the worst attack. The next question - before I set those values - is whether my hardware can handle that. In my case, the virtual appliance has 16 vCPUs/cores, so assuming 50 threads per core, that would be maybe at most 5GB total of RAM usage, or 2-5MB per thread. Okay, so I configured the mpm_event (no one should be using prefork anymore) module by opening /etc/apache2/mods-available/mpm_event.conf and changing the defaults as follows:

#Defaults
#StartServers            2
#MinSpareThreads         25
#MaxSpareThreads         75
#ThreadLimit             64
#ThreadsPerChild         25
#MaxRequestWorkers       150
#MaxConnectionsPerChild  0
#Adjustments
StartServers            4
MinSpareThreads         25
MaxSpareThreads         75
ThreadLimit             64
ThreadsPerChild         25
MaxRequestWorkers       800
MaxConnectionsPerChild  0
#32 is the exact ServerLimit, setting to 50 to have some wiggle room
ServerLimit             50

The stock configuration seems to be tailored to allow a popular hobbyist to have a functioning website with minimal configuration changes. Once those were changed, and since I use WordPress and Nextcloud which rely heavily on php, I also took a look at /etc/php/8.2/fpm/pool.d/www.conf and adjusted the servers, or child processes as follows:

#default
#pm.max_children: 5
#pm.start_servers: 2
#pm.min_spare_servers: 1
#pm.max_spare_servers: 3
#adjusted
pm.max_children = 400
pm.start_servers = 40
pm.min_spare_servers = 20
pm.max_spare_servers = 40
pm.max_requests = 1000

If everything hit theoretical ceilings, I would likely need more like 24GB of RAM, but presuming that rarely if at all happens, I left the ram allocation at 16GB. Okay, so this means that my web server and my php configuration and handler were all optimized to be handle at least 20K simultaneous requests as per its configuration. It can, in fact, handle more than that as per the buffers and allowances I chose. Next, what to do about these bots? Well, in my case, I have hardware that can handle the configuration above. So, the obvious and next conclusion is that anything above 20K url requests of any form (whether AI-bots or regular old DDOS) should be given an IP tables ban so my server stays running. And which tool is designed to assess logs and access and generate firewall rules accordingly? Enter fail2ban. This tutorial assumes you already have fail2ban installed and minimally configured for core services. If not, please read Fail2Ban. Alright, let's create a custom jail that stops anything > 20K requests in a min and times those IPs out for 10 mins. In /etc/fail2ban/jail.local I entered:

[apache-botflood]
enabled  = true
port     = http,https
filter   = apache-botflood
logpath  = /var/log/apache2/access.log
           /var/log/apache2/access.log.1
maxretry = 20000
findtime = 60
bantime  = 600

Then, I created the regex filter in /etc/fail2ban/filter.d/apache-botflood.conf as follows:

[Definition]
failregex = ^<HOST> .* "(GET|POST|HEAD).*HTTP.*"$
ignoreregex =

After restarting apache, fpm, and fail2ban, along with a restart for good measure, including checking the status of all services to ensure there were no errors … it was now time to DDOS my machine from another server of mine. To do this, I used ab as follows:

ab -n 30000 -c 1000 http://haacksnetworking.org/

It was overkill to do 30K, but I needed to make sure I did enough that it would trigger the rule. It worked properly and banned the IP accordingly for 10 minutes. In short, if the bots get too frisky, I can time them out for any amount of url requests / minute, and time them out for any amount of time I feel is appropriate. Since my server and virtual appliance can handle it, I've set my rule to 20K requests in one minute. Your use case, hardware, personal tolerance of bot behavior and associated bias might differ. Maybe your hardware is less robust or more robust, maybe you think the bots deserve a longer or shorter time out, or maybe you want to ban any IP that does this indefinitely. For me, I want bots to scrape my wiki, tech blog, poetry, or whatever else they want. And, as far as the technology is concerned, I don't think it's really fair or honest of me to request that they change their behavior for my public site. If I held that sentiment, I would not have made these resources public. With that said, I also don't want my server to go down (there was a suspicious November 24 outage lol), and I want to help others who use LAMP stacks and common php-based instances be able to protect their appliances using common tools available to Debian users. Hopefully this helps!

Happy hacking!

oemb1905 2025/04/06 09:58

computing/managingbots.1743934571.txt.gz · Last modified: 2025/04/06 10:16 by oemb1905