Haack's Wiki

managingbots
Jonathan Haack
Haack's Networking
webmaster@haacksnetworking.org

managingbots

This tutorial is designed for Debian OS and LAMP stack users that want to track and or prohibit bot scraping (or other url requests) that might harm server performance and/or cause it to fail. In my case, I have a multi-site WordPress that includes my tech blog, poetry, and teaching blog. Additionally, I have a separate vhost on the same instance for my dokuwiki and another for file-share only Nextcloud. The same instance also runs my business’s email server with postgres and dovecot. Needless to say, I don’t want this virtual appliance to have down time. The first thing I did was create a script that would scrape and tally all the bots and how much they have done during the last day:

Daily Bot Scrape

The first report I ran is listed below. I keep this running daily now, including report purging and I can therefore monitor and fine tune it if needed if the numbers in weeks ahead either drop or raise significantly.

April 4th Report

Considering my physical host has 48 threads, 384GB of RAM, and that the virtual appliance that runs on it for this script was built with 16 vCPU cores and 16GB of RAM, I decided that the first thing to do was to tweak apache to ensure it could handle the flood long enough to take action and then, after that, ensure that the virtual hardware I allocated to the appliance was sufficient to handle that load. Accordingly, I speculated that if I increased FPM workers to 800, this would mean that each worker could handle 2–3 requests per second per worker, or roughly 100K – 150K requests. This is well-above what even the most aggressive bot did to my server and also well-above most reports I’ve see most impacted servers. Regarding the newsworthy articles, I investigated four of them and – based on their reports – did my best to estimate the worst bursts of url requests they experienced. I also compared these numbers to known figures on what constitutes a DDOS attack (20K per minute or higher). Here are the estimates and associated original articles:

Game UI Database: Bursts up to 12K / min
SoruceHut: Bursts up to 5K / min
Triplegangers: Bursts up to 1K / min
freeBSD Wiki: Bursts up to 10K / min

The worst scraping was roughly 12K/min, so that meant that setting workers to a maximum of 800 would be more than 5x what I needed to ensure that my server would not crash during the bot flood itself. The next question – before I set those values – is whether my hardware can handle such a strong allocation. In my case, the virtual appliance has 16 vCPUs/cores, so assuming 50 threads per core, that would be maybe at most 5GB total of RAM usage, or 2-5MB per thread. Okay, so I configured the mpm_event (no one should be using prefork anymore) module by opening /etc/apache2/mods-available/mpm_event.conf and changing the defaults as follows:

#Defaults
#StartServers            2
#MinSpareThreads         25
#MaxSpareThreads         75
#ThreadLimit             64
#ThreadsPerChild         25
#MaxRequestWorkers       150
#MaxConnectionsPerChild  0
#Adjustments
StartServers            4
MinSpareThreads         25
MaxSpareThreads         75
ThreadLimit             64
ThreadsPerChild         25
MaxRequestWorkers       800
MaxConnectionsPerChild  0
#32 is the exact ServerLimit, setting to 50 to have some wiggle room
ServerLimit             50

The stock configuration seems to be tailored to allow a popular hobbyist to have a functioning website with minimal configuration changes. The configuration does not, however, prepare the server to sustain requests at the rates we’ve all heard about recently. So, I made those changes first and then I put my eyes on php servers/workers, or child processes. I did this because I rely on php heavily with WordPress multi-site, Nextcloud, and Dokuwiki all running on this appliance. Looking at /etc/php/8.2/fpm/pool.d/www.conf, I adjusted the servers as follows:

#default
#pm.max_children: 5
#pm.start_servers: 2
#pm.min_spare_servers: 1
#pm.max_spare_servers: 3
#adjusted
pm.max_children = 400
pm.start_servers = 40
pm.min_spare_servers = 20
pm.max_spare_servers = 40
pm.max_requests = 1000

If everything hit theoretical ceilings, I might need more like 24GB of RAM, but that would only be if I sustained 150K hits per minute for a sustained period of time. I considered that unlikely and also easy to change later with virsh edit domain.com should a need arise. At this point, both my web server and my php handler were all optimized to easily handle at least 20K simultaneous requests. This meant that just by configuring my server properly, I could sustain the burst/flood, and then take action on it. To me, the obvious tool to take action with would be fail2ban, which is designed to query server use, load, and abuse and make appropriate firewall rules to address those variables. Please read Fail2Ban if you are not familiar with how to set up a basic configuration. Take care to adjust to dbpurgeage to 30d otherwise many rules and timeouts you setup will likely exceed the amount of datapoints available in the database that fail2ban queries. Once that was done and confirmed, I created a custom jail that stops anything > 20K requests in a min and times those IPs out for 10 mins. In /etc/fail2ban/jail.local I entered:

[apache-botflood]
enabled  = true
port     = http,https
filter   = apache-botflood
logpath  = /var/log/apache2/access.log
           /var/log/apache2/access.log.1
maxretry = 20000
findtime = 60
bantime  = 600

Then, I created the regex filter in /etc/fail2ban/filter.d/apache-botflood.conf as follows:

[Definition]
failregex = ^<HOST> .* "(GET|POST|HEAD).*HTTP.*"$
ignoreregex =

It is important to note that this definition stops any url requests that exceed 20K/min, not just AI-bots, hence the title AI-bots and a “+” sign. Also, if you prefer a jail that only bans the bots, and not all heinously large url requests, then adjust your jail to something like this instead:

[Definition]
failregex = ^ - - [.] "GET [^"]HTTP[^"]" .(GPTBot|ClaudeBot|Bytespider|PerplexityBot|CCBot|xAI-Bot|DeepSeekBot|Google-Extended|Anthropic-Web-Crawler|facebookexternalhit|ia_archiver|Applebot|bingbot|Twitterbot|Slackbot|Discordbot)\b.*
ignoreregex =
ignorecase = true

Next, I restarted apache, fpm, and fail2ban, and because I’m paranoid, I also restarted and then checked each service’s status to ensure that nothing was mis-configured. All was in order, so it was now time to DDOS my machine from another server of mine. To do this, I used ab as follows:

ab -n 30000 -c 1000 http://haacksnetworking.org/

It was overkill to do 30K, but I needed to make sure I did enough that it would trigger the rule. It worked properly and banned the IP accordingly for 10 minutes. In short, if the bots in weeks ahead get too frisky, I can time them out for any rate (url requests / minute) I so choose and then time them out (in minutes) for whatever I feel is appropriate. Since my server and virtual appliance can handle it, I’ve set my rule to 20K requests in one minute as the ceiling I tolerate. Your use case, hardware, personal tolerance of bot behavior and associated biased towards bot behavior might differ. Maybe your hardware is less robust or more robust, maybe you think the bots deserve a longer or shorter time out, or maybe you want to ban any IP that does this indefinitely. For me, I want bots to scrape my wiki, tech blog, poetry, or whatever else they want and I don’t think it’s really fair or honest of me to request that they change their behavior for my public site. If I held that sentiment, I would not have made these resources public. But, I also don’t want my server to go down because they scrape or request too much, amounting to a de facto DDOS attack. So, this is what I cooked up to ensure I am prepared in weeks ahead and I also hope it helps others who use Debian + LAMP stacks protect their appliances using common tools available to Debian users.

— oemb1905 2025/04/12 01:24

Haack's Wiki

User Tools

Site Tools

Page Tools