Both sides previous revisionPrevious revisionNext revision | Previous revision |
computing:managingbots [2025/04/06 10:16] – oemb1905 | computing:managingbots [2025/04/12 01:25] (current) – oemb1905 |
---|
------------------------------------------- | ------------------------------------------- |
| |
This tutorial is designed for Debian OS and LAMP stack users. In my case, I have a multi-site WordPress that includes my tech blog, poetry, and teaching blog. Additionally, I have a separate vhost on the same instance that is this very dokuwiki. The first thing I did was create a script that would scrap and tally all the bots and how much they have done during the last day: | This tutorial is designed for Debian OS and LAMP stack users that want to track and or prohibit bot scraping (or other url requests) that might harm server performance and/or cause it to fail. In my case, I have a multi-site WordPress that includes my tech blog, poetry, and teaching blog. Additionally, I have a separate vhost on the same instance for my dokuwiki and another for file-share only Nextcloud. The same instance also runs my business’s email server with postgres and dovecot. Needless to say, I don’t want this virtual appliance to have down time. The first thing I did was create a script that would scrape and tally all the bots and how much they have done during the last day: |
| |
* [[https://repo.haacksnetworking.org/haacknet/haackingclub/-/blob/main/scripts/apache/bot-scrape-daily.sh?ref_type=heads|Daily Bot Scrape]] | * [[https://repo.haacksnetworking.org/haacknet/haackingclub/-/blob/main/scripts/apache/bot-scrape-daily.sh?ref_type=heads|Daily Bot Scrape]] |
| |
The first report is listed below. I keep this running daily now, including report purging and I can therefore monitor and fine tune it if needed. | The first report I ran is listed below. I keep this running daily now, including report purging and I can therefore monitor and fine tune it if needed if the numbers in weeks ahead either drop or raise significantly. |
| |
* [[https://haacksnetworking.org/bot-scrape-04-04-25.txt|April 4th Report]] | * [[https://haacksnetworking.org/bot-scrape-04-04-25.txt|April 4th Report]] |
| |
After considering my physical host of 48 threads, 384GB of RAM, and the virtual appliance that runs on it for this script was built, which is 16 vCPU cores and 16GB of RAM, I decided that the first thing to do was to tweak apache to ensure it could handle the flood long enough to take action. So, by increasing the maximum workers to 800, with each worked being able to handle approximately 1.5K - 2.5K requests per second, this means that the server could handle 95K - 145K requests. This is well-above what even the most aggressive bot did to my server, and also well-above most reports I've see most impacted servers. Here are some bird's eye averages of some of the reports we've all read about: | Considering my physical host has 48 threads, 384GB of RAM, and that the virtual appliance that runs on it for this script was built with 16 vCPU cores and 16GB of RAM, I decided that the first thing to do was to tweak apache to ensure it could handle the flood long enough to take action and then, after that, ensure that the virtual hardware I allocated to the appliance was sufficient to handle that load. Accordingly, I speculated that if I increased FPM workers to 800, this would mean that each worker could handle 2–3 requests per second per worker, or roughly 100K – 150K requests. This is well-above what even the most aggressive bot did to my server and also well-above most reports I’ve see most impacted servers. Regarding the newsworthy articles, I investigated four of them and – based on their reports – did my best to estimate the worst bursts of url requests they experienced. I also compared these numbers to known figures on what constitutes a DDOS attack (20K per minute or higher). Here are the estimates and associated original articles: |
| |
* [[https://www.gamedeveloper.com/business/-this-was-essentially-a-two-week-long-ddos-attack-game-ui-database-slowdown-caused-by-openai-scraping|Game UI Database]]: Bursts up to 12K / min | * [[https://www.gamedeveloper.com/business/-this-was-essentially-a-two-week-long-ddos-attack-game-ui-database-slowdown-caused-by-openai-scraping|Game UI Database]]: Bursts up to 12K / min |
* [[https://fosstodon.org/@jimsalter/114270715367978012|freeBSD Wiki]]: Bursts up to 10K / min | * [[https://fosstodon.org/@jimsalter/114270715367978012|freeBSD Wiki]]: Bursts up to 10K / min |
| |
Okay, so the multi processing module numbers above are tweaked to be roughly 5X of the worst attack. The next question - before I set those values - is whether my hardware can handle that. In my case, the virtual appliance has 16 vCPUs/cores, so assuming 50 threads per core, that would be maybe at most 5GB total of RAM usage, or 2-5MB per thread. Okay, so I configured the mpm_event (no one should be using prefork anymore) module by opening ''/etc/apache2/mods-available/mpm_event.conf'' and changing the defaults as follows: | The worst scraping was roughly 12K/min, so that meant that setting workers to a maximum of 800 would be more than 5x what I needed to ensure that my server would not crash during the bot flood itself. The next question – before I set those values – is whether my hardware can handle such a strong allocation. In my case, the virtual appliance has 16 vCPUs/cores, so assuming 50 threads per core, that would be maybe at most 5GB total of RAM usage, or 2-5MB per thread. Okay, so I configured the mpm_event (no one should be using prefork anymore) module by opening ''/etc/apache2/mods-available/mpm_event.conf'' and changing the defaults as follows: |
| |
#Defaults | #Defaults |
ServerLimit 50 | ServerLimit 50 |
| |
The stock configuration seems to be tailored to allow a popular hobbyist to have a functioning website with minimal configuration changes. Once those were changed, and since I use WordPress and Nextcloud which rely heavily on php, I also took a look at ''/etc/php/8.2/fpm/pool.d/www.conf'' and adjusted the servers, or child processes as follows: | The stock configuration seems to be tailored to allow a popular hobbyist to have a functioning website with minimal configuration changes. The configuration does not, however, prepare the server to sustain requests at the rates we’ve all heard about recently. So, I made those changes first and then I put my eyes on php servers/workers, or child processes. I did this because I rely on php heavily with WordPress multi-site, Nextcloud, and Dokuwiki all running on this appliance. Looking at ''/etc/php/8.2/fpm/pool.d/www.conf'', I adjusted the servers as follows: |
| |
#default | #default |
pm.max_requests = 1000 | pm.max_requests = 1000 |
| |
If everything hit theoretical ceilings, I would likely need more like 24GB of RAM, but presuming that rarely if at all happens, I left the ram allocation at 16GB. Okay, so this means that my web server and my php configuration and handler were all optimized to be handle at least 20K simultaneous requests as per its configuration. It can, in fact, handle more than that as per the buffers and allowances I chose. Next, what to do about these bots? Well, in my case, I have hardware that can handle the configuration above. So, the obvious and next conclusion is that anything above 20K url requests of any form (whether AI-bots or regular old DDOS) should be given an IP tables ban so my server stays running. And which tool is designed to assess logs and access and generate firewall rules accordingly? Enter fail2ban. This tutorial assumes you already have fail2ban installed and minimally configured for core services. If not, please read [[https://wiki.haacksnetworking.org/doku.php?id=computing:fail2ban|Fail2Ban]]. Alright, let's create a custom jail that stops anything > 20K requests in a min and times those IPs out for 10 mins. In ''/etc/fail2ban/jail.local'' I entered: | If everything hit theoretical ceilings, I might need more like 24GB of RAM, but that would only be if I sustained 150K hits per minute for a sustained period of time. I considered that unlikely and also easy to change later with virsh edit domain.com should a need arise. At this point, both my web server and my php handler were all optimized to easily handle at least 20K simultaneous requests. This meant that just by configuring my server properly, I could sustain the burst/flood, and then take action on it. To me, the obvious tool to take action with would be fail2ban, which is designed to query server use, load, and abuse and make appropriate firewall rules to address those variables. Please read [[https://wiki.haacksnetworking.org/doku.php?id=computing:fail2ban|Fail2Ban]] if you are not familiar with how to set up a basic configuration. Take care to adjust to dbpurgeage to 30d otherwise many rules and timeouts you setup will likely exceed the amount of datapoints available in the database that fail2ban queries. Once that was done and confirmed, I created a custom jail that stops anything > 20K requests in a min and times those IPs out for 10 mins. In ''/etc/fail2ban/jail.local'' I entered: |
| |
[apache-botflood] | [apache-botflood] |
ignoreregex = | ignoreregex = |
| |
After restarting apache, fpm, and fail2ban, along with a restart for good measure, including checking the status of all services to ensure there were no errors ... it was now time to DDOS my machine from another server of mine. To do this, I used ''ab'' as follows: | It is important to note that this definition stops any url requests that exceed 20K/min, not just AI-bots, hence the title AI-bots and a “+” sign. Also, if you prefer a jail that only bans the bots, and not all heinously large url requests, then adjust your jail to something like this instead: |
| |
| [Definition] |
| failregex = ^ - - [.] "GET [^"]HTTP[^"]" .(GPTBot|ClaudeBot|Bytespider|PerplexityBot|CCBot|xAI-Bot|DeepSeekBot|Google-Extended|Anthropic-Web-Crawler|facebookexternalhit|ia_archiver|Applebot|bingbot|Twitterbot|Slackbot|Discordbot)\b.* |
| ignoreregex = |
| ignorecase = true |
| |
| Next, I restarted apache, fpm, and fail2ban, and because I’m paranoid, I also restarted and then checked each service’s status to ensure that nothing was mis-configured. All was in order, so it was now time to DDOS my machine from another server of mine. To do this, I used ''ab'' as follows: |
| |
ab -n 30000 -c 1000 http://haacksnetworking.org/ | ab -n 30000 -c 1000 http://haacksnetworking.org/ |
| |
It was overkill to do 30K, but I needed to make sure I did enough that it would trigger the rule. It worked properly and banned the IP accordingly for 10 minutes. In short, if the bots get too frisky, I can time them out for any amount of url requests / minute, and time them out for any amount of time I feel is appropriate. Since my server and virtual appliance can handle it, I've set my rule to 20K requests in one minute. Your use case, hardware, personal tolerance of bot behavior and associated bias might differ. Maybe your hardware is less robust or more robust, maybe you think the bots deserve a longer or shorter time out, or maybe you want to ban any IP that does this indefinitely. For me, I want bots to scrape my wiki, tech blog, poetry, or whatever else they want. And, as far as the technology is concerned, I don't think it's really fair or honest of me to request that they change their behavior for my public site. If I held that sentiment, I would not have made these resources public. With that said, I also don't want my server to go down (there was a suspicious November 24 outage lol), and I want to help others who use LAMP stacks and common php-based instances be able to protect their appliances using common tools available to Debian users. Hopefully this helps! | It was overkill to do 30K, but I needed to make sure I did enough that it would trigger the rule. It worked properly and banned the IP accordingly for 10 minutes. In short, if the bots in weeks ahead get too frisky, I can time them out for any rate (url requests / minute) I so choose and then time them out (in minutes) for whatever I feel is appropriate. Since my server and virtual appliance can handle it, I’ve set my rule to 20K requests in one minute as the ceiling I tolerate. Your use case, hardware, personal tolerance of bot behavior and associated biased towards bot behavior might differ. Maybe your hardware is less robust or more robust, maybe you think the bots deserve a longer or shorter time out, or maybe you want to ban any IP that does this indefinitely. For me, I want bots to scrape my wiki, tech blog, poetry, or whatever else they want and I don’t think it’s really fair or honest of me to request that they change their behavior for my public site. If I held that sentiment, I would not have made these resources public. But, I also don’t want my server to go down because they scrape or request too much, amounting to a de facto DDOS attack. So, this is what I cooked up to ensure I am prepared in weeks ahead and I also hope it helps others who use Debian + LAMP stacks protect their appliances using common tools available to Debian users. |
| |
Happy hacking! | --- //[[alerts@haacksnetworking.org|oemb1905]] 2025/04/12 01:24// |
| |
--- //[[alerts@haacksnetworking.org|oemb1905]] 2025/04/06 09:58// | |