I was scouring the indie-web earlier, and found a pretty useful list of bots to add to your robots.txt. But, since I’m not convinced that this is enough to keep them away, I also figured out a simple way to at least potentially completely block them from accessing your websites.

  • drkt
    link
    fedilink
    English
    arrow-up
    15
    ·
    7 days ago

    Again it must always be stressed that this is a false sense of security. You can only block crawlers that identify themselves, or by pulling an IP block list of offenders which means they’ve already offended in order to be identified and they can just change their IP address.

    You can’t block them, but you can make their life harder. Return 200 OK on 404 Not Found so malicious bots trying to drive-by you for random URLs like /admin or whatever will think they found something. Make honeypots that redirect and loop, filled with bait wordlists and forms that go nowhere. Poison to well. Deliberately serve incorrect, broken or AI-generated data to known bots.

    Waste their time, instead of wasting your own time.

    • Net_Runner :~$@lemmy.zipOP
      link
      fedilink
      English
      arrow-up
      6
      ·
      7 days ago

      Yeah, that’s true. If they’re not using names, then there’s not a whole lot you can do. And blocking IPs is impossible, because they use different IPs constantly.

      But!

      With my post, and your suggestions, this is the “something” that’s better than doing absolutely nothing

      • drkt
        link
        fedilink
        English
        arrow-up
        3
        ·
        7 days ago

        I suppose it comes down to being offensive or defensive. I don’t think being defensive is worth my time. I’m not paying for bandwidth and compute-time is so cheap it’s irrelevant so I’m on the offensive. You can do both if you want. There’s definitely more ready-to-go defensive solutions than there are offensive (your own article, for example), but I think tinkering and adapting my own solution is fun. It’s like a game of cat and mouse but they have money to lose and I don’t.

      • drkt
        link
        fedilink
        English
        arrow-up
        4
        ·
        edit-2
        7 days ago

        Eventually I’m gonna make a proper article about it, but what I’m doing right now boils down to this:

        • Intercept 404
        • Redirect to error-hole.php
        • error-hole.php returns 200 and spits out a bunch of bot-targets

        The next iteration of this will include a lot of uncompressed filler data so hopefully the bots have to download half a gigabyte of data every time they do this. I’m not paying for bandwidth, it doesn’t matter to me.

        See for yourself https://drkt.eu/fdhasklfh

        I can see that it works by just looking at my access logs.