Fighting bad bots - Anubis activated

As you may have noticed, we have had a bunch of outages on cgit (cgit.haiku-os.org) and trac (dev.haiku-os.org) over the last year or two.

In the background, we have been fighting a huge influx of bots scraping pages (likely AI-based looking at where they are coming from). Services like Trac and cgit are overwhelmed easily, so they have been the first to historically fail. I’ve been doing the lazy IP range blocks for a while, but the bad-behaving AI bots (ignoring robots.txt) have hit a critical mass.

Thanks to the awesome idea of @nephele , we have put Anubis in place. When you hit cgit or Trac, Anubis will weigh the weight of your browsers soul (lol, it does sha256 calculations) and allow you passage if you meet the computational requirements.

I have testimonial that it works fine on WebPositive and every browser i’ve personally thrown at it.

tl;dr you’ll see a cute mascot judging you before redirecting you to the service. After judgement, you’ll get a cookie to remember the event and grant you access for a period of time (24 hours? I need to check) before rejudging you.

tl;dr tl;dr

16 Likes

I’ve dropped the difficulty to 4 to see if the smaller amount of work is still enough to block bots. (we had a few folks with weaker systems complain about how long the computations took)

1 Like

I knew Anubis. Anubis was a friend of mine.
Little girl, you’re no Anubis.


But seriously, thank you for not making us select all blocks with a traffic light.

8 Likes

lol, yeah. I absolutely hate those… and have a feeling there would be a revolt if I asked everyone to solve one before accessing the sites :hocho:

So far the only impact has been the “tab” on haiku-os.org showing trac tickets is broken. (under investigation).

I’m happy to report trac and cgit are much snappier for now.

3 Likes

Looks like this defender against bots is getting in the way of downloading attachments in their original formats concerning Trac. I keep getting HTML files that point back to the defender against bots (if that makes sense).

UPDATE: I just tried the same on Falkon and it did the right thing. Must be an issue with Web+.

So, I just tested this and it worked fine downloading the correct attachment in Linux under LibreWolf.

I think what’s key is you have to have the “client is safe / real” cookie in place to access the raw file.

I feel like you’re somehow hitting the raw download link without that cookie in place (downloading the file without visiting https://dev.haiku-os.org, or WebPositive isn’t persisting the cookie between tabs)

EDIT: Yeah, given your update, it feels like WebPositive isn’t passing that cookie to the server when you download your attachement. Paging @pulkomandy :slight_smile:

I’m just clicking the download link in Trac. Nothing special beyond that.

The curl backend for webkitlegacy is unmaintained and removed from upstream for a while already. I do my best to keep it running but I won’t be putting time into it since it is all throwaway work once we switch to webkit2.

At the moment my work on webkit is focused on merging upstream changes and getting our work upstraemed. Do not expect me to find time and motivation to fix this in a reasonable time…

3 Likes

I’ve spotted a small issue when posting a Trac link in the quote block:

For me the block about looks like:

The link is displayed correctly if posted directly:
https://dev.haiku-os.org/ticket/19527

We’ve not deployed anubis to discourse, so this is probably something else?

Or is this the forum software trying to be “smart” and doing previews? We should probably just disable that if possible, this link masquarading is often abused…

I assume so.

Sounds like a great solution to keep away AI bots.

The name is very fitting, it actually made me laugh this morning when I read it. One of my cats is named Anubis. He is very cute when he’s calm, but can get quite agressive when something or somebody annoys him. :slight_smile:

7 Likes

Aye’ , it’s really cute !.. :star_struck:

Thanks for this funny solution against bots …

:cowboy_hat_face:

Can’t use the Opera Mini browser to access neither cgit nor Trac, after the anubis activation. Damn “AI” scrappers ruining things :frowning: .

What platform are you using?

Does the browser allow to enable js?

We might be able to short-circuit anubis for currently logged in users, but then we don’t have users for cgit.

Have been using Opera Mini on Android for ages (was using it on Symbian even earlier :smiley:).

That browser’s main appeal is that it has a “data saving” mode (with the “extreme” setting being really good at that), and it accomplish that via server-side processing of webpages/images. So I guess there’s not much that can be done in this case (short of whitelisting its user agent, which would in the end get abused by bots anyway).

I’ll just use a “regular” browser on a PC from now on, as I understand the need for this thing. Still hating the bots that made it necessary.

I wonder how much data it can even save for cgit anf trac? maybe you can add an exception om the client. but yeah this sounds difficult to support

Edit: I’d like to mention however that whitelisting UA is totally something we can do. Bots aren’t targetting us specifically, but rather oss as a whole. If we add some exceptions for uncommon UA they won’t easily catch this. (and if they do we can just remove the whitelist ahain)

1 Like

Instead of whitelisting the user agent,which can be faked by bots,you could also whitelist Operas transcoding servers.
They all belong to IP address ranges which are owned by Opera,and that’s something the bots can’t fake.
Their ASN numbers are:

  • AS21837
  • AS39832
  • AS136189
3 Likes