Cgit now behind SSO login - AI scraper bot fight report

Hello!

I just wanted to toss out a notification that cgit has been put back into place. You’ll need to login with your sso.haiku-os.org account (same one gerrit uses as well as a few other services) to use it.

tl;dr: The SSO login is the compromise on fighting the AI bots. You don’t need special permissions, just an account on sso.haiku-os.org. (the eventual plan is to migrate everyone’s identities over to sso.haiku-os.org)

Here’s a quick history overview for the why:

  1. cgit getting hammered by AI scrapers bringing it (and Gerrit + git) down.
    1. after some testing, cgit can handle maybe 20 requests a second before it starts to get overwhelmed. The reason is cgit processes the git repository in real time. There is a local cgit cache, but with so much history most random AI requests are cache misses. We have 50k+ tags and almost 100,000 commits.
    2. The AI bot requests are from random IP addresses on random ASN’s, from random user agents. Blocking IP ranges is no longer viable.
    3. cgit isn’t seeing many commits / improvements as a project. Maybe 3-5 per year. (though, this seems to be improving a bit lately)
  2. We tossed the awesome Anubis in front of it to slow the bots down.
  3. Anubis worked for a while, however when faced with tens of thousands of requests from AI scrapers coming from random IP addresses all over the internet, it’s easy to let 20-30 requests per second through crippling it.
  4. Getting tired of fighting cgit, I forked rgit (a rust-based cgit clone with a RocksDB cache of the git repo). Called it gitore
    1. Made a bunch of improvements to gitore to make it more cgit-compatible (and bug squash for our huge repo)
    2. It can 100% handle the AI scraper load without Anubis infront of it (though, the AI bots ate up tons of bandwidth without Anubis blocking them :frowning: )
  5. Increasing calls that people aren’t happy with gitore
    1. 100% valid, gitore had a lot of issues making it not 1:1 feature complete with cgit.
    2. Nobody really stepped up to help out in fixing the bugs / adding features.
  6. With limited time to work on gitore, I ended up putting cgit back behind our SSO portal.
    1. of 10,000 requests, ~100 are legit via logged in users.

We can now see who is accessing cgit.. if an AI bot (or abusive user using this post to mess around and break stuff :wink: ) gets hold of a user account on sso.haiku-os.org, it will be pretty easy to track it down who and nuke them.

18 Likes

Thank you for your very hard work to keep everything up and running while the bots are hitting that hard.
I’m personally not a fan of login walls,but it’s better than continuing to waste the projects money on bandwidth consumption by scraper bots,and it’s implemented rather nicely without relying on questionable third-parties :+1:

As I’ve been running my OpenGrok instance on https://grok.nikisoft.one rather reliable in the recent months despite continued pressure by random AI scraper bots,I may try setting up a unofficial cgit instance soon and see how it goes here.
My servers have plenty of unused resources and unmetered bandwidth at 1Gbit/s each (2 servers),so it can’t hurt to play around a bit.

To note, I am still working on gitore improvements. I still feel like it is a viable path forward since it solves the technical limitations of cgit performance, however the following HAS to be addressed:

  • Git tag scanning needs fixed (working through this now)
  • recovery from shutdown needs fixed (right now, the lengthy repo cache process happens on every startup)
  • clickable line numbers
  • parsing bug numbers (aka #1234) to trac links

Once the above is complete, i’ll probably put gitore back up on a separate subdomain to see if I can win people over :laughing:

8 Likes

Thank you very much for your efforts, Alexander !
I registered on SSO today - that solved my read access :wink:

1 Like

My cgit instance is now up and running :tada:
https://cgit.nikisoft.one
Spent many hours first updating the server and then setting up cgit.
It’s running on the same server as OpenGrok and shows the same repositories,the countermeasures against the bots are also the same and should be almost invisible to most humans.
I can’t promise that it will keep running when the bots start hammering the server,but I’ll try to keep it online as hard as I can.
It’s free to use and works without an account,maybe someone finds it useful :slight_smile:

2 Likes