This evening a standard operating system upgrade has once again turned fatal.
Our infrastructure still depends on a single bare metal server at Hetzner which
continues to be our downfall. This evening a (tested) OS upgrade failed resulting
in maui going MIA. I requested KVM access to attempt repair of maui after it was
missing for ~15 minutes, however we were stuck waiting almost 2 hours for the KVM
from Hetzner.
And … we’re back! Still on the “backup” server, but, things should be mostly functional now. I think the total forum outage time was something like 50 hours, though.
For me, https://i18n.haiku-os.org/pootle/ is still unaccessible, right now. Also, the repository I have set up in this machine seems to be unoperational:
Fetching repository checksum from https://eu.hpkg.haiku-os.org/haiku/master/x86_64/current ...
Validating checksum for Haiku ...
Fetching repository-cache from https://eu.hpkg.haiku-os.org/haiku/master/x86_64/current ...
Validating checksum for Haiku ...
Checksum error:
*** expected '<?xml version="1.0" encoding="UTF-8"?>
*** <Error><Code>AccessDenied'
*** got 'd43ed19f062c557620fafd1df8cb2778241c3e86b3e0587dc60d3dfe7f4f3280'*** failed! : Bad data
Known issue. The haiku repository is currently offline. Haikuports is functional though. No data has been lost.
I’m currently waiting on the Haiku, Inc. board of directors to approve the new infrastructure… as soon as that happens things should start improving.
We’re making some changes which should result in a MUCH more stable infrastructure. No more depending on a single dedicated server that’s difficult to access + troubleshoot when things go wrong.
That site only tests whether the webserver is alive and responding to requests, which it is; and the “Bad Gateway” appears to return code 200. So yes, that makes sense.
Yes, it’s been acting very strangely; @nielx was going to investigate but I don’t know how far he’s gotten.
Please keep me posted if anyone sees any problems at this point. I’m pretty confident we are back at 100%.
I’ve implemented some basic (really generous) http/https rate limiting due to script-kiddies scanning our web services.
We are having some issues with the discuss.haiku-os.org container not reliably starting… I plan on digging into this as time presents itself.
For networking issues, i’ve seen one git push be successful, but hang-up on the client… this could be a Gerrit issue though and not our actual fault.
I’m sorry for all of the downtime this year, we’re going through a lot of growing pains. We’re trying to balance reliable infrastructure we can do maintenance on while not spending a ton of cash. The solution historically was to “not touch it, it’s working”, but that really isn’t a healthy long-term strategy. While the new infrastructure isn’t perfect, it offers us a lot more flexibility in the event of a server getting trashed… we also have instant KVM/ILO access to the server which I feel will help us out a lot.
As we grow, we also have the option now to purchase a second server to do more of an active/passive configuration with the new iSCSI storage attachment… but we opted to not take on the extra cost (yet). At the moment on online.net we’re paying ~89eur/mo. This is a pretty big jump from the 39eur/mo we paid Hetzner, but it gives us more growth options with fewer technical hurdles. Network performance has also greatly improved to areas outside of the EU as a bonus.
On the bright side, through all of these major issues, we didn’t lose any data . Backups are important kids, stay in school.
… and on the heels of this comment we had a massive slowdown across the board a few minutes ago.
Luckily, it was just a silly mistake on my part. I was doing disk performance tests pre-go-live and left the vm.dirty_bytes at 32MiB. (vs the standard ratio of 30% or 9.8GiB)
I was running a backup and killed our I/O. Things back to normal and shouldn’t happen again.