Partial Outage | Haiku Project

Haiku · November 8, 2018, 1:49am

This evening a standard operating system upgrade has once again turned fatal.

Our infrastructure still depends on a single bare metal server at Hetzner which continues to be our downfall. This evening a (tested) OS upgrade failed resulting in maui going MIA. I requested KVM access to attempt repair of maui after it was missing for ~15 minutes, however we were stuck waiting almost 2 hours for the KVM from Hetzner.

This is a companion discussion topic for the original entry at https://www.haiku-os.org/blog/kallisti5/2018-11-06_partial_outage/

waddlesplash · November 8, 2018, 2:22am

And … we’re back! Still on the “backup” server, but, things should be mostly functional now. I think the total forum outage time was something like 50 hours, though.

victordomingos · November 8, 2018, 2:14pm

For me, https://i18n.haiku-os.org/pootle/ is still unaccessible, right now. Also, the repository I have set up in this machine seems to be unoperational:

Fetching repository checksum from https://eu.hpkg.haiku-os.org/haiku/master/x86_64/current ...
Validating checksum for Haiku ...
Fetching repository-cache from https://eu.hpkg.haiku-os.org/haiku/master/x86_64/current ...
Validating checksum for Haiku ...
Checksum error:
*** expected '<?xml version="1.0" encoding="UTF-8"?>
*** <Error><Code>AccessDenied'
*** got      'd43ed19f062c557620fafd1df8cb2778241c3e86b3e0587dc60d3dfe7f4f3280'*** failed! : Bad data

Not sure if that’s expectable and known.

kallisti5 · November 8, 2018, 2:31pm

Known issue. The haiku repository is currently offline. Haikuports is functional though. No data has been lost.

I’m currently waiting on the Haiku, Inc. board of directors to approve the new infrastructure… as soon as that happens things should start improving.

We’re making some changes which should result in a MUCH more stable infrastructure. No more depending on a single dedicated server that’s difficult to access + troubleshoot when things go wrong.

waddlesplash · November 8, 2018, 8:44pm

Fixed it, thanks for noticing.

humdinger · November 10, 2018, 9:40am

Trac seems to be down, too. Results in “Bad Gateway”.
Strangely, https://downforeveryoneorjustme.com/dev.haiku-os.org says it’s OK…

waddlesplash · November 10, 2018, 6:44pm

That site only tests whether the webserver is alive and responding to requests, which it is; and the “Bad Gateway” appears to return code 200. So yes, that makes sense.

Yes, it’s been acting very strangely; @nielx was going to investigate but I don’t know how far he’s gotten.

Vanne · November 12, 2018, 12:31pm

Yep and its still down. No updates possible (has been like this for over a week now)

waddlesplash · November 12, 2018, 3:29pm

Please be patient. We’ve acquired the new permanent server and are working on migrating to it. Should only be a few more days…

kallisti5 · November 13, 2018, 7:57pm

and we’re back up on the new hosting provider. Speeds are greatly improved from all of my testing in the US.

lsitongia · November 14, 2018, 2:59am

Hurray! That must have been a lot of sweat and hard work. Thank you!

bullfrog · November 14, 2018, 6:19am

Sounds like we’ve been slashdotted.

lsitongia · November 14, 2018, 6:39pm

Hi,
Is https://www.haiku-os.org/ related to the outage? It is back up now, but I still cannot receive an account verify message from it.

dragon · November 15, 2018, 8:39am

Is the update server still affected? It goes to 68% downloading a package then connection dies.

bulent · November 15, 2018, 10:13am

I looks like that both the Haiku and the Haikuports repositories are offline (November 15th, 11.12 GMT).
Am I correct?

humdinger · November 15, 2018, 12:53pm

I just updated from beta1 to nightly. Servers are up, everything went smoothly.

bulent · November 15, 2018, 1:01pm

Thank you. I confirm that all runs smoothly now. Perhaps it was a temporary glitch.

kallisti5 · November 16, 2018, 2:39pm

Please keep me posted if anyone sees any problems at this point. I’m pretty confident we are back at 100%.
I’ve implemented some basic (really generous) http/https rate limiting due to script-kiddies scanning our web services.

We are having some issues with the discuss.haiku-os.org container not reliably starting… I plan on digging into this as time presents itself.

For networking issues, i’ve seen one git push be successful, but hang-up on the client… this could be a Gerrit issue though and not our actual fault.

I’m sorry for all of the downtime this year, we’re going through a lot of growing pains. We’re trying to balance reliable infrastructure we can do maintenance on while not spending a ton of cash. The solution historically was to “not touch it, it’s working”, but that really isn’t a healthy long-term strategy. While the new infrastructure isn’t perfect, it offers us a lot more flexibility in the event of a server getting trashed… we also have instant KVM/ILO access to the server which I feel will help us out a lot.

As we grow, we also have the option now to purchase a second server to do more of an active/passive configuration with the new iSCSI storage attachment… but we opted to not take on the extra cost (yet). At the moment on online.net we’re paying ~89eur/mo. This is a pretty big jump from the 39eur/mo we paid Hetzner, but it gives us more growth options with fewer technical hurdles. Network performance has also greatly improved to areas outside of the EU as a bonus.

On the bright side, through all of these major issues, we didn’t lose any data . Backups are important kids, stay in school.

kallisti5 · November 16, 2018, 1:41pm

… and on the heels of this comment we had a massive slowdown across the board a few minutes ago.

Luckily, it was just a silly mistake on my part. I was doing disk performance tests pre-go-live and left the vm.dirty_bytes at 32MiB. (vs the standard ratio of 30% or 9.8GiB)

I was running a backup and killed our I/O. Things back to normal and shouldn’t happen again.

korli · November 17, 2018, 8:35am

Are the haikuports builders up? At least the build masters don’t see them available.
“all builders lost”