Partial Outage | Haiku Project


#1

This evening a standard operating system upgrade has once again turned fatal.

Our infrastructure still depends on a single bare metal server at Hetzner which continues to be our downfall. This evening a (tested) OS upgrade failed resulting in maui going MIA. I requested KVM access to attempt repair of maui after it was missing for ~15 minutes, however we were stuck waiting almost 2 hours for the KVM from Hetzner.


This is a companion discussion topic for the original entry at https://www.haiku-os.org/blog/kallisti5/2018-11-06_partial_outage/

Updating Haiku, refreshing repository failed!
#2

And … we’re back! Still on the “backup” server, but, things should be mostly functional now. I think the total forum outage time was something like 50 hours, though.


#3

For me, https://i18n.haiku-os.org/pootle/ is still unaccessible, right now. Also, the repository I have set up in this machine seems to be unoperational:

Fetching repository checksum from https://eu.hpkg.haiku-os.org/haiku/master/x86_64/current ...
Validating checksum for Haiku ...
Fetching repository-cache from https://eu.hpkg.haiku-os.org/haiku/master/x86_64/current ...
Validating checksum for Haiku ...
Checksum error:
*** expected '<?xml version="1.0" encoding="UTF-8"?>
*** <Error><Code>AccessDenied'
*** got      'd43ed19f062c557620fafd1df8cb2778241c3e86b3e0587dc60d3dfe7f4f3280'*** failed! : Bad data

Not sure if that’s expectable and known.


#4

Known issue. The haiku repository is currently offline. Haikuports is functional though. No data has been lost.

I’m currently waiting on the Haiku, Inc. board of directors to approve the new infrastructure… as soon as that happens things should start improving.

We’re making some changes which should result in a MUCH more stable infrastructure. No more depending on a single dedicated server that’s difficult to access + troubleshoot when things go wrong.


#5

Fixed it, thanks for noticing.


#6

Trac seems to be down, too. Results in “Bad Gateway”.
Strangely, https://downforeveryoneorjustme.com/dev.haiku-os.org says it’s OK…


#7

That site only tests whether the webserver is alive and responding to requests, which it is; and the “Bad Gateway” appears to return code 200. So yes, that makes sense.

Yes, it’s been acting very strangely; @nielx was going to investigate but I don’t know how far he’s gotten.


#8

Yep and its still down. No updates possible (has been like this for over a week now)


#9

Please be patient. We’ve acquired the new permanent server and are working on migrating to it. Should only be a few more days…


#10

and we’re back up on the new hosting provider. Speeds are greatly improved from all of my testing in the US.


#11

Hurray! That must have been a lot of sweat and hard work. Thank you!


#12

Sounds like we’ve been slashdotted.


#13

Hi,
Is https://www.haiku-os.org/ related to the outage? It is back up now, but I still cannot receive an account verify message from it.


#14

Is the update server still affected? It goes to 68% downloading a package then connection dies.


#15

I looks like that both the Haiku and the Haikuports repositories are offline (November 15th, 11.12 GMT).
Am I correct?


#16

I just updated from beta1 to nightly. Servers are up, everything went smoothly.


#17

Thank you. I confirm that all runs smoothly now. Perhaps it was a temporary glitch.


#18

Please keep me posted if anyone sees any problems at this point. I’m pretty confident we are back at 100%.
I’ve implemented some basic (really generous) http/https rate limiting due to script-kiddies scanning our web services.

We are having some issues with the discuss.haiku-os.org container not reliably starting… I plan on digging into this as time presents itself.

For networking issues, i’ve seen one git push be successful, but hang-up on the client… this could be a Gerrit issue though and not our actual fault.

I’m sorry for all of the downtime this year, we’re going through a lot of growing pains. We’re trying to balance reliable infrastructure we can do maintenance on while not spending a ton of cash. The solution historically was to “not touch it, it’s working”, but that really isn’t a healthy long-term strategy. While the new infrastructure isn’t perfect, it offers us a lot more flexibility in the event of a server getting trashed… we also have instant KVM/ILO access to the server which I feel will help us out a lot.

As we grow, we also have the option now to purchase a second server to do more of an active/passive configuration with the new iSCSI storage attachment… but we opted to not take on the extra cost (yet). At the moment on online.net we’re paying ~89eur/mo. This is a pretty big jump from the 39eur/mo we paid Hetzner, but it gives us more growth options with fewer technical hurdles. Network performance has also greatly improved to areas outside of the EU as a bonus.

On the bright side, through all of these major issues, we didn’t lose any data :slight_smile:. Backups are important kids, stay in school.


#19

… and on the heels of this comment we had a massive slowdown across the board a few minutes ago.

Luckily, it was just a silly mistake on my part. I was doing disk performance tests pre-go-live and left the vm.dirty_bytes at 32MiB. (vs the standard ratio of 30% or 9.8GiB)

I was running a backup and killed our I/O. Things back to normal and shouldn’t happen again.


#21

Are the haikuports builders up? At least the build masters don’t see them available.
“all builders lost”