Semaphore performance

Hi everyone. I’m experimenting with an Actor Model framework, and have the runtime compiled for OSX, Win10, and Haiku_x64. I’ve got platform specific semaphores (Grand Dispatch semaphores, Win32 semaphores, and Haiku Benaphores). Here are some benchmarks on same laptop (MacBookPro 11.3)

Win10 => 4.05s release, 68s debug => CreateSemaphore(NULL, initial, LONG_MAX, NULL);
OSX => 8.9s release, 10.5s debug = > dispatch_sempahore_create(initial);
Haiku_x64 => 34s (-O3) = > create_semaphore(initial, “name”);

The benchmark spawns 8 thread pool threads, + timer thread, + load balancing thread (11 threads total). The actors are creating and dispatching messages (adding to locked queue), and the worker threads process the actor messages. The benchmark tests 10 million messages. In essence, it stresses semaphore locking/unlocking and thread rescheduling (as new messages unlock a thread). A semaphore unlock would typically force an OS reschedule while it pops the next waiting thread from the semaphore wait queue.

Haiku performance wasn’t exactly thrilling in this benchmark. The Benaphore implementation (vs raw Haiku sempahores) is 4s faster (34s vs 38s). OSX initially behaved poorly with native mach semaphores (38s), but improved with Grand Central Dispatch semaphores (almost 300% improvement).

Seeing the improvement with OSX semaphores, I wonder if there is a lighter version of semaphores I can use for Haiku? I understand that each alternative implementation has different compromises (ie. less sanity checking, less aggressive rescheduling, limited queue functionality, limited debug facilities etc. -
nothing is free in this world). Having said that, I’m quite impressed with Win10 performance, almost double OSX performance on same hardware.

Did you use a normal nightly build for the tests? If yes, you might want to try again with a custom build which doesn’t have all the kernel debugging options enabled. For nightly builds, we put a lot of the debug code in.

You can change this in the sources in build/config_headers/kernel_debug_config.h, just change the value of the define for “KDEBUG_LEVEL” at the top to 0, then rebuild Haiku.

Nightly builds, updated daily (x64 version, i use the nifty SoftwareUpdater app). I boot from a 8Gb USB flash disk, and dont have room to compile custom Haiku builds. If there is debug code enabled, thats OK, I can wait for official Beta and rerun the benchmarks. I was just wondering if there was a commonly known ‘lite’ version of semaphores some developers might be using. I did notice a x3 improvement with Grand Dispatch semaphores (vs Mach) so hoped there would be something simular in HaikuLand.

I wonder if Grand Dispatch semaphores are not “true” semaphores and are somehow handled in userland (i.e. without syscalls), and that Windows is also using some trick to make things work in userland, which is why they’re so performant.

Dispatch semaphore source can be found here:


No suprise that its a Benaphore. It’s also limited compared to Haiku semaphores (https://github.com/haiku/haiku/blob/master/src/system/kernel/sem.cpp) since its ‘team’ local, while Haiku semaphores are system wide (accessible by other teams). Haiku also does more validation. These factors explain the performance difference. I’d love to see what Win10 is doing.

Also, dispatch semaphores are using a MPSC lock free queue, while Haiku is using simpler lock based thread queue. Possibly investigate using this implementation (http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue ) of a MPSC queue for semaphores to improve performance.

Ah, this looks pretty good. Can you create a ticket about it so we don’t forget?

You should be extremely cautious with artificial benchmarks. It’s very easy to end up measuring something nobody cares about, and if the benchmark influences development this can cause the OS to be made worse in pursuit of “solving” a problem nobody actually had. For example with locking primitives you will usually want the primitive to perform best when uncontended, and then strive to avoid most contention in a real system, rather than figuring you’ll have a lot of contention and then optimising for that case at the cost of making uncontended use awful.

That said, yes, the state of the art has come some way since BeOS.

For example, each Benaphore needs at least 8 bytes of storage in the user process, plus a kernel semaphore which needs dozens of bytes of kernel RAM and is always itself a limited resource. If you have an array of 50 000 items you obviously won’t feel comfortable allocating 50 000 Benaphores to protect them. So you take one big Benaphore and protect the whole array. This works of course, but it unnecessarily forbids parallel operations on separate items so it’s slow.

In contrast on a modern system with a much lighter primitive you can afford to lock each item, so many threads can operate on separate items at once rather than all waiting on a single OS-wide semaphore.

For example in Windows you have WaitOnAddress and in Linux futex(2). In both cases the idea is that the fast path doesn’t touch the operating system at all. There is no “semaphore ID” let alone a kernel data structure to be allocated. Only in the slow path (when the lock is contended) does the userspace need help from the operating system. The clever trick here is that the OS already needs per-thread data structures. It doesn’t matter that there may be millions or even billions of locks on a system, any particular thread can only be stuck waiting for one of them. In each case the OS primitive provides a way to indicate which lock is waited for, and how you’d know to keep waiting. The OS checks it really should wait, notes these two facts in the per-thread structures, and puts the thread to sleep. When the thread with the lock gives it back, it will see it was contended and tell the OS, which wakes up any threads that are asleep waiting for that lock…

1 Like

I read a paper a few years ago about semaphore optimization for high frequency stock trading. There was an analysis of CPU time for a basic queue data structure in a stock market buy/sell application. Some large percent of the application was spent in the kernel, constantly processing the atomic semaphore locks. The fix reduced kernel time and improved the performance by some nice percentage. For high frequency trading that is a big win worth millions of $. For Haiku OS, not sure it is noticeable until you start high frequency stock trading :slight_smile:

You could probably do something using the pthread API, which is where you should go when you don’t need cross-process locking. There you can find quite a lot of lighter primitives. Or, you could roll your own thing using atomic_add and friends, I guess?

You could probably do something using the pthread API, which is where you should go when you don’t need cross-process locking. There you can find quite a lot of lighter primitives. Or, you could roll your own thing using atomic_add and friends, I guess?

Using Posix semaphores, I managed to get 53s (as opposed to 33s). So Haiku native semaphores are faster than Posix semaphores on Haiku (in this benchmark).

I’ve been reading more about alternative implementations, and have concern regarding the quality of the implementations. Eg. there are several Benaphore/Futex implementations which don’t cater for proper TimedWait() failures, implementations which break when signals are received or semaphores destroyed.

Haiku (BeOS) semaphores are very robust and my benchmark always completes successfully (although slower). As do GrandCentral semaphores. Oddly, the Windows semaphores (which were the fastest) did exhibit 1 out of 10 deadlocks (benchmark non completions). I still need to investigate why. However, Haiku and OSX, although slower, have proven to be 100% reliable, while the Windows one is worrying. I’m more inclined to believe there is a race condition I need to investigate, but there are blog posts out there where Windows/Xbox developers are staying away from the faster solutions and falling back to robust solutions, since they also discovered weird deadlocks which disappeared on slower (more reliable) code paths. But that doesn’t mean there there is no issue, just that a faster code math may trigger the deadlock sooner.

This seems like you’ve put the cart in front of the horse. Deadlocks mean your benchmark is broken, it is likely that the results would be significantly different if you fixed the benchmark. So you should “investigate why” before trying to divine a meaning to the numbers from the benchmark.

You seem to know a lot about these matters … perhaps you might have time to help improve our implementation? :slight_smile:

This seems like you’ve put the cart in front of the horse. Deadlocks mean your benchmark is broken, it is likely that the results would be significantly different if you fixed the benchmark. So you should “investigate why” before trying to divine a meaning to the numbers from the benchmark.

The benchmark uses 3 locking primitives: spinlocks, counting semaphores, and spinlocks which transition to semaphores after ‘n’ spins. Under Windows, If I eliminate the raw spinlock (type 1) and replace it with type 2 (semaphores) or type 3 (combo spinlock/semaphore), there are no deadlocks (16s vs 4s, so 4x slower), running overnight scripted tests. There seems to be some sort of scheduler inversion if the raw spinlock spins too often (in the scenario where it gets preempted out). There are few reports of people experiencing the same behaviour (xbox developer forums) when using spinlocks. Haiku, Linux and OSX haven’t exhibited this scheduler inversion, ever. This is something that needs to be investigated for that platform. There may be a bug, but it’s only reproducable to spinlocks on one platform (not reproducable on others).