Multitasking question

I’ve put my haiku install (on bare metal) under stress using lame which is well known to be a single threaded application.
Just wanted to monitor the CPU usage, in this case an i7 7500u 2 cores + HT (so haiku sees 4 cores) under different load scenarios.
This is with a single lame run:


This is with two lame processes:

This is with three lame processes:

This is with four:

And finally, with five lame processes:

While I can totally understand behaviour with 1, 4 and 5 processes, are the other ones expected/correct? Or are they sub-optimal (as they seem to be)?

1 Like

The scheduler tries to fulfill many constraints, for example, balance the load on the different CPU cores and caches. If you have 2 cores with hyperthreading, it is better to use just one thread of each core. But if you do so for a long time, that core will heat up more than the other, and so it will exit its “turbo boost” mode to not overheat. So it is a good idea to migrate to another core after a while, until the first one cools down.

So, if you hit a case like this, the threads will be moved from one core to another periodically.

There are also other things to take into account, for example, when a thread does a disk access, it is put on wait until the reply from the disk is received, and when it becomes ready again, it may be put on another idle CPU. Maybe there is now another task (even one with little CPU use) running on the CPU where it was initially. And it does not cost anything to put it on another idle CPU at that point.

So, there is nothing in these graph that allows to say the scheduling is sub-optimal. If you really only had 2 or 3 threads to run, and 4 CPU cores, then, yes, this would be strange. But there are at least a few dozens other things running in the background, that introduce “noise” to the system and make it hard to predict what the best solution is. And you wouldn’t see it just from the global CPU activity graph.

Is it optimal? Probably not. But to answer this you have to define what “optimal” is. First, what do you want to achieve? Lowest possible latency for the user? If so, the mouse movements and all things related to “display” will have to be run first. Or do you want your lame process to run as fast as possible even if the GUI is completely frozen during that time? That’s also possible, you can set your processes with a very high priority. Or maybe, try to keep one CPU core idle as much as possible when it’s not really needed? You can also do that, by switching the scheduler to it’s “power saving” mode (available in ProcessController menu). And, also, the scheduler has to make decisions very fast. If it spent 2 seconds computing the perfect solution everytime it needs to select what thread to run on what CPU, it would not be optimal anymore, even if the solution would be the theoretical perfect one. The scheduler is an O(1) algorithm, meaning it does not scan all available CPU cores and all available threads when it wants to decide what to run next. And so, sometimes it has to take a shortcut and pick the “good enough” solution that can be done very quickly, rather than the perfect solution that would take too long to compute.

11 Likes

Where in the Haiku source code would I look for code that will migrate processes off core 1 onto core 2 “until the first one cools down” ?

Typical modern CPUs have some per-core cache, e.g. 64kB of very fast L1 data cache, 1MB of L2 cache and although the shared cache (L3) is typically larger it’s also far slower. So if you re-awaken the process on a different core the associated caches don’t have that memory cached, they need to ask for it and, if they’re writing to it, they also need to tell other caches they’ve got it now making it effectively far slower to run that process for a short while each time this happens. This would also happen if the process has been asleep for long enough for the cache to all be re-used, or if you specifically went out of your way to clear the cache, but that would be weird when running a CPU intensive process like lame.

I would say these charts suggest Haiku’s passive affinity works OK (whether deliberately or just by good luck) for the 1x lame case but fails badly for 2x or 3x lame and it’s impossible to know whether it’s also a problem for 4x and 5x lame.

1 Like

The code is in the scheduler:

https://cgit.haiku-os.org/haiku/tree/src/system/kernel/scheduler/low_latency.cpp

There isn’t that much, it’s just load average management, trying to keep all cores close to the average if possible. The computing of the load of each core is done by the CPU itself and reported through usage counters. When the CPU slows down, its load percentage increase for the same amount of processing, so the result is moving some things out to other cores.

I just send a long explanation of why affinity is not the only thing to optimize for. The scheduler takes some other things into account and as a result it took the decision to move things to different CPUs.

It is impossible, by looking at just these graphs, to decide if that was reasonable or not. Because, as I said, there are a few dozen other threads running and we don’t know what they are doing.

Also, the L1 and L2 caches are separated by core, but that doesn’t matter for hyperthreading. With just the graphs, its very possible that threads are simply moved from one hardware thread to the other on the same CPU core (and yes, before you ask, the scheduler does take CPU topology into account and knows about the cache structure, which you can see in the separate _ChooseCPU and _ChooseCore functions here: scheduler_thread.cpp « scheduler « kernel « system « src - haiku - Haiku's main repository)

3 Likes

I didn’t expect my question raised such rich debate.

On one hand, I find reasonable moving a thread to a different CPU core when this is heating up too much - this will prevent a fan motor to spin up and suits well an energy saving profile.

On the other hand, keeping threads on the same core (with the advantage in cache hits) will enhance performance, and it’s a wise choice for a performance/gaming profile.

1 Like

As opposed to it being done by a little goblin with an abacus? I can’t tell if you’re confused about what’s actually going on here or if you intended to confuse readers by making it sound like this is something more than it really is. Haiku simply does not, in fact, have the mechanism you described earlier.

In practice workloads like this are always busy, so you go from 100% busy while boosted to 100% busy while waiting for cooldown, there’s no difference.

It’s pretty easy just from the graphs to see that Haiku ping-pongs busy threads with no coherent pattern. We could imagine maybe it’s trying to confuse jungle predators, or get a Netflix comedy show, or any number of ludicrous ideas but it seems reasonable to assume it’s actually just doing this because it screws up natural affinity.

Yes, the stats are likely not to be highest quality - remember back when Haiku could end up believing it was using very little or even negative RAM ? Hilarious. But while the smooth graphs could be a bug, the stuff where it leaps all over the place would take almost more effort to fake even by accident.

1 Like

The scheduler definitely has load tracking. By what rationale are you claiming it doesn’t?

https://cgit.haiku-os.org/haiku/tree/src/system/kernel/scheduler/scheduler_cpu.cpp#n531

Does the scheduler migrate threads between cores too aggressively? I don’t know, I haven’t studied this, and the person who designed this scheduler (pdziepak) knows far more than I do about CPUs and using them efficiently (he was one of the Cloudius/ScyllaDB developers for a while.) It’s possible, of course. But the scheduler does take thread affinity and core topology into account, that’s pretty apparent.

2 Likes

As an aside: all the interactions I can recall having or reading from you both on this forum (and checking your comment history here seems to confirm this) as well as on HN, have been extremely negative towards Haiku from you. Why do you bother to come here at all, if you think Haiku is doing everything all wrong? And it’s not just Haiku, since last time we interacted I pointed out something OpenBSD also does, and your response included the phrase “I don’t know anyone who uses OpenBSD”. Well, you are on this forum, so I am pretty sure that statement is incorrect. Or you are admitting you don’t really know us!

9 Likes

There will be some IO happening at some point. Or at least a wait for RAM because of a cache miss. And there are other threads waiting to run and the thread is not set to realtime priority, so at some point it will be stopped to make space for other threads no matter what, if there are more ready threads than there are CPU cores.

It is impossible to say there is no coherent pattern from a 10 mile high view with a refresh rate of a few hundred milliseconds when there are hundreds of threads running.

No, because affinity is not the only thing to take into account, and it has made a decision against it because of another factor. I think I have proposed a list of realistic ones:

  • There are a few hundred other threads that are ready, one of them takes the CPU because this is a preemptive system and, unless a thread is set to realtime priority, we end up giving the CPU to other ready threads after a while. Once the thread is kicked out, if another CPU happens to be available, it may be better to move it there than wait for its original CPU to be free again
  • The scheduler does not always make the perfect decision because it also has to make a very fast decision, and that requires some compromises on finding the optimal solution
  • Load balancing between the cores because they leave turbo boost, or whatever other reason may cause them to have lower performance

But sure, maybe it’s jungle predators or Netflix comedy show. In that case, I would be interested if you can point out the part of the Haiku code that does this, because I could not find it.

Are you accusing me of telling the graphs are fake? Or are you think they are? I’m not sure where you want to go with that. I only said they give a very high level overview. You are interpreting them as if there was only 1, 2, or 3 threads running in total on the system, in which case, yes, this would be unexpected. But that isn’t the case, we’re dealing with hundred of small threads that want to run as well. And without knowing what these does, we can’t conclude anything, and only guess at what could be happening. I think my guesses are more based on reality and knowledge than yours, but there are still only guesses. They have the consequence that I am not very worried about a thread changing CPUs from time to time. So, I don’t really want to spend time investigating this further, as I consider it a non-problem, and I have better things to use my curiosity on.

2 Likes

One real measure of something would be Haiku vs. other systems. How does this “benchmark” perform on Linux, vs. how does it perform on Haiku, on the same machine? Of course Linux will do better, it’s far more of an optimized system than Haiku is in just about all respects, but the question is how different Haiku is. (And also vs. FreeBSD and OpenBSD, too.)

3 Likes

While this may feel like a similar delay to you the programmer, it’s not for the CPU. When somebody’s use of std::unordered_map results in dependent reads and thus stalls for a (from the CPU’s point of view) long period that isn’t an OS event, the kernel can’t somehow schedule other work to take place while waiting. With SMT the other hyperthread does get the whole CPU to itself until the fetch completes which is nice.

No, I’m suggesting that the graphs where it’s all over the place are most likely reflecting a real problem. I don’t trust statistics which could easily be statically wrong the way Haiku’s RAM usage stats were, but these effects wouldn’t be caused by this sort of defect.

This is a technical endeavour so unlike if you were fans of a band I don’t like, it’s not merely a matter of taste.

@tialaramex to be somewhat blunt, everything you have stated so far is conjecture. Do you have any evidence to back up any of your statements? Or data sets from another OS? The explanations from @PulkoMandy and @waddlesplash explain the key points of scheduling in haiku and it is obvious that a graph with ms accuracy does not show anything clearly unless the load is steady state across all cores. You have stated otherwise, but have provided no evidence for that. For example you could share:

  • A graph of the same tests under Linux
  • A graph at a much higher resolution under haiku with an explanation of where the scheduler moves a task from one place to another
  • An examination of how the scheduler code differs between the two systems by direct example
  • Something else?

Of course you may not feel you need to provide such information and of course you are not obligated to, but so far your argument is looking pretty thin. You’ve made an accusation, do you have any evidence?

5 Likes

From what I remember of mmlr’s scheduler design this is the intended behavior. A single threaded task takes one core, multithreaded tasks bounce around between each core until you run out of cores/threads and then all (4) go at 100%. If you want to put all the work on as few cores as possible you enable power saving mode because that will do the same work slower while saving power by not using the other cores (although I don’t know if that part currently works or not).

1 Like

Enable power saving and rerun test please topolinik 2 and 3 cores should change.

powersaving

1 Like

There’s millions of technical endeavors around the world. Why pick on us, exactly? Of course we, a volunteer project with (currently) one paid developer and a “motley crew” of volunteers will stack up poorly against Linux, which gets millions of dollars spent probably just on optimizing it. Is this supposed to be surprising? Are we supposed to be embarrassed by your sharp comments?

3 Likes

10 posts were split to a new topic: Sinister forces have it in for me/us/Haiku

Unlike @PulkoMandy and @waddlesplash I actually understand the code we’re talking about. @PulkoMandy’s “Computing of the load is done by the CPU itself” is, as I intimated, misleading, it’s true that Haiku’s load calculations are, like all its other calculations, made using the computer, that’s the nature of software, but the CPU isn’t doing anything special here, it’s just doing the arithmetic programmed in the scheduler code. As @jscipione gets to, in the actual design “multithreaded tasks bounce around between each core until you run out of cores/threads”.

It does attempt to track whether cache effects will matter, specifically assuming that after 0.1 seconds the cache effect goes from crucial to negligible. It’d be pretty spooky if across all makes and models of CPU over decades 100_000 microseconds was the right choice, but that sort of thing is probably a heuristic anyway and weird arbitrary choices dominate in software engineering (why do people’s MT19937 implementations choose 5489 as the default seed? It’s just tradition at this point).

However as we see this doesn’t end up making a difference, there’s no way those two lame threads are asleep for a whole 100ms each time they swap. It’s not my job to debug your operating system kernel.

100ms is just the refresh rate of ActivityMonitor by default. So with ActivityMonitor we can’t see anythin that’s happening faster than that. What are you even talking about?

Are we into personal attack now? I’m not really interested in that.

I will continue pointing to the code, which you apparently understand.

In the case of “computing load by the CPU itself”, I was referring to the MPREF and APERF MSRs, which are read here: intel_pstates.cpp « intel_pstates « cpufreq « power « kernel « add-ons « src - haiku - Haiku's main repository and give info on the number of CPU cycles executed, from which you can detect the CPU frequency, and if the CPU was shut down for powersaving.

I have, of course, oversimplified things in this forum thread to give a simple and understandable response to an user that isn’t necessarily well-versed into CPU internals. If you do understand the code, maybe you can go and read it instead of me having to do the research for you, it will save a lot of time and frustration.

4 Likes

Just couldn’t - Every time I launch the test (two lame encoding runs) after selecting Power Saving, some input related process crashes, so no more keyboard and touchpad (mouse pointer stops moving). There’s no way to stop anything, even take screenshot, and during one run even ActivityMonitor freezed.
The only working key is the Power button.