Can not spawn process after a while

LupusMichaelis · February 28, 2024, 1:51pm

Notes:

I may use interchangeably “team id” and “pid” in this conversation
I’m not sure of the category in which to post this, if a moderator may sort me out and explains (in private) a more adequate place, I’d be grateful
I didn’t open a Trac issue as they might already be one I didn’t find, if not, please tell me I’ll open one (and tell me in what module to triage it)

I’m encountering a stability issue with Haiku. I’m trying to track down what’s
happening, so I wrote a program for that. But I’m not sure my watching app is right,
so I’d like someone to give it a look and tell me if I’m wrong.

This didn’t lead to any data loss.

I’m having an Haiku (R1/beta4 hrev56578+95) deployed in a virtual machine
qemu under Linux
(Linux mothra 5.10.0-23-amd64 #1 SMP Debian 5.10.179-1 (2023-05-12) x86_64 GNU/Linux)

I’m using it daily to learn BeOS API by writing programs and sometime attempting to
compile Haiku’s components. In order to simplify my life, I downloaded and
compiled Byobu, which works ok as long as it’s not too demanding on the Terminal
app (but that’s a problem for a different day). Byobu spawn a couple of processes
per second to refresh it’s status bar.

About ever week or so, Haiku can’t spawn processes or threads anymore. The
Deskbar becomes unresponsive, but <ctrl>+<alt>+<del> allow (sometimes) for a
soft reboot, and all running app are still usable.

When attempting to run from the Terminal app crashes it: it displays a message
about forking failing and exits.

I observed that the team id was growing. During the past 2 months, I was writing
down the uptime and the last team id with the ps command. Once I reach the last
tether of my patience, I decided to monitor this growth:
collect-ps

Once compiled, you can use it to fetch the last team id:

./collect-ps

or watch the system:

./collect-ps -w &

This will collect the last team id and the max team id every second, and attempt to
spawn a process and a thread.

On my last attempt, the KO is reached for PID 25’465’541. This number doesn’t look
like anything to me. I’m waiting for the next crash ^^;

At this time, when I ran ps in an open Terminal app, this didn’t crash and output that message:

~/workshop/belab/todo> ps
-bash: fork: Unknown Device Error (-2147432385)
-bash: cannot make pipe for command substitution: Too many open files

My hypothesis is that the kernel reaches an integer ceiling and overflows. But I
might be wrong, maybe an other resource id is exhausted.

I join the logs I collected.

So, in my monitoring tool, I observe a few strange problem:

The team id seems stuck for
a while, even though a ps will show higher team ids. It’s like if that value was cached at some point, but I don’t have an good enough knowledge of the system.
And my monitoring tool might be erroneous.
sometime the team id will reverse to a previous value

Is this a known behaviour? I didn’t find anything in the Trac concerning such issue.

nephele · February 28, 2024, 1:53pm

Please just open a ticket. Regardless of there possibly beeing duplicates, or the wrong components.

This is a user forum, and bug reports, especially about kernel functionality, have no place here.

LupusMichaelis · February 28, 2024, 1:55pm

Excerpt form the logs:

watch-pid.log (end of life)

2024-02-28-10:32:04	    25465541	    25465541	process(ok)	thread(ok)
2024-02-28-10:33:04	    25465541	    25465541	process(ok)	thread(ok)
2024-02-28-10:34:04	    25465541	    25465541	process(ok)	thread(ok)
2024-02-28-10:35:04	    27273448	    27273448	process(ok)	thread(ok)
2024-02-28-10:36:04	    27274986	    27274986	process(ok)	thread(ok)
2024-02-28-10:37:04	    27276508	    27276508	process(ok)	thread(ok)
2024-02-28-10:38:04	    27278027	    27278027	process(ok)	thread(ok)
2024-02-28-10:39:04	    27279554	    27279554	process(ok)	thread(ok)
2024-02-28-10:40:04	    27281070	    27281070	process(ok)	thread(ok)
2024-02-28-10:41:04	    27282610	    27282610	process(ok)	thread(ok)
2024-02-28-10:42:04	    27284133	    27284133	process(ok)	thread(ok)
2024-02-28-10:43:04	    27285681	    27285681	process(ok)	thread(ok)
2024-02-28-10:44:04	    27287191	    27287191	process(ok)	thread(ok)
2024-02-28-10:45:04	    27288755	    27288755	process(ok)	thread(ok)
2024-02-28-10:46:04	    27290269	    27290269	process(ok)	thread(ok)
2024-02-28-10:47:04	    27291804	    27291804	process(ok)	thread(ok)
2024-02-28-10:48:04	    27293302	    27293302	process(ok)	thread(ok)
2024-02-28-10:49:04	    27294818	    27294818	process(ok)	thread(ok)
2024-02-28-10:50:04	    25465541	    25465541	process(ok)	thread(ok)
2024-02-28-10:51:04	    25465541	    25465541	process(ok)	thread(ok)
2024-02-28-10:52:04	    25465541	    25465541	process(ko)	thread(ko)
2024-02-28-10:53:04	    25465541	    25465541	process(ko)	thread(ko)
2024-02-28-10:54:04	    25465541	    25465541	process(ko)	thread(ko)
2024-02-28-10:55:04	    25465541	    25465541	process(ko)	thread(ko)
2024-02-28-10:56:04	    25465541	    25465541	process(ko)	thread(ko)
2024-02-28-10:57:04	    25465541	    25465541	process(ko)	thread(ko)
2024-02-28-10:58:04	    25465541	    25465541	process(ko)	thread(ko)
2024-02-28-10:59:04	    25465541	    25465541	process(ko)	thread(ko)
2024-02-28-11:00:04	    25465541	    25465541	process(ko)	thread(ko)
2024-02-28-11:01:04	    25465541	    25465541	process(ko)	thread(ko)

watch-pid.err

Failing to spawn child: Unknown General Error (-2147483585)
Failing to spawn child: Unknown General Error (-2147483463)
Failing to spawn child: Unknown General Error (-2147483341)
Failing to spawn child: Unknown General Error (-2147483219)
Failing to spawn child: Unknown General Error (-2147483097)
Failing to spawn child: Unknown General Error (-2147482975)
Failing to spawn child: Unknown General Error (-2147482853)
Failing to spawn child: Unknown General Error (-2147482730)
Failing to spawn child: Unknown General Error (-2147482324)
Failing to spawn child: Unknown General Error (-2147481964)

watch-pid.log (at beginning)

2024-02-28-11:04:35	        2214	        2214	process(ok)	thread(ok)
2024-02-28-11:05:35	        2875	        2875	process(ok)	thread(ok)
2024-02-28-11:06:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:07:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:08:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:09:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:10:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:11:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:12:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:13:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:14:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:15:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:16:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:17:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:18:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:19:35	       24968	       24968	process(ok)	thread(ok)
2024-02-28-11:20:35	       26562	       26562	process(ok)	thread(ok)
2024-02-28-11:21:35	       28066	       28066	process(ok)	thread(ok)
2024-02-28-11:22:35	       29580	       29580	process(ok)	thread(ok)
2024-02-28-11:23:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:24:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:25:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:26:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:27:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:28:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:29:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:30:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:31:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:32:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:33:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:34:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:35:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:36:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:37:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:38:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:39:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:40:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:41:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:42:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:43:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:44:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:45:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:46:35	       65957	       65957	process(ok)	thread(ok)
2024-02-28-11:47:35	       67475	       67475	process(ok)	thread(ok)
2024-02-28-11:48:35	        4282	        4282	process(ok)	thread(ok)
2024-02-28-11:49:35	        4282	        4282	process(ok)	thread(ok)

LupusMichaelis · February 28, 2024, 1:58pm

Ok, sorry for the noise, please delete the thread I have a copy of the message

fkap · February 28, 2024, 3:23pm

@nephele
In my opinion it does have value to also have it here.

Certainly create a ticket with all that info. Maybe not all forum users are developers or technical aware, but I think is useful for me and possibly other users to be aware of that behavior. Maybe we could give a hint that way in the future, to another person seeing something similar.

In any case, I appreciated the intention and investigation of @LupusMichaelis. Couldn’t know that without this post.