Debug report diagnosis help required

Can someone please decipher the below? I’ve been struggling with this for a couple of days.

It looks to me like a problem allocating space. This is occurring after just a few thread spawns (these threads have been confirmed to close fine - max 6 running in parallel - called ‘CommandThread’ when running. ‘user thread’ are the, now finished, CommandThreads.).

All ideas welcome. Thanks in advance.

Debug information for team /boot/home/git/Paladin/Paladin/Paladin.new (208413):
CPU(s): 6x Intel Core™ i7-8750H
Memory: 4.00 GiB total, 1.26 GiB used
Haiku revision: hrev53655 Dec 27 2019 07:20:24 (x86_64)

Active Threads:
	thread 208413: Paladin.new (main)
	thread 208414: team 208413 debug task 
	thread 208463: w>Paladin: Paladin 
	thread 208465: build thread 
	thread 208466: build thread 
	thread 208469: build thread 
	thread 208470: build thread 
	thread 208471: user thread 
	thread 208472: CommandThread 
	thread 208473: user thread 
	thread 208474: CommandThread 
	thread 208468: build thread 
		state: Debugged

		Frame		IP			Function Name
		-----------------------------------------------
		00000000	0x32ccb833a2	_kern_resize_area + 0xa 
			Disassembly:
				_kern_resize_area:
				0x00000032ccb83398:           4989ca  mov %rcx, %r10
				0x00000032ccb8339b:   48c7c0c6000000  mov $0xc6, %rax
				0x00000032ccb833a2:             0f05  syscall  <--

		0x7f372a394350	0x32ccc28de4	BPrivate::hoardSbrk(long) + 0x234 
		0x7f372a394390	0x32ccc2adea	BPrivate::superblock::makeSuperblock(int, BPrivate::processHeap*) + 0xea 
		0x7f372a3943f0	0x32ccc2af78	BPrivate::threadHeap::malloc(unsigned long) + 0xa8 
		0x7f372a394410	0x32ccc2b78b	malloc + 0x9b 
		0x7f372a394420	0x193e98d9ea0	? 
		0x1c810fd9bea	0x193e9904240	? 
		0x48535441e5894855	0x8948000002b0ec81	? 

		Registers:
			  rip:	0x00000032ccb833a4
			  rsp:	0x00007f372a394308
			  rbp:	0x00007f372a394340
			  rax:	0x0000000000000000
			  rbx:	0x0000000000020000
			  rcx:	0x00000032ccb833a4
			  rdx:	0x0000000000000000
			  rsi:	0x0000000000180000
			  rdi:	0x0000000000a15e2d
			   r8:	0x0000000000000007
			   r9:	0x0000000000000007
			  r10:	0x00000193e98e8bd0
			  r11:	0x0000000000003287
			  r12:	0x0000000000180000
			  r13:	0x0000000000000003
			  r14:	0x0000000000152880
			  r15:	0x00000193e98565f0
			   cs:	0x002b
			   ds:	0x0000
			   es:	0x0000
			   fs:	0x0000
			   gs:	0x0000
			   ss:	0x0023
			  st0:	0
			  st1:	0
			  st2:	0
			  st3:	0
			  st4:	0
			  st5:	0
			  st6:	0
			  st7:	0
			  mm0:	{0, 0, 0, 0}
			  mm1:	{0, 0, 0, 0}
			  mm2:	{0, 0, 0, 0}
			  mm3:	{0, 0, 0, 0}
			  mm4:	{0, 0, 0, 0}
			  mm5:	{0, 0, 0, 0}
			  mm6:	{0, 0, 0, 0}
			  mm7:	{0, 0, 0, 0}
			 xmm0:	{0, 0, 0, 0, 0, 0, 0, 0}
			 xmm1:	{0, 0, 0, 0, 0, 0, 0, 0}
			 xmm2:	{0, 0, 0, 0, 0, 0, 0, 0}
			 xmm3:	{0, 0, 0, 0, 0, 0, 0, 0}
			 xmm4:	{0, 0, 0, 0, 0, 0, 0, 0}
			 xmm5:	{0, 0, 0, 0, 0, 0, 0, 0}
			 xmm6:	{0, 0, 0, 0, 0, 0, 0, 0}
			 xmm7:	{0, 0, 0, 0, 0, 0, 0, 0}
			 xmm8:	{0, 0, 0, 0, 0, 0, 0, 0}
			 xmm9:	{0, 0, 0, 0, 0, 0, 0, 0}
			xmm10:	{0, 0, 0, 0, 0, 0, 0, 0}
			xmm11:	{0, 0, 0, 0, 0, 0, 0, 0}
			xmm12:	{0, 0, 0, 0, 0, 0, 0, 0}
			xmm13:	{0, 0, 0, 0, 0, 0, 0, 0}
			xmm14:	{0, 0, 0, 0, 0, 0, 0, 0}
			xmm15:	{0, 0, 0, 0, 0, 0, 0, 0}

Loaded Images:
	ID		Text Base	Text End	Data Base	Data End	Type	Name
	--------------------------------------------------------------------------------
	3666251	0x2b1cebe000	0x2b1d07c000	0x2b1d27c000	0x2b1d297000	lib    	/boot/system/lib/libtracker.so
	3666250	0x32ccb41000	0x32ccc9f000	0x32cce9e000	0x32ccef1000	lib    	/boot/system/lib/libroot.so
	3666259	0x4778fd2000	0x4778fdd000	0x47791dd000	0x47791de000	lib    	/boot/system/lib/libiculx.so.57.1
	3666264	0x951de29000	0x951df24000	0x951e123000	0x951e126000	lib    	/boot/system/lib/libtextencoding.so
	3666263	0x95faf88000	0x95fb051000	0x95fb250000	0x95fb251000	lib    	/boot/system/lib/libzstd.so.1.4.4
	3666254	0xbc92851000	0xbc92867000	0xbc92a66000	0xbc92a67000	lib    	/boot/system/lib/libgcc_s.so.1
	3666248	0xc7142d5000	0xc7145e9000	0xc7147e8000	0xc714816000	lib    	/boot/system/lib/libbe.so
	3666262	0xf46f382000	0xf46f39e000	0xf46f59d000	0xf46f59e000	lib    	/boot/system/lib/libz.so.1.2.11
	3666256	0x104e0c87000	0x104e0f04000	0x104e1103000	0x104e1118000	lib    	/boot/system/lib/libicui18n.so.57.1
	3666255	0x10b32518000	0x10b33d96000	0x10b33f95000	0x10b33f96000	lib    	/boot/system/lib/libicudata.so.57.1
	3666257	0x111ebde0000	0x111ebdec000	0x111ebfeb000	0x111ebfed000	lib    	/boot/system/lib/libicuio.so.57.1
	3666252	0x11441579000	0x1144158e000	0x1144178d000	0x1144178f000	lib    	/boot/system/lib/libtranslation.so
	3666249	0x144bbc8d000	0x144bbcd1000	0x144bbed1000	0x144bbed2000	lib    	/boot/system/lib/libpcre.so.1.2.11
	3666246	0x14707abd000	0x14707ade000	0x14707cdd000	0x14707cdf000	lib    	/boot/system/runtime_loader
	3666261	0x18b7bdcd000	0x18b7bf65000	0x18b7c164000	0x18b7c179000	lib    	/boot/system/lib/libicuuc.so.57.1
	3666253	0x1b2409a4000	0x1b240b06000	0x1b240d06000	0x1b240d1a000	lib    	/boot/system/lib/libstdc++.so.6.0.25
	3666260	0x1c1f4756000	0x1c1f4785000	0x1c1f4985000	0x1c1f49b9000	lib    	/boot/system/lib/libicutu.so.57.1
	3666247	0x1c810f27000	0x1c81102b000	0x1c81122a000	0x1c811238000	app    	/boot/home/git/Paladin/Paladin/Paladin.new
	3666258	0x243cd364000	0x243cd3b9000	0x243cd5b9000	0x243cd5bc000	lib    	/boot/system/lib/libicule.so.57.1
	3666245	0x7fa64cdee000	0x7fa64cdf6000	0x00000000	0x00000000	system 	commpage

Areas:
	ID		Base		End			Size (KiB)	Protection	Locking			Name
	--------------------------------------------------------------------------------

Protection Flags: r - read, w - write, x - execute, s - stack, o - overcommit, c - cloneable, S - shared, k - kernel

Semaphores:
	ID		Count	Last Holder	Name
	------------------------------------------------------------

Interesting. It could have to do with us moving to our new memory allocator (and back? Not sure if we’re still on it)

@waddlesplash , @pulkomandy , any ideas?

It appears that the memory allocator is in the middle of a syscall. Is this reported as a crash? Or does the application just hang?

No, we moved back. The call stack is clearly in hoard here.

The application mostly crashes without being caught by the debugger (and it’s being launched by Debugger from the command line). Debugger occasionally catches this particular error just before the whole of Paladin crashes and shuts down, and that’s where the report is coming from.

The code this is based on is the Generic Thread class from Expander. It never fails if there’s only ever one instance of CommandThread (a subclass of GenericThread), but always fails if there’s more than one instance and more source files to compile than Command Threads. Each thread is being allocated and removed fine though AFAICT.

Main difference in my usage of this code is that I’m running multiple instances in parallel (rather than just a single zip file unpack operation), and creating and closing threads rapidly after each other.
I suspect there’s an errant low level thread call causing this, but its crashing before I can get a handle on it. I’ve tried putting breakpoints in logical locations, but of course that basically means only one thread is doing something, so I never see the problem.

I suspect Paladin is being taken out due to a kill_thread taking out its parent process. Can’t see why though.

I’ve uploaded the (very broken) code to the feature-365 branch here. Please let me know if anything in there jumps out at you:-

Try to run application under strace and see what happened before exit.

2 Likes

The first problem I see is that you have a semaphore leak (delete_sem is commented out). This will be a problem because semaphores are in a system-wide pool and there are a limited number of them. It’s probably better to use a pthread_mutex here (if you actually need that part of the GenericThread class).

Second problem is the lack of error check on fThreadId. If spawn_thread fails, I think some operations will mistake an error code for a thread_id, which is probably not a good idea.

Also since the error is in the memory allocator, maybe running under libroot_debug will give you a more useful backtrace: LD_PRELOAD=libroot_debug.so ./Paladin

2 Likes

Yeah I only just commented that out whilst playing. I agree on the fThreadId checks. I’m now calling Kill() (which has checks) instead of directly using kill_thread in the destructor for this.

I’ll add checks on every use just to be safe, and try the strace and libroot_debug approaches too.

Does anyone know why the original author is using fThreadId in some places, but -fThreadId in others? Seemed odd to me.

Hi! Thanks to your libroot_debug approach I managed to find a major issue around multiple threads clearing away a list that should not have been shared. This is probably the issue I have been managing to catch.

The issue I cannot seem to catch before it kills Paladin is still occurring. Using strace (thanks @X512 !) I managed to get the following back. I’ve hidden all other threads for brevity:-

[217674] port_buffer_size_etc(0x53c2e, 0x8, 0x7fffffffffffffff) = 0xffffffff8000000a (96777 us)
[217674] port_buffer_size_etc(0x53c2e, 0x8, 0x7fffffffffffffff) = 0xffffffff8000000a (75037 us)
[217674] port_buffer_size_etc(0x53c2e, 0x8, 0x7fffffffffffffff) = 0xffffffff8000000a (28220 us)
[217674] port_buffer_size_etc(0x53c2e, 0x8, 0x7fffffffffffffff) = 0xffffffff8000000a (12030 us)
[217674] port_buffer_size_etc(0x53c2e, 0x8, 0x7fffffffffffffff) = 0xffffffff8000000a (78990 us)
[217674] port_buffer_size_etc(0x53c2e, 0x8, 0x7fffffffffffffff) = 0xffffffff8000000a (1498450 us)
[217674] port_buffer_size_etc(0x53c2e, 0x8, 0x7fffffffffffffff) = 0x0 (1746 us)
[217674] read_port_etc(0x53c2e, [0x0], (nil), 0x0, 0x8, 0x0) = 0x0 (1 us)
[217674] port_count(0x53c2e) = 0x0 (0 us)
[217674] write(0x1, 0xffffffffffffffff, 0x73d4f79680, 0x18) = 0xffffffff8000600d (5 us)
[217674] --- Broken pipe (Broken pipe) ---

I’m not too familiar with low level calls, but I am expecting a lot of calls to fetch stderr/stdout - I’m guessing that’s the port_buffer_size_etc calls?

I would also imagine the read_port_etc is something checking the status of the thread. I can only guess that the write is a call to kill the thread? Nothing is sending data to the thread after it’s been started (or shouldn’t be…)

Any advice much welcomed. I’ve never had to deal with this level of execution before.

No. Ports are kernel objects for sending messages (IPC). stderr/stdout is accessed by file IO.

Last write call fails with B_BUSTED_PIPE error.

the “ports” things are indeed for IPC, in your case, possibly BMessage receiving. Something here is checking a port at 0x53c2e until there is some data in it, then reads said data.

stdin, stdout and stderr do not need to be “fetched”. From the program point of view, they are just opaque integers that are passed down to syscalls, and the 0x01 you see as the first argument to write is likely stdout (0 would be stdin, and 2 would be stderr). That call returns an error, so it appears stdout was closed somehow? Maybe review your file descriptor management then?

1 Like

Thank you all for your help. I’ve now figured this issue out. The GenericThread code was making certain assumptions about the subclass’ lifecycle. By ensuring the semaphore could only be deleted under the correct conditions, and registering an exit function to shutdown stdin, I’ve now resolved that issue.

I’ve got a separate issue now :rofl: Always the way! I’ll figure that out though.

Builds are now blindingly fast in Paladin :slight_smile: I’ll check in the changes when all the bugs are ironed out. Amazing some of them didn’t occur earlier. Should make using Paladin a whole lot more reliable now.

6 Likes