Haiku, MBR and formatting SSDs

I just discovered a way to format a SSD properly in Haiku.

For those who don’t know about SSDs, the most basic read/write internal operations work on 4K blocks. This causes two problems.

[Note: this is a simplified version of what really happens inside a SSD and it’s controller.]

  1. While reads are always really quick no matter how a partition is set-up on a SSD. If a partition is formatted with a sector/block size that is less than 4K in size then every write operation requires the SSD controller to read in the affected internal block, modify it with data to be written, then write out the changed block data. If instead the partition was formatted with 4K or 8K blocks as the sector size then all writes can be done directly for the affected blocks without any read/modify operations being needed. (This is of-course faster.)

However the above only works properly if the start of a partition is in alignment with the 4K blocks of the SSD.

  1. So this bring us to the heart of the matter, and that is the MBR that is written to the start of the drive and then has the partitions written afterwards. Testing both Haiku and Windows 7 OSes, I found the MBR to be the standard 512 bytes in size. This means all sectors are offsetted by 512 bytes from being in alignment with 4K blocks. This is easy to see when you try to create your first partition, the dialog block clearly points out the starting offset of the partition is 512 bytes.

In the past where sector sizes of 256 and 512 bytes were in use then any writes would always fall inside a single 4k block, but Haiku uses 1k, 2k, 4k or 8k sectors for it’s partitions. This means sectors will cross the 4K boundary (1 in 4 for 1K sectors, 1 in 2 for 2K ones, and every single time for 4K sectors and twice each time an 8K sector is written.)

Why is this so bad? Well, say we format a partition with 4K sectors, if the partition was aligned then every write would be just a single write operation, but since the MBR forces all the partitions to be out of alignment each write requires a read/modify/write operation and since there is still 512 bytes left over the drive needs to do another read/modify/write operation to write out that final chunk of data. See the problem?

The solution: What we need is some way to write out a 4K MBR. This turns out to be easy to do with the Haiku program “DriveSetup”.

[Warning: The following procedure will wipe out ALL DATA off the drive. Do not do this to your boot drive unless you have another way to boot your system. And remember all data on the drive will be lost.]

First, delete any partitions or mapping that may be already installed on the drive until you have only the single entry for the drive itself in your DriveSetup window.

Select the drive to be partition/formatted.

Choose to format the entire drive as “Be File System”.

Choose the sector size as 4K.

Agree to the formatting. Wait until finished.

Now “Initialize” the drive with “Intel partition mapping”.

Now if you select the following “” entry and try to “Create” a partition you find that the offset is now at 4096 bytes. And since partitions sizes are multiples of 1 MByte all the following partitions are also aligned.

Good Luck everyone,

Questions?

there is a better solution, i think. MBR is a thing from the past and honestly only windows has that instead of using GPT. So Haiku should use GPT and drop MBR support.

According to GUID Partition Table - Wikipedia is the same as GPT (correct me if I am wrong).

However, “DriveSetup” does give you the option to initialize the drive with a GUID. When I tried it the mapping was 17 KBytes in size! This of-course causes the same problem as the smaller MBR table, you still end up crossing the 4K block boundaries during write operations.

There is more to this problem than just changing blindly to another mapping solution.

Since I don’t use GPT, is there some way in Haiku to move the start of a partition to a 4K boundary with the GUID? How hard is it to do if the answer is yes?

Nope that’s quite right.

However the partitioning scheme is largely orthogonal to the partition alignment. In the past the usual practice was to align to (mythical, arbitrarily assigned) cylinder number, but today a modern OS will usually choose 1MiB alignment since drive sizes make the “waste” of a few hundred kilobytes largely unimportant. Haiku doesn’t currently do that.

I think you’ve missed the mark with the title for this thread, the issues you’re trying to address are called Advanced Format or Long Sector, and they’re not really related to SSDs. In fact the push to have long sectors comes from the spinning rust industry, a considerable fraction of the usable surface of a modern hard disk is wasted on error correction, using longer sectors means a higher percentage is available for user data.

Advanced Format presents two issues to an operating system. One of them is alignment for best performance with so-called 512e drives, where the physical sectors are 4096 bytes but the logical sector size remains 512 bytes for compatibility. All operating systems work with these drives, but for best performance an OS should ensure when installing or creating new partitions that they’re properly aligned. Most of today’s operating systems do this, but Haiku doesn’t out of the box yet and your tip addresses that.

The second phase is called 4Kn, 4096 byte native logical sectors. Operating systems need to understand how to address devices with 4096 byte sectors, an OS that can’t support this won’t work with 4Kn drives, it will either not recognise the disk or corrupt data. There are starting to be 4Kn drives on the market, but right now they’re not the majority because of the compatibility concern.

Throughout none of this really impacts SSDs except from the point of view of modifying firmware to handle the newer SCSI or ATA commands. The native storage in SSDs is the “erase block” which is far larger than any disk sector (often 64kbytes). SSDs and other flash storage for consumers always hide the erase blocks behind a firmware layer that does read-modify-write, just as 512e hard disks do for unaligned writes, but in an SSD there’s no latency overhead from waiting for an additional rotation. The difference in size between logical sectors and erase blocks is what TRIM and similar commands are all about - by informing the drive of unused logical sectors it can make wiser choices about how to re-assign erase blocks, destroying data that the OS doesn’t care about rather than using resources to preserve it unnecessarily.

Okay, reading the page http://en.wikipedia.org/wiki/GUID_Partition_Table I see that in the Partition table header (LBA 1) at offset 80 that the size of the array of partition entries can be set smaller (Based on 128 bytes per entry I calculate 120 partition entries possible.).

This should create a GPT that is only 16 KBytes in size.

Would this be worthwhile suggesting to the developers?

i guess it’s worth mentioning to the devs, indeed.

SSD don’t have a good compatibility with MBR setup, i think that it would be best to have GPT.

I will do that later today then.

I see what your concern is here. In practice the total amount of flash wear will be higher for such small writes at the filesystem layer anyway because the filesystem metadata (which is held elsewhere on the partition) must typically be updated too. Typically a last modified date is updated, and if the write was an append it will update the file’s notional length too. So overall I doubt you will have as much effect on overall lifetime as you might hope here, but it certainly can’t hurt.

Sure, I understand this consideration, but it’s a relatively modest fix. For something like Windows XP they have the excuse that the entire operating system pre-dates this issue. Even if Haiku R1 actually ships in the next few years it will post-date Long Sector and a reasonable person would wonder why nobody had the sight (for we’re not talking about foresight here) to get it working.

It’s not a matter of choice for Haiku. Communication with the actual drive is always done in terms of the drive’s logical sector size, the OS gets no choice in the matter. For both legacy 512n and the currently popular 512e drives the logical sector size is always 512 bytes. So what you call it “faking it” is the normal mode of operation on these drives and the drive is responsible for read-modify-write to cope with any short writes. For 4Kn the logical sector size is 4096 bytes, there’s no way to read or write smaller sectors, since no smaller sectors exist in these drives.

To function correctly on 4Kn drives Haiku needs to ensure that every “layer” in the storage part of the OS understands that logical sector size is arbitrary per drive and not always 512 bytes, and that correspondingly all writes smaller than the logical sector size of the drive must be subject to read-modify-write in the OS if appropriate.

These calculations are a little “off” because as we saw above in a 512e drive (which is where this matters) the actual writes are always 512 byte granularity. So e.g. a 512 byte write always costs you one read-modify-write cycle, alignment comes into play only for writes larger than 512 bytes.

In a system like Haiku the “block cache” writes block-sized quantities of data to the disk most of the time. Right now these blocks are always as big as, or larger than the logical sector size so the OS just issues the appropriate number of sector writes to the drive. But on spinning rust with 4096 byte physical sectors, or SSDs with 4kbyte physical pages the blocks may actually be smaller than the sectors, in this case the drive performs read-modify-write.

In 4Kn this isn’t the drive’s problem, it accepts only 4096 byte logical sectors, so the OS is responsible for deciding whether to read the previous contents of a sector before writing new contents or to simply pad the data it has to the proper length. In many cases (e.g. most filesystem operations) the latter is entirely satisfactory.

I was not aware of that being strictly the case, but it seems you are right.

[quote=NoHaikuForMe][quote=Earl Colby Pottinger]
According to GUID Partition Table - Wikipedia is the same as GPT (correct me if I am wrong).
[/quote]

Nope that’s quite right.
[/quote]

Thank you.

In-fact I noticed that was mentioned on the WIKI page. I am going to test this out.

[Later]

Okay, I confirmed that Windows 7 does start it’s first partition at the 1 MByte offset. As Haiku’s MBR can be made to work at the 4 KByte offset and the GPT looks like it can be made to work at the 16 KByte offset this 1 MByte offset looks like a extreme version of future proofing to me. However I can’t blame them, still far better to be too big than too small.

Yes, the issues do have to do with SSDs as well. The difference in speed gains are a lot smaller with SSDs than hard drives, but they do exist. Remember I originally said I was only giving a simplified version of why you want to align the writes, there are a lot more details to go into as to why to do this. More important to me, when you consider the cost per GByte for SSDs vs Hard drives then tuning the SSDs for performance and longer life is still more important.

On a personal computer SSDs are for doing work faster. Hard drives are for bulk storage.

Yes, the tip does address this. But more important it addresses the fact that most SSDs write into 4 KByte data blocks. As I pointed out before, writing across the data block boundaries gives a performance hit. Since Haiku-OS is not a super high speed it is unlikely that the speed difference will be noticed.

But worse you end up using the erase block faster as if you have a number of small writes (4 KBytes or less) each write will use up two data block per write instead of the single data block that would be replaced in an aligned drive.

I am concerned with the off-the-shelf SSDs that I expect other haiku users to buy today. My three year old Intel SSD and my brand new Samsung SSD were easy to buy and install today. The 4kn drives that you are talking about are not what most users need or plan to handle right now.

Haiku-OS does support 4 KByte sectors since I can set the sector size of the partition to that size when I create a partition. Note: it does not matter if Haiku-OS writes out a single 4 KByte sector or fakes it writing out eight(8) 512 Byte sectors in sequence as the SSD controller will concatenate the writes together and treat it as a single 4 KByte write.

That concatenation is important since it also applies with larger sequential writes to the data blocks, you only need to worry about the starting block and the ending block writes crossing data blocks boundaries and using two extra data blocks.

Basically, on unaligned SSDs using 4 KByte sectors you have 100% overhead in writes of 4 KByte or less to the erase block. 4 KByte+ to 8 KByte writes are 50% overhead and larger writes will have 2 data blocks extra overhead. So if you are doing lots of small writes you are wearing down your SSD faster if it is out of alignment.

[quote=NoHaikuForMe]
Throughout none of this really impacts SSDs except from the point of view of modifying firmware to handle the newer SCSI or ATA commands. The native storage in SSDs is the “erase block” which is far larger than any disk sector (often 64kbytes). SSDs and other flash storage for consumers always hide the erase blocks behind a firmware layer that does read-modify-write, just as 512e hard disks do for unaligned writes, but in an SSD there’s no latency overhead from waiting for an additional rotation. The difference in size between logical sectors and erase blocks is what TRIM and similar commands are all about - by informing the drive of unused logical sectors it can make wiser choices about how to re-assign erase blocks, destroying data that the OS doesn’t care about rather than using resources to preserve it unnecessarily.[/quote]

While erase blocks come in different sizes, the data blocks inside that are written to are presently all fixed at 4 KBytes in size.

The problem you are skipping over is that small writes that cross the data block boundaries cause multiple blocks(2) to used up from the erase block for each write. In the case of lots of small writes that means erase blocks will be used up twice as fast compared to an aligned partition.

As you noted, Haiku-OS does not presently support TRIM, this makes it all the more important to do something to align partitions in order to invalidate data blocks as little as possible. Which is the purpose of this tip as a single step in the right direction.

There is also the unlikely problem that if Haiku-OS tries to write a SSD at it’s top speed that the SSD will slow the system down while it handles the extra read/modify/write operations.

Below is what I have been reading and basing my ideas on.

http://storageconference.org/2010/Papers/MSST/Seppanen.pdf

My new SSD: Samsung SSD 840: Testing the Endurance of TLC NAND
Reviews and tech specs on SSDs: SSDs - Latest Articles and Reviews on AnandTech