ICU in Haiku

nephele · June 15, 2024, 8:14pm

Why would using ICU be a bad thing for a browser?

nipos · June 15, 2024, 8:18pm

It’s not that ICU itself is a bad thing,but Serenity using only own libraries from one repository made porting significantly easier as I didn’t have to worry about any external dependencies at all.
The big error wall about ICU stuff went away by replacing icu66_devel with icu74_devel,so that one was quite easy in the end.
I expected it takes hours to get those dozens of errors solved.

cb88 · June 19, 2024, 3:58pm

Because ICU itself is ultra bloated. It also uses STL heavily and its usage of it is bloated. By comparison the competition is about 60x smaller.

nephele · June 19, 2024, 4:15pm

Haiku already used ICU, adding anything else is an additional dependency. Not a dependency less.

If it there are good reasons to remove the dependency in Haiku we can probably look into that though.

cb88 · June 19, 2024, 4:20pm

As I said already. Usage of ICU itself is bloated beyond bloat in the libary itself due to STL bloat.

It may be Haiku is not additionally bloated but someone would have to evaluate that.

X512 · June 19, 2024, 5:53pm

ICU is indeed bloated and it contributes a lot to Haiku memory usage and installation image size. C++ library is less bloated because it do not contains heavy language tables.

marcoapc · June 20, 2024, 1:32am

Here UTF8-CPP alternative to ICU, also comparisons with other alternatives:

PulkoMandy · June 20, 2024, 7:29am

ICU is not just about handling UTF-8. A lot of its size is actually data: timezone data, name of all languages in the world translated into every other language, and so on.

This is not bloat, it’s data. You can’t magically make it smaller unless you tell people to stop using some languages (not a great idea, right?)

It is very annoying to get a vague reference to “the competition”. What competition is it?

cb88 · June 20, 2024, 1:49pm

Literally anything else that processes UTF8… because nobody else uses the bloated tables or STL. There really isn’t any reason to be more specific than that… its just that bad. I certainly haven’t done the comparison between alternatives to suggest a best alternative, thus my original statement suggestion someone would need to do so… if ICU were left by the wayside. Most of them I am aware of are in the 0.5 to 1.5MB range tending to toward the former.

Also, yes that literally is bloat… where exactly do you use the name of other countries translated, nowhere, ok I can see maybe localizing a globe app, but even then… that’s the globe app’s job not the native character libaries job… thus bloat. Because the 1 place you use that you want the drop down to be natively readable not localized.

Why store “data” that you know isn’t going to be used… you have this 30MB library that is 98% unused “data”.

marcoapc · June 20, 2024, 2:57pm

ICU doesn’t seem to do work efficiently and has a lot of unnecessary stuff on board.

PulkoMandy · June 20, 2024, 3:41pm

We don’t use ICU for anything even remotely related to UTF-8, so I’m confused. We handle UTF-8 ourselves in BString and BUnicodeChar, with only a few things using ICU because it makes no sense to reimplement it when we already have ICU doing it (things like determining if a character is a number or not, which depends on the language and is more complicated than just including the digits 0 to 9).

We use it for formatting dates and times and units in various languages (translating a value to “2 hours, 30 minutes and 40 seconds”), for handling complex plural rules in localization (is zero singular or plural? it depends on the language. Some languages have rules very different from English, with several types of plurals). We use it for the timezone database, for country and language names, and so on.

Of course, if all you need to do is determining if a sequence of bits is a valid UTF8 string, ICU is not the right tool. But if you are going to embed ICU for the 10000 other things it does, you may as well use its support classes when you need them.

In the Locale preferences (locales, timezones), and this API is used in various other places as well.

PulkoMandy · June 20, 2024, 4:00pm

I don’t need random forum posts based on “seems to” and handwaving. Give me a library that implements all the things we use from ICU more efficiently, and we’ll happily switch to it.

I’m sure you will find out it is a complex topic, ICU does, in fact, do a lot of things, and actually is pretty efficient at it, and that explains why basically everyone uses it for that job?

cb88 · June 20, 2024, 4:15pm

Stop being so negative.

Secondly, if we only use a few things from it, tzdata really? That should not be in ICU… iana provides tzdata. Time Zone Database its like half MB including code.

KapiX · June 20, 2024, 4:56pm

Let me sum up: you criticize usage of a project, knowing almost nothing about how it is used, making a bunch of vague statements, and when someone explains how you are wrong, you complain about negativity.

Does IANA provide plural rules and other things PulkoMandy has mentioned, or do you propose using IANA for timezones, some other library for UTF-8 stuff and ICU for everything else? How is that not bloat?

If you want to have a constructive discussion, please make an inventory of things ICU is used for in Haiku, then make a list of replacements for all of those uses, and then share it here as a starting point. And maybe read a few “Falsehoods programmers believe about…” articles.

Currently your attitude can be described by “I don’t see any use for this data so no one needs it” which is arrogant to say the least.

PulkoMandy · June 20, 2024, 6:22pm

I will add that the data is, indeed, available without ICU from the CLDR project: Unicode CLDR - CLDR Releases/Downloads

It is a 30MB ZIP file. So, sure, you can remove ICU (14MB package), and use the newly available -16MB to write new, smaller and lighter code using the same data source. Personally, I don’t know how to write code with a negative size. But if you are smarter than me and the ICU developers, maybe you can

marcoapc · June 20, 2024, 7:13pm

Is it necessary to depend on a single library for everything, could we see demand for use and select more optimized libraries?

If it is a matter of the project focusing on the ICU out of necessity, having a single library for everything, this is a matter of project design.

If you use the library, it doesn’t make it more efficient or give you better results, it’s a matter of the comfort of the devs, not a technical issue, examples are xz with backdoors and openssl with heartbleed, there is no real security check of the source code or quality, large projects they simply exchange security for convenience.

PulkoMandy · June 20, 2024, 7:39pm

In the end, ICU will end up being installed anyways because, for example, WebKit depends on it (but not only that, various other things like Qt and so on). So, it’s hard to remove anything here. You may add more, supposedly less bloated libraries, but that’s just adding more implementations of the same thing.

I’m not sure what you mean here. XZ is a 1-person project, and that’s how it could be compromised and backdoored (and that was caught before it could do any serious damage). Openssl at heartbleed time was similar, a very small project with only a few people trying to maintain it.

Developpers on such understaffed projects put the little time they have on it. You can’t blame them for the security problems, unless you paid them to provide any security guarantee. All open source licenses have a disclaimer for this (and any other problems). Other projects chose to package these libraries and accept these risks, because that’s still better than trying to rewrite everything. If I wrote an SSL implementation from scratch, it would surely be far less secure than OpenSSL. If I wrote a localization library, it would surely be a lot worse than ICU. Not because I’m a bad developer, but because I only have a vague, surface-view idea of how these things work, and also because I am not interested in getting a better idea. I have already enough things to do working in other areas, and building on the work of other people.

ICU was already discussed several times. There are ways to make its disk usage smaller, the first and most obvious one would be to research how to share the same icu data between gcc2 and gcc13, currently we ship two versions. Likewise, we now ship multiple versions for gcc13, because various things in haikuports are not updated to the latest one. And, finally, we could actually look closer at what’s in the data and if it’s possible to rebuild it with less data. But that’s not something I’m willing to spend hours on just to save maybe 5MB off it.

SCollins · June 20, 2024, 8:33pm

It’s 2024, why on earth are people obsessing over 30mb ?? Even rpi has orders of magnitude more storage, it’s not 1985 again.

Breathe

suhr · June 20, 2024, 8:52pm

It’s 2024, why on earth are people obsessing over 30mb

The reason why Haiku is so light is because people care.

marcoapc · June 20, 2024, 8:53pm

I agree, it is complicated to remove ICU completely, if it is possible to do it in some software, if necessary.

The problem is not having a dev or a small team, but the quality of the source code and the lack of help from those who use the software as a final product, large LInux companies use it and also distros with security teams, that is, auditing the source code, but It seems that this does not correspond to reality, xz is compression and decompression software, it has no main focus on security, so it was not a critical area, it was used to explore systemd framework, a software that does everything and at the same time nothing, openssl is focused on security, there should have been attention to auditing the source code, but there was no help in cleaning up obsolete code, not to mention the complexity of the code and the possibility of hiding a backdoor.