ICU in Haiku

nephele · June 30, 2024, 3:41pm

I’ve not used google in years, so where does the “reliance on google search” in my comment come from

KapiX · June 30, 2024, 3:51pm

I am not going to dispute that, I have noticed that Google requires more and more manipulation to find the result that you want.

But please give me some credit, I did more than just typing “icu bloated” and being satisfied with nothing.

What I found is one random comment on Hacker News (with similar “nobody needs the tables”), and a LibreOffice issue where they want to cut out the data they don’t use (and that’s a reasonable proposition actually).

I’m not convinced that 2 results means “widely known”.

I would expect an article, blog post, maybe some benchmark, and more than one. Or even more comments. If you have links to those, provide them.

cb88 · June 30, 2024, 3:52pm

Fair enough.

VoloDroid · June 30, 2024, 4:16pm

Well, it does “know” now, as the Google search for icu library bloat returns this thread as the 2nd and the 4th results

This! Anyone involved in the AOSP development (or any Android app developer who once tried to figure out how plural rules work under the hood) knows it very well.

Apparently, the only viable option to completely replace the ICU4C functionality is ICU itself, just the one written from scratch, the ICU4X library.

Quoting ICU4X devs:

But… before one starts celebrating and suggesting to include it in Haiku, we should open the release announcement on Reddit and read the title: “The Unicode Consortium announces ICU4X 1.0, its new high-performance internationalization library. It’s written in Rust, with official C++ and JavaScript wrappers available.”

So, it’s being written in Rust means we can integrate it into Haiku itself like never. Okay, let’s not say never, but definitely not in the next 10 years. Till then I think we can keep using ICU4C and look for more important problems to solve. Case closed?

p.s.: as a bonus, here’s an interesting discussion between Flutter devs about the suggested migration to ICU4X, where one can find a word “bloat” mentioned a few times. Some other library libgraphaeme is also suggested there, and a Flutter maintainer in the last comment explains why it’s not a viable solution for them, and why they prefer to keep with ICU instead (possibly switching to ICU4X). It’s quite interesting to read: Migrate to ICU4X (?) · Issue #113400 · flutter/flutter · GitHub

cocobean · June 30, 2024, 7:34pm

Note:

2023-12-13: ICU 74.2 released with date/time formatting bug fixes.

The issue with using ICU alternatives is also in keeping track of them for the changes and long-term support…

Begasus · July 1, 2024, 5:27am

I’m guessing 74.2 can be added to the depot (without dropping the existing ones), 74.1 is still needed for Haiku, so we can’t drop that for now. 75.0 has already been added.

cafeina · July 1, 2024, 5:34am

Could anyone able to do it, make a performance test of ICU library and how it might impact Haiku load time? The idea of a “fit ICU library” with a separated “ICU data files” could be interesting if it’s proven that this approach has less performance impact than a “monolithic ICU”.

cocobean · July 1, 2024, 4:25pm

Yes. Compare load time and memory usage between the Haiku hrev57801 nightly image versus R1B4. Check for any ICU-related issues (validate any major concerns with the ICU devs, moreso).

Ref: ICU - International Components for Unicode - ICU 73
“ICU 73 … reduces C++ memory use in date formatting…”

marcoapc · July 1, 2024, 5:49pm

The libgraphaeme website has an interesting comparison of libraries:

https://libs.suckless.org/libgrapheme/

cocobean · July 1, 2024, 6:11pm

Nice… stuff relative to musl; libc too…

Evaluate current state (Y2024?!?)…as last code changes were Y2023. Although, this does help someone’s previous statements on bloat and other things with this reference:

Ref: libraries | suckless.org software that sucks less

"Motivation

The goal of this project is to be a suckless and statically linkable alternative to the existing bloated, complicated, overscoped and/or incorrect solutions for Unicode string handling (ICU, GNU’s libunistring, libutf8proc, etc.), motivating more hackers to properly handle Unicode strings in their projects and allowing this even in embedded applications.

The problem can be easily seen when looking at the sizes of the respective libraries: The ICU library (libicudata.a, libicui18n.a, libicuio.a, libicutest.a, libicutu.a, libicuuc.a) is around 38MB and libunistring (libunistring.a) is around 2MB, which is unacceptable for static linking. Both take many minutes to compile even on a good computer and require a lot of dependencies, including Python for ICU. On the other hand libgrapheme (libgrapheme.a) only weighs in at around 300K and is compiled (including Unicode data parsing and compression) in under a second, requiring nothing but a C99 compiler and POSIX make(1).

Some libraries, like libutf8proc and libunistring, are incorrect by basing their API on assumptions that haven’t been true for years (e.g. offering stateless grapheme cluster segmentation even though the underlying algorithm is not stateless). As an additional factor, libutf8proc’s UTF-8-decoder is unsafe, as it allows overlong encodings that can be easily used for exploits."

marcoapc · July 1, 2024, 7:19pm

Creator of the libgrapheme project, explains that development is slow, due to refinement and concern with source code:

github.com/flutter/flutter

Migrate to ICU4X (?)

opened 06:22PM - 13 Oct 22 UTC

dnfield

engine a: internationalization P2 team-engine triaged-engine

ICU4X has released a stable product that can be consumed as a static lib via gen…erated C++ headers. It promises some increased modularity and potentially a smaller binary. We should consider migrating to it. https://home.unicode.org/announcing-icu4x-1-0/ https://github.com/unicode-org/icu4x/blob/main/docs/tutorials/cpp.md Some things unclear to me: - We'd need a Rust toolchain for this. - I imagine the API migration isn't entirely trivial. - We'll have to update tests etc. But some nice things: - ICU4X supports dynamically loading locale-specific information out of the box. This should be able to save us a good chunk of binary size for at least some users.

VoloDroid · July 1, 2024, 8:59pm

It’s nice, but it’s nowhere near a full replacement of ICU. It’s a strings handling lib (word/sentence/graphaeme segmentation and similar things). But no plural rules, no date formatting etc.

PulkoMandy · July 2, 2024, 6:37am

There would be no problem integrating it, since it has a C++ wrapper. We already have Rust running, and this would be built at haikuports as an external package (as ICU already is). So, yes, if this is actually smaller and faster, and already implements the features we need, let’s do that.

It will, however, mean that Haiku itself is running on ICU4X, but all 3rd-party code (like WebKit), will be still requiring ICU4C, since the two projects are not API compatible. And so we end up shipping both ICU4C and ICU4X in Haiku releases, and in the end the overall total is larger and uses more memory. Oops.

SCollins · July 2, 2024, 11:12am

Google search is a advertisement device, Google is a advertising agency, Google is also a product and services retailer/wholesaler.

Google doesn’t care about code bloat, because it saves them development spending.

As to ICU, while 32mb might be a big deal on a 486x era machine, it’s not on anything remotely modern built in the 21st century, and while I understand your concerns, premature optimization will stall dorward progress on other aspects of the project.

As always though, I’m sure oatches are welcome

nephele · July 2, 2024, 11:18am

Adding rust as a dependency requires an additional toolchain to port to any architecture. I don’t think that is a good idea.

cb88 · July 2, 2024, 10:21pm

I think that is probably true on the OS library side of things, however the application side is also quite heavy. It would require bench marking applications targeting either one to determine the difference though. Indirectly related but MUSL distros use a lot less ram that glibc… like Alpine etc. Some have pointed out that ICU is already there, but what is the overhead per application, eg runtime overhead? To be dismissive of things like that is why roughly the same computing today starts at 300MB ram where BeOS itself ran in less than 1/10th of that… so the bloat does exist, and it probably cannot be squarely laid on just new features. The same kind of dismissive logic is used to excuse windows for using 4GB (not just cached but resident, if it were just cached nobody would be having this conversation)

cb88 · July 2, 2024, 10:23pm

Perhaps but Rust ports do exist for any relevant architecture. I mean my pet architecture Sparc might not be too stable but again… relevance being the key word does it even matter?

I mean it might be nice to use uutils instead of C/C++ ones … and eliminate decades of bugs that have just continued to crop up in those. Its not like rust is required at runtime.

SCollins · July 2, 2024, 10:49pm

I’m not saying that it’s not a valid concern, but at some point, the functionality required creates the memory footprint. The memory usage you’re referring to is blatantly tracking information etc being used to data harvest etc.

cb88 · July 2, 2024, 10:57pm

Possibly, but then there is functionality added that isn’t required also, like the recent Linux debugger bitmap logo support… really? HPC has resorted to things like McKernel (does something like runs linux on x number of cores for compatibility and services + McKernel on the rest with a compatible API to service high performance low jitter threads).

marcoapc · July 3, 2024, 1:13pm

We need to know what we are using from the ICU, so we can look for alternatives or use .dat files from the ICU.

Using a rust application will cause more bloating and require a lot of extra work.

I believe the best solution is to identify how we use the ICU, use Haiku’s native software to test, make a comparison with the .dat file, using ICU and other software together, for example libgrapheme, so we can define a path.