BString and UTF8

Octopus · June 15, 2010, 10:33am

The BString class seems to be strange to me. Though it is handling only char type of data, it promises to handle UTF8 strings. But whenever a char parameter is used, I wonder how to apply this to e.g. a two byte sized UTF8 char symbol. Or is there also a char16 based BString alternative?

Octopus · June 15, 2010, 11:34am

I will give an example:

BString s = “costs are 0”; s += ‘€’;

this will not work like expected.

It would be good to rewrite the BString class (internally still UTF8) to compatibly also support a kind of metachar instead of poor char parameters to support also multibyte chars then, which seem to be translated correctly into such multibyte integer values.

If you would send me the BString source code, I would like to do that rewriting job.

humdinger · June 15, 2010, 1:55pm

Hi Octopus!

I’m not sure how and what you’d like to improve with BString. There’s a guide on how to get the source, there’s also the SVN browser to have a look first. You may also want to join the haiku-developer mailing list.

That said, your above example works, if you use s += “€”; or s << “€”;

Regards,
Humdinger

Octopus · June 15, 2010, 2:39pm

Of course, I am about to join some developer near places.

And of course you could append a string instead of a (multibyte) char. But there is an overall problem. You have to know, whether a char is a multibyte char or not. If you have only constants, this could be done. If you are living in a country like USA, where all letters are equal to exactly one UTF8 byte, live is very easy here. In Germany e.g. there are letters “äöüÄÖÜß€”. In other countries there will be even more. The good thing is, the compiler treats chars like ‘€’ as a UTF8 multibyte metachar (integer). There would be no big problem to handle those metachars correctly where chars are requested as parameters in BString methods. And it would look more natural, because it is what one intends to do. Currently those BString parameters will be shrinked to a nonsense byte without any warning about the loss of information.

P.S.: Another example:
#include #include String.h>

int main()
{
using namespace std;

BString s = “Bärenstraße”;

cout << "ä: " << s.FindFirst(‘ä’) << endl;

return 0
}

This will output 2 as the found position, which is erroneously one byte behind the storage place of ‘ä’. This is done because of the unwarned shrinking of the char parameter in the FindFirst method.

PPS.: thus multibyte characters also might be located at a very differennt places, because only 8 bits will be compared.

Octopus · June 15, 2010, 2:56pm

Well developers e.g. in the USA or UK have it easy. All traditional chars are identical to their UTF8 counterparts. Here in Germany e.g. there are “äöüÄÖÜß€”. If you have them as constants, then you really are able to avoid their use as a char and substitute such calls by calling an equivalent using a char string parameter. But this is not natural. Moreover it leads to a lot of unwanted and not (at once) understood errors. Such a parameter, e.g. ‘ä’ is shrinked to its last byte and might cause wrong results (mostly) without warning. E.g.
BString s = "Bärenstraße; s.FindFirst(‘ä’);will create a wrong position answer for the start of the storage place of ‘ä’ (confusions with UTF8 chars with equal last byte are possible, too). And if you are working with metachar variables, which you might need when living in Germany, things will become even more complicated. Thus it would be helpful to have those BString methods compatibly work with metachar (int32) instead of single char parameters, which will have to extend the routines a little bit.

X512 · June 15, 2010, 4:57pm

Don’t use chars. Use string:
BString s = “Bärenstraße”; int p = s.FindFirst(“ä”); printf("%d", p); // Out: 1

fano · June 15, 2010, 6:53pm

I come from Italy and for me “èéòàùì” are chars NOT string so is for me perfectly natural to write:

BString s = “Bèrenstraße”;
int p = s.FindFirst(‘è’);
printf("%d", p); // SHOULD Out: 1

int32 FindFirst (char c) const // This not work with UTF-8 chars? It should…
Find the first occurrence of the given character.
int32 FindFirst (const char *string, int32 fromOffset) const
Find the first occurrence of the given string, starting from the given offset.
int32 FindFirst (const BString &string, int32 fromOffset) const
Find the first occurrence of the given BString, starting from the given offset.
int32 FindFirst (const char *string) const // Why I must pass a “string” for a char (è)?
Find the first occurrence of the given string.

I don’t know id can be so easy char is in relaity a sinomus of “byte” (that is 8 bit) so we need a method that can accept a int (or if we want be exotic create an apposite type wchar = int32)…
but the trick with ‘ì’ to indicate a “UTF-8 char” will work?
Redife char as int32 fear it will be near impossible, however, for compatibility reason…

fano · June 15, 2010, 8:10pm

This is a C++ Linux equivalent… and it works no tricks ì is “a char” and is found in position 4… I had not to use a string (that is a char * = “ì” trick!).


#include 
#include 
using namespace std;
int

main(void)

{

string testUTF8 = “Birìbò”;

char toFind = ‘ì’;

size_t pos;
pos = testUTF8.find(toFind);

if (!pos) {

cout << “The char is NOT found”;

return -1;

}
cout << "The char is found at position " << (int) pos << endl; // return 4 and it is right!

return 0;

}

So Haiku’s FindFirst(char c) should work in an analogous way…

Octopus · June 15, 2010, 8:40pm

not at all … 3 would be right, because it starts with index 0.

PS.: Moreover variable toFind is not able to hold a multibyte value completely.

Octopus · June 15, 2010, 9:07pm

What sense does a UTF8 based BString class make, if its methods are not able to handle multibyte chars as parameter? Then it would be more consequent to forget about UTF8. Instead class methods should be opened to handle metachar parameters. That is what I have volunteered.

humdinger · June 16, 2010, 1:09pm

I’m just a newbie dabbler, so I don’t yet understand what’s the problem in using “é” instead of ‘é’. You may want to discuss your plans at the haiku-developer mailing list, that’s where the Haiku devs hang out. Then you can provide a patch and create a ticket for it on the bug tracker.

Regards,
Humdinger

Octopus · June 18, 2010, 9:37am

Any cooperation task for Haiku development seems to be too complicated for me. I still do not understand how things are organized here. Maybe a native English speaking person will do it easier. I presume a little video report on how to interact showing a little example might help some fans to contribute. For now I will stop posting my problems at this site and keep watching quietly how Haiku will proceed.

humdinger · June 18, 2010, 12:36pm

It’s really not that complicated.

(optionally) Present your issue and how you'd like to solve it on the developer mailing list and discuss it there.
Create a ticket in the bugtracker, announcing that you're working on it (maybe link to the mainling list discussion).
Get the relevant Haiku source, develop your changes while respecting the coding style and create a patch that you'll attach to your ticket. Either your patch is accepted or you're asked to work on it some more.

Regards, Humdinger

edit: PS: Don’t mistake these forum chats with the official development process. That is really done on the dev mailing list.

Zenja · June 18, 2010, 6:30pm

Do you realise the difference between ‘ä’ and “ä”, in memory? The first item occupies sizeof(char) bytes, which is usually one byte. The second item is a zero terminated byte array. Also, the reason your code worked under Linux was a coincidence which depends on the local users code page. On a typical US system, it would give different results to a west european code page. When you realise the difference, you’ve taken your first step towards understanding Unicode.

BString was created at a time when std::string was under-featured. Today it’s a different story. But then again, the entire codepage/unicode text handling facility in C based languages is also a mess. This is the price of legacy.

Octopus · June 18, 2010, 9:58pm

[quote=Zenja]… BString was created at a time when std::string was under-featured. Today it’s a different story. But then again, the entire codepage/unicode text handling facility in C based languages is also a mess. This is the price of legacy.[/quote]Therefor I suggested to compatibly extend the BString Class. Where pure char parameters are requested, there metachar (int32) parameters should be supported. I had volunteered to do this. But it does not make much sense to perform such efforts, where the to be solved problem is not seen.

P.S.: ‘ä’ needs two bytes, ‘€’ will need three. They cannot be represented by char, but by metachar (int32).

humdinger · June 19, 2010, 5:23am

Maybe you missed what I edited to my last post: “Don’t mistake these forum chats with the official development process. That is really done on the dev mailing list.”. If really want to work on the issue, you should consider taking it to the official list.

Regards,
Humdinger

Octopus · June 19, 2010, 9:19am

Well, I have subscripted two mailing lists, but that does not enable me to post there anything. Logon at those mailing lists is not possible, because of missing of passwords or authorization strings.

humdinger · June 19, 2010, 9:55am

That’s odd. You just have to subscribe with the email address you’re going to use when posting to the list and confirm the email sent to you by freelists.org once. After that all mails will be delivered to the account you specified and posting should a simple mail to haiku-development@freelists.org.

Regards,
Humdinger

TechnoMancer · July 1, 2010, 7:58pm

Unfortunately you cannot extend it in a compatible manner.
If you change parameter types then the symbols will mangle differently and anything compiled against the old one will die with symbol not found errors.

I would suggest creating a separate unicode supporting string and integrating that, as BString must remain binary compatible with the BeOS one and has limited usefulness.

mab · January 31, 2011, 11:39pm

I believe that the BString class can be extended in a compatible manner to handle UTF-8. I believe it can be done without changing the parameter types in the existing methods. It can be done so that all of the existing behavior can be done without changing the binary compatibility or the behavior of the existing methods. The one thing that my implementation won’t do is locale sensitive linguistic comparisons. For that, a Collator class needs to be added to the Locale Kit.

I’m working on an implementation and am planning on writing some articles that demonstrate the new methods that will be added.