Why does SortableIntField avoid UCS-16 surrogates - lucene

While reading the source code of SortableIntField, I noticed that this class avoids "UCS-16 surrogates" when converting an integer to a String (see method int int2sortableStr(int, char[], int) of NumberUtils.java).
What issue would these characters raise?

The comments of the given code are confusing, actually there is a mistake, Wikipedia:
Occasionally, articles about Unicode will mistakenly refer to UCS-2 as
"UCS-16". UCS-16 does not exist; the authors who make this error
usually intend to refer to UCS-2 or to UTF-16.
Your question #1: Why does SortableIntField avoid UCS-16 surrogates?
To decrease running time and to save space by avoiding endiness, for instance.
Your question #2: What issue would these characters raise?
Again, they would take more space and if the endiness is an issue, then also the running time would increase. And also remember to catch your exceptions, since otherwise you can put easily your server down.

Related

How to determine Thousands Separator using Format in VBA

I would like to determine the Thousand Separator used while running a VBA Code on a target machine without resolving to calling system built-in functions such as (Separator = Application.ThousandsSeparator).
I am using the following simple code using 'Format':
ThousandSeparator = Mid(Format(1000, "#,#"), 2, 1)
The above seems to work fine, and would like to confirm if this is a safe method of doing it without resorting to system calls.
I would expect the result to be a single char string in the form of , or . or ' or a Space as applicable to the locale on the machine.
Please note that I want to only use a language statement such as Format or similar (no sys calls). Also this relates to Thousands Separator not Decimal Separator. This article Using VBA to detect which decimal sign the computer is using does not help or answer my question. Thanks
Thanks in advance.
The strict answer to whether it is safe to use Format to get the thousands separator is No.
E.g. on Windows, it is possible to enter up to three characters into the Thousands Separator field in the regional settings in the control panel.
Suppose you enter asd and click OK.
If you now call Format(1000, "#,#") it will give you 1a000. That is only the first letter of your thousands separator. You have failed to retrieve it correctly.
Reading the registry:
? CreateObject("WScript.Shell").RegRead("HKCU\Control Panel\International\sThousand")
you get back asd in full.
To be fair, the Excel international properties do not seem to be of much help either. Application.International(xlThousandsSeparator) in this situation will return the separator originally defined in your computer's locale, not the value you've overridden it to.
Having that said, the practical answer is Yes, because it would appear (and if you happen to know for sure, please post an answer here) that there is no culture with multi-char thousand separator (even in China where scary things like 1億2345万6789 or 1億2345萬6789 exist, they happen to be represented with just one UTF-16 character), and you probably are happy to ignore the people who decided to play with their locale settings in that fashion.

Char.ConvertFromUtf32 not available in Silverlight

I'm converting a WinForms app to Silverlight (VB.NET). What should I use instead of Char.ConvertFromUtf32 as it's not available to use in Silverlight?
UTF-32 is currently not part of Silverlight, so you have to find a way around the limitation. I think you should stop a moment and think exactly why you need to read UTF32-encoded text.
If you are reading such text from a database or a file on the server, I would perform the conversion server-side (if possible I would convert everything to UTF-8 and get rid of the UTF-32 data in one shot).
If you are parsing a user-provided file on the client side, I would detect the UTF-32 encoding and gently tell the user that the file encoding is not supported. UTF32 is pretty rare nowadays, so I guess it should not be a very common case (but I could be wrong not knowing your exact situation).
In order to detect the file encoding you have to look at the first few bytes (byte order mark) -more information here, if they are not present the task becomes much harder and involves some kind of heuristics based on character frequency.
From: https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/types/how-to-convert-between-hexadecimal-strings-and-numeric-types
You can use a direct cast, like:
// Get the character corresponding to the integral value.
string stringValue = Char.ConvertFromUtf32(value);
char charValue = (char)value;
Small warning, it will only work up to 0xffff. It will not work for high range Unicode from 0x10000 to 0x10ffff.
Also, if you need to parse \uXXXX, try this other question: How do I convert Unicode escape sequences to Unicode characters in a .NET string?

00626 SQL Loader error

How to avoid
"characterset conversion buffer overflow" error in sql*loader? error # 00626.
I am not able to find this on internet please suggest me the solution for this.
What is the character set of the input datafile? You might try specifying the character set in the control file:
CHARACTERSET char_set_name LENGTH SEMANTICS CHARACTER
By default, if not specified, Oracle will use byte length semantics. Thus, if you define a field length in your control file as VARCHAR(20), in byte semantics you'd have 20 byte buffer, but in character length semantics you might have a 40 byte buffer. This would be my guess as to what could be the source of the error.
It's not a lot of help, but here's what the Oracle error manual has to say about that error:
SQL*Loader-00626: Character set
conversion buffer overflow.
Cause: A conversion from the datafile character set to the client
character set required more space than
that allocated for the conversion
buffer. The size of the conversion
buffer is limited by the maximum size
of a varchar2 column.
Action: The input record is rejected. The data will not fit into
the column.
It sounds like there isn't any way to work around this within SQLLoader. If it is affecting a small number of records then it may be easiest to simply handle those manually. If it is many records, then you probably need to find or create a different loading tool.
Just a few ideas for you to think about:
You could try to load different parts of the "string" into different fields in the database .. maybe that way you can work around the limitation.
You could try to do the character set conversion in a different tool .. some text editors may give you some options .. and then load the file without it requiring the conversion.
Not sure if there's any merit in these ideas, but hopefully you can work something out.
Thanks for all your help. This problem has been resolved. We split the file and loaded in chunks and it worked fine

Asc(Chr(254)) returns 116 in .Net 1.1 when language is Hungarian

I set the culture to Hungarian language, and Chr() seems to be broken.
System.Threading.Thread.CurrentThread.CurrentCulture = "hu-US"
System.Threading.Thread.CurrentThread.CurrentUICulture = "hu-US"
Chr(254)
This returns "ţ" when it should be "þ"
However, Asc("ţ") returns 116.
This: Asc(Chr(254)) returns 116.
Why would Asc() and Chr() be different?
I checked and the 'wide' functions do work correctly: ascw(chrw(254)) = 254
Chr(254) interprets the argument in a system dependent way, by looking at the System.Globalization.CultureInfo.CurrentCulture.TextInfo.ANSICodePage property. See the MSDN article about Chr. You can check whether that value is what you expect. "hu-US" (the hungarian locale as used in the US) might do something strange there.
As a side-note, Asc() has no promise about the used codepage in its current documentation (it was there until 3.0).
Generally I would stick to the unicode variants (ending on -W) if at all possible or use the Encoding class to explicitly specify the conversions.
My best guess is that your Windows tries to represent Chr(254)="ţ" as a combined letter, where the first letter is Chr(116)="t" and the second ("¸" or something like that) cannot be returned because Chr() only returns one letter.
Unicode text should not be handled character-by-character.
It sounds like you need to set the code page for the current thread -- the current culture shouldn't have any effect on Asc and Chr.
Both the Chr docs and the Asc docs have this line:
The returned character depends on the code page for the current thread, which is contained in the ANSICodePage property of the TextInfo class. TextInfo.ANSICodePage can be obtained by specifying System.Globalization.CultureInfo.CurrentCulture.TextInfo.ANSICodePage.
I have seen several problems in VBA on the Mac where characters over 127 and some control characters are not treated properly.
This includes paragraph marks (especially in text copied from the internet or scanned), "¥", and "Ω".
They cannot always be searched for, cannot be used in file names - though they could in the past, and when tested, come up as another ascii number. I have had to write algorithms to change these when files open, as they often look like they are the right character, but then crash some of my macros when they act strangely. The character will look and act right when I save the file, but may be changed when it is reopened.
I will eventually try to switch to unicode, but I am not sure if that will help this issue.
This may not be the issue that you are observing, but I would not rule out isolated problems with certain characters like this. I have sent notes to MS about this in the past but have received no joy.
If you cannot find another solution and the character looks correct when you type it in, then I recommend using a macro snippet like the one below, which I run when updating tables. You of course have to setup theRange as the area you are looking at. A whole file can take a while.
For aChar = 1 To theRange.Characters.count
theRange.Characters(aChar).Select
If Asc(Selection.Text) = 95 And Selection.Text <> "_" Then Selection.TypeText "Ω"
Next aChar

Is it safe to convert a mysqlpp::sql_blob to a std::string?

I'm grabbing some binary data out of my MySQL database. It comes out as a mysqlpp::sql_blob type.
It just so happens that this BLOB is a serialized Google Protobuf. I need to de-serialize it so that I can access it normally.
This gives a compile error, since ParseFromString() is not intended for mysqlpp:sql_blob types:
protobuf.ParseFromString( record.data );
However, if I force the cast, it compiles OK:
protobuf.ParseFromString( (std::string) record.data );
Is this safe? I'm particularly worried because of this snippet from the mysqlpp documentation:
"Because C++ strings handle binary data just fine, you might think you can use std::string instead of sql_blob, but the current design of String converts to std::string via a C string. As a result, the BLOB data is truncated at the first embedded null character during population of the SSQLS. There’s no way to fix that without completely redesigning either String or the SSQLS mechanism."
Thanks for your assistance!
It doesn't look like it would be a problem judging by that quote (it's basically saying if a null character is found in the blob it will stop the string there, however ASCII strings won't have random nulls in the middle of them). However, this might present a problem for internalization (multibyte charsets may have nulls in the middle).