How can I get pymongo to always return str and not unicode? - pymongo

From the pymongo docs:
MongoDB stores data in BSON format. BSON strings are UTF-8 encoded so PyMongo must ensure
that any strings it stores contain only valid UTF-8 data. Regular strings () are > validated and stored unaltered. Unicode strings () are encoded UTF-8 first. > The reason our example string is represented in the Python shell as u’Mike’ instead of
‘Mike’ is that PyMongo decodes each BSON string to a Python unicode string, not a regular
str."
It seems a bit silly to me that the database can only store UTF-8 encoded strings, but the return type in pymongo is unicode, meaning the first thing I have to do with every string from the document is once again call encode('utf-8') on it. Is there some way around this, i.e. telling pymongo not to give me unicode back but just give me the raw str?

No, there is no such feature in PyMongo; every string decoded from BSON is decoded as UTF-8. Python represents the string internally as UCS-2 or some other format, depending on the Python version. See the code where the BSON decoder extracts a string.
In the upcoming PyMongo 3.x series we may add features for more flexible BSON decoding to allow developers to optimize uncommon use cases like this.

Related

How to store Bytes/Slice(UInt8) as a string in Crystal?

I'm encoding an Object into Bytes (ie Slice(UInt8)) via MessagePack. How would I store this in a datastore client (eg Crystal-Redis) that only accepts Strings?
If you have no other choice to store the Slice as a String, you can encode it as a String, but at the cost of reduced performance.
There's Base64 strict_encode/decode:
encoded = An_Object.to_msgpack # Slice(UInt8)
save_to_datastore "my_stuff", Base64.strict_encode(encoded)
from_storage = get_from_datastore "my_stuff"
if from_storage
My_MsgPack_Mapping.from_msgpack( Base64.decode(from_storage) )
end
Or you can use Slice#hexstring and String#hexbytes:
encoded = An_Object.to_msgpack # Slice(UInt8)
save_to_datastore "my_stuff", encoded.hexstring
from_storage = get_from_datastore "my_stuff"
if from_storage && from_storage.hexbytes?
My_MsgPack_Mapping.from_msgpack( from_storage.hexbytes )
end
(Crystal-Redis users have another option: see this issue.)
Both Crystal and Redis should be able to handle strings with non-valid UTF-8 bytes, so you could just directly create a String from the slice and store this to Redis and vice versa.
This is of course not entirely safe: you should make sure to avoid invoking any string methods that expect a valid UTF-8 string.
But apart from that, this direct method should be perfectly fine. Is is faster and more memory-efficient than using a string encoding.
redis.set key, String.new(slice)
redis.get(key).to_slice

Character encoding that won't change the higher bits after I set them

I'm looking for a character encoding that allows me to set a byte higher than 127. NSASCIICharacterEncoding and NSUTF8CharacterEncoding replace those higher values.
The character encoding only matters when you're trying to interpret the bytes as characters. If that's what you need to do, and if you're using data that comes from some outside source, then use whatever encoding the outside source used.
On the other hand, if you're just trying to manage a collection of bytes (i.e. not characters), then look into using NSData instead. NSData doesn't care about character encodings, doesn't change the order of your bytes, and will happily keep track of as much data as you give it. (There's a mutable version if you need to modify the data it contains.)

Objective C - char with umlaute to NSString

I am using libical which is a library to parse the icalendar format (RFC 2445).
The problem is, that there may be some german umlaute for example in the location field.
Now libical returns a const char * for each value like:
"K\303\203\302\274nstlerhaus in M\303\203\302\274nchen"
I tried to convert it to NSString with:
[NSString stringWithCString:icalvalue_as_ical_string_r(value) encoding:NSUTF8StringEncoding];
But what I get is:
Künstlerhaus in München
Any suggestions? I would appreciate any help!
Seems like your string got doubly-UTF-8-encoded, because "Künstlerhaus in München" actually is UTF-8, if you UTF-8-decode that again you should get the correct string.
Bear in mind though that you shouldn't be satisfied with that result. There are combinations where a doubly-UTF-8-encoded string can't be simply be decoded by doing a double-UTF-8-decode. Some encoding combinations are irreversible. So in your situation I'd suggest you find out why the string got doubly-UTF-8-encoded in the first place, probably the ical is stored in the wrong encoding on the hard disk, or libical uses the wrong character set to access it, or if you're getting the ical from a server, perhaps the charset there is wrong for text/ical, etc, etc...
The C string does not seem to be encoded in UTF-8, as there are four bytes for each of the characters. For example ü would be encoded as \xc3\xbc (or \195\188) in UTF-8. So the input is either already garbled when you receive it or it uses some other encoding.

Objective-C How to get unicode character

I want to get unicode code point for a given unicode character in Objective-C. NSString said it internal use UTF-16 encoding and said,
The NSString class has two primitive methods—length and characterAtIndex:—that provide the basis for all other methods in its interface. The length method returns the total number of Unicode characters in the string. characterAtIndex: gives access to each character in the string by index, with index values starting at 0.
That seems assume characterAtIndex method is unicode aware. However it return unichar is a 16 bits unsigned int type.
- (unichar)characterAtIndex:(NSUInteger)index
The questions are:
Q1: How it present unicode code point above UFFFF?
Q2: If Q1 make sense, is there method to get unicode code point for a given unicode character in Objective-C.
Thx.
The short answer to "Q1: How it present unicode code point above UFFFF?" is: You need to be UTF16 aware and correctly handle Surrogate Code Points. The info and links below should give you pointers and example code that allow you to do this.
The NSString documentation is correct. However, while you said "NSString said it internal use UTF-16 encoding", it's more accurate to say that the public / abstract interface for NSString is UTF16 based. The difference is that this leaves the internal representation of a string a private implementation detail, but the public methods such as characterAtIndex: and length are always in UTF16.
The reason for this is it tends to strike the best balance between older ASCII-centric and Unicode aware strings, largely due to the fact that Unicode is a strict superset of ASCII (ASCII uses 7 bits, for 128 characters, which are mapped to the first 128 Unicode Code Points).
To represent Unicode Code Points that are > U+FFFF, which obviously exceeds what can be represented in a single UTF16 Code Unit, UTF16 uses special Surrogate Code Points to form a Surrogate Pair, which when combined together form a Unicode Code Point > U+FFFF. You can find details about this at:
Unicode UTF FAQ - What are surrogates?
Unicode UTF FAQ - What’s the algorithm to convert from UTF-16 to character codes?
Although the official Unicode UTF FAQ - How do I write a UTF converter? now recommends the use of International Components for Unicode, it used to recommend some code officially sanctioned and maintained by Unicode. Although no longer directly available from Unicode.org, you can still find copies of the "no longer official" example code in various open-source projects: ConvertUTF.c and ConvertUTF.h. If you need to roll your own, I'd strongly recommend examining this code first, as it is well tested.
From the documentation of length:
The number returned includes the
individual characters of composed
character sequences, so you cannot use
this method to determine if a string
will be visible when printed or how
long it will appear.
From this, I would infer that any characters above U+FFFF would be counted as two characters and would be encoded as a Surrogate Pair (see the relevant entry at http://unicode.org/glossary/).
If you have a UTF-32 encoded string with the character you wish to convert, you could create a new NSString with initWithBytesNoCopy:length:encoding:freeWhenDone: and use the result of that to determine how the character is encoded in UTF-16, but if you're going to be doing much heavy Unicode processing, your best bet is probably to get familiar with ICU (http://site.icu-project.org/).

Char.ConvertFromUtf32 not available in Silverlight

I'm converting a WinForms app to Silverlight (VB.NET). What should I use instead of Char.ConvertFromUtf32 as it's not available to use in Silverlight?
UTF-32 is currently not part of Silverlight, so you have to find a way around the limitation. I think you should stop a moment and think exactly why you need to read UTF32-encoded text.
If you are reading such text from a database or a file on the server, I would perform the conversion server-side (if possible I would convert everything to UTF-8 and get rid of the UTF-32 data in one shot).
If you are parsing a user-provided file on the client side, I would detect the UTF-32 encoding and gently tell the user that the file encoding is not supported. UTF32 is pretty rare nowadays, so I guess it should not be a very common case (but I could be wrong not knowing your exact situation).
In order to detect the file encoding you have to look at the first few bytes (byte order mark) -more information here, if they are not present the task becomes much harder and involves some kind of heuristics based on character frequency.
From: https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/types/how-to-convert-between-hexadecimal-strings-and-numeric-types
You can use a direct cast, like:
// Get the character corresponding to the integral value.
string stringValue = Char.ConvertFromUtf32(value);
char charValue = (char)value;
Small warning, it will only work up to 0xffff. It will not work for high range Unicode from 0x10000 to 0x10ffff.
Also, if you need to parse \uXXXX, try this other question: How do I convert Unicode escape sequences to Unicode characters in a .NET string?