Representing data types e.g. Chars, Strings, Integers etc - vb.net

I am a .NET Developer and I do not believe I know enough about encoding. I have read this article: http://www.joelonsoftware.com/articles/Unicode.html.
Say I declare this string:
Dim TestString As String = "1"
I believe this will be represented as a Unicode character. Say I declare this integer:
Dim TestInt As Integer = 1
How is this represented? I assume that Unicode is not used? i.e. it is only used for String and Chars? Is that correct? Therefore I believe that on a 32 bit machine 1 would simply be represented as:
00000000 0000000 0000000 00000001
Do numeric data types have byte order marks: http://en.wikipedia.org/wiki/Byte_order_mark ?

All strings in .NET are UTF-16. From the language spec:
Visual Basic .NET defines the following primitive types:
...
The Char value type, which represents a single Unicode character and
maps to System.Char...
The String reference type, which
represents a sequence of Unicode characters and maps to System.String...
Why should an integral value types like an integer be represented with Unicode in computer memory? Unicode is (citing from Wikipedia):
a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems.
So yes, it's only used for Strings and Chars.
Also note that an Integer will always be 4-byte signed integer, no matter if you use a 32 bit or 64 bit machine.
Byte order marks are an entire different topic. As already said in a comment, it's used in text file or stream.

Related

Can ASCII arrays be manipulated as arrays without converting to String form?

This is a basic question, but I can't find anything on it, since I don't know what to search — each of my tries have come up with unrelated results.
If I use Text.Encoding.ASCII.GetBytes to convert a string into ASCII, does each byte represent exactly one character? Does the following code work as exactly intended in all circumstances (for all Strings other than the examples)?
Dim t1() As Byte = Text.Encoding.ASCII.GetBytes("Hello ")
Dim t2() As Byte = Text.Encoding.ASCII.GetBytes("World")
Dim msg As String = Text.Encoding.ASCII.GetString(t1.Concat(t2).ToArray)
Now msg should be "Hello World".
I would like this to work as I don't want to have to convert data I receive back to Strings in order to manipulate it before it is sent again.
What if I used something other than ASCII (like UTF-8, for example)?
If I use Text.Encoding.ASCII.GetBytes to convert a string into ASCII, does each byte represent exactly one character?
Yes. ASCII is a 7bit encoding, it does not support multi-byte characters. Any Unicode codepoint above U-007F will get converted to a ? character in ASCII.
If you were to use UTF-7 instead, for instance, it can encode individual Unicode codepoints into a sequence of multiple ASCII characters.
Does the following code work as exactly intended in all circumstances (for all Strings other than the examples)?
In your particular example, yes (provided you are using LINQ's Concat() method - there are other ways to concat arrays together). There is no data loss.
But for other examples, just know that you will have data loss if you convert non-ASCII characters to ASCII, or otherwise mismatch encodings between GetBytes() and GetString().
You can certainly manipulate byte arrays. Just make sure the arrays are in the same encoding if you merge them together.
.NET strings are counted sequences of UTF-16 code units (char), one or two of which encode a Unicode codepoint (int Char.ConvertToUtf32 ). Some codepoints are "combining characters", which when applied to a preceding "base character" form a grapheme (which is then rendered by a font into a glyph).
An encoder from Unicode to an encoding of another character set should attempt to preserve graphemes. In .NET, a grapheme is called a "text element."
So, yes, you can combine encoded byte sequences as long as you haven't defeated the encoder by converting parts of a grapheme into different byte sequences. If you are breaking a string into two before encoding, see TextElementEnumerator and StringInfo class.

Returning Unicode Name With Code Point

I know how to return a Unicode character from a code point. That's not what I'm after. What I want to know is how to return the name associated with a particular code point. For example, The code point for 🍀 is 1F340. And its name is FOUR LEAF CLOVER. Is it possible for us to return this name with its code point? I've read about 100 topics involving Unicode. But I haven't see one discussing my question. I hope that's possible.
Thank you for your help.
Have you considered the ICU library? It offers the following C API: http://icu-project.org/apiref/icu4c/uchar_8h.html#aa488f2a373998c7decb0ecd3e3552079
int32_t u_charName(
UChar32 code,
UCharNameChoice nameChoice,
char* buffer,
int32_t bufferLength,
UErrorCode* pErrorCode)
Retrieve the name of a Unicode character.
Depending on nameChoice, the character name written into the buffer is the "modern" name or the name that was defined in Unicode version 1.0. The name contains only "invariant" characters like A-Z, 0-9, space, and '-'. Unicode 1.0 names are only retrieved if they are different from the modern names and if the data file contains the data for them. gennames may or may not be called with a command line option to include 1.0 names in unames.dat.
Parameters
code The character (code point) for which to get the name. It must be 0<=code<=0x10ffff.
nameChoice Selector for which name to get.
buffer Destination address for copying the name. The name will always be zero-terminated. If there is no name, then the buffer will be set to the empty string.
bufferLength ==sizeof(buffer)
pErrorCode Pointer to a UErrorCode variable; check for U_SUCCESS() after u_charName() returns.
Returns
The length of the name, or 0 if there is no name for this character. If the bufferLength is less than or equal to the length, then the buffer contains the truncated name and the returned length indicates the full length of the name. The length does not include the zero-termination.
ICU is the right approach, but it's even simpler than Chris said. Foundation includes ICU already, for various text processing functions, including CFStringTransform(). Its transform parameter accepts "any valid ICU transform ID defined in the ICU User Guide for Transforms".
One of ICU's transforms is Any-Name:
Converts between characters and their Unicode names in curly braces. For example:
., ⇆ {FULL STOP}{COMMA}
(The syntax isn't exactly as documented, but it's close enough you can figure it out.)
There's also an Any-Hex transform which can be used for translating to/from the codepoint hex value.

Hexadecimal numbers vs. hexadecimal enocding (with base64 as well)

Encoding with hexadecimal numbers seems to be different from using hexadecimals to represent numbers. For example, then hex number 0x40 to me should be equal to 64, or BA_{64}, but when I put it through this hex to base64 converter, I get the output: QA== which to me is equal to some number times 64. Why is this?
Also when I check the integer value of the hex string deadbeef I get 3735928559, but when I check it other places I get: 222 173 190 239. Why is this?
Addendum: So I guess it is because it is easier to break the number into bit chunks than treat it as a whole number when encoding? That is pretty confusing to me but I guess I get it.
You may wish to read this:
http://en.wikipedia.org/wiki/Base64
In summary, base64 specifies a specific encoding, which involves using different values for letters than their ASCII encoding.
For the second part, one source is treating the entire string as a 32 bit integer, and the other is dividing it into bytes and giving the value of each byte.

Processing: How to convert a char datatype into its utf-8 int representation?

How can I convert a char datatype into its utf-8 int representation in Processing?
So if I had an array ['a', 'b', 'c'] I'd like to obtain another array [61, 62, 63].
After my answer I figured out a much easier and more direct way of converting to the types of numbers you wanted. What you want for 'a' is 61 instead of 97 and so forth. That is not very hard seeing that 61 is the hexadecimal representation of the decimal 97. So all you need to do is feed your char into a specific method like so:
Integer.toHexString((int)'a');
If you have an array of chars like so:
char[] c = {'a', 'b', 'c', 'd'};
Then you can use the above thusly:
Integer.toHexString((int)c[0]);
and so on and so forth.
EDIT
As per v.k.'s example in the comments below, you can do the following in Processing:
char c = 'a';
The above will give you a hex representation of the character as a String.
// to save the hex representation as an int you need to parse it since hex() returns a String
int hexNum = PApplet.parseInt(hex(c));
// OR
int hexNum = int(c);
For the benefit of the OP and the commenter below. You will get 97 for 'a' even if you used my previous suggestion in the answer because 97 is the decimal representation of hexadecimal 61. Seeing that UTF-8 matches with the first 127 ASCII entries value for value, I don't see why one would expect anything different anyway. As for the UnsupportedEncodingException, a simple fix would be to wrap the statements in a try/catch block. However that is not necessary seeing that the above directly answers the question in a much simpler way.
what do you mean "utf-8 int"? UTF8 is a multi-byte encoding scheme for letters (technically, glyphs) represented as Unicode numbers. In your example you use trivial letters from the ASCII set, but that set has very little to do with a real unicode/utf8 question.
For simple letters, you can literally just int cast:
print((int)'a') -> 97
print((int)'A') -> 65
But you can't do that with characters outside the 16 bit char range. print((int)'二') works, (giving 20108, or 4E8C in hex) but print((int)'𠄢') will give a compile error because the character code for 𠄢 does not fit in 16 bits (it's supposed to be 131362, or 20122 in hex, which gets encoded as a three byte UTF-8 sequence 239+191+189)
So for Unicode characters with a code higher than 0xFFFF you can't use int casting, and you'll actually have to think hard about what you're decoding. If you want true Unicode point values, you'll have to literally decode the byte print, but the Processing IDE doesn't actually let you do that; it will tell you that "𠄢".length() is 1, when in real Java it's really actually 3. There is -in current Processing- no way to actually get the Unicode value for any character with a code higher than 0xFFFF.
update
Someone mentioned you actually wanted hex strings. If so, use the built in hex function.
println(hex((int)'a')) -> 00000061
and if you only want 2, 4, or 6 characters, just use substring:
println(hex((int)'a').substring(4)) -> 0061

Do certain characters take more bytes than others?

I'm not very experienced with lower level things such as howmany bytes a character is. I tried finding out if one character equals one byte, but without success.
I need to set a delimiter used for socket connections between a server and clients. This delimiter has to be as small (in bytes) as possible, to minimize bandwidth.
The current delimiter is "#". Would getting an other delimiter decrease my bandwidth?
It depends on what character encoding you use to translate between characters and bytes (which are not at all the same thing):
In ASCII or ISO 8859, each character is represented by one byte
In UTF-32, each character is represented by 4 bytes
In UTF-8, each character uses between 1 and 4 bytes
In ISO 2022, it's much more complicated
US-ASCII characters (of whcich # is one) will take only 1 byte in UTF-8, which is the most popular encoding that allows multibyte characters.
It depends on the encoding. In Single-byte character sets such as ANSI and the various ISO8859 character sets it is one byte per character. Some encodings such as UTF8 are variable width where the number of bytes to encode a character depends on the glyph being encoded.
The answer of course is that it depends. If you are in a pure ASCII env, then yes, every char takes 1 byte, but if you are in a Unicode env (all of Windows for example), then chars can range from 1 to 4 bytes in size.
If you choose a char from the ASCII set, then yes your delimter is a small as possible.
No, all characters are 1 byte, unless you're using Unicode or wide characters (for accents and other symbols for example).
A character is 1 byte, or 8 bits, long which gives 256 possible combination to form characters with. 1 byte characters are called ASCII characters. They only use 7 bits (even though 8 are available, but you can't use this 8th bit) to form the standard alphabet and various symbols used when teletypes and typewriters were still common.
You can find an ASCII chart and what numbers correspond to what characters here.