Silverlight UTF8 encoder produces wacky output

Silverlight UTF8 encoder produces wacky output - vb.net

I've been trying to trace down a bug for hours now and it has come down to this:
Dim length as Integer = 300
Dim buffer() As Byte = binaryReader.ReadBytes(length)
Dim text As String = System.Text.Encoding.UTF8.GetString(buffer, 0, buffer.Length)
The problem is the buffer contains 300 bytes but the length of the string 'text' is now 285. When I convert it back to bytes, the length is 521 bytes... WTF?
The same code is a normal WinForms app works perfectly. The data being read by the binary reader is a UTF8 encoded string. Any ideas why Silverlight is playing funny buggers?

I bet your stream contains some characters that require more than one byte. UTF8 uses a single byte when possible, but uses more bytes when the character is outside the ASCII range.
This explains why your buffer is longer than the string (300 vs 285).
Example:
string: "t e s t ä " (length = 5 -last char takes 2 bytes)
bytes: 0x74 | 0x65 | 0x73 | 0x74 | 0xc3 0xa4 (length = 6)
As to why it becomes even longer when you convert the text back to bytes, my best guess (also looking at the 521 size you get) is that you are using Encoding.Unicode instead of Encoding.UTF8 to perform the conversion. Unicode always uses two bytes for each character.
(btw. obviously this has nothing to do with Silverlight. You are probably testing the code with two different strings in Winforms vs. Silverlight. No worry, we've all done stupid mistakes like that :-) )

Related

Sending Unicode Characters greater than 0x7F through RS232

My application is in Windows CE 6.0 using Compact Framework and is being used to issue remote commands to a device through RS-232. These commands are send using bytes with specific hex values, e.g. sending 0x22 0x28 0x00 0x01 as a command sequence. I'm sending the bytes one at a time. The hex values are stored internally in a string for each command sequence, e.g. "22,28,00,01". I'm sending the bytes using the following code.
Dim i As Integer
Dim SendString() As String
Dim SendByte, a As String
DutCommand = "22,0A,00,02,E7,83" 'Sample command string
SendString = Split(DutCommand, ",") 'Split the string
For i = 0 To UBound(SendString) 'Send each byte after encoding
SendByte = Chr(CInt("&H" & SendString(i)))
CommPort.Write(SendByte)
Next
SendByte is being properly encoded even for values greater than 0x7F but the last two bytes being sent (0xE7 and 0x83) are being sent as 0x3F, the ASCII code for "?" since it's greater than 0x7F.
Am I missing a setting for the Comm port to handle encoding? Is there a simple method for sending the data with values greater than 0x7F?

You simply forgot to convert the hex values to bytes. It needs to look like this:
For i = 0 To UBound(SendString) 'Send each byte after encoding
Dim b = Byte.Parse(SendString(i), Globalization.NumberStyles.HexNumber)
CommPort.BaseStream.WriteByte(b)
Next
The non-stringy way is:
Dim DutCommand As Byte() = {&H22, &H0A, &H00, &H02, &HE7, &H83}
CommPort.Write(DutCommand, 0, DutCommand.Length)

I am assuming that you are using SerialPort.Write.
If so, notice what the documentation says:
By default, SerialPort uses ASCIIEncoding to encode the characters. ASCIIEncoding encodes all characters greater than 127 as (char)63 or '?'. To support additional characters in that range, set Encoding to UTF8Encoding, UTF32Encoding, or UnicodeEncoding.
Seems like the solution is pretty clear. You'll need to set the CommPort.Encoding property to the desired value.
See SerialPort.Encoding for more info.

As per the documentation for SerialPort.Write:
By default, SerialPort uses ASCIIEncoding to encode the characters.
ASCIIEncoding encodes all characters greater than 127 as (char)63 or
'?'. To support additional characters in that range, set Encoding to
UTF8Encoding, UTF32Encoding, or UnicodeEncoding.
You could also consider using the Write overload that actually just writes the raw bytes.

Redis int representation of a string is bigger when the string is more than 7 bytes but smaller otherwise

I'm trying to reduce Redis's objects size as much as I can and I've taken this whole week to experiment with it.
While testing different data representations I found out that an int representation of the string "hello" results in a smaller object. It may not look like much, but if you have a lot of data it can make a difference between using a few GB memory vs dozens of it.
Look at the following example (you can try it yourself if you want):
> SET test:1 "hello"
> debug object test:1
> Value at:0xb6c9f380 refcount:1 encoding:raw serializedlength:6 lru:9535350 lru_seconds_idle:7
In particular you should look at serializedlength which is 6 (bytes) in this case.
Now, look at the following int representation of it:
> SET test:2 "857715"
> debug object test:2
> Value at:0xb6c9f460 refcount:1 encoding:int serializedlength:5 lru:9535401 lru_seconds_idle:2
As you see, it results in a byte shorter object (note also encoding:int which I think is suggesting that ints get handled in a more efficient way).
With the string "hello w" (you'll see in a few moments why I didn't use "hello world" instead) we get an even bigger saving when it's represented as an int:
> SET test:3 "hello w"
> SET test:4 "857715023" <- Int representation. Notice that I inserted a "0", if I don't, it results in a bigger object and the encoding is set to "raw" instead (after all a space is not an int).
>
> debug object test:3
> Value at:0xb6c9f3a0 refcount:1 encoding:raw serializedlength:8 lru:9535788 lru_seconds_idle:6
> debug object test:4
> Value at:0xb6c9f380 refcount:1 encoding:int serializedlength:5 lru:9535809 lru_seconds_idle:5
It looks cool as long as you don't exceed 7 bytes string.. Look at what happens by a "hello wo" int representation:
> SET test:5 "hello wo"
> SET test:6 "85771502315"
>
> debug object test:5
> Value at:0xb6c9f430 refcount:1 encoding:raw serializedlength:9 lru:9535907 lru_seconds_idle:9
> debug object test:6
> Value at:0xb6c9f470 refcount:1 encoding:raw serializedlength:12 lru:9535913 lru_seconds_idle:5
As you can see the int (12 bytes) is bigger than the string representation (9 bytes).
My question here is, what's going on behind the scenes when you represent a string as an int, that it is smaller until you reach 7 bytes?
Is there a way to increase this limit as you do with "list-max-ziplist-entries/list-max-ziplist-value" or a clever way to optimize this process so that it always (or nearly) results in a smaller object than a string?
UPDATE
I've further experimented with other tricks, and you can actually have smaller ints than string, regardless of its size, but that would involve a little more work as of data structure modelling.
I've found out that if you split the int representation of a string in chunks of ~8 numbers each, it ends up being smaller.
Take as an example the word "Hello World Hi Universe" and create both a string and int SET:
> HMSET test:7 "Hello" "World" "Hi" "Universe"
> HMSET test:8 "74111114" "221417113" "78" "2013821417184"
The results are as follows:
> debug object test:7
> Value at:0x7d12d600 refcount:1 encoding:ziplist serializedlength:40 lru:9567096 lru_seconds_idle:296
>
> debug object test:8
> Value at:0x7c17d240 refcount:1 encoding:ziplist serializedlength:37 lru:9567531 lru_seconds_idle:2
As you can see we got the int set smaller by 3 bytes.
The problem in this will be how to organize such a thing, but it shows that it's possible nonetheless.
Still, don't know where this limit is set. The ~700K persistent use of memory (even when you have no data inside) makes me think that there is a pre-defined "pool" dedicated to the optimization of int sets.
UPDATE2
I think I've found where this intset "pool" is defined in Redis source.
At line 81 in the file redis.h there is the def REDIS_SHARED_INTEGERS set to 10000
REDISH_SHARED_INTEGERS
I suspect it's the one defining the limit of an intset byte length.
I have to try to recompile it with an higher value and see if I can use a longer int value (it'll most probably allocate more memory if it's the one I think of).
UPDATE3
I want to thank Antirez for the reply! Didn't expect that.
As he made me notice, len != memory usage.
I got further in my experiment and saw that the objects get already slightly compressed (serialized). I may have missed something from the Redis documentation.
The confirmation comes from analyzing a Redis key wih the command redis-memory-for-key key, which actually returns the memory usage and not the serialized length.
For example, let's take the "hello" string and int we used before, and see what's the result:
~ # redis-memory-for-key test:1
Key "test:1"
Bytes 101
Type string
~ #
~ # redis-memory-for-key test:2
Key "test:2"
Bytes 87
Type string
As you can notice the intset is smaller (87 bytes) than the string (101 bytes) anyway.
UPDATE4
Surprisingly a longer intset seems to affect its serializedlength but not memory usage..
This makes it possible to actually build a 2digit-char mapping while it still being more memory efficient than a string, without even chunking it.
By 2digit-char mapping I mean that instead of mapping "hello" to "85121215" we map it to digits with a fixed length of 2 each, prefixing it with "0" if digit < 10 like "0805121215".
A custom script would then proceed by taking every two digit apart and converting them to their equivalent char:
08 05 12 12 15
\ | | | /
h e l l o
This is enough to avoid disambiguation (like "o" and "ae" which both result in the digit "15").
I'll show you this works by creating another set and therefore analyzing its memory usage like I did before:
> SET test:9 "0805070715"
Unix shell
----------
~ # redis-memory-for-key test:9
Key "test:9"
Bytes 87
Type string
You can see that we have a memory win here.
The same "hello" string compressed with Smaz for comparison:
>>> smaz.compress('hello')
'\x10\x98\x06'
// test:10 would be unfair as it results in a byte longer object
SET post:1 "\x10\x98\x06"
~ # redis-memory-for-key post:1
Key "post:1"
Bytes 99
Type string

My question here is, what's going on behind the scenes when you represent a
string as an int, that it is smaller until you reach 7 bytes?
Notice that the integer you supplied as test #6 is no longer actually encoded
as an integer, but as raw:
SET test:6 "85771502315"
Value at:0xb6c9f470 refcount:1 encoding:raw serializedlength:12 lru:9535913 lru_seconds_idle:
So we see that a "raw" value occupies one byte plus the length of its string representation. In memory
you get that plus the overhead of the value.
The integer encoding, I suspect, encodes a number as a 32-bit integer; then it will always
need five bytes, one to tell its type, and four to store those 32 bits.
As soon as you overflow the maximum representable integer in 32 bits, which is either 2 billions or 4 depending on whether you use a sign or not, you need to revert to raw encoding.
So probably
2147483647 -> five bytes (TYPE_INT 0x7F 0xFF 0xFF 0xFF)
2147483649 -> eleven bytes (TYPE_RAW '2' '1' '4' '7' '4' '8' '3' '6' '4' '9')
Now, how can you squeeze a string representation PROVIDED THAT YOU ONLY USE AN ASCII SET?
You can get the string (140 characters):
When in the Course of human events it becomes necessary for one people
to dissolve the political bands which have connected them with another
and convert each character to a six-bit representation; basically its index in the string
"ABCDEFGHIJKLMNOPQRSTUVWXYZ01234 abcdefghijklmnopqrstuvwxyz56789."
which is the set of all the characters you can use.
You can now encode four such "text-only characters" in three "binary characters", a sort of "reverse base 64 encoding"; base64 encoding will get three binary characters and create a four-byte sequence of ASCII characters.
If we were to code it as groups of integers, we would save a few bytes - maybe get it down
to 130 bytes - at the cost of a larger overhead.
With this type of "reverse base64" encoding, we can get 140 character to 35 groups of four characters, which become a string of 35x3 = 105 binary characters, raw encoded to 106 bytes.
As long, I repeat, as you never use characters outside the range above. If you do, you can
enlarge the range to 128 characters and 7 bits, thus saving 12.5% instead of 25%; 140 characters will then become 126, raw encoded to 127 bytes, and you save (141-127) = 14 bytes.
Compression
If you have much longer strings, you can compress them (i.e., you use a function such as deflate() or gzencode() or gzcompress() ). Either straight; in which case the above string becomes 123 bytes. Easy to do.
Compressing many small strings: the Rube Goldberg approach
Since compression algorithms learn, and at the beginning they dare assume nothing, small strings will not compress greatly. They're "all beginning", so to speak. Just as an engine, when running cold the performances are inferior.
If you have a "corpus" of text these strings come from, you can use a time-consuming trick that "warms up" the compression engine and may double (or better) its performances.
Suppose you have two strings, COMMON and TARGET (the second one is the one you're interested in). If you z-compressed COMMON you would get, say, ZCMN. If you compressed TARGET you would get ZTRGT.
But as I said, since the gz compression algorithm is stream oriented, and it learns as it goes by, the compression ratio of the second half of any text (provided there aren't freakish statistical distribution changes between halves) is always appreciably higher than that of the first half.
So if you were to compress, say, COMMONTARGET, you'd get ZCMGHQI.
Notice that the first part of the string, as far as almost the end, is the same as before. Indeed if you compressed COMMONFOOBAR, you'd get something like ZCMQKL. And the second part is compressed better than before, even if we count the area of overlap as belonging entirely to the second string.
And this is the trick. Given a family of strings (TARGET, FOOBAR, CASTLE BRAVO), we compress not the strings, but the concatenation of those strings with a large prefix. Then we discard from the result the common compressed prefix. Thus TARGET is taken from the compression of COMMONTARGET (which is ZCMGHQI), and becomes GHQI instead of ZTRGT, with a 20% gain.
The decoder does the reverse: given GHQI, it first applies the common compressed prefix ZCM (which it must know); then it decodes the result, and finally discards the common uncompressed prefix, of which it need only know the length beforehand.
So the first sentence above (140 characters) becomes 123 when compressed by itself; if I take the rest of the Declaration and use it as a prefix, it compresses to 3355 bytes. This prefix plus my 140 bytes becomes 3409 bytes, of which 3352 are common, leaving 57 bytes.
At the cost of storing once the uncompressed prefix in the encoder, and the compressed prefix once in the decoder, and the whole thingamajig running five times as slow, I can now get those 140 bytes down to 57 instead of 123 - less than half of before.
This trick works great for small strings; for larger ones, the advantage isn't worth the pain. Also, different prefixes yield different results. The best prefixes are those that contain most of the sequences that are likely to appear in the string pool, ordered by increasing length.
Added bonus: the compressed prefix also doubles as a sort of weak encryption, as without that, you can't easily decode the compressed strings, even if you might be able to recover some pieces thereof.

Translating Hexadecimal To Value In Vb.net [For Torrent Trackers]

Im basically trying to achieve this : how to get the peers from an torrent tracker
Im stuck here :
Not only that, you have to send the actual value of the hash as a GET parameter. "76a36f1d11c72eb5663eeb4cf31e351321efa3a3" is a hexadecimal representation of the hash, but the tracker protocol specifies that you need to send the value of the hash (=bytestring). So you have to first decode the hexadecimal representation and then URL encode it: urllib.urlencode( [('info_hash', '76a36f1d11c72eb5663eeb4cf31e351321efa3a3'.decode('hex'))] ) == 'info_hash=v%A3o%1D%11%C7.%B5f%3E%EBL%F3%1E5%13%21%EF%A3%A3' # in Python.
I have researched quite alot and due to my newbish coding skills I can't manage to do the following in vb.net. Could anyone please enlighten me ?
I need to do the same thing :
Conversion from hexadecimal representation to the bytestring value of the hash.
Thanks in advance

I was surprised there was not an easier way to turn a hex string into a byte array, but I didn't locate one quickly so here is the hard way:
Dim hex As String = "76a36f1d11c72eb5663eeb4cf31e351321efa3a3"
'prepend leading zero if needed
If hex.Length Mod 2 <> 0 Then
hex = " " & hex
End If
Dim bytes As Byte() = New Byte((hex.Length \ 2) - 1) {}
For byteNum As Int32 = 0 To bytes.Length - 1
bytes(byteNum) = Byte.Parse(hex.Substring(byteNum * 2, 2),
Globalization.NumberStyles.AllowHexSpecifier)
Next
'convert to an ansi string and escape
Dim final As String =Uri.EscapeDataString(
System.Text.Encoding.Default.GetString(bytes))

Why does this code encodes random salt first as hexadecimal digits?

I'm looking at some existing code that is generating a salt which is used as input into an authentication hash.
The salt is 16 bytes long, and is generated by first using an OS random number generator to get 8 bytes of random data.
Then each byte in the 8 byte buffer is used to place data into 2 bytes of the 16 byte buffer as follows:
out[j] = hexTable[data[i] & 0xF];
out[j-1] = hexTable[data[i] >> 4 & 0xF];
Where out is the 16 byte salt, data is the initial 8 byte buffer, j and i are just loop incrementers obviously, and hexTable is just an array of the hex digits i.e. 0 to F.
Why is all this being done? Why isn't the 16 byte salt just populated with random data to begin with? Why go through this elaborate process?
Is what is being done here a standard way of generating salts? What's the benefit and point of this over just generating 16 random bytes in the first place?

This is simply conversion of your 8 random bytes to 16 hexadecimal digits.
It seems that someone misunderstood the concept of salt, or what input your hash needs, and thought it only accepts hexadecimal digits.
Maybe also the salt is stored somewhere where it is easier to store hexadecimal digits instead of pure bytes, and the programmer thought it would be good to be able to reuse the stored salt as-is (i.e. without converting it back to bytes first).

VBA - Read file byte by byte on system with Asian locale

I am trying to convert a file from binary to text, by simply replacing each character with the hexadecimal code. For example, character 'c' will be replaced by '63'.
I have a code which is working fine in normal systems, but it breaks down in the PC where I need to use it as it has default locale set to Chinese.
I am using the following statements to read a byte -
ch$ = " "
Get #f%, , ch$
I suspect there is a problem when I am reading the file byte by byte, as it is skipping certain bytes because they form composite characters. It's probably reading 2 bytes which form an Asian character as one byte. It is thus forming a much smaller file than the expected size.
How can I read the file byte by byte?
Full code is pasted here: http://pastebin.com/kjpSnqzV

Your suspicion is correct. VB file reading automatically converts strings into Unicode from the default code page on the PC. On an Asian code page, some characters are represented as more than one byte.
I advise you to use a Byte variable rather than a string - that will stop VB being over helpful.
Dim ch As Byte
Get #f%, , ch
Another possible problem with the original code is that some byte sequences are illegal on Asian code pages (they don't represent valid characters). So your code could experience errors for some input files, but presumably you want it to work with any file.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas