Conversion of UTF16 to UTF32 - Invalid surrogate pair - utf-16

While converting an array of UTF16 to UTF32, if I come across a high surrogate and if the next value is neither a high surrogate nor a low surrogate, should we invalidate both the values in UTF16 array? or
Should we invalidate just the high surrogate and proceed with the conversion of next value?
Reference: https://unicodebook.readthedocs.io/unicode_encodings.html#surrogates
Thanks.

Related

Are there any downsides to using nanoid for primary key?

I know that UUIDs and incrementing integers are often used for primary keys.
I'm thinking of nanoids instead because those are URL friendly without being guessable / brute-force scrapeable (like incrementing integers).
Would there be any reason not to use nanoids as primary keys in a database like Postgres? (For example: Maybe they drastically increase query time since they aren't ... aligned or something?)
https://github.com/ai/nanoid
Most databases use incrementing id's because it's more efficient to insert a new value onto the end of a B-tree based index.
If you insert a new value into a random place in the middle of a B-tree, it may have to split the B-tree nonterminal node, and that could cause the node at the next higher level to split, and so on up to the top of the B-tree.
This also has a greater risk of causing fragmentation, which means the index takes more space for the same number of values.
Read https://www.percona.com/blog/2015/04/03/illustrating-primary-key-models-in-innodb-and-their-impact-on-disk-usage/ for a great visualization about the tradeoff between using an auto-increment versus UUID in a primary key.
That blog is about MySQL, but the same issue applies to any B-tree based data structure.
I'm not sure if there is a disadvantage to using nanoids, but they are often unnecessary. While UUIDs are long, they can be translated to a shorter format without losing entropy.
See the NPM package (https://www.npmjs.com/package/short-uuid).
UUIDs are standardized by the Open Software Foundation (OSF) and described by the RFC 4122. That means that there will be far more chances for other tools to give you some perks around it.
Some examples:
MongoDB has a special type to optimize the storage of UUIDs. Not only a NanoID string will take more space, but even the binary takes more bits (126 in Nano ID and 122 in UUID)
Once saw a logging tool extracting the timestamp from the uids, can't remember which, but is is available
Also the long, non reduced version of UUIDs are very easy to identify visually. When the end user is a developer, it might help to understand the nature/source of the ID (like clearly not a database auto-increment key)

Key types supported by Redis

What are the different key types supported by Redis? The documentation mentions all the various types (strings, set, hashmap, etc) of values supported by Redis, but I couldn't quiet find the key type information.
From redis documentation (Data types intro):
Redis keys
Redis keys are binary safe, this means that you can use any binary sequence as a key, from a string like "foo" to the content
of a JPEG file. The empty string is also a valid key. A few other
rules about keys:
Very long keys are not a good idea. For instance a key of 1024 bytes is a bad idea not only memory-wise, but also because the
lookup of the key in the dataset may require several costly
key-comparisons. Even when the task at hand is to match the
existence of a large value, hashing it (for example with SHA1) is a
better idea, especially from the perspective of memory and
bandwidth.
Very short keys are often not a good idea. There is little point in writing "u1000flw" as a key if you can instead write
"user:1000:followers". The latter is more readable and the added
space is minor compared to the space used by the key object itself
and the value object. While short keys will obviously consume a bit
less memory, your job is to find the right balance.
Try to stick with a schema. For instance "object-type:id" is a good idea, as in "user:1000". Dots or dashes are often used for multi-word
fields, as in "comment:1234:reply.to" or "comment:1234:reply-to".
The maximum allowed key size is 512 MB.
From my experience any binary sequence typically means a String, but I may not be familiar with languages where you can achieve this by using other data types.
Keys in Redis are all strings, so it doesn't really matter what kind of value you pass into a client. Under-the-hood the RESP protocol is used and it will pass the value as a string to the engine.
Example:
ZADD some_key 1 some_value
some_key is always a string, even if you pass 3 as key, it is handled as a string. This is true for every client.

Search performance of non-clustered index on binary and integer columns

I store fixed-size binary hashes less than or equal 64 bits in Microsoft SQL Server 2012 database. The size of binary hash may be 48 or 32 bits also. Each hash has an identificator Id. The table structure is like this:
Id int NOT NULL PRIMARY KEY,
Hash binary(8) NOT NULL
I created non-clustered index on Hash column for performance purposes and for fast way to look up of hash. Also I tried to create integer columns instead of binary(n) depending on bytes n. For example I changed the column type from binary(4) to int.
Are there differences between indices on column types binary(8) and bigint or between binary(4) and int and so on?
Is it reasonable to store hashes as integers to improve search performance?
Under the covers the index is limited to a certain byte length. The smaller the better for IO. It's easy enough to convert between the datatypes using convert(varbinary(25),Hash) syntax, once you have the value of interest. You don't want to invoke a ton of converts while you're looking up the records.
If there is a difference it might be due to the collations or statistics that are being used, which just says between two values if one is more, less or equal. Statics enable the query to look past lots of values because it "knows" the data distribution.
When you have a large strings and you attempt to do like '%value' lookups, there's not much benefit much from the index. Hashes should be random. Which means the focus is on the number of bytes that are compared to make a query decision. The fewer the better.
The unhelpful but accurate cya that every database engineer will tell you, it depends, you should test it.

Naming Convention and Valid Characters for a Redis Key

I was wondering what characters are considered valid in a Redis key. I have googled for some time and can not find any useful info.
Like in Python, valid variable name should belong to the class [a-zA-Z0-9_]. What are the requirements and conventions for Redis keys?
Part of this is answered here, but this isn't completely a duplicate, as you're asking about allowed characters as well as conventions.
As for valid characters in Redis keys, the manual explains this completely:
Redis keys are binary safe, this means that you can use any binary sequence as a key, from a string like "foo" to the content of a JPEG file. The empty string is also a valid key.
A few other rules about keys:
Very long keys are not a good idea, for instance a key of 1024 bytes is a bad idea not only memory-wise, but also because the lookup of the key in the dataset may require several costly key-comparisons. Even when the task at hand is to match the existence of a large value, to resort to hashing it (for example with SHA1) is a better idea, especially from the point of view of memory and bandwidth.
Very short keys are often not a good idea. There is little point in writing "u1000flw" as a key if you can instead write "user:1000:followers". The latter is more readable and the added space is minor compared to the space used by the key object itself and the value object. While short keys will obviously consume a bit less memory, your job is to find the right balance.
Try to stick with a schema. For instance "object-type:id" is a good idea, as in "user:1000". Dots or dashes are often used for multi-word fields, as in "comment:1234:reply.to" or "comment:1234:reply-to".
The maximum allowed key size is 512 MB.

Using GUID (or similar) has performance penalty in Redis?

Does using a GUID or ulong key impact Redis DB performance?
Similar: Does name length impact performance in Redis?
This question is an old one, but other answers are a bit misleading. Eric's answer is totally unrelated to Redis. Pfreixes's answer is based on personal assumptions and is simply wrong.
In fact, it's fairly safe to use GUID keys (performance-wise) as even 300+ character keys don't affect performance significantly on O(1) operations. Check this benchmark: Does name length impact performance in Redis?.
GUID typically has a length of 32-36 chars, if you're using hex representation. As Evan Carrol noticed in comments, Redis strings are binary safe, so you can use binary value and reduce key size down to 128 bits (16 chars). Keys with such length won't hurt performance at all.
Also, documentation suggests to use hashing functions for really large keys: http://redis.io/topics/data-types-intro
Redis use a hash strategy to store all keys, every key is stored using a hash function. All Redis db peformance about keys fall into this function - or something related.
Original key is also stored to figure out future colisions between diferent keys, and yes big keys could be impact at memory handle and all of related fields : memory fragmentation, cache hits/misses, etc ...