Using GUID (or similar) has performance penalty in Redis? - redis

Does using a GUID or ulong key impact Redis DB performance?
Similar: Does name length impact performance in Redis?

This question is an old one, but other answers are a bit misleading. Eric's answer is totally unrelated to Redis. Pfreixes's answer is based on personal assumptions and is simply wrong.
In fact, it's fairly safe to use GUID keys (performance-wise) as even 300+ character keys don't affect performance significantly on O(1) operations. Check this benchmark: Does name length impact performance in Redis?.
GUID typically has a length of 32-36 chars, if you're using hex representation. As Evan Carrol noticed in comments, Redis strings are binary safe, so you can use binary value and reduce key size down to 128 bits (16 chars). Keys with such length won't hurt performance at all.
Also, documentation suggests to use hashing functions for really large keys: http://redis.io/topics/data-types-intro

Redis use a hash strategy to store all keys, every key is stored using a hash function. All Redis db peformance about keys fall into this function - or something related.
Original key is also stored to figure out future colisions between diferent keys, and yes big keys could be impact at memory handle and all of related fields : memory fragmentation, cache hits/misses, etc ...

Related

Does Redis keep all chars from prefixed kind of keys?

I may have 100mi long, but partially static keys like:
someReal...LongStaticPrefix:12345
someReal...LongStaticPrefix:12
someReal...LongStaticPrefix:123456
Where only the last part of the key is dynamic, the rest is static.
Does Redis keep all keys long or does it make an internal alias or something like that?
Should I worry about storage or performance?
Or is it better if I make internal alias do the keys to keep them short?
Redis does keep the whole key. This long prefix will impact your memory usage.
Given redis uses a hash map to store the keys, the performance impact is low. Hash map load factor is usually between 0.5 and 1. This means usually there is just one or two keys per hash slot. So the performance impact is the extra network payload for the long key, the longer effort to hash it, and the longer comparison in the hash slot with one or two keys. It's likely negligible unless your prefix is really really long
Consider a shorter key prefix.
Before considering using a hash structure (HSET), consider if you are using redis cluster or if you may need to eventually. A single hash key cannot be sharded.
A minor optimization : Consider using a suffix for faster compares at the hash slot chain

Key types supported by Redis

What are the different key types supported by Redis? The documentation mentions all the various types (strings, set, hashmap, etc) of values supported by Redis, but I couldn't quiet find the key type information.
From redis documentation (Data types intro):
Redis keys
Redis keys are binary safe, this means that you can use any binary sequence as a key, from a string like "foo" to the content
of a JPEG file. The empty string is also a valid key. A few other
rules about keys:
Very long keys are not a good idea. For instance a key of 1024 bytes is a bad idea not only memory-wise, but also because the
lookup of the key in the dataset may require several costly
key-comparisons. Even when the task at hand is to match the
existence of a large value, hashing it (for example with SHA1) is a
better idea, especially from the perspective of memory and
bandwidth.
Very short keys are often not a good idea. There is little point in writing "u1000flw" as a key if you can instead write
"user:1000:followers". The latter is more readable and the added
space is minor compared to the space used by the key object itself
and the value object. While short keys will obviously consume a bit
less memory, your job is to find the right balance.
Try to stick with a schema. For instance "object-type:id" is a good idea, as in "user:1000". Dots or dashes are often used for multi-word
fields, as in "comment:1234:reply.to" or "comment:1234:reply-to".
The maximum allowed key size is 512 MB.
From my experience any binary sequence typically means a String, but I may not be familiar with languages where you can achieve this by using other data types.
Keys in Redis are all strings, so it doesn't really matter what kind of value you pass into a client. Under-the-hood the RESP protocol is used and it will pass the value as a string to the engine.
Example:
ZADD some_key 1 some_value
some_key is always a string, even if you pass 3 as key, it is handled as a string. This is true for every client.

Naming Convention and Valid Characters for a Redis Key

I was wondering what characters are considered valid in a Redis key. I have googled for some time and can not find any useful info.
Like in Python, valid variable name should belong to the class [a-zA-Z0-9_]. What are the requirements and conventions for Redis keys?
Part of this is answered here, but this isn't completely a duplicate, as you're asking about allowed characters as well as conventions.
As for valid characters in Redis keys, the manual explains this completely:
Redis keys are binary safe, this means that you can use any binary sequence as a key, from a string like "foo" to the content of a JPEG file. The empty string is also a valid key.
A few other rules about keys:
Very long keys are not a good idea, for instance a key of 1024 bytes is a bad idea not only memory-wise, but also because the lookup of the key in the dataset may require several costly key-comparisons. Even when the task at hand is to match the existence of a large value, to resort to hashing it (for example with SHA1) is a better idea, especially from the point of view of memory and bandwidth.
Very short keys are often not a good idea. There is little point in writing "u1000flw" as a key if you can instead write "user:1000:followers". The latter is more readable and the added space is minor compared to the space used by the key object itself and the value object. While short keys will obviously consume a bit less memory, your job is to find the right balance.
Try to stick with a schema. For instance "object-type:id" is a good idea, as in "user:1000". Dots or dashes are often used for multi-word fields, as in "comment:1234:reply.to" or "comment:1234:reply-to".
The maximum allowed key size is 512 MB.

Java EE/SQL: Is there a significant performance lag between primary key types?

Currently I am involved in learning some basics of the Java EE technology. I encountered a particular project and took a deeper look into the underlying database structure.
On server-side I investigated a Java function that creates a primary key with a length of 32 characters (based on concatenating the time, a random hash, and an additional cryptographic nonce).
I am interested in a estimation about the performance loss caused by using such a primary key. If there is no security reason to create such kind of unique IDs wouldn't it be much better to let the underlying database create new increasing primaries, starting at 0?
Wouldn't a SQL/JQL search be much faster when using numbers instead of strings?
Using numbers will probably be faster, but you should measure it with a test case if you need the performance ratio between both options.
I don't think number comparison vs string comparison will give a big performance advantage by itself. However:
larger fields typically means less data per table block, so you have to read more blocks from DB in case of a full scan (it will be slower)
accordingly, larger keys typically means less keys per index block, so you have to read more index blocks in case of index scans (it will be slower)
larger fields are, well, larger, so by definition they are less space-efficient.
Note that we are talking about data size and not data type: most likely a 8-byte integer will not be significantly more efficient than a 8-byte string.
Note also that using random IDs is usually more "clusterable" than sequence numbers, as sequences / autonumerics need to be administered centrally (although this can be mitigated using techniques such as the Hi-Lo algorithm. Most curent persistence frameworks support this technique).

INT vs Unique-Identifier for ID field in database

I am creating a new database for a web site using SQL Server 2005 (possibly SQL Server 2008 in the near future). As an application developer, I've seen many databases that use an integer (or bigint, etc.) for an ID field of a table that will be used for relationships. But lately I've also seen databases that use the unique identifier (GUID) for an ID field.
My question is whether one has an advantage over the other? Will integer fields be faster for querying and joining, etc.?
UPDATE: To make it clear, this is for a primary key in the tables.
GUIDs are problematic as clustered keys because of the high randomness. This issue was addressed by Paul Randal in the last Technet Magazine Q&A column: I'd like to use a GUID as the clustered index key, but the others are arguing that it can lead to performance issues with indexes. Is this true and, if so, can you explain why?
Now bear in mind that the discussion is specifically about clustered indexes. You say you want to use the column as 'ID', that is unclear if you mean it as clustered key or just primary key. Typically the two overlap, so I'll assume you want to use it as clustered index. The reasons why that is a poor choice are explained in the link to the article I mentioned above.
For non clustered indexes GUIDs still have some issues, but not nearly as big as when they are the leftmost clustered key of the table. Again, the randomness of GUIDs introduces page splits and fragmentation, be it at the non-clustered index level only (a much smaller problem).
There are many urban legends surrounding the GUID usage that condemn them based on their size (16 bytes) compared to an int (4 bytes) and promise horrible performance doom if they are used. This is slightly exaggerated. A key of size 16 can be a very peformant key still, on a properly designed data model. While is true that being 4 times as big as a int results in more a lower density non-leaf pages in indexes, this is not a real concern for the vast majority of tables. The b-tree structure is a naturally well balanced tree and the depth of tree traversal is seldom an issue, so seeking a value based on GUID key as opposed to a INT key is similar in performance. A leaf-page traversal (ie. a table scan) does not look at the non-leaf pages, and the impact of GUID size on the page size is typically quite small, as the record itself is significantly larger than the extra 12 bytes introduced by the GUID. So I'd take the hear-say advice based on 'is 16 bytes vs. 4' with a, rather large, grain of salt. Analyze on individual case by case and decide if the size impact makes a real difference: how many other columns are in the table (ie. how much impact has the GUID size on the leaf pages) and how many references are using it (ie. how many other tables will increase because of the fact they need to store a larger foreign key).
I'm calling out all these details in a sort of makeshift defense of GUIDs because they been getting a lot of bad press lately and some is undeserved. They have their merits and are indispensable in any distributed system (the moment you're talking data movement, be it via replication or sync framework or whatever). I've seen bad decisions being made out based on the GUID bad reputation when they were shun without proper consideration. But is true, if you have to use a GUID as clustered key, make sure you address the randomness issue: use sequential guids when possible.
And finally, to answer your question: if you don't have a specific reason to use GUIDs, use INTs.
The GUID is going to take up more space and be slower than an int - even if you use the newsequentialid() function. If you are going to do replication or use the sync framework you pretty much have to use a guid.
INTs are 4 bytes, BIGINTs ar 8 bytes, and GUIDS are 16 bytes. The more space required to represent the data, the more resources required to process it -- disk space, memory, etc. So (a) they're slower, but (b) this probably only matters if volume is an issue (millions of rows, or thousands of transactions in very, very little time.)
The advantage of GUIDs is that they are (pretty much) Globally Unique. Generate a guid using the proper algorithm (and SQL Server xxxx will use the proper algorithm), and no two guids will ever be alike--no matter how many computers you have generating them, no matter how frequently. (This does not apply after 72 years of usage--I forget the details.)
If you need unique identifiers generated across multiple servers, GUIDs may be useful. If you need mondo perforance and under 2 billion values, ints are probably fine. Lastly and perhaps most importantly, if your data has natural keys, stick with them and forget the surrogate values.
if you positively, absolutely have to have a unique ID, then GUID. Meaning if you're ever gonna merge, sync, replicate, you probably should use a GUID.
For less robust things, an int, should suffice depending upon how large the table will grow.
As in most cases, the proper answer is, it depends.
Use them for replication etc, not as primary keys.
Kimberly L Tripp article
Against: Space, not strictly monotonic, page splits, bookmark/RIDs etc
For: er...
Fully agreed with JBrooks.
I want to say that when your table is large, and you use selects with JOINS, especially with derived tables, using GUIDs can significally decrease performance.