I am utterly confused about HMSET.
If I create a HMSET with 'N' number of (field value)... pairs then
Can I access the values in the 'field' in O(1) time ? ( The documentation says "Time complexity: O(N) where N is the number of fields being requested." ). so accessing one field should be O(1). Does it behave just like a dictionary ? is that assumption correct ?
Yes - Hashes exhibit O(1) complexity when accessing a single field.
Related
What would be the worst case time complexity for finding a key that appears twice in the sorted array using binary search? I know that the worst case time complexity for binary search on a sorted array is O(log n). So, in the case that the key appears more than once the time complexity should be lesser than O(log n). However, I am not sure how to calculate this.
In the worst case the binary search needs to perform ⌊log_2(n) + 1⌋ iterations to find the element or to conclude that the element is not in the array.
By having a duplicate you might just need one step less.
For instance, suppose your duplicate elements appear in the first and second indices of the array (same if they are in the last and one before the last).
In such a case you would have ⌊log_2(n)⌋ comparisons, thus, still O(log(n)) as a worst case time complexity.
I need to match tens of thousands of 4 byte strings with about one or more boolean values. I don't mind using up a whole word for the booleans if it means faster retrieval. However I have such tight constraints for my data I imagine there is some, albeit minor, optimizations that can be made if these are reported to the storage engine in advance. Does Redis have any way to take advantage of this?
Here is a sample of my data:
"DENL": false
"NLES": false
"NLUS": true
"USNL": true
"AEGB": true
"ITAE": true
"ITFR": false
The keys are the concatination of two ISO 3166-1 alpha-2 codes. As such they are guaranteed to be 4 uppercase English letters.
The data structures I have considered using are:
Hashes to map the 4 byte keys to a string representing the booleans
A separate set for each boolean value
And since my data only contains uppercase English letters and there are only 456976 possible combinations of those (which comes out to 56KB per bit stored per key):
One or more strings that are accessed with bitwise operations (GETBIT, BITFIELD) using a function to convert the key string to a bit index.
I think that sets are probably the most elegant solution and a binary string over all possible combinations will be the most efficient. I would like to know wheter there is there some kind of middle ground? Like a set with fixed length strings as members. I would expect a datatype optimized for fixed length strings would be able to provide faster searching than one optimized for variable length strings.
It is slightly better to use the 4-letter country-code-combination as a simple key, with an empty value.
The set data-type is really a hash map where the keys are the element and are added to the hash map with NULL value. I wouldn't use a set as this implies to hashes and two lookups into a hash map: the first for the set key in the database and the second for the hash internal to the set for the element.
Use the existence of the key as either "need customs declaration" or "does not need a customs declaration" as Tomasz says.
Using simple keys allows you to use the SET command with NX/XX conditions, which may be handy in your logic:
NX -- Only set the key if it does not already exist.
XX -- Only set the key if it already exists.
Use EXISTS command instead of GET as it is slightly faster (no type checking, no value fetching).
Another advantage of simple keys vs sets is to get the value of multiple keys at once using MGET:
> MGET DENL NLES NLUS
1) ""
2) ""
3) (nil)
To be able to do complex queries, assuming these are rare and not optimized for performance, you can use SSCAN (if you go with sets) or KEYS (if you go with simple keys). However, if you go with simple keys you better use a dedicated database, see SELECT.
To query for those with NL on the left side you would use:
KEYS NL??
There are a couple of optimizations you could try:
use a set and treat all values as either "need customs declaration" or "does not need a customs declaration" - depending which one has fewer values; then with SISMEMBER you can check if your key is in that set which gives you the correct answer,
have a look at introduction to Redis data types, chapter "Bitmaps" - if you pre-define all of your keys in some array you can use SETBIT and GETBIT operations to store the flag "needs customs declaration" for given bit number (index in array).
Just out of curiosity. Let's say I have a list which contain N elements(which will repeat) and a function which will return the frequency of these elements. I think the time complexity of this program should be O(N) right? As the function just need to loop through N and check whether the elements is already exist, if yes, +=, else = 1. Okay, so my friend and I have an argument as, how about if we need to multiple the elements with its frequency as well? And maybe divided by its total number? My friend think the complexity should be O(N^2) but it doesn't sounds right for me. What do you think, and why?
Thank you.
It depends on how you will record the frequencies. If you will be using an array they for each += you need to find the previous frequency value first, the complexity is quadratic. However if you maintain a hash table structure, on which access is instant on average, the complexity would be linear.
Let's say we have a table or an array of N-length.
What happens if the length of the table is uneven ?
With an even table the length would be determined by doing N/2.
I would assume that for an uneven table this still holds true.
Does it do N/2 and then take the integer part of the result and use that as "the middle" or does it round up/down ? Or does it do something else entirely ?
I am doing indexing of data in my IRE (Information Retrieval and Extraction) course. Now instead of storing terms in the index, I am storing termID which is a mapping corresponding to the term. The size of term, if the length of the term is 15, would be 15 bytes i.e. 120 bits while if I use termID instead of term then I can definitely store it in less than 120 bits. One of the possible ways is to maintain a dictionary of the (term, termID) where termID would be from 1..n where n is the number of terms. The problems with this method is:
I have to keep this dictionary in the ram and the dictionary size can be in GBs.
To find termID corresponding to a term, it will take O(log(n)) where n is the number of terms in the dictionary.
Can I make some function which takes a term as an input and returns the mapping (encryption) in O(1) ?. It is okay if there are few collisions (Just guessing that a few collisions in exchange of speed and memory is a good trade-off. BTW I don't know how much it will effect my search results).
Is there any other better way to do this?
I think you gave the answer already more or less by saying "it is ok if there are a few collisions". The trick is hashing. You can first reduce the number of "characters" in your search terms. E.g., drop numbers, and special characters. Afterwards you can merge Upper and lower-case characters. Finally you could apply some simple replacements e.g. replacing the german ü bei ue (which is actually there origin). After doing so you have probably sth. like 32bit. You can then represent an four character string in a single byte. If you reserve 4 bytes for each words you need to deal with the longer words. There you can basically resort to xor each 4byte block.
An alternative approach would be to do something hybrid for the dictionary. If you would build a dictionary for only the 10k most frequent words you are most likely covering already most of the texts. Hence, you only need to keep parts of your dictionary in memory, while for most of the words you can use dictionary on hardisc or maybe even ignore them.