I have a use case where I have to search by key and in another use case, I have to search by value. Given this scenario, what's the best approach as scanning the entire cache can degrade performance (to filter by value).
Do reverse store i.e store value as key and key as the value in the same logical table?
Use different database and store Value, Key as K | V. I see a few posts that suggest using a different database is a bad idea and deprecated?
Or is there a better alternative/approach?
Do you really need to use Redis? Redis (and generally key-value stores) are not optimized for this kind of task.
If you need to stick with Redis you can create index to implement search by value. It will not be as storage effective and intuitive as e.g. SQL database table though. See documentation here: https://redis.io/topics/indexes
Related
We are doing a migration from AWS Redshift to GCP BigQuery.
Problem statement:
We have a Redshift table that uses the IDENTITY column functionality to issue an internal EDW surrogate key (PK) for natural/business keys. These natural keys are from at least 20 different source systems for customers. We need a method to identify them in case natural keys are somehow duplicated (because we have so many source systems). In BigQuery, the functionality of the Redshift IDENTITY column does not exist. How can I replicate this in BQ?
We cant use GENERATE_UUID() because all our downstream clients have been using a BIGINT for the last 4 years. All history is based on BIGINT and too much would need to change for a VARCHAR.
Does anyone have any ideas, recommendations or suggestions?
Some considerations I have made:
1. load the data into Spark and keep it in memory and use scala or python functions to issue the surrogate key.
2. use a nosql data store (but this does not seem likely as a use case).
Any ideas are welcome!
In these cases, the idea is generally to identify an injective/bijective function, which can map to some unique space.
How about you try something like: SELECT UNIX_MICROS(current_timestamp()) + x as identity where x is a numeral that you can somehow manage (using case statements or if conditions) based on the business name or something?
You can also eliminate x from this formula if you intend to process things linearly in some order, like one business entity at a time.
Hope it helps.
From my understanding, we can compare SQL vs NoSQL to array vs hashmap/dict.
(Let's consider PostgreSQL vs MongoDB just for a context)
SQL is arranged in tables and searches through the rows for what you're looking for.
NoSQL is arranged in a key-value way, so if you know the key, you'll get the value "directly" without needing to search through anything.
With the above considered, when I make a query in SQL using only the primary-key as my WHERE to get one item, does it still do a row search or does it do a "direct" hit on the row?
I hope my doubt has been understood
Primary keys are guaranteed to be unique. Unique keys are implemented using indexes, which in all databases that I know of are b-trees.
A query on a primary key uses the b-tree to access the data. This is log(n) in complexity.
Some databases support other index structures, such as hash tables. That would generally make such a lookup more like O(1) rather than O(log n).
I don't think you are on a fruitful path trying to differentiate NOSQL from SQL databases by looking at such examples. You should look at the requirements they are trying to implement, starting with ACID properties and concepts such as delayed consistency.
Is there a way to search by a parent part of the key in Redis?
For example: X:Y = [1,2] and X:Z = [4,6]
Both keys have a key subpart of X.
Can I run some sort of operation to get X = [1,2,4,6]?
Redis has no built-in ability to do that, but you can use it to build it.
Yes, you can search for keys in Redis according to their name, but it would be inefficient in terms of performance. Refer to SCAN for more information.
A more performant way is to index your keys, so searching is done in sub-linear time. Refer to Secondary Indexing with Redis for some pointers.
Once you've retrieved the names of your keys, it appears that you want the union of their values. One candidate data type that supports this functionality is the Redis Set via the SUNION command.
An alternative approach entirely to scanning/indexing, sets and unions is to use a single data type for all the "keys" sharing the same prefix ("X"). The Redis Hash can do that for you, and while it doesn't offer the equivalent of the union operation on its fields, it can be implemented by a Lua script (or even the application).
Other than these two approached, I'm confident that there are more ways to use Redis to achieve what you're trying to do. Choosing the right one is a matter of understanding all the requirements, but I'm afraid that information is lacking from the question.
I know that having many partition keys reduce the batch processing (EGT) in the Azure Table Storage. However I wonder to know whether there is any performance issue in terms of reading as well or not? For example, if I designed my Azure Table such that every new entity has a new partition key and I end up having 1M or more partition keys. IS there any performance disadvantege for read queries?
If the most often operation done by you is Point Query (PartitionKey and RowKey specified), the unique-partition-key design is quite good. However if your querying operation is usually Table Scan (No Partition Key specified), the design will be awful.
You can refer to chapter "Design for querying" in Azure Table Design Guide for the details.
Point query is the most efficient query to retrieve a single entity by specifying a single PartitionKey and RowKey using equality predicates. If your PartitionKey is unique, you may consider using a constant string as RowKey to enable you to leverage point query. The choice of design also depends on how you plan to read/retrieve your data. If you always plan to use point query to retrieve the data, this design makes sense.
Please see “New PartitionKey Value for Every Entity” section in the following article http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx. In short, it will scale very well since our storage system has an option to load balance several partitions. However, if your application requires you to retrieve data without specifying a PartionKey, it will be inefficient because it will result a table scan.
Please me an email # ascl#microsoft.com, if you want to discuss further on your table design.
I'm developing a job service that has features like radial search, full-text search, the ability to do full-text search + disable certain job listings (such as un-checking a textbox and no longer returning full-time jobs).
The developer who is working on Sphinx wants the database information to all be stored as intergers with a key (so under the table "Job Type" values might be stored such as 1="part-time" and 2="full-time")... whereas the other developers want to keep the database as strings (so under the table "Job Type" it says "part-time" or "full-time".
Is there a reason to keep the database as ints? Or should strings be fine?
Thanks!
Walker
Choosing your key can have a dramatic performance impact. Whenever possible, use ints instead of strings. This is called using a "surrogate key", where the key presents a unique and quick way to find the data, rather than the data standing on it's own.
String comparisons are resource intensive, potentially orders of magnitude worse than comparing numbers.
You can drive your UI off off the surrogate key, but show another column (such as job_type). This way, when you hit the database you pass the int in, and avoid looking through to the table to find a row with a matching string.
When it comes to joining tables in the database, they will run much faster if you have int's or another number as your primary keys.
Edit: In the specific case you have mentioned, if you only have two options for what your field may be, and it's unlikely to change, you may want to look into something like a bit field, and you could name it IsFullTime. A bit or boolean field holds a 1 or a 0, and nothing else, and typically isn't related to another field.
if you are normalizing your structure (i hope you are) then numeric keys will be most efficient.
Aside from the usual reasons to use integer primary keys, the use of integers with Sphinx is essential, as the result set returned by a successful Sphinx search is a list of document IDs associated with the matched items. These IDs are then used to extract the relevant data from the database. Sphinx does not return rows from the database directly.
For more details, see the Sphinx manual, especially 3.5. Restrictions on the source data.