redis mass insertion consistency or limitation - redis

I need to insert 6925144 unique keys to redis, where each key contains hash of data.
I use the script in ruby that were published in the main page.
The whole insertion takes ~3 min DBSIZE after insertion is 1277553, but I expected it to be 6925144.
I am not sure that redis misses some records, may be calculation of DBSIZE is different for HASH, or may be 1277553 is just some natural limitation.
What is the best and the easiest way to check the consistence of the insertion.

If you are sure about the number of records there should be the diff in DBSIZE is most likely real.
Keep an eye on the logs when doing the insert.
Do you have duplicates in the dataset? Need of a merge-function?
Does the ID get correctly encoded into a unique string or 32/64-bit int?

Related

Should I split a Primary key into Partition Key and Row Key components?

I want to store data in an Azure Table. The Primary Key for this data will be an MD5 hash.
To get a good balance of performance and scalability it is a good idea to use a combination of both Partition Key and Row Key in the Azure Table.
I am considering splitting the MD5 hash into two parts at an arbitrary point. I will probably use the first three or so characters for the Partition Key so as to have a higher likelihood of collisions, and therefore end up with Partitions that each have a decent quantity of Row entries in them. The rest of the characters will make up the Row Key. This would mean the data is spread over 4,096 Partitions.
The overall dataset could become large, in the order of hundreds of thousands of records.
I am aware that atomic operations can more easily be done across entries in the same Partition; this is not a concern for me.
Is this Key-splitting approach worth considering? Or should I simply go for the simpler approach and have the Partition Key use the entire MD5 hash, with an empty Row Key?
Both of your approches are fine. Basically, 4096 partitions are sufficient for scaling; if you want even better scalability, use the full MD5 as partition key since you don't need atomic operations with a partition. Please note that row key can't be an empty string, so consider using a constant string or the same value as partition key (full MD5) instead.

Out of Process in memory database table that supports queries for high speed caching

I have a SQL table that is accessed continually but changes very rarely.
The Table is partitioned by UserID and each user has many records in the table.
I want to save database resources and move this table closer to the application in some kind of memory cache.
In process caching is too memory intensive so it needs to be external to the application.
Key Value stores like Redis are proving inefficient due to the overhead of serializing and deserializing the table to and from Redis.
I am looking for something that can store this table (or partitions of data) in memory, but let me query only the information I need without serializing and deserializing large blocks of data for each read.
Is there anything that would provide Out of Process in memory database table that supports queries for high speed caching?
Searching has shown that Apache Ignite might be a possible option, but I am looking for more informed suggestions.
Since it's out-of-process, it has to do serialization and deserialization. The problem you concern is how to reduce the serialization/deserizliation work. If you use Redis' STRING type, you CANNOT reduce these work.
However, You can use HASH to solve the problem: mapping your SQL table to a HASH.
Suppose you have the following table: person: id(varchar), name(varchar), age(int), you can take person id as key, and take name and age as fields. When you want to search someone's name, you only need to get the name field (HGET person-id name), other fields won't be deserialzed.
Ignite is indeed a possible solution for you since you may optimize serialization/deserialization overhead by using internal binary representation for accessing objects' fields. You may refer to this documentation page for more information: https://apacheignite.readme.io/docs/binary-marshaller
Also access overhead may be optimized by disabling copy-on-read option https://apacheignite.readme.io/docs/performance-tips#section-do-not-copy-value-on-read
Data collocation by user id is also possible with Ignite: https://apacheignite.readme.io/docs/affinity-collocation
As the #for_stack said, Hash will be very suitable for your case.
you said that Each user has many rows in db indexed by the user_id and tag_id . So It is that (user_id, tag_id) uniquely specify one row. Every row is functional depends on this tuple, you could use the tuple as the HASH KEY.
For example, if you want save the row (user_id, tag_id, username, age) which values are ("123456", "FDSA", "gsz", 20) into redis, You could do this:
HMSET 123456:FDSA username "gsz" age 30
When you want to query the username with the user_id and tag_id, you could do like this:
HGET 123456:FDSA username
So Every Hash Key will be a combination of user_id and tag_id, if you want the key to be more human readable, you could add a prefix string such as "USERINFO". e.g. : USERINFO:123456:FDSA .
BUT If you want to query with only a user_id and get all rows with this user_id, this method above will be not enough.
And you could build the secondary indexes in redis for you HASH.
as the above said, we use the user_id:tag_id as the HASH key. Because it can unique points to one row. If we want to query all the rows about one user_id.
We could use sorted set to build a secondary indexing to index which Hashes store the info about this user_id.
We could add this in SortedSet:
ZADD user_index 0 123456:FDSA
As above, we set the member to the string of HASH key, and set the score to 0. And the rule is that we should set all score in this zset to 0 and then we could use the lexicographical order to do range query. refer zrangebylex.
E.g. We want to get the all rows about user_id 123456,
ZRANGEBYLEX user_index [123456 (123457
It will return all the HASH key whose prefix are 123456, and then we use this string as HASH key and hget or hmget to retrieve infomation what we want.
[ means inclusive, and ( means exclusive. and why we use 123457? it is obvious. So when we want to get all rows with a user_id, we shoud specify the upper bound to make the user_id string's leftmost char's ascii value plus 1.
More about lex index you could refer the article I mentioned above.
You can try apache mnemonic started by intel. Link -http://incubator.apache.org/projects/mnemonic.html. It supports serdeless features
For a read-dominant workload MySQL MEMORY engine should work fine (writing DMLs lock whole table). This way you don't need to change you data retrieval logic.
Alternatively, if you're okay with changing data retrieval logic, then Redis is also an option. To add to what #GuangshengZuo has described, there's ReJSON Redis dynamically loadable module (for Redis 4+) which implements document-store on top of Redis. It can further relax requirements for marshalling big structures back and forth over the network.
With just 6 principles (which I collected here), it is very easy for a SQL minded person to adapt herself to Redis approach. Briefly they are:
The most important thing is that, don't be afraid to generate lots of key-value pairs. So feel free to store each row of the table in a different key.
Use Redis' hash map data type
Form key name from primary key values of the table by a separator (such as ":")
Store the remaining fields as a hash
When you want to query a single row, directly form the key and retrieve its results
When you want to query a range, use wild char "*" towards your key. But please be aware, scanning keys interrupt other Redis processes. So use this method if you really have to.
The link just gives a simple table example and how to model it in Redis. Following those 6 principles you can continue to think like you do for normal tables. (Of course without some not-so-relevant concepts as CRUD, constraints, relations, etc.)
using Memcache and REDIS combination on top of MYSQL comes to Mind.

Correct modeling in Redis for writing single entity but querying multiple

I'm trying to convert data which is on a Sql DB to Redis. In order to gain much higher throughput because it's a very high throughput. I'm aware of the downsides of persistence, storage costs etc...
So, I have a table called "Users" with few columns. Let's assume: ID, Name, Phone, Gender
Around 90% of the requests are Writes. to update a single row.
Around 10% of the requests are Reads. to get 20 rows in each request.
I'm trying to get my head around the right modeling of this in order to get the max out of it.
If there were only updates - I would use Hashes.
But because of the 10% of Reads I'm afraid it won't be efficient.
Any suggestions?
Actually, the real question is whether you need to support partial updates.
Supposing partial update is not required, you can store your record in a blob associated to a key (i.e. string datatype). All write operations can be done in one roundtrip, since the record is always written at once. Several read operations can be done in one rountrip as well using the MGET command.
Now, supposing partial update is required, you can store your record in a dictionary associated to a key (i.e. hash datatype). All write operations can be done in one roundtrip (even if they are partial). Several read operations can also be done in one roundtrip provided HGETALL commands are pipelined.
Pipelining several HGETALL commands is a bit more CPU consuming than using MGET, but not that much. In term of latency, it should not be significantly different, except if you execute hundreds of thousands of them per second on the Redis instance.

Finding Redis data by last update

I'm new to Redis and I want to use the following scheme:
key: EMPLOYEE_*ID*
value: *EMPLOYEE DATA*
I was thinking of adding a time stamp to the end of the key, but I'm not sure if that'll even help. Basically I want to be able to get a list of employees who are the most stale ie having been updated. What's the best way to accomplish this in Redis?
Keep another key with the data about employees (key names) and the update's timestamp - the best candidate for that is a Sorted Set. To maintain that key's data integrity, you'll have update it with pertinent changes whenever you update one the employees' keys.
With that data structure in place, you can easily get the keys names of the recently-updated employees with the ZRANGE command.
Have you tried to filter by expiration time? You could set the same expiration to all keys and update the expiration each time the key is updated. Then with a LUA script you could iterate through the keys and filter by expiration time. Those with smaller expiration time are those who are not updated.
This would work with some assumptions, it depends on how your system works. Also the approach is O(N) with respect to the number of employees. So if on one side you can save space, it will not scale well with the number of entries and the frequency of scan.

DB2 v10 zos : identify free index values

My organisation has hundreds of DB2 tables that each have a randomly generated unique integer index. The random values are generated by either COBOL CICS mainframe programs or Java distributed applications. The normal approach taken is to randomly generate an integer value (only positive values are employed), then attempt to insert the data row, retrying when a duplicate index value has already been persisted. I would like to improve the performance of this approach and I'm considering trying to identify integer values that have not been generated and persisted to each table, this would mean we don't ever need to retry. We would know our insert would work. Does db2 have a function that can return unused index values?
The short answer is no.
The slightly longer answer is to point out that, if such a function existed, in your case on the first insert into one of your tables the size of the result set it would return would be 2,147,483,647 (positive) integers. At 4 bytes each, that would be 8,589,934,588 bytes.
Given the constraints of your existing system, what you're doing is probably the best that can be done. If the performance of retrying is unacceptable, I'm afraid redesigning your key scheme is the next step.
I think that's a question to ask: Is this scheme of using random numbers for unique keys causing a performance problem? As the tables fill up the key space you will see more and more retries, but you have a relatively large key space. If you're seeing large numbers of retries maybe your random numbers are less random than you'd like.
just a thought but you could use one sequence for a group of tables. In this way, the value will still be random (because you wouldn't know which it the next table you perform an insert to) but based on a specific sequance wich mean that most of the time you won't get a retry because the number keep ascending. that same Sequance can loop after a few hunderd million inserts and start to "fill in the blanks".
as far as other key ideas are concerned,You could also try and use a diffrent key, maybe one based on Timestamp or Rowid. that will still be random but not repetitive.