Correct modeling in Redis for writing single entity but querying multiple

Correct modeling in Redis for writing single entity but querying multiple - redis

I'm trying to convert data which is on a Sql DB to Redis. In order to gain much higher throughput because it's a very high throughput. I'm aware of the downsides of persistence, storage costs etc...
So, I have a table called "Users" with few columns. Let's assume: ID, Name, Phone, Gender
Around 90% of the requests are Writes. to update a single row.
Around 10% of the requests are Reads. to get 20 rows in each request.
I'm trying to get my head around the right modeling of this in order to get the max out of it.
If there were only updates - I would use Hashes.
But because of the 10% of Reads I'm afraid it won't be efficient.
Any suggestions?

Actually, the real question is whether you need to support partial updates.
Supposing partial update is not required, you can store your record in a blob associated to a key (i.e. string datatype). All write operations can be done in one roundtrip, since the record is always written at once. Several read operations can be done in one rountrip as well using the MGET command.
Now, supposing partial update is required, you can store your record in a dictionary associated to a key (i.e. hash datatype). All write operations can be done in one roundtrip (even if they are partial). Several read operations can also be done in one roundtrip provided HGETALL commands are pipelined.
Pipelining several HGETALL commands is a bit more CPU consuming than using MGET, but not that much. In term of latency, it should not be significantly different, except if you execute hundreds of thousands of them per second on the Redis instance.

Related

Best Redis Data Type For Distributed Computation

I have an application that needs to use Redis with the following requirements:
A producer storing tens of millions of string records up to 128 bytes each
Indexing the records as each worker needs to access records from its own determined range X to Y in order for multiple workers to be able to process in parallel
Deleting the processed records and storing the results back in redis under a different index
Which redis data type is optimal for this?
I am considering ordered sets, where I would write original strings in one set and results in another, though I have read somewhere that they come with a 64 byte overhead and I'd like to save on memory as much as possible as that allows me to process more records. Another alternative is a simple SET key value where I would index let's say 0-100,000,000 as records to be processed and 100,000,000-200,000,000 as the corresponding result records.
Does anyone know how much memory overhead exists for each solution or can even propose a better one?

How well does a unique hash index perform in comparison to the record ID?

Following CQRS practices, I will need to supply a custom generated ID (like a UUID) in any create command. This means when using OrientDB as storage, I won't be able to use its generated RIDs, but rather perform lookups on a manual index using the UUIDs.
Now in the OrientDB docs it states that the performance of fetching records using the RID is independent of the database size O(1), presumably because it already describes the physical location of the record. Is that also the case when using a UNIQUE_HASH_INDEX?
Is it worth bending CQRS practices to request a RID from the database when assembling the create command, or is the performance difference negligible?

I have tested the performance of record retrieval based on RIDs and indexed UUID fields using a database holding 180,000 records. For the measurement, 30,000 records have been looked up, while clearing the local cache between each retrieval. This is the result:
RID: about 0.2s per record
UUID: about 0.3s per record
I've done queries throughout populating the database in 30,000 record steps. The retrieval time wasn't significantly influenced by the database size in both cases. Don't mind the relatively high times as this experiment was done on an overloaded PC. It's the relation between the two that is relavant.
To anser my own question, a UNIQUE_HAS_INDEX based query is close enough to RID-based queries.

Redis PFADD to check a exists-in-set query

I have a requirement to process multiple records from a queue. But due to some external issues the items may sporadically occur multiple times.
I need to process items only once
What I planned to use is PFADD into redis every record ( as a md5sum) and then see if that returns success. If that shows no increment then the record is a duplicate else process the record.
This seems pretty straightforward , but I am getting too many false positives while using PFADD
Is there a better way to do this ?

Being the probabilistic data structure that it is, Redis' HyperLogLog exhibits 0.81% standard error. You can reduce (but never get rid of) the probability for false positives by using multiple HLLs, each counting a the value of a different hash function on your record.
Also note that if you're using a single HLL there's no real need to hash the record - just PFADD as is.
Alternatively, use a Redis Set to keep all the identifiers/hashes/records and have 100%-accurate membership tests with SISMEMBER. This approach requires more (RAM) resources as you're storing each processed element, but unless your queue is really huge that shouldn't be a problem for a modest Redis instance. To keep memory consumption under control, switch between Sets according to the date and set an expiry on the Set keys (another approach is to use a single Sorted Set and manually remove old items from it by keeping their timestamp in the score).

In general in distributed systems you have to choose between processing items either :
at most once
at least once
Processing something exactly-once would be convenient however this is generally impossible.
That being said there could be acceptable workarounds for your specific use case, and as you suggest storing the items already processed could be an acceptable solution.
Be aware though that PFADD uses HyperLogLog, which is fast and scales but is approximate about the count of the items, so in this case I do not think this is what you want.
However if you are fine with having a small probability of errors, the most appropriate data structure here would be a Bloom filter (as described here for Redis), which can be implemented in a very memory-efficient way.
A simple, efficient, and recommended solution would be to use a simple redis key (for instance a hash) storing a boolean-like value ("0", "1" or "true", "false") for instance with the HSET or SET with the NX option instruction. You could also put it under a namespace if you wish to. It has the added benefit of being able to expire keys also.
It would avoid you to use a set (not the SET command, but rather the SINTER, SUNION commands), which doesn't necessarily work well with Redis cluster if you want to scale to more than one node. SISMEMBER is still fine though (but lacks some features from hashes such as time to live).
If you use a hash, I would also advise you to pick a hash function that has fewer chances of collisions than md5 (a collision means that two different objects end up with the same hash).
An alternative approach to the hash would be to assign an uuid to every item when putting it in the queue (or a squuid if you want to have some time information).

What is the conventional way to store objects in a sorted set in redis?

What is the most convenient/fast way to implement a sorted set in redis where the values are objects, not just strings.
Should I just store object id's in the sorted set and then query every one of them individually by its key or is there a way that I can store them directly in the sorted set, i.e. must the value be a string?

It depends on your needs, if you need to share this data with other zsets/structures and want to write the value only once for every change, you can put an id as the zset value and add a hash to store the object. However, it implies making additionnal queries when you read data from the zset (one zrange + n hgetall for n values in the zset), but writing and synchronising the value between many structures is cheap (only updating the hash corresponding to the value).
But if it is "self-contained", with no or few accesses outside the zset, you can serialize to a chosen format (JSON, MESSAGEPACK, KRYO...) your object and then store it as the value of your zset entry. This way, you will have better performance when you read from the zset (only 1 query with O(log(N)+M), it is actually pretty good, probably the best you can get), but maybe you will have to duplicate the value in other zsets / structures if you need to read / write this value outside, which also implies maintaining synchronisation by hand on the value.
Redis has good documentation on performance of each command, so check what queries you would write and calculate the total cost, so that you can make a good comparison of these two options.
Also, don't forget that redis comes with optimistic locking, so if you need pessimistic (because of contention for instance) you will have to do it by hand and/or using lua scripts. If you need a lot of sync, the first option seems better (less performance on read, but still good, less queries and complexity on writes), but if you have values that don't change a lot and memory space is not a problem, the second option will provide better performance on reads (you can duplicate the value in redis, synchronize the values periodically for instance).

Short answer: Yes, everything must be stored as a string
Longer answer: you can serialize your object into any text-based format of your choosing. Most people choose MsgPack or JSON because it is very compact and serializers are available in just about any language.

Stream with a lot of UPDATEs and PostgreSQL

I'm quite a newbie with PostgreSQL optimization and chosing whatever's appropriate job for it and whatever's not. So, I want to know whenever I'm trying to use PostgreSQL for inappropriate job, or it is suitable for it and I should set everything up properly.
Anyway, I have a need for a database with a lot of data that changes frequently.
For example, imagine an ISP, having a lot of clients, each having a session (PPP/VPN/whatever), with two self-describing frequently updated properties bytes_received and bytes_sent. There is a table with them, where each session is represented by a row with unique ID:
CREATE TABLE sessions(
id BIGSERIAL NOT NULL,
username CHARACTER VARYING(32) NOT NULL,
some_connection_data BYTEA NOT NULL,
bytes_received BIGINT NOT NULL,
bytes_sent BIGINT NOT NULL,
CONSTRAINT sessions_pkey PRIMARY KEY (id)
)
And as accounting data flows, this table receives a lot of UPDATEs like those:
-- There are *lots* of such queries!
UPDATE sessions SET bytes_received = bytes_received + 53554,
bytes_sent = bytes_sent + 30676
WHERE id = 42
When we receive a never ending stream with quite a lot (like 1-2 per second) of updates for a table with a lot (like several thousands) of sessions, probably thanks to MVCC, this makes PostgreSQL very busy. Are there any ways to speed everything up, or Postgres is just not exactly suitable for this task and I'd better consider it unsuitable for this job and put those counters to another storage like memcachedb, using Postgres only for fairly static data? But I'll miss an ability to infrequently query on this data, for example to find TOP10 downloaders, which is not really good.
Unfortunately, the amount of data cannot be lowered much. The ISP accounting example is all thought up to simplify the explanation. The real problem's with another system, which structure is somehow harder to explain.
Thanks for suggestions!

The database really isn't the best tool for collecting lots of small updates, but as I don't know your queryability and ACID requirements I can't really recommend something else. If it's an acceptable approach the application side update aggregation suggested by zzzeek can help lower the update load significantly.
There is an similar approach that can give you durability and ability to query the fresher data at some performance cost. Create a buffer table that can collect the changes to the values that need to be updated and insert the changes there. At regular intervals in a transaction rename the table to something else and create a new table in place of it. Then in a transaction aggregate all the changes, do the corresponding updates to the main table and truncate the buffer table. This way if you need a consistent and fresh snapshot of any data you can select from the main table and join in all the changes from the active and renamed buffer tables.
However if neither is acceptable you can also tune the database to deal better with heavy update loads.
To optimize the updating make sure that PostgreSQL can use heap-only tuples to store the updated versions of the rows. To do this make sure that there are no indexes on the frequently updated columns and change the fillfactor to something lower from the default 100%. You'll need to figure out a suitable fill factor on your own as it depends heavily on the details of the workload and the machine it is running on. The fillfactor needs to be low enough that allmost all of the updates fit on the same database page before autovacuum has the chance to clean up the old non-visible versions. You can tune autovacuum settings to trade off between the density of the database and vacuum overhead. Also, take into account that any long transactions, including statistical queries, will hold onto tuples that have changed after the transaction has started. See the pg_stat_user_tables view to see what to tune, especially the relationship of n_tup_hot_upd to n_tup_upd and n_live_tup to n_dead_tup.
Heavy updating will also create a heavy write ahead log (WAL) load. Tuning the WAL behavior (docs for the settings) will help lower that. In particular, a higher checkpoint_segments number and higher checkpoint_timeout can lower your IO load significantly by allowing more updates to happen in memory. See the relationship of checkpoints_timed vs. checkpoints_req in pg_stat_bgwriter to see how many checkpoints happen because either limit is reached. Raising your shared_buffers so that the working set fits in memory will also help. Check buffers_checkpoint vs. buffers_clean + buffers_backend to see how many were written to satisfy checkpoint requirements vs. just running out of memory.

You want to assemble statistical updates as they happen into an in-memory queue of some kind, or alternatively onto a message bus if you're more ambitious. A receiving process then aggregates these statistical updates on a periodic basis - which can be anywhere from every 5 seconds to every hour - depends on what you want. The counts of bytes_received and bytes_sent are then updated, with counts that may represent many individual "update" messages summed together. Additionally you should batch the update statements for multiple ids into a single transaction, ensuring that the update statements are issued in the same relative order with regards to primary key to prevent deadlocks against other transactions that might be doing the same thing.
In this way you "batch" activities into bigger chunks to control how much load is on the PG database, and also serialize many concurrent activities into a single stream (or multiple, depending on how many threads/processes are issuing updates). The tradeoff which you tune based on the "period" is, how much freshness vs. how much update load.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas