We have analytical ETL store model results on a Snowflake table (2 columns: user-id, score)
We need to use that info in our low latency service, which snowflake is not suitable for that latency.
I thought about storing that table on a Redis collection.
I would like to have some idea of how to keep the Redis in sync with the table.
any other solution for the latency is also welcomed
well it depends on how frequently you snowflake data is updated, what process is updating the data (snowplow or some external tool that you can hock into), what latency you want, are prepared between the snowflake data change, and redis having the values.
You could and a task to export the changes to a S3 and then have a lambda watching the bucket/folder, and push the changes into redis.
You could have your tool that loads the changes, pull the changes out and push those into redis. (we did a form of this)
You could have something poll the snowflake data (seems the worst idea) and push changes into redis. Well if you are polling the main table, this sounds bad, but you could also have a multi-table insert/merge command, thus when you are updating the main table, insert into a changes or stream, and thus read from this in you redis sync.
Related
Our use case is using Redis Cache in our main flow to store centralized data for our processing units in order to improve performance.
In order to persist the data from Redis to the DB we came up with a few approaches:
Writing the data to the DB right when we writing it to Redis (this approach isn't practical as it impacts the performance)
Having a duplicate entry inserted with TTL and listening to Redis events in a different process which in turn will insert them to the DB. (downside is we're losing events in case of this process being down, we can scale out the number of processes for availability in that case)
Having a Redis SortedSet holding the last update time of the entry and ZRange in a different process periodically to query for the last keys to be updated and insert them into the DB. (downside is inserting to the SortedSet is log2n and impacting performance in the main flow)
Is there another approach we're missing? We're using Azure Redis for Cache.
Thanks.
I have three microservice that needs to communicate between each other. Microservice-1 is incharge of the data and the database(he writes and read to it). I will add a redis cache store for Microservice-1 to cache data there. I want to put a redis-slave for the other 2 microservices to reduce communication with the actual microservice, if the data is already in the cache store. Since all updates to the data, has to go thru the Microservice-1 and he will always update the cache, redis replication will make sure the other two microservices will get it too. Ofcourse, if the data is not in cache, it will call the Microservice-1 for the data, which will update the cache.
Am i missing something, with this approach ?
This will definitely work in the "sunny day" case.
But sometimes there are storms, and in storms there's a chance of losing cache coherency (i.e. the DB and Redis disagree on the data).
For example, lets say that you have Microservice-1 update the DB and then update Redis. What happens if there's a crash between updating the DB and updating Redis?
On the other hand, what if you reverse the ordering (update Redis and then the DB)? Now Redis could be updated and not the DB.
Neither of these in insurmountable, but absent a means of having a transaction which ensures that 0 or 2 of Redis and the DB are updated, there will always be a time window where the change is in one but not the other. In that situation, it's probably worth embracing eventual consistency (e.g. periodically scan the DB and update redis with recently updated records).
As an elaboration on that, a Command Query Responsibility Segregation with Event Sourcing (CQRS/ES) approach may prove useful: Microservice-1 gets split into two services, one which takes commands (requests to update) and another which handles queries. Instead of updating a row in a DB, the command service now appends (in a typical DB, an INSERT) an event which describes what changed. The query service can subscribe to those events and update Redis. Other microservices can also subscribe to the stream of events and update their own views (which can be remixed in any way they want) of Microservice-1's state.
I have a NoSQL database that we are using for data processing, as it can be used for my application faster than SQL can. I'm treating our NoSQL database almost like a cache of information, with the SQL being the authority of data, and the NoSQL store being updated with changes. Right now this is being done through our application, so when a request comes in for a change, it is made in the SQL database, and the NoSQL database. This is failing at times as sometimes the NoSQL update fails, or other situations cause the NoSQL database to get out of sync.
I could do a batch update every X minutes, however it is a lot of information in the data stores, and it would take hours to ensure that they are in sync. We have some timestamps to do a difference of what has been changed, but this is not always accurate.
I'm wondering what some recommended strategy for keeping a data store(secondary database cache) in sync with my main store are?
I know I've done with this with messaging in the past - specifically JMS with ActiveMQ. I would send the updates to a NoSQL store (Mongo) by using a queue. This way messages could accumulate in the queue and if the connection to the NoSQL store ever got severed, it could pick up where it left off.
It worked really well because ActiveMQ was really stable and simple to work with.
I've always seen this done with diffs like you mentioned. You introduce date fields all over and then keep track of the latest sync. The nice thing about this approach is that it easily allows you to replay transactions by modifying the last sync date.
One last piece of advice ... write good tools around pumping data from point A to point B (in this case SQL to NoSQL). I wrote several tools to bulk load the NoSQL store from SQL at my last job and it made life easy if anything got really out of sync. Between scripts and bulk loading processes, I could always recover.
We have big shopping and product dealing system. We have faced lots problem with MySQL so after few r&D we planned to use Redis and we start integrating Redis in our system.
Following this previously directly hitting the database now we have moved the Redis system
User shopping cart details
Affiliates clicks tracking records
We have product dealing user data.
other site stats.
I am not only storing the data in Redis system i have written crons which moves Redis data in MySQL data at time intervals. This is the main point i am facing the issues.
Bellow points i am looking for solution
Is their any other ways to dump big data from Redis to MySQL?
Redis fail our store data in file so is it possible to store that data directly to MySQL database?
Is Redis have any trigger system using that i can avoid the crons like queue system?
Is their any other way to dump big data from Redis to MySQL?
Redis has the possibility (using bgsave) to generate a dump of the data in a non blocking and consistent way.
https://github.com/sripathikrishnan/redis-rdb-tools
You could use Sripathi Krishnan's well-known package to parse a redis dump file (RDB) in Python, and populate the MySQL instance offline. Or you can convert the Redis dump to JSON format, and write scripts in any language you want to populate MySQL.
This solution is only interesting if you want to copy the complete data of the Redis instance into MySQL.
Does Redis have any trigger system that i can use to avoid the crons like queue system?
Redis has no trigger concept, but nothing prevents you to post events in Redis queues each time something must be copied to MySQL. For instance, instead of:
# Add an item to a user shopping cart
RPUSH user:<id>:cart <item>
you could execute:
# Add an item to a user shopping cart
MULTI
RPUSH user:<id>:cart <item>
RPUSH cart_to_mysql <id>:<item>
EXEC
The MULTI/EXEC block makes it atomic and consistent. Then you just have to write a little daemon waiting on items of the cart_to_mysql queue (using BLPOP commands). For each dequeued item, the daemon has to fetch the relevant data from Redis, and populate the MySQL instance.
Redis fail our store data in file so is it possible to store that data directly to MySQL database?
I'm not sure I understand the question here. But if you use the above solution, the latency between Redis updates and MySQL updates will be quite limited. So if Redis fails, you will only loose the very last operations (contrary to a solution based on cron jobs). It is of course not possible to have 100% consistency in the propagation of data though.
every day we have an SSIS package running to import data into a database.
sometimes people are querying the database at the same time.
the loading (data import) times out because there's a table lock on the specific table.
what is the standard protocol on inserting data and querying data at the same time?
First you need to figure out where those locks are coming from. Use the link to see if there are any locks.
How to: Determine Which Queries Are Holding Locks
If you have another process that holds a table lock then not much you can do.
Are you sure the error is "not able to OBTAIN a table lock". If so look at changing your SSIS package to not use table locks.
There are several strategies.
One approach is to design your ETL pipeline as to minimize lock time. All the data is prepared in staging tables and then, when complete, is switched in using fast partition switch operations, see Transferring Data Efficiently by Using Partition Switching. This way the ETL blocks reads onyl for a very short duration. It also has the advantage that the reads see all the ETL data at once, not intermediate stages. The drawback is difficult implementation.
Another approach is to enable snapshot isolation and/or read committed snapshot in the database, see Row Versioning-based Isolation Levels in the Database Engine. This way reads no longer block behind the locks held by the ETL. the drawback is resource consumption, the hardware must be able to drive the additional load of row versioning.
Yet another approach is to move the data querying to a read-only standby server, eg. using log shipping.