Redis write back cache still a manual task? - redis

I am working on an assignment. The REST API (developed in Spring) has a method m() which simulates cleaning of windows by a person. Towards the end the cleaner has to write a unique phrase (a string) on the window. Phrases written by all cleaners are eventually saved in the MySQL DB. So each time m() is executed, a query is made to the DB to fetch all phrases written to the DB today so far. The cleaner method m() then generates a random string as a phrase, checks it in the queried phrases to make sure its unique and writes it to the DB. So there is one query per m() to fetch all phrases and one to write the phrase. Both happens on the same table.
This is a scenario that can take advantage of caching and I went to Redis. I also think write back cache is the best solution. So every write happens, it happens to the cache instead of the DB and every read happens from the cache as well. The cache can be copied to the DB in a new thread per hour (or something configurable). I was reading Can Redis write out to a database like PostgreSQL? and it seems some years back you had to do this manually.
My questions:
Is doing this manually still the way to go? If not, can someone
point me to a Redis resource I can make use of?
If manual is the way to go this is how I plan to implement it. Is it ideal?
Phrases written each hour will be appended to a list of objects (userid, phrase) in Redis, the list for midnight to 1 am will be called phrases_1, for 1 to 2 am as phrases_2 and so on. Each hour a background thread will write the entire hour's list to DB. Every time all phrases are required to be fetched for checking, I will load all lists for the day from the cache e.g. phrases_1, phrases_2 in a loop and consolidate them. (Later when number of users grow - I will have to shard but that is not my immediate concern).
Thanks.

Check https://github.com/RedisGears/rgsync (and https://redislabs.com/solutions/use-cases/caching/) which tries to address both the cases of write-back and write-through.
I'm yet to do a functionality test.
It is also interesting to note that a 2020 CMU paper (https://www.pdl.cmu.edu/PDL-FTP/Storage/2020.apocs.writeback.pdf) claims "writeback-aware caching is NPcomplete and Max-SNP hard"

Instead of going to redis for uniqueness of data,you should create a unique index on the field you want to be unique and MySQL will take care of the rest for you

Related

Local db for simultaneously write operation

what I'm looking for to do is a multithread python script (more of 100 threads), each thread will read a value from API (every second) and it will put it in a table (table "ALLVALUES") specifying the key and overwriting it.
Main script every 5 seconds will read the table ALLVALUES and will retrieve the last value.
I tried with sqlite but sometimes the write operation in a specific thread fail because sqllite block the db when write, I use the WAL configuration, but the results are the same.
Which is the architecture that I could use in order to solve my problem?
write and read must be very fast so for this reason I'm looking for something local.
Thank you

How can data be synchronized between processes in SQL?

I'm wondering something perhaps extremely stupid, but I can't seem to find an answer (which is not a good sign, usually).
Assuming we have a SQL server (MySQL, PostgreSQL, this question even applies to Sqlite3 though there's no server) and several clients connected to it. I've seen countless times queries that might be hard to sync in my opinion.
So let's assume we have a table (usage statistics, say) with a row per day.
statistic (
day,
num_requests
)
(I avoid mentioning data types, since it's not the point, but the number of requests should be a number of some sort.)
So when a new web request is sent, the web server will ask this table's current statistic and increase the number of requests. No biggie right?
number = cursor.execute("""
SELECT num_requests FROM statistic
WHERE ...
""")
number += 1
cursor.execute("""
UPDATE statistic SET num_requests=?
WHERE ...
""", (number, ))
But what does happen if two requests are handled somewhat simultaneously, perhaps on several clients? Different processes? They each ask for today's current statistic (just a read operation, non-blocking), they get the number of requests from this row (this step doesn't involve the server) and then they increment it by 1. At this point, if both requests are running somewhat simultaneously, they have both incremented the same number once and they send an UPDATE requests with their number.
In the end, the number of requests for today's statistic has increased by one, although they were two requests. I know there are mechanisms to ensure proper data synchronization, but I fail to see how it could address the situation in this case. Read usually is non-blocking as far as I know. Write can be blocking, but since read for the other process has happened before, the second write operation will not be acceptable. And I don't see any way to express that logically.
In other words, this seems like the point where we would lock the row in most programming languages, and say "from that point onward, you can neither read it or write it, I'm working on it". The first request will execute its read (lock), increment and write, and then will unlock. The second request will have to wait patiently for the lock to be released. I don't see that mechanism in SQL. Is that transparent and not even necessary? And if so, how does it work? Or have we lived our entire life with problems like it?
Thanks!
cursor.execute("""
UPDATE statistic SET num_requests=num_requests+1
WHERE ...
""", (number, ))

Best way of storing an array in an SQL database?

For an Android Launcher (Home Screen) app project i want to implement a feature called "Sort by usage". This will sort by the launch count of an app within a user settable timeframe.
The current idea for the implementation is to store an array of unich epoch timestamps, one for each launch.
Additionaly it'll store a counter caching the current amount of launches within the selected timeframe, incremented with every launch. Of course, this would regularly have to be rebuild as time passes, but merely every few hours or at least x percent of the selected timeframe, so computations definitely wouldn't run as often as without the counter, since this information is required everytime when any app entries on screen need to get sorted - but i'm not quite sure if it matters in any way during actual use.
I am now unsure how to store the timestamp array inside the SQL database. As there is a table holding one record with information about each launcher entry i thought about the following options:
Store the array of unix epochs in serialized form (maybe JSON Array) to one field of the entries record
Create a seperate table for launch times with
a. each record starting with an id associated with an entry followed by all launch times, one for each field
b. each record a combination of entry id and one launch time
these options would obvously have the advatage of storing the timestamp using an appropriate type
I probably didn't quite understand why you need a second piece of data for your launch counter - the fact you saved a timestamp already means a launch - why not just count timestamps? Less updating, less record locking, more concurrency.
Now, let's say you've got a separate table with timestamps in a classic one to many setting.
Pros of this setup - you never need to update anything - just keep inserting. You can easily cluster your table by timestamp, run a filter on your timeframe and issue a group by and count rows. The client then will get the numbers and sort by count (I believe it's generally better to not sort in SQL). Cons - you need a join to parent table and probably need to get your indexes right.
Alternatively you store timestamps in a blob text (JSON, CSV, whatever) with your main records. This definitely means you'll have to update your records a lot, which potentially opens you up to locking issues. Then, I'm not entirely sure what you'll have to do to get your final launch counts - you read all entities, deserialise all timestamps, filter by timeframe and then count? It does feel a bit more convoluted in your case.
I don't think there's such thing as a "best" way. You have to consider pros and cons. From what I gather, you might be better off with classic SQL approach unless there's something I didn't catch that will outweigh my points above

How to find number of rows inserted/deleted in MySQL

Is there a way to find out the number of rows inserted/deleted in a table in MySQL? Is this kind of statistics kept somewhere in the database? If not, what would be the best way to implement something to keep track of these statistics?
When I say how many, I mean within a certain period (last 24 hours, or since server was up, or last week etc)
When I need to keep track of deleted things, I just don't delete.
I change a column value that excludes it from normal user results.
If space is an issue, you can set it's contents you no longer care about to empty.
Inserted you can user COUNT()
The Binary Log contains records of all queries that update or insert data. I don't know if it stores the number of affected rows, however.
There is also a General Query Log, which tracks all queries that were run.
(Information current for MySQL 5.0. If you're using an older version ymmv)
If I want to handle logging my SQL queries, I have 2 possibilities:
Turning the MySQL Log function on
Writting my own 'trace' class
I prefer doing number 2.
Why?
Because it is more controllable. You can easily differ from INSERT DELETE UPDATE and so on queries.
But that is not the only advantage of your own trace class, because creating trace files (so called "logs") makes administrative tasks much more easier.
You can structure the trace output, put it into a separate database, store it into some XML or JSON file.
You can order things as you want them to be.

SQL Identity Column out of step

We have a set of databases that have a table defined with an Identity column as the primary key. As a sub-set of these are replicated to other servers, a seed system was created so that they could never clash. That system was by using a starting seed with an increment of 50.
In this way the table on DB1 would generate 30001, 30051 etc, where Database2 would generate 30002, 30052 and so on.
I am looking at adding another database into this system (it is split for scaling/loading purposes) and have discovered that the identites have got out of sync on one or two of the databases - i.e. database 3 that should have numbers ending in 3, doesn't anymore. The seeding and increments is still correct according to the table design.
I am obviously going to have to work around this problem somehow (probably by setting a high initial value), but can anyone tell me what would cause them to get out of sync like this? From a query on the DB I can see the sequence went as follows: 32403,32453, 32456, 32474, 32524, 32574 and has continued in increments of 50 ever since it went wrong.
As far as I am aware no bulk-inserts or DTS or anything like that has put new data into these tables.
Second (bonus) question - how to reset the identity so that it goes back to what I want it to actually be!
EDIT:
I know the design is in principle a bit ropey - I didn't ask for criticism of it, I just wondered how it could have got out of sync. I inherited this system and changing the column to a GUID - whilst undoubtedly the best theoretical solution - is probably not going to happen. The system evolved from a single DB to multiple DBs when the load got too large (a few hundred GBs currently). Each ID in this table will be referenced in many other places - sometimes a few hundred thousand times each (multiplied by about 40,000 for each item). Updating all those will not be happening ;-)
Replication = GUID column.
To set the value of the next ID to be 1000:
DBCC CHECKIDENT (orders, RESEED, 999)
If you want to actually use Primary Keys for some meaningful purpose other than uniquely identify a row in a table, then it's not an Identity Column, and you need to assign them some other explicit way.
If you want to merge rows from multiple tables, then you are violating the intent of Identity, which is for one table. (A GUID column will use values that are unique enough to solve this problem. But you still can't impute a meaningful purpose to them.)
Perhaps somebody used:
SET IDENTITY INSERT {tablename} ON
INSERT INTO {tablename} (ID, ...)
VALUES(32456, ....)
SET IDENTITY INSERT {tablename} OFF
Or perhaps they used DBCC CHECKIDENT to change the identity. In any case, you can use the same to set it back.
It's too risky to rely on this kind of identity strategy, since it's (obviously) possible that it will get out of synch and wreck everything.
With replication, you really need to identify your data with GUIDs. It will probably be easier for you to migrate your data to a schema that uses GUIDs for PKs than to try and hack your way around IDENTITY issues.
To address your question directly,
Why did it get out of sync may be interesting to discuss, but the only result you could draw from the answer would be to prevent it in the future; which is a bad course of action. You will continue to have these and bigger problems unless you deal with the design which has a fatal flaw.
How to set the existing values right is also (IMHO) an invalid question, because you need to do something other than set the values right - it won't solve your problem.
This isn't to disparage you, it's to help you the best way I can think of. Changing the design is less work both short term and long term. Not to change the design is the pathway to FAIL.
This doesn't really answer your core question, but one possibility to address the design would be to switch to a hi_lo algorithm. it wouldn't require changing the column away from an int. so it shouldn't be nearly as much work as changing to a guid.
Hi_lo is used by the nhibernate ORM, but I couldn't find much documentation on it.
Basically the way a Hi_lo works is you have 1 central place where you keep track of your hi value. 1 table in 1 of the databases that every instance of your insert application can see. then you need to have some kind of a service (object, web service, whatever) that has a life somewhat longer than a single entity insert. this service when it starts up will go to the hi table, grab the current value, then increment the value in that table. Use a read committed lock to do this so that you won't get any concurrency issues with other instances of the service. Now you would use the new service to get your next id value. It internally starts at the number it got from the db, and when it passes that value out, increments by 1. keeping track of this current value and the "range" it's allowed to pass out. A simplistic example would be this.
service 1 gets 100 from "hi_value" table in db. increments db value 200.
service 1 gets request for a new ID. passes out 100.
another instance of the service, service 2 (either another thread, another middle tier worker machine, etc) spins up, gets 200 from the db, increments db to 300.
service 2 gets a request for a new id. passes out 200.
service 1 gets a request for a new id. passes out 101.
if any of these ever gets to passing out more than 100 before dying, then they will go back to the db, and get the current value and increment it and start over. Obviously there's some art to this. How big should your range be, etc.
A very simple variation on this is a single table in one of your db's that just contains the "nextId" value. basically manually reproducing oracle's sequence concept.