Strategy for keeping separate Databases in Sync - sql

I have a NoSQL database that we are using for data processing, as it can be used for my application faster than SQL can. I'm treating our NoSQL database almost like a cache of information, with the SQL being the authority of data, and the NoSQL store being updated with changes. Right now this is being done through our application, so when a request comes in for a change, it is made in the SQL database, and the NoSQL database. This is failing at times as sometimes the NoSQL update fails, or other situations cause the NoSQL database to get out of sync.
I could do a batch update every X minutes, however it is a lot of information in the data stores, and it would take hours to ensure that they are in sync. We have some timestamps to do a difference of what has been changed, but this is not always accurate.
I'm wondering what some recommended strategy for keeping a data store(secondary database cache) in sync with my main store are?

I know I've done with this with messaging in the past - specifically JMS with ActiveMQ. I would send the updates to a NoSQL store (Mongo) by using a queue. This way messages could accumulate in the queue and if the connection to the NoSQL store ever got severed, it could pick up where it left off.
It worked really well because ActiveMQ was really stable and simple to work with.
I've always seen this done with diffs like you mentioned. You introduce date fields all over and then keep track of the latest sync. The nice thing about this approach is that it easily allows you to replay transactions by modifying the last sync date.
One last piece of advice ... write good tools around pumping data from point A to point B (in this case SQL to NoSQL). I wrote several tools to bulk load the NoSQL store from SQL at my last job and it made life easy if anything got really out of sync. Between scripts and bulk loading processes, I could always recover.

Related

When/how to write data in redis cache to SQL database?

I have a fairly small relational database currently setup (SQLite, changing to PostgreSQL) that has some relatively simple many-to-one and many-to-many relations. The app uses Websockets to give real-time updates to any clients so I want to make any operations as quick as possible. I was planning on using redis to cache parts of the data in memory as required (parts that will be read/written frequently) so that queries will be faster. I know with the database currently so small, performance gains aren't going to be noticeable but I want it to be scalable.
There seems to be a lot of material/information suggesting using redis as a cache is a good idea, but I'm struggling to find much information about when it is suitable to write updates to the SQL database and how is best to do it.
For example, should I write updates to the redis-store, then send updated data out to clients and then write the same update to the SQL database all in the same request (in that order)? (i.e. more frequent smaller writes)
Or should I simply just write updates to the redis-store and send the updated data out to clients. Then, periodically (every minute?) read back from the redis-store and subsequently save it in the SQL database? (i.e. less frequent but larger writes)
Or is there some other best practice for keeping a redis store and SQL database consistent?
Would my first example perform poorly due to the larger number of writes to disk and the CPU being more active or would this be negligible?

Multiple application on network with same SQL database

I will have multiple computers on the same network with the same C# application running, connecting to a SQL database.
I am wondering if I need to use the service broker to ensure that if I update record A in table B on Machine 1, the change is pushed to Machine 2. I have seen applications that need to use messaging servers to accomplish this before but I was wondering why this is necessary, surely if they connect to the same database, any changes from one machine will be reflected on the other?
Thanks :)
This is mostly about consistency and latency.
If your applications always perform atomic operations on the database, and they always read whatever they need with no caching, everything will be consistent.
In practice, this is seldom the case. There's plenty of hidden opportunities for caching, like when you have an edit form - it has the values the entity had before you started the edit process, but what if someone modified those in the mean time? You'd just rewrite their changes with your data.
Solving this is a bunch of architectural decisions. Different scenarios require different approaches.
Once data is committed in the database, everyone reading it will see the same thing - but only if they actually get around to reading it, and the two reads aren't separated by another commit.
Update notifications are mostly concerned with invalidating caches, and perhaps some push-style processing (e.g. IM client might show you a popup saying you got a new message). However, SQL Server notifications are not reliable - there is no guarantee that you'll get the notification, and even less so that you'll get it in time. This means that to ensure consistency, you must not depend on the cached data, and you have to force an invalidation once in a while anyway, even if you didn't get a change notification.
Remember, even if you're actually using a database that's close enough to ACID, it's usually not the default setting (for performance and availability, mostly). You need to understand what kind of guarantees you're getting, and how to write code to handle this. Even the most perfect ACID database isn't going to help your consistency if your application introduces those inconsistencies :)

How should data be provided to a web server using a data warehouse?

We have data stored in a data warehouse as follows:
Price
Date
Product Name (varchar(25))
We currently only have four products. That changes very infrequently (on average once every 10 years). Once every business day, four new data points are added representing the day's price for each product.
On the website, a user can request this information by entering a date range and selecting one or more products names. Analytics shows that the feature is not heavily used (about 10 users requests per week).
It was suggested that the data warehouse should daily push (SFTP) a CSV file containing all data (currently 6718 rows of this data and growing by four each day) to the web server. Then, the web server would read data from the file and display that data whenever a user made a request.
Usually, the push would only be once a day, but more than one push could be possible to communicate (infrequent) price corrections. Even in the price correction scenario, all data would be delivered in the file. What are problems with this approach?
Would it be better to have the web server make a request to the data warehouse per user request? Or does this have issues such as a greater chance for network errors or performance issues?
Would it be better to have the web server make a request to the data warehouse per user request?
Yes it would. You have very little data, so there is no need to try and 'cache' this in some way. (Apart from the fact that CSV might not be the best way to do this).
There is nothing stopping you from doing these requests from the webserver to the database server. With as little information as this you will not find performance an issue, but even if it would be when everything grows, there is a lot to be gained on the database-side (indexes etc) that will help you survive the next 100 years in this fashion.
The amount of requests from your users (also extremely small) does not need any special treatment, so again, direct query would be the best.
Or does this have issues such as a greater chance for network errors or performance issues?
Well, it might, but that would not justify your CSV method. Examples and why you need not worry, could be
the connection with the databaseserver is down.
This is an issue for both methods, but with only one connection per day the change of a 1-in-10000 failures might seem to be better for once-a-day methods. But these issues should not come up very often, and if they do, you should be able to handle them. (retry request, give a message to user). This is what enourmous amounts of websites do, so trust me if I say that this will not be an issue. Also, think of what it would mean if your daily update failed? That would present a bigger problem!
Performance issues
as said, this is due to the amount of data and requests, not a problem. And even if it becomes one, this is a problem you should be able to catch at a different level. Use a caching system (non CSV) on the database server. Use a caching system on the webserver. Fix your indexes to stop performance from being a problem.
BUT:
It is far from strange to want your data-warehouse separated from your web system. If this is a requirement, and it surely could be, the best thing you can do is re-create your warehouse-database (the one I just defended as being good enough to query directly) on another machine. You might get good results by doing a master-slave system
your datawarehouse is a master-database: it sends all changes to the slave but is inexcessible otherwise
your 2nd database (on your webserver even) gets all updates from the master, and is read-only. you can only query it for data
your webserver cannot connect to the datawarehouse, but can connect to your slave to read information. Even if there was an injection hack, it doesn't matter, as it is read-only.
Now you don't have a single moment where you update the queried database (the master-slave replication will keep it updated always), but no chance that the queries from the webserver put your warehouse in danger. profit!
I don't really see how SQL injection could be a real concern. I assume you have some calendar type field that the user fills in to get data out. If this is the only form just ensure that the only field that is in it is a date then something like DROP TABLE isn't possible. As for getting access to the database, that is another issue. However, a separate file with just the connection function should do fine in most cases so that a user can't, say open your webpage in an HTML viewer and see your database connection string.
As for the CSV, I would have to say querying a database per user, especially if it's only used ~10 times weekly would be much more efficient than the CSV. I just equate the CSV as overkill because again you only have ~10 users attempting to get some information, to export an updated CSV every day would be too much for such little pay off.
EDIT:
Also if an attack is a big concern, which that really depends on the nature of the business, the data being stored, and the visitors you receive, you could always create a backup as another option. I don't really see a reason for this as your question is currently stated, but it is a possibility that even with the best security an attack could happen. That mainly just depends on if the attackers want the information you have.

Is it possible to Update an SQL Server Database and an Informix Database at the same time?

I was wondering if it is possible to add/update/delete an SQL Server database table, as well as an Informix database table at the same time.
Both databases will have the same table (data and all), so the query would only change just based on which database it is going to. For some reason, we need the data inside both databases and kept up in real time.
Is it possible to do this with a SQL Trigger or maybe a SProc?
Any insight of how to do this, or a push in the right direction would be very much appreciated.
Doing a synchronous update, ie. a distributed transaction by using a linked server, possible for a trigger, while technically possible, I would definitely advise against it. Aaron brings the issue of how reliable XA in general is, but my point is different: availability. Your update in SQL Server will fail if it cannot connect and update in Informix. Downtime (patching, maintenance, not to mention disasters) of the Informix site will imply downtime of the SQL Server site, driving your five 9's toward nine 5's quite fast... This is why I strongly advocate decoupling the application of updates. Transactional Replication is such an example of decoupling and it supports heterogenous environments (ie. Informix client downstream to accept the changes).
You will have a delay of update visibility (state in SQL Server will be reflected in Informix after delay that can be milliseconds, seconds, minutes, even hours in a bad day). And the updates are one way, nothing flows back from Informix to SQL Server. But doing master-master replication in an heterogeneous environment is something that not even Chuck Norris would attempt, just saying.
Maintaining two different DBMS with a single transaction requires a transaction monitor such as the XA system to coordinate the transactions. There are such systems. The XA specification is typically the underlying standard. Both Microsoft's SQL Server and IBM's Informix work with such systems, and it is possible to have SQL Server and Informix controlled by the same transaction monitor. I have fewer qualms about the technical competency of such systems than the others who've answered; I share their concerns about whether it is appropriate for you.
Such systems are very heavyweight. If you want consistency, all transactions that modify the single table described in the question will need to use the same XA services (plural; likely one for insert, one for update, one for delete) to do so. Further, if the same transactions need to manage any other tables too, then you need to add and use services for those tables as well. It is this aspect that tends to make such systems difficult to manage.
Using a replication system with the potential for delay before the sites are consistent is probably better than trying for absolute synchronicity, unless there are cogent demands for such synchronicity.
If there really is a demand for absolute synchronicity, then use a transaction monitor.
Do not roll your own.
They are hard to get right. Handling all the special cases is tricky. And (under the hypothesis that you need absolute synchronicity) doing it wrong is costly but easy.
That depends on your definition of "possible". Technically, you can use a technique called "two-phase commit."
The idea is you send the data to both databases and then a "prepare commit" command which does everything necessary to commit the data except for committing it. If the prepare fails, the commit would fail too. If prepare succeeds, then commit must succeed.
Brilliant idea, doesn't work in practice. One common case is that you send the commit to both databases and one of them gets lost on the way (network outage). Happens rarely but when it happens, you have an inconsistent state and, since this step must not fail, no good way to clean up.
So my solution works like this:
You load the data into a new table which has two extra columns where you can say "server X has seen this record"
You add a job which copies all jobs for server X to server X and updates the respective column. Write the job in such a way that it can be aborted and restarted at any time (i.e. it must be able to cope with cases where data already exists on the target side).
That way, you can distribute the data to any number of servers in a consistent, fault tolerant way.

What's the best way to get a 'lot' of small pieces of data synced between a Mac App and the Web?

I'm considering MongoDB right now. Just so the goal is clear here is what needs to happen:
In my app, Finch (finchformac.com for details) I have thousands and thousands of entries per day for each user of what window they had open, the time they opened it, the time they closed it, and a tag if they choose one for it. I need this data to be backed up online so it can sync to their other Mac computers, etc.. I also need to be able to draw charts online from their data which means some complex queries hitting hundreds of thousands of records.
Right now I have tried using Ruby/Rails/Mongoid in with a JSON parser on the app side sending up data in increments of 10,000 records at a time, the data is processed to other collections with a background mapreduce job. But, this all seems to block and is ultimately too slow. What recommendations does (if anyone) have for how to go about this?
You've got a complex problem, which means you need to break it down into smaller, more easily solvable issues.
Problems (as I see it):
You've got an application which is collecting data. You just need to
store that data somewhere locally until it gets sync'd to the
server.
You've received the data on the server and now you need to shove it
into the database fast enough so that it doesn't slow down.
You've got to report on that data and this sounds hard and complex.
You probably want to write this as some sort of API, for simplicity (and since you've got loads of spare processing cycles on the clients) you'll want these chunks of data processed on the client side into JSON ready to import into the database. Once you've got JSON you don't need Mongoid (you just throw the JSON into the database directly). Also you probably don't need rails since you're just creating a simple API so stick with just Rack or Sinatra (possibly using something like Grape).
Now you need to solve the whole "this all seems to block and is ultimately too slow" issue. We've already removed Mongoid (so no need to convert from JSON -> Ruby Objects -> JSON) and Rails. Before we get onto doing a MapReduce on this data you need to ensure it's getting loaded into the database quickly enough. Chances are you should architect the whole thing so that your MapReduce supports your reporting functionality. For sync'ing of data you shouldn't need to do anything but pass the JSON around. If your data isn't writing into your DB fast enough you should consider Sharding your dataset. This will probably be done using some user-based key but you know your data schema better than I do. You need choose you sharding key so that when multiple users are sync'ing at the same time they will probably be using different servers.
Once you've solved Problems 1 and 2 you need to work on your Reporting. This is probably supported by your MapReduce functions inside Mongo. My first comment on this part, is to make sure you're running at least Mongo 2.0. In that release 10gen sped up MapReduce (my tests indicate that it is substantially faster than 1.8). Other than this you can can achieve further increases by Sharding and directing reads to the the Secondary servers in your Replica set (you are using a Replica set?). If this still isn't working consider structuring your schema to support your reporting functionality. This lets you use more cycles on your clients to do work rather than loading your servers. But this optimisation should be left until after you've proven that conventional approaches won't work.
I hope that wall of text helps somewhat. Good luck!