I have a problem to solve and at some point of it. It says this:
I decided to use a relational database to cache calculated data for the next calls.
What does the cache part means? Where is that data sorted? Is it saved in a temporal table? How can I access to that information?
Thanks in advance!!
"Cache calculated data" means the results of some resource-consuming calculations are stored in a database for faster future access without re-calculation. The data can be stored in one or several tables, it can be versioned or not, so particular implementation may vary.
Related
we need some help, choosing the best database for the following upcoming situation:
Write:
We'll get a huge number of data each minute (round about one million entries) and need to save them in a database. One of the unique identifiers of the entries are timestamps.
Read:
The software should be able to load the stored data for an user-defined time range. The loading process should be as fast as possible.
Currently, we are using the Microsoft SQL Server in our application. I'm not sure, if this is the right technology for our new requirements.
Which database should we use? Do we need to replace MSSQL with something or is another database able to run simultaneously with it?
Thank you!
I'm looking to use bigquery as a scalable SQL database where transactions or modify/delete ops are not needed. Instead to modify or delete I'm looking to create new time based versions of the records so that I can audit any change and also roll-back(if I need to). bigquery seems advertised only for analytics/"big data" but never as general purpose database without modify/transactions. Is it wrong to use it that way? Am I missing something?
What is the best approach to load only the Delta into the analytics DB from a highly transactional DB?
Note:
We have a highly transactional system and we are building an analytic database out of it. At present, we are wiping off all the fact and dimension tables from the analytics DB and loading the entire "processed" data at midnight. Problem with this approach is that, we are loading the same data again and again every time along with the few new data that got added/updated on that particular day. We need to load the "Delta" alone (rows which are inserted newly & the old rows which got updated). Any efficient way to do this?
It is difficult to tell something without knowing the details e.g. the database schema, the database engine... However the most natural approach for me is to use timestamps. This solution assumes that entities (single record in a table, or group of related records) that are loaded/migrated from a transactional DB into an analytic one have a timestamp.
This timestamp says when given entity was created or updated the last time. While loading/migrating data you should take into account only these entities for each the timestamp > the date of the last migration. This approach has this advantage that is quite simple and does not require any specific tool. The question is if you already have timestamps in your DB.
Another approach might be to utilize some kind of change tracking mechanism. For example MMSQL server has something like that (see this article). However, I have to admit that I've never used it so I'm not sure if it is suitable in this case. If your database doesn't support change tracking, you can try to create it on your own based on triggers, but in general it is not easy thing to do.
We need to load the "Delta" alone (rows which are inserted newly & the old rows which got updated). Any efficient way to do this?
You forgot rows that got deleted. And that is the crux of the problem. Having a updated_at field on every table and polling for rows with updated_at > #last_poll_time works, more or less, but polling like this does not give you a transaction ally consistent image because each table is polled at a different moment. Tracking deleted rows induces complications at app/data model layer, as rows have to be either logically deleted (is_deleted) or moved to an archive table (for each table!).
Another solution is to write triggers in the database, attach a trigger to each table, and have the trigger write into table_history the changes that occurred. Again, for each table. These solutions are notoriously difficult to maintain long term in presence of schema changes (columns added, modified, tables dropped etc etc)
But there are database specific solution that can help. For instance SQL Server has Change Tracking and Change Data Capture. These can be leveraged to build an ETL pipeline that maintains an analytical data warehouse. Database schema changes are still a pain, though.
There is no silver bullet, no pixie dust.
In our web app, we are creating session table in database to store temporary data. So the temp table will be created and destroyed for every user. I have some 300 users for this web app. So for every user these table will be created and destroyed.
i heard that this way of design is not good due to performance issues.
I am using MS Sql server 2005. Is there any way to store a result set temporarily without creating any table.
Please suggest me some solution.
Thanks.
Either:
use a single permanent database table for all users, with a UserID column to filter on
or
just use the session handling ability of your web platform to store the info
It sounds as if you are creating and dropping permanent tables. Have you tried using real temp tables (those with table names beginning with #). OR table variables if you havea small data set. Either of these can work quite well. If you use real temp tables, you need to make sure your tempdb is sized large enough to accomodate the usual amount of users, growing tempdb can cause delays.
I think, a solution at GenerateData is what you are looking for. You can create test/sample databases their and delete them when needed.
Depending on what you're actually doing (and whether you can refactor it) it may be more appropriate to use table variables which are highly performant in general.
There is a question of whther the DB is really an appropriate place to even be trying to persist data sets if it's for your applications benefit - if the question isn't just academic perhaps it would be better to keep the object representation of the data in your app memory?
I wan't sure how to word this question so I'll try and explain. I have a third-party database on SQL Server 2005. I have another SQL Server 2008, which I want to "publish" some of the data in the third-party database too. This database I shall then use as the back-end for a portal and reporting services - it shall be the data warehouse.
On the destination server I want store the data in different table structures to that in the third-party db. Some tables I want to denormalize and there are lots of columns that aren't necessary. I'll also need to add additional fields to some of the tables which I'll need to update based on data stored in the same rows. For example, there are varchar fields that contain info I'll want to populate other columns with. All of this should cleanse the data and make it easier to report on.
I can write the query(s) to get all the info I want in a particular destination table. However, I want to be able to keep it up-to-date with the source on the other server. It doesn't have to be updated immediately (although that would be good) but I'd like for it be updated perhaps every 10 minutes. There are 100's of thousands of rows of data but the changes to the data and addition of new rows etc. isn't huge.
I've had a look around but I'm still not sure the best way to achieve this. As far as I can tell replication won't do what I need. I could manually write the t-sql to do the updates perhaps using the Merge statement and then schedule it as a job with sql server agent. I've also been having a look at SSIS and that looks to be geared at the ETL kind of thing.
I'm just not sure what to use to achieve this and I was hoping to get some advice on how one should go about doing this kind-of thing? Any suggestions would be greatly appreciated.
For that tables whose schemas/realtions are not changing, I would still strongly recommend Replication.
For the tables whose data and/or relations are changing significantly, then I would recommend that you develop a Service Broker implementation to handle that. The hi-level approach with service broker (SB) is:
Table-->Trigger-->SB.Service >====> SB.Queue-->StoredProc(activated)-->Table(s)
I would not recommend SSIS for this, unless you wanted to go to something like dialy exports/imports. It's fine for that kind of thing, but IMHO far too kludgey and cumbersome for either continuous or short-period incremental data distribution.
Nick, I have gone the SSIS route myself. I have jobs that run every 15 minutes that are based in SSIS and do the exact thing you are trying to do. We have a huge relational database and then we wanted to do complicated reporting on top of it using a product called Tableau. We quickly discovered that our relational model wasn't really so hot for that so I built a cube over it with SSAS and that cube is updated and processed every 15 minutes.
Yes SSIS does give the aura of being mainly for straight ETL jobs but I have found that it can be used for simple quick jobs like this as well.
I think, staging and partitioning will be too much for your case. I am implementing the same thing in SSIS now but with a frequency of 1 hour as I need to give some time for support activities. I am sure that using SSIS is a good way of doing it.
During the design, I had thought of another way to achieve custom replication, by customizing the Change Data Capture (CDC) process. This way you can get near real time replication, but is a tricky thing.