I'm looking to use bigquery as a scalable SQL database where transactions or modify/delete ops are not needed. Instead to modify or delete I'm looking to create new time based versions of the records so that I can audit any change and also roll-back(if I need to). bigquery seems advertised only for analytics/"big data" but never as general purpose database without modify/transactions. Is it wrong to use it that way? Am I missing something?
Related
In my application (C# application, using Entity-Framework and SQL Database), I am needed to create a daily task to update/insert data from a third-party application (both the applications are using SQL server database). For efficiency sake, I am looking for a way to determine what all records from the previous day have been modified and thus import only those records.
I know I can add a modified_on column to the source table and create a trigger to update that column when something is changed on that record, but that will need me to make changes to the third-party application's database schema which I want to avoid.
There's the change tracking feature but it's of limited use to you as you're using EF and that makes the way the data is queried awkward. You may be able to use it somehow, but I doubt it's elegant.
Way easier is to indeed change the schema but add only a single column of type rowversion. That binary datatype (loaded as byte[] in EF) is special and gets larger every time something (such as the third-party application) updates the row. No need for any triggers. You can look what the largest one is you already processed and then query all those that are larger than that.
In addition to change tracking suggested by John, in another answer, you can think of setting up Temporal tables.
You can run queries against the temporal tables to identify the changed records and pull them accordingly from main table.
What is the best approach to load only the Delta into the analytics DB from a highly transactional DB?
Note:
We have a highly transactional system and we are building an analytic database out of it. At present, we are wiping off all the fact and dimension tables from the analytics DB and loading the entire "processed" data at midnight. Problem with this approach is that, we are loading the same data again and again every time along with the few new data that got added/updated on that particular day. We need to load the "Delta" alone (rows which are inserted newly & the old rows which got updated). Any efficient way to do this?
It is difficult to tell something without knowing the details e.g. the database schema, the database engine... However the most natural approach for me is to use timestamps. This solution assumes that entities (single record in a table, or group of related records) that are loaded/migrated from a transactional DB into an analytic one have a timestamp.
This timestamp says when given entity was created or updated the last time. While loading/migrating data you should take into account only these entities for each the timestamp > the date of the last migration. This approach has this advantage that is quite simple and does not require any specific tool. The question is if you already have timestamps in your DB.
Another approach might be to utilize some kind of change tracking mechanism. For example MMSQL server has something like that (see this article). However, I have to admit that I've never used it so I'm not sure if it is suitable in this case. If your database doesn't support change tracking, you can try to create it on your own based on triggers, but in general it is not easy thing to do.
We need to load the "Delta" alone (rows which are inserted newly & the old rows which got updated). Any efficient way to do this?
You forgot rows that got deleted. And that is the crux of the problem. Having a updated_at field on every table and polling for rows with updated_at > #last_poll_time works, more or less, but polling like this does not give you a transaction ally consistent image because each table is polled at a different moment. Tracking deleted rows induces complications at app/data model layer, as rows have to be either logically deleted (is_deleted) or moved to an archive table (for each table!).
Another solution is to write triggers in the database, attach a trigger to each table, and have the trigger write into table_history the changes that occurred. Again, for each table. These solutions are notoriously difficult to maintain long term in presence of schema changes (columns added, modified, tables dropped etc etc)
But there are database specific solution that can help. For instance SQL Server has Change Tracking and Change Data Capture. These can be leveraged to build an ETL pipeline that maintains an analytical data warehouse. Database schema changes are still a pain, though.
There is no silver bullet, no pixie dust.
I'm looking for an auditing solution that does exactly what Change Data Capture (CDC) does, except I need it to also track the application user that made the change. I'm currently using SQL Server 2012 Enterprise and may be upgrading to 2014 later this year.
We already have an auditing solution in place that leverages Delete, Insert, and Update triggers, but some new requirements might force us to update every audit trigger and corresponding audit table. Given various problems we've run in to with that solution over the years, this seems like as good a time as any to reevaluate and potentially replace the solution.
To give you an idea of what I'm currently working with (and may be able to leverage), we use a stored procedure (ConnectionInitialize) to store a user id with a SPID in a table (ApplicationUser) and then we delete the row using another stored procedure (ConnectionReset) once we're done making our deletes, inserts, and updates.
Were we to use CDC, I looked into adding a trigger to something like the cdc.lsn_time_mapping table, but I couldn't find a way to map the LSN back to the SPID (and therefore the user id) that was being used. This also presented some other issues in that CDC is always a little bit behind.
I looked into SQL Server Audit a little bit, but that presented some challenges of its own. We're using Transparent Data Encryption (TDE) to appease some of our security requirements, but SQL Server Audit looks like it'd need a separate encryption strategy; that and I'm more interested in the columns than in the actual SQL statements. Even so, these aren't deal-breakers for me, so I'm still looking into it.
Given what I'm trying to accomplish, does anyone have any feedback or recommendations?
By itself, CDC doesn't meet the requirements. The reason being is that CDC only grabs changes to your data, not any underlying context under which those changes were made. You can, however, get what you're looking for if you're willing to tag your data with some audit columns. The basic idea is that you append a column to your table (or to a different table if you aren't able to modify the actual table for whatever reason) and populate it with the user who last modified the record (pretty simple to do via an insert/update trigger). Once that is actual data, you can consume it however you need to (CDC being one possible mechanism).
Late answer but hopefully useful.
There is a third party tool, ApexSQL Audit, capable of meeting your requirements. My previous company is using it for years and they have been satisfied with it.
There is a helpful comparison article you can read to find more details about audited data, auditing mechanisms, integrity protection etc, for both CDC & Audit tool at one place.
I have a database with 50 tables and I want to log users requests, such as inserts, updates or deletes on all the tables in the database. I can also create a trigger for this for each request type.
What is the best way to do this from a performance perspective or is there a better way to track this?
You can also create audit tables which are populated by triggers (and which allow much more flexibility than change data capture). The critical component is to capture sets of data not try to work row-by-row. It does add some overhead yes, but if you write the triggers correctly, it isn't that much. Be sure to capture who (including which application if you have multiple applications hitting the database) and when as well as the old and new values. Set up one audit table per table you want audited (too much locking if you use only one audit table). And at the time you set up your system, write the code to get data back from a bad transaction or set of transactions. That makes it easier to recover when you do have something go wrong and you need to revert. We use two tables per table audited, one contains the info about the process that did the changes (name of the application, date, user, etc. and an auditid), the other contains the details about what was changed (old and new values, ID of the record being affected and column affected). Our structure enables us to use the same structure for each table being audited, and allows the tables to change without having to change the audit table and allows us to easily script the audit tables for a new tables. It is also easy for us to see what records were changed at the same time or in the same process or to find out which of the many applications which touch our database was responsible for the bad data as well as telling us who in particular was responsible for the bad data. This helps us track down application bugs and find out why the data was changed the way it was in some cases. It also makes it easier for us to track down all the data that was affected by a broken process rather than just the one we knew about.
If you have Enterprise Edition, look into Change Data Capture. If you don't have Enterprise and aren't interested in capturing the historical values of the columns that change, look into Change Tracking.
See Comparing Change Data Capture and Change Tracking to understand the differences between the two.
Assuming all requests to insert, update and/or delete data goes through some middle-tier data access layer, I would suggest you do your logging there. This is where we do all of ours. It is much simpler than trying to extract the actual insert / delete / update statements out of SQL Server.
If you want to do auditing of data, you can look into Change Data Capture (CDC). But this requires the Enterprise Edition.
I wan't sure how to word this question so I'll try and explain. I have a third-party database on SQL Server 2005. I have another SQL Server 2008, which I want to "publish" some of the data in the third-party database too. This database I shall then use as the back-end for a portal and reporting services - it shall be the data warehouse.
On the destination server I want store the data in different table structures to that in the third-party db. Some tables I want to denormalize and there are lots of columns that aren't necessary. I'll also need to add additional fields to some of the tables which I'll need to update based on data stored in the same rows. For example, there are varchar fields that contain info I'll want to populate other columns with. All of this should cleanse the data and make it easier to report on.
I can write the query(s) to get all the info I want in a particular destination table. However, I want to be able to keep it up-to-date with the source on the other server. It doesn't have to be updated immediately (although that would be good) but I'd like for it be updated perhaps every 10 minutes. There are 100's of thousands of rows of data but the changes to the data and addition of new rows etc. isn't huge.
I've had a look around but I'm still not sure the best way to achieve this. As far as I can tell replication won't do what I need. I could manually write the t-sql to do the updates perhaps using the Merge statement and then schedule it as a job with sql server agent. I've also been having a look at SSIS and that looks to be geared at the ETL kind of thing.
I'm just not sure what to use to achieve this and I was hoping to get some advice on how one should go about doing this kind-of thing? Any suggestions would be greatly appreciated.
For that tables whose schemas/realtions are not changing, I would still strongly recommend Replication.
For the tables whose data and/or relations are changing significantly, then I would recommend that you develop a Service Broker implementation to handle that. The hi-level approach with service broker (SB) is:
Table-->Trigger-->SB.Service >====> SB.Queue-->StoredProc(activated)-->Table(s)
I would not recommend SSIS for this, unless you wanted to go to something like dialy exports/imports. It's fine for that kind of thing, but IMHO far too kludgey and cumbersome for either continuous or short-period incremental data distribution.
Nick, I have gone the SSIS route myself. I have jobs that run every 15 minutes that are based in SSIS and do the exact thing you are trying to do. We have a huge relational database and then we wanted to do complicated reporting on top of it using a product called Tableau. We quickly discovered that our relational model wasn't really so hot for that so I built a cube over it with SSAS and that cube is updated and processed every 15 minutes.
Yes SSIS does give the aura of being mainly for straight ETL jobs but I have found that it can be used for simple quick jobs like this as well.
I think, staging and partitioning will be too much for your case. I am implementing the same thing in SSIS now but with a frequency of 1 hour as I need to give some time for support activities. I am sure that using SSIS is a good way of doing it.
During the design, I had thought of another way to achieve custom replication, by customizing the Change Data Capture (CDC) process. This way you can get near real time replication, but is a tricky thing.