I have read many articles and posts about how a cursor is a massive performance hindrance over the equivalent single set query.
However, with a cursor, you are able to perform the desired operation successfully on all rows that did not err, and provide an error message for each row that did.
Is there some other way I can achieve this row granularity with set operations?
No, a set-based operations works - as the name tells us - with a set. It will work or fail in total.
A CURSOR (or any other procedural approach like WHILE or an external program) can be the best choice in this case.
If performance matters I would prefer to use a tolerant staging table for the first set based import. Then do some quality/cleaning actions there to ensure the succesfull transfer and shift the cleaned data into your target tables (set based).
This depends on the data, your business rules and - of course - the amount of rows.
Related
In a DWH environment for performance reasons I need to materialize a view into a table with approx. 100 columns and 50.000.000 records. Daily ~ 60.000 new records are inserted and ~80.000 updates on existing records are performed. By decision I am not allowed to use materialized views because the architect claims this leads to performance issues. I can't argue the case anymore, it's an irrevocable decision and I have to accept.
So I would like to make a daily full load in the night e.g. truncate and insert. But if the job fails the table may not be empty but must contain the data from the last successful population.
Therefore I thought about something like a failover table, that will be used instead if anything wents wrong:
IF v_load_job_failed THEN failover_table
ELSE regular_table
Is there something like a failover table that will be used instead of another table depending on a predefined condition? Something like a trigger that rewrites or manipulates a select-query before execution?
I know that is somewhat of a dirty workaround.
If you have space for (brief) period of time of double storage, I'd recommend
1) Clone existing table (all indexes, grants, etc) but name with _TMP
2) Load _TMP
3) Rename base table to _BKP
4) Rename _TMP to match Base table
5) Rename _BKP to _TMP
6) Truncate _TMP
ETA: #1 would be "one time"; 2-6 would be part of daily script.
This all assumes the performance of (1) detecting all new records and all updated records and (2) using MERGE (INSERT+UPDATE) to integrate those changed records into base table is "on par" with full load.
(Personally, I lean toward the full load approach anyway; on the day somebody tweaks a referential value that's incorporated into the view def and changes the value for all records, you'll find yourself waiting on a week-long update of 50,000,000 records. Such concerns are completely eliminated with full-load approach)
All that said, it should be noted that if MV is defined correctly, the MV-refresh approach is identical to this approach in every way, except:
1) Simpler / less moving pieces
2) More transparent (SQL of view def is attached to MV, not buried in some PL/SQL package or .sql script somewhere)
3) Will not have "blip" of time, between table renames, where queries / processes may not see table and fail.
ETA: It's possible to pull this off with "partition magic" in a couple of ways that avoid a "blip" of time where data or table is missing.
You can, for instance, have an even-day and odd-day partition. On odd-days, insert data (no commit), then truncate even-day (which simultaneously drops old day and exposes new). But is it worth the complexity? You need to add a column to partition by, and deal with complexity of reruns - if you're logic isn't tight, you'll wind up truncating the data you just loaded. This does, however, prevent a blip
One method that does avoid any "blip" and is a little less "whoops" prone:
1) Add "DUMMY" column that always has value 1.
2) Create _TMP table (also with "DUMMY" column) and partition by DUMMY column (so all rows go to same partition)
-- Daily script --
3) Load _TMP table
4) Exchange partition of _TMP table with main base table WITHOUT VALIDATION INCLUDING INDEXES
It bears repeating: all of these methods are equivalent if resource usage to MV-refresh; they're just more complex and tend to make developers feel "savvy" for solving problems that have already been solved.
Final note - addressing David Aldridge - first and foremost, daily refresh tables SHOULD NOT have logging enabled. In recovery scenario, just make sure you have step to run refresh scripts once base tables are restored.
Performance-wise, mileage is going to vary on this; but in my experience, the complexity of identifying and modifying changed/inserted rows can get very sticky (at some point, somebody will do something to base data that your script did not take into account; either yielding incorrect results or performance obstacles). DWH environments tend to be geared to accommodate processes like this with little problem. Unless/until the full refresh proves to have overhead above&beyond what the system can tolerate, it's generally the simplest "set-it-and-forget-it" approach.
On that note, if data can be logically separated into "live rows which might be updated" vs "historic rows that will never be updated", you can come up with a partitioning scheme and process that only truncates/reloads the "live" data on a daily basis.
A materialized view is just a set of metadata with an underlying table, and there's no reason why you cannot maintain a table in a manner similar to a materialized view's internal mechanisms.
I'd suggest using a MERGE statement as a single query rather than a truncate/insert. It will either succeed in its entirety or rollback to leave the previous data intact. 60,000 new records and 80,000 modified records is not much.
I think that you cannot go far wrong if you at least start with a simple, single SQL statement and then see how that works for you. If you do decide to go with a multistep process then ensure that it automatically recovers itself at any stage where it might go wrong part way through -- that might turn out to be the tricky bit.
Which is better?
1)A cursor that loop 30000 record and perform update one by one
2)Create a script that has 30000 update command
thanks
Both should take about the same time, mainly subject to how the CURSOR is declared.
Reason? You have 30,000 individual updates which is usually the main factor
Note that 30,000 individual UPDATES in one batch will probably fail because of batch size and compile time anyway...
SQL is a set based language and you can most likely do a single UPDATE to update all rows in one go. If you can't, it is because of 2 reasons
You need "per row" logic: this can usually be achieved by CASE expressions, UDFs etc
You don't understand sets and SQL
With more information (the SQL and logic) we could help you more...
There is a very easy way to tell: Do it and measure the time.
Other than that, having 30000 lines does not make a lot of sense when you can have just 10.
Making updates this way for reasons other than data migration or maintenance doesn't sound like wise either, and in those cases performance is not an issue - but maintenance and legibility always is.
You know, that depends on context.
It helps, though, to learn. SQL for example. You are on a low level not to see the real optimizations possible here. SQL is a lot more than just Update, Insert and simple Select statements.
1)A cursor that loop 30000 record and perform update one by one
Linear step by step processing. No way to paralellize as SQL itself has no threading mechanisms available to the user; Optimizations are one by one - i.e. the query optimizer looks at items one statement at a time.
2)Create a script that has 30000 update command
Assuming the script is external, it could split the work and run it concurrent on multiple connections, i.e. run more than one parallel.
But there is more:
Make a script that calculates the new values.
Bulk import them into a temporary table using the buld copy API
Issue ONE update statment that takes the updated values from the temporary table to the final one.
Maybe have a script that issues a merge statement for multi update? There are tons of variations there if you know the SQL api more than "update, open cursor, simple select".
I do that - though a lot more data (batches of 50.000, sometimes 4-6 at the same time). The problem being that sql bulk copy has some overhead. But I manage 75.000 inserts per second that way.
A lot depends on the business questions and the complexity of the logic - if it is simple updates then the question is: Calculated or externally driven? Multiple values by 2 = calculated, updating addresses = data driven (i.e. you need the new data from somewhere).
My question is mainly concerned with "what's best for performance", but kinda "philosophically" speaking as well (if it makes a difference)... so let's jump right in.
[TableA].[ColumnB] stores a value that needs to exist in [TableC].[ColumnD]. Right off the bat, no answers involving Foreign-keys - just assume that they're "not allowed" in this environment for whatever reason.
But due to "circumstances x,y,z", [TableA].[ColumnB] sometimes gets values that do not exist in [TableC].[ColumnD], because, let's say, [TableA] gets populated from an object that exists in running code as a "serialized blob", an in-memory representation of the data, and the [ColumnB] values got populated before those values were deleted from [TableC].[ColumnD] by some other process. ANYWAY, this is for example's sake, so don't get bogged down in the "why does this condition happen", just accept that it does.
To "fix" the problem, which method is best of these two: 1. make a Trigger that fires on-INSERT on [TableA], to Update [ColumnB] to the value that it should be (and assume I have a "mapping" of bad-to-good values). Or, 2. run a scheduled-Job every hour/minute/whatever that runs Update queries to change all possible "bad" values to their corresponding "good" values.
To put it more generally, what's better for performance and/or what is best practice: a Trigger, or a periodic Scheduled-Job? In context, let's say [TableA] is typically on the order of hundreds of thousands of rows, with Inserts happening 10-100 records at-a-time, as frequently as every few minutes to as rarely as a few times per day.
On-insert.
Doing triggers is like callbacks- They're more logically sound, and they spread any lag into every query. Doing continual checks (called polling or cron-jobs), you end up with more severe moments of lag every now and then. In almost all cases, using triggers/callbacks are the better way to go as having 1ms of lag added to every query is better than 100ms of lag at seemingly random intervals.
Use of triggers is generally discouraged, but your load is light and your case seems to be a natural trigger case. Consider using instead-of trigger to avoid two operations on the same row (one insert instead of insert and update). It may be the simplest and most reliable solution (as long as you have written reliable code in the trigger that won't cause the whole operation to crash).
Since you are considering a batch job, you are not concerned with timing issues. I.e it's OK with your application that tables may be out of sync for 1 minute or even 1 hour. That's the major difference with the trigger approach, which will guarantee that tables are in sync all the time. Potential timing issues would make me uncomfortable. On the plus side, you won't be at risk of crashing the original insert operation with your trigger.
If you go this route, please consider Change Tracking feature. Change tracking will indicate which rows have been inserted since the last time you checked, so you won't have to scan the whole table for new records. Alternatively, if your TableA has an INDENITY primary or unique key, you can implement similar design without change tracking functionality.
Triggers are both best performance and practice, as they maintain referential integrity as well as allowing the server to optimise for performance.
You didn't say what version of SQL Server you were using, but if it's 2008+, you can use Change Data Capture to keep track of data changes to your "primary" table. Then, periodically, you can run a batch over the change table and do whatever processing is required over that small set.
I'm working on a project in which we will need to determine certain types of statuses for a large body of people, stored in a database. The business rules for determining these statuses are fairly complex and may change.
For example,
if a person is part of group X
and (if they have attribute O) has either attribute P or attribute Q,
or (if they don't have attribute O) has attribute P but not Q,
and don't have attribute R,
and aren't part of group Y (unless they also are part of group Z),
then status A is true.
Multiply by several dozen statuses and possibly hundreds of groups and attributes. The people, groups, and attributes are all in the database.
Though this will be consumed by a Java app, we also want to be able to run reports directly against the database, so it would be best if the set of computed statuses were available at at the data level.
Our current design plan, then, is to have a table or view that consists of a set of boolean flags (hasStatusA? hasStatusB? hasStatusC?) for each person. This way, if I want to query for everyone who has status C, I don't have to know all of the rules for computing status C; I just check the flag.
(Note that, in real life, the flags will have more meaningful names: isEligibleForReview?, isPastDueForReview?, etc.).
So a) is this a reasonable approach, and b) if so, what's the best way to compute those flags?
Some options we're considering for computing flags:
Make the set of flags a view, and calculate the flag values from the underlying data in real time using SQL or PL-SQL (this is an Oracle DB). This way the values are always accurate, but performance may suffer, and the rules would have to be maintained by a developer.
Make the set of flags consist of static data, and use some type of rules engine to keep those flags up-to-date as the underlying data changes. This way the rules can be maintained more easily, but the flags could potentially be inaccurate at a given point in time. (If we go with this approach, is there a rules engine that can easily manipulate data within a database in this way?)
In a case like this I suggest applying Ward Cunningham's question- ask yourself "What's the simplest thing that could possibly work?".
In this case, the simplest thing might be to come up with a view that looks at the data as it exists and does the calculations and computations to produce all the fields you care about. Now, load up your database and try it out. Is it fast enough? If so, good - you did the simplest possible thing and it worked out fine. If it's NOT fast enough, good - the first attempt didn't work, but you've got the rules mapped out in the view code. Now you can go on to try the next iteration of "the simplest thing" - perhaps your write a background task that watches for inserts and updates and then jumps in to recompute the flags. If that works, fine and dandy. If not, go to the next iteration...and so on.
Share and enjoy.
I would advise against making the statuses as column names but rather use a status id and value. such as a customer status table with columns of ID and Value.
I would have two methods for updating statuses. One a stored procedure that either has all the logic or calls separate stored procs to figure out each status. you could make all this dynamic by having a function for each status evaluation, and the one stored proc could then call each function. The 2nd method would be to have whatever stored proc(s), that updates user info, call a stored proc to go update all the users statuses based upon the current data. These two methods would allow you to have both realtime updates for the data that changed and if you add a new status, you can call the method to update all statuses with new logic.
Hopefully you have one point of updates to the user data, such as a user update stored proc, and you can put the status update stored proc call in that procedure. This would also save having to schedule a task every n seconds to update statuses.
An option I'd consider would be for each flag to be backed by a deterministic function that returns the up-to-date value given the relevant data.
The function might not perform well enough, however, if you're calling it for many rows at a time (e.g. for reporting). So, if you're on Oracle 11g, you can solve this by adding virtual columns (search for "virtual column") to the relevant tables based on the function. The Result Cache feature should improve the performance of the function as well.
I have a normalized database and need to produce web based reports frequently that involve joins across multiple tables. These queries are taking too long, so I'd like to keep the results computed so that I can load pages quickly. There are frequent updates to the tables I am summarising, and I need the summary to reflect all update so far.
All tables have autoincrement primary integer keys, and I almost always add new rows and can arrange to clear the computed results in they change.
I approached a similar problem where I needed a summary of a single table by arranging to iterate over each row in the table, and keep track of the iterator state and the highest primary keen (i.e. "highwater") seen. That's fine for a single table, but for multiple tables I'd end up keeping one highwater value per table, and that feels complicated. Alternatively I could denormalise down to one table (with fairly extensive application changes), which feels a step backwards and would probably change my database size from about 5GB to about 20GB.
(I'm using sqlite3 at the moment, but MySQL is also an option).
I see two approaches:
You move the data in a separate database, denormalized, putting some precalculation, to optimize it for quick access and reporting (sounds like a small datawarehouse). This implies you have to think of some jobs (scripts, separate application, etc.) that copies and transforms the data from the source to the destination. Depending on the way you want the copying to be done (full/incremental), the frequency of copying and the complexity of data model (both source and destination), it might take a while to implement and then to optimizie the process. It has the advantage that leaves your source database untouched.
You keep the current database, but you denormalize it. As you said, this might imply changing in the logic of the application (but you might find a way to minimize the impact on the logic using the database, you know the situation better than me :) ).
Can the reports be refreshed incrementally, or is it a full recalculation to rework the report? If it has to be a full recalculation then you basically just want to cache the result set until the next refresh is required. You can create some tables to contain the report output (and metadata table to define what report output versions are available), but most of the time this is overkill and you are better off just saving the query results off to a file or other cache store.
If it is an incremental refresh then you need the PK ranges to work with anyhow, so you would want something like your high water mark data (except you may want to store min/max pairs).
You can create triggers.
As soon as one of the calculated values changes, you can do one of the following:
Update the calculated field (Preferred)
Recalculate your summary table
Store a flag that a recalculation is necessary. The next time you need the calculated values check this flag first and do the recalculation if necessary
Example:
CREATE TRIGGER update_summary_table UPDATE OF order_value ON orders
BEGIN
UPDATE summary
SET total_order_value = total_order_value
- old.order_value
+ new.order_value
// OR: Do a complete recalculation
// OR: Store a flag
END;
More Information on SQLite triggers: http://www.sqlite.org/lang_createtrigger.html
In the end I arranged for a single program instance to make all database updates, and maintain the summaries in its heap, i.e. not in the database at all. This works very nicely in this case but would be inappropriate if I had multiple programs doing database updates.
You haven't said anything about your indexing strategy. I would look at that first - making sure that your indexes are covering.
Then I think the trigger option discussed is also a very good strategy.
Another possibility is the regular population of a data warehouse with a model suitable for high performance reporting (for instance, the Kimball model).