SQL How to properly create a summary table?

SQL How to properly create a summary table? - sql

I have underlying tables on which the data changes constantly. Every minute or so, I run a stored procedure to summarize the data in those underlying tables into a summary table. The summarization time is very long (~30s) so it does not make sense to have a "summary view." Additionally the summary table is constantly accessed by multiple users, it needs to be quick, responsive, and cannot be down.
To solve this, do the following in the stored procedure:
Summarize the data into "new summary table" (this can take as long as it needs because the "current summary table" is serving the needs of the users)
Drop the "current summary table"
Rename "new summary table" to "current summary table"
My questions are:
Is this safe/proper?
What happens if a user tries to access the "current summary table" when the summarization procedure is between steps 2 and 3 above?
What is the right way to do this? At the end of the day, I just need a summary to always be quickly (this is important) accessible and to be up-to-date (within a minute or so)

By using triggers on the details, you can make the summary stay in sync. For things like average, you need to track sum and count in the summary table as well, so you can recompute average. Triggers by row might be higher overhead than by all rows of the operation, if you have bulk churn and SQL Server has two flavors of trigger like Oracle. Inserts might make a summary row or update it, deletes may update or delete the summary row, and updates might change a key and so do both. Of course, there may be multiple sorts of summary row for any detail row.
Oracle had a materialized view, so maybe SQL Server has that, too. Oh look, it does! Google is my friend, so much to remember! It would be something like a shorthand for the above, at best.
There is the potential for a lot of delay in detail table churn with such triggers. Regenerating the summary table with a periodic query might suffice for some uses. A procedure can truncate the previous table for reuse, generate a new table in it, and then in a trans swap the names. If there is a time stamp in or for the table, the procedure can skip no-change updates. The lock, disk and CPU overhead for query is often a lot less than for churn.
Some summaries like median are very hard to support except by a view, but it might run fast if indexed (not clustered, sorted not hash index), as queries can be fulfilled right from non-clustered indexes. Excess indexes slow transactions (churn), so many use replicated tables for reporting, with few, narrow indexes on the parent transaction table and report-oriented indexes on the replicated table.

Related

Snowflake database: Question on table performance which is stored in snowflake

We have a continuous insert, update, and delete in a table that is in snowflake DB, can this slow down the performance of a table in snowflake over the period of time?

Yes. For two reasons.
because the changes of the INSERT, UPDATE, & DELETE alter the fragment the partition data, thus even if the same number of ROW are present after N hours/days, the layout of the rows can become unaligned to the affinity of queries you run, thus your performance profile can go from highly prunes partition reads, to full table reads.
Also with large number of changes, even if the data is all perfectly ordered after then, the share fact many changes are being made with mean you end up with way too many partitions, which slows down you SQL compilations.
You also can have bad performance if you are INSERT, UPDATE, & DELETE to the same table at the same time, as the second operation will be blocked by the former. This can waste wall clock, and credit allocation (if they are different warehouses)
Some things you can do to avoid this, is run clustering, rebuild the tables in "down time". Not delete the data, but insert into "delete tables" and then left join and exclude matches. We have done all the above.

what is the best way to create warehouse stock report make SQL query or table updated by triggers

Now I want to make an online stock report balance sheet calculate the result from hundreds of thousands of records from so many tables
I have two ways to do that
first way:
Make SQL View query calculate the result from all transaction and get the result JIT
second way:
make stock balance table and update this table by triggers run in each transaction.
what is the best way to get balance sheet reports?

If I understand correctly, you want input on how to generate your stock balance report. This type of analysis usually boils down to a couple factors: concurrency, performance and maintenance. My comments:
1. Dynamic calculation (many tables, hundreds of thousands of records)
Writes: stock table
Reads: stock table
PROS: better concurrency / throughput (only one write)
CONS: potential for slower reports (if tables grow too big and indexes are not maintained)
This is the better option if you have lots of transactions and want to avoid locking issues. You avoid having to update separate tables / structures when you update your stock table.
2. Maintain summary table via triggers run after each transaction
Writes: stock table, summary table
Reads: summary table
PROS: Fast reports (if you allow for "dirty" reads)
CONS: slower writes, potential for more locking issues
If you don't have a lot of transactions and you want fast performance, then this is worth looking at. Just keep in mind you have to do two UPDATEs, so your write operations will take longer. If you can do a "dirty" read (i.e. access the summary table without a READ lock), then this should give you very fast reports.
Another option to look at is indexed (materialized) views, which is like a hybrid version of the two options above: Indexed views

Generated de-normalised View table

We have a system that makes use of a database View, which takes data from a few reference tables (lookups) and then does a lot of pivoting and complex work on a hierarchy table of (pretty much fixed and static) locations, returning a view view of data to the application.
This view is getting slow, as new requirements are added.
A solution that may be an option would be to create a normal table, and select from the view, into this table, and let the application use that highly indexed and fast table for it's querying.
Issue is, I guess, if the underlying tables change, the new table will show old results. But the data that drives this table changes very infrequently. And if it does - a business/technical process could be made that means an 'Update the Table' procedure is run to refresh this data. Or even an update/insert trigger on the primary driving table?
Is this practice advised/ill-advised? And are there ways of making it safer?

The ideal solution is to optimise the underlying queries.
In SSMS run the slow query and include the actual execution plan (Ctrl + M), this will give you a graphical representation of how the query is being executed against your database.
Another helpful tool is to turn on IO statistics, this is usually the main bottleneck with queries, put this line at the top of your query window:
SET STATISTICS IO ON;
Check if SQL recommends any missing indexes (displayed in green in the execution plan), as you say the data changes infrequently so it should be safe to add additional indexes if needed.
In the execution plan you can hover your mouse over any element for more information, check the value for estimated rows vs actual rows returned, if this varies greatly update the statistics for the tables, this can help the query optimiser find the best execution plan.
To do this for all tables in a database:
USE [Database_Name]
GO
exec sp_updatestats
Still no luck in optimising the view / query?
Be careful with update triggers as if the schema changes on the view/table (say you add a new column to the source table) the new column will not be inserted into your 'optimised' table unless you update the trigger.
If it is not a business requirement to report on real time data there is not too much harm in having a separate optimized table for reporting (Much like a DataMart), just use a SQL Agent job to refresh it nightly during non-peak hours.
There are a few cons to this approach though:
More storage space / duplicated data
More complex database
Additional workload during the refresh
Decreased cache hits

Oracle - Failover table or query manipulation

In a DWH environment for performance reasons I need to materialize a view into a table with approx. 100 columns and 50.000.000 records. Daily ~ 60.000 new records are inserted and ~80.000 updates on existing records are performed. By decision I am not allowed to use materialized views because the architect claims this leads to performance issues. I can't argue the case anymore, it's an irrevocable decision and I have to accept.
So I would like to make a daily full load in the night e.g. truncate and insert. But if the job fails the table may not be empty but must contain the data from the last successful population.
Therefore I thought about something like a failover table, that will be used instead if anything wents wrong:
IF v_load_job_failed THEN failover_table
ELSE regular_table
Is there something like a failover table that will be used instead of another table depending on a predefined condition? Something like a trigger that rewrites or manipulates a select-query before execution?
I know that is somewhat of a dirty workaround.

If you have space for (brief) period of time of double storage, I'd recommend
1) Clone existing table (all indexes, grants, etc) but name with _TMP
2) Load _TMP
3) Rename base table to _BKP
4) Rename _TMP to match Base table
5) Rename _BKP to _TMP
6) Truncate _TMP
ETA: #1 would be "one time"; 2-6 would be part of daily script.
This all assumes the performance of (1) detecting all new records and all updated records and (2) using MERGE (INSERT+UPDATE) to integrate those changed records into base table is "on par" with full load.
(Personally, I lean toward the full load approach anyway; on the day somebody tweaks a referential value that's incorporated into the view def and changes the value for all records, you'll find yourself waiting on a week-long update of 50,000,000 records. Such concerns are completely eliminated with full-load approach)
All that said, it should be noted that if MV is defined correctly, the MV-refresh approach is identical to this approach in every way, except:
1) Simpler / less moving pieces
2) More transparent (SQL of view def is attached to MV, not buried in some PL/SQL package or .sql script somewhere)
3) Will not have "blip" of time, between table renames, where queries / processes may not see table and fail.
ETA: It's possible to pull this off with "partition magic" in a couple of ways that avoid a "blip" of time where data or table is missing.
You can, for instance, have an even-day and odd-day partition. On odd-days, insert data (no commit), then truncate even-day (which simultaneously drops old day and exposes new). But is it worth the complexity? You need to add a column to partition by, and deal with complexity of reruns - if you're logic isn't tight, you'll wind up truncating the data you just loaded. This does, however, prevent a blip
One method that does avoid any "blip" and is a little less "whoops" prone:
1) Add "DUMMY" column that always has value 1.
2) Create _TMP table (also with "DUMMY" column) and partition by DUMMY column (so all rows go to same partition)
-- Daily script --
3) Load _TMP table
4) Exchange partition of _TMP table with main base table WITHOUT VALIDATION INCLUDING INDEXES
It bears repeating: all of these methods are equivalent if resource usage to MV-refresh; they're just more complex and tend to make developers feel "savvy" for solving problems that have already been solved.
Final note - addressing David Aldridge - first and foremost, daily refresh tables SHOULD NOT have logging enabled. In recovery scenario, just make sure you have step to run refresh scripts once base tables are restored.
Performance-wise, mileage is going to vary on this; but in my experience, the complexity of identifying and modifying changed/inserted rows can get very sticky (at some point, somebody will do something to base data that your script did not take into account; either yielding incorrect results or performance obstacles). DWH environments tend to be geared to accommodate processes like this with little problem. Unless/until the full refresh proves to have overhead above&beyond what the system can tolerate, it's generally the simplest "set-it-and-forget-it" approach.
On that note, if data can be logically separated into "live rows which might be updated" vs "historic rows that will never be updated", you can come up with a partitioning scheme and process that only truncates/reloads the "live" data on a daily basis.

A materialized view is just a set of metadata with an underlying table, and there's no reason why you cannot maintain a table in a manner similar to a materialized view's internal mechanisms.
I'd suggest using a MERGE statement as a single query rather than a truncate/insert. It will either succeed in its entirety or rollback to leave the previous data intact. 60,000 new records and 80,000 modified records is not much.
I think that you cannot go far wrong if you at least start with a simple, single SQL statement and then see how that works for you. If you do decide to go with a multistep process then ensure that it automatically recovers itself at any stage where it might go wrong part way through -- that might turn out to be the tricky bit.

How should I keep accurate records summarising multiple tables?

I have a normalized database and need to produce web based reports frequently that involve joins across multiple tables. These queries are taking too long, so I'd like to keep the results computed so that I can load pages quickly. There are frequent updates to the tables I am summarising, and I need the summary to reflect all update so far.
All tables have autoincrement primary integer keys, and I almost always add new rows and can arrange to clear the computed results in they change.
I approached a similar problem where I needed a summary of a single table by arranging to iterate over each row in the table, and keep track of the iterator state and the highest primary keen (i.e. "highwater") seen. That's fine for a single table, but for multiple tables I'd end up keeping one highwater value per table, and that feels complicated. Alternatively I could denormalise down to one table (with fairly extensive application changes), which feels a step backwards and would probably change my database size from about 5GB to about 20GB.
(I'm using sqlite3 at the moment, but MySQL is also an option).

I see two approaches:
You move the data in a separate database, denormalized, putting some precalculation, to optimize it for quick access and reporting (sounds like a small datawarehouse). This implies you have to think of some jobs (scripts, separate application, etc.) that copies and transforms the data from the source to the destination. Depending on the way you want the copying to be done (full/incremental), the frequency of copying and the complexity of data model (both source and destination), it might take a while to implement and then to optimizie the process. It has the advantage that leaves your source database untouched.
You keep the current database, but you denormalize it. As you said, this might imply changing in the logic of the application (but you might find a way to minimize the impact on the logic using the database, you know the situation better than me :) ).

Can the reports be refreshed incrementally, or is it a full recalculation to rework the report? If it has to be a full recalculation then you basically just want to cache the result set until the next refresh is required. You can create some tables to contain the report output (and metadata table to define what report output versions are available), but most of the time this is overkill and you are better off just saving the query results off to a file or other cache store.
If it is an incremental refresh then you need the PK ranges to work with anyhow, so you would want something like your high water mark data (except you may want to store min/max pairs).

You can create triggers.
As soon as one of the calculated values changes, you can do one of the following:
Update the calculated field (Preferred)
Recalculate your summary table
Store a flag that a recalculation is necessary. The next time you need the calculated values check this flag first and do the recalculation if necessary
Example:
CREATE TRIGGER update_summary_table UPDATE OF order_value ON orders
BEGIN
UPDATE summary
SET total_order_value = total_order_value
- old.order_value
+ new.order_value
// OR: Do a complete recalculation
// OR: Store a flag
END;
More Information on SQLite triggers: http://www.sqlite.org/lang_createtrigger.html

In the end I arranged for a single program instance to make all database updates, and maintain the summaries in its heap, i.e. not in the database at all. This works very nicely in this case but would be inappropriate if I had multiple programs doing database updates.

You haven't said anything about your indexing strategy. I would look at that first - making sure that your indexes are covering.
Then I think the trigger option discussed is also a very good strategy.
Another possibility is the regular population of a data warehouse with a model suitable for high performance reporting (for instance, the Kimball model).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas