How should I keep accurate records summarising multiple tables? - sql

I have a normalized database and need to produce web based reports frequently that involve joins across multiple tables. These queries are taking too long, so I'd like to keep the results computed so that I can load pages quickly. There are frequent updates to the tables I am summarising, and I need the summary to reflect all update so far.
All tables have autoincrement primary integer keys, and I almost always add new rows and can arrange to clear the computed results in they change.
I approached a similar problem where I needed a summary of a single table by arranging to iterate over each row in the table, and keep track of the iterator state and the highest primary keen (i.e. "highwater") seen. That's fine for a single table, but for multiple tables I'd end up keeping one highwater value per table, and that feels complicated. Alternatively I could denormalise down to one table (with fairly extensive application changes), which feels a step backwards and would probably change my database size from about 5GB to about 20GB.
(I'm using sqlite3 at the moment, but MySQL is also an option).

I see two approaches:
You move the data in a separate database, denormalized, putting some precalculation, to optimize it for quick access and reporting (sounds like a small datawarehouse). This implies you have to think of some jobs (scripts, separate application, etc.) that copies and transforms the data from the source to the destination. Depending on the way you want the copying to be done (full/incremental), the frequency of copying and the complexity of data model (both source and destination), it might take a while to implement and then to optimizie the process. It has the advantage that leaves your source database untouched.
You keep the current database, but you denormalize it. As you said, this might imply changing in the logic of the application (but you might find a way to minimize the impact on the logic using the database, you know the situation better than me :) ).

Can the reports be refreshed incrementally, or is it a full recalculation to rework the report? If it has to be a full recalculation then you basically just want to cache the result set until the next refresh is required. You can create some tables to contain the report output (and metadata table to define what report output versions are available), but most of the time this is overkill and you are better off just saving the query results off to a file or other cache store.
If it is an incremental refresh then you need the PK ranges to work with anyhow, so you would want something like your high water mark data (except you may want to store min/max pairs).

You can create triggers.
As soon as one of the calculated values changes, you can do one of the following:
Update the calculated field (Preferred)
Recalculate your summary table
Store a flag that a recalculation is necessary. The next time you need the calculated values check this flag first and do the recalculation if necessary
Example:
CREATE TRIGGER update_summary_table UPDATE OF order_value ON orders
BEGIN
UPDATE summary
SET total_order_value = total_order_value
- old.order_value
+ new.order_value
// OR: Do a complete recalculation
// OR: Store a flag
END;
More Information on SQLite triggers: http://www.sqlite.org/lang_createtrigger.html

In the end I arranged for a single program instance to make all database updates, and maintain the summaries in its heap, i.e. not in the database at all. This works very nicely in this case but would be inappropriate if I had multiple programs doing database updates.

You haven't said anything about your indexing strategy. I would look at that first - making sure that your indexes are covering.
Then I think the trigger option discussed is also a very good strategy.
Another possibility is the regular population of a data warehouse with a model suitable for high performance reporting (for instance, the Kimball model).

Related

Is it possible to implement point in time recovery (PITR) in PostgreSQL for a single table?

Let's say I have a database with lots of tables, but there's one big table that's being updated regularly. At any given point in time, this table contains billions of rows, and let's say that the table is updated so regularly that we can expect a 100% refresh of the table by the end of each quarter. So the volume of data being moved around is in the order tens of billions. Because this table is changing so constantly, I want to implement a PITR, but only for this one table. I have two options:
Hack PostgreSQL's in-house PITR to apply only for one table.
Build it myself by creating a base backup, set up continuous archiving, and using a python script to execute the log of SQL statements up to a point in time (or use PostgreSQL's EXECUTE statement to loop through the archive). The big con with this is that it won't have the timeline functionality.
My problem is, I don't know if option 1 is even possible, and I don't know if option 2 even makes sense (looping through billions of rows sounds like it defeats the purpose of PITR, which is speed and convenience.) What other options do I have?

Best practice to update bulk data in table used for reporting in SQL

I have created a table for reporting purpose where I am storing data for about 50 columns and at some time interval my scheduler executes a service which processes other tables and fill up data in my flat table.
Currently I am deleting and inserting data in that table But I want to know if this is the good practice or should I check every column in every row and update it if any change found and insert new record if data does not exists.
FYI, total number of rows which are being reinserted is 100k+.
This is a very broad question that can only really be answered with access to your environment and discussion on your personal requirements. Obviously this is not possible via Stack Overflow.
This means you will need to make this decision yourself.
The information you need to understand to be able to do this are the types of table updates available and how you can achieve them, normally referred to as Slowly Changing Dimensions. There are several different types, each with their own advantages, disadvantages and optimal use cases.
Once you understand the how of getting your data to incrementally update as required, you can then look at the why and whether the extra processing logic required to achieve this is actually worth it. Your dataset of a few hundred thousand rows of data is not large and probably may therefore not need this level of processing just yet, though that assessment will depend on how complex and time consuming your current process is and how long you have to run it.
It is probably faster to repopulate the table of 100k rows. To do an update, you still need to:
generate all the rows to insert
compare values in every row
update the values that have changed
The expense of updating rows is heavily on the logging and data movement operations at the data page level. In addition, you need to bring the data together.
If the update is updating a significant portion of rows, perhaps even just a few percent of them, then it is likely that all data pages will be modified. So the I/O is pretty similar.
When you simply replace the table, you will start by either dropping the table or truncating it. Those are relatively cheap operations because they are not logged at the row level. Then you are inserting into the table. Inserting 100,000 rows from one table to another should be pretty fast.
The above is general guidance. Of course, if you are only changing 3 rows in the table each day, then update is going to be faster. Or, if you are adding a new layer of data each day, then just an insert, with a handful of changed historical values might be a fine approach.

Why would all tables not be temporal tables by default?

I'm creating a new database and plan to use temporal tables to log all changes. The data stored will be updated daily but will not be more than 5000 records per table
Is there any reason I shouldn't just make all tables temporal?
Ps. I am aware of the space usage of temporal tables, this is not as far as I understand a problem
I am aware of the space usage of temporal tables, this is not as far as I understand a problem
On the contrary - it's pretty big problem - and there are many other downsides too.
When you use Temporal Tables (at least in SQL Server), every UPDATE operation (even if the data is unchanged) results in a copy being made in the History table (granted, under-the-hood this may be a COW-optimized copy, but it's still another conceptual entity instance).
Secondly - from my personal experience working with LoB applications: most changes to databases are not important enough to justify creating an entire copy of a row, for example, imagine a table with 4 columns ( CREATE TABLE People ( FirstName nvarchar(50), LastName nvarchar(50), Address nvarchar(200), Biography nvarchar(max): whenever a typo in FirstName is fixed then all of the data in the other columns is copied-over, even if Biography contains a 4GB worth of text data - even if this is COW-optimized it's still creating copies for every user action that results in a change.
Is there any reason I shouldn't just make all tables temporal?
The main reason, in my experience, is that it makes changing your table schema much harder because the schemas (aka "table design") of the Active and History tables must be identical: so if you have a table with a NULL column that you want to change to a NOT NULL column and you have NULL values in your History table then you're stuck - at least until you write a data transformation step that will supply the History table with valid data - it's basically creating more work for yourself with little to gain.
Also, don't confuse Temporal Tables with Immutable, Append-only data-stores (like the Bitcoin Blockchain) - while they share similar design objectives (except true immutability) they exist to solve different problems - and if you consider the size requirements and scaling issues of the Ethereum block-chain (over a terabyte by now) then that should give you another idea why it's probably not a good idea.
Finally, even if Temporal Tables didn't have these issues - you still need to go through the effort to write your main software such that it can natively handle temporal data - and things like Entity Framework still don't have built-in support for querying Temporal Data.
...and even with all the historical records you've managed to save in the History table, what do you want it for? Do you really need to track every corrected typo and small, inconsequential change? How will your users react to needing to manually audit the changes to determine what's meaningful or not?
In short:
If your table design probably won't change much in the future...
AND small updates happen infrequently...
OR large updates happen regularly AND you need an audit record
...then go ahead and use Temporal Tables wherever you can.
if not, then you're just creating more future work for yourself with little to gain.
"log all changes" is not a good use case for the temporal features in SQL.
The use case for the SYSTEM TIME temporal feature is when there is a pressing requirement obligating you/the user to be able to easily and quickly reconstruct (should be in scare quotes actually) the state that your database was in at a given moment in time. That is difficult and error-prone and expensive if all you have is a true log of past changes. But if you can suffice with keeping just a log of the changes, then do that (it will be difficult and error-prone and expensive to recreate past database states from the current state and your log, but that's not a pressing problem if there's no pressing need).
Also note that the SQL temporal features encompass also the notion of BUSINESS TIME, which is a different time dimension than the SYSTEM TIME one. Business time is targeted to keeping a history of the world situation, system time is targeted at keeping a history of your database itself, that is, a history of your records of the world situation.

Oracle - Failover table or query manipulation

In a DWH environment for performance reasons I need to materialize a view into a table with approx. 100 columns and 50.000.000 records. Daily ~ 60.000 new records are inserted and ~80.000 updates on existing records are performed. By decision I am not allowed to use materialized views because the architect claims this leads to performance issues. I can't argue the case anymore, it's an irrevocable decision and I have to accept.
So I would like to make a daily full load in the night e.g. truncate and insert. But if the job fails the table may not be empty but must contain the data from the last successful population.
Therefore I thought about something like a failover table, that will be used instead if anything wents wrong:
IF v_load_job_failed THEN failover_table
ELSE regular_table
Is there something like a failover table that will be used instead of another table depending on a predefined condition? Something like a trigger that rewrites or manipulates a select-query before execution?
I know that is somewhat of a dirty workaround.
If you have space for (brief) period of time of double storage, I'd recommend
1) Clone existing table (all indexes, grants, etc) but name with _TMP
2) Load _TMP
3) Rename base table to _BKP
4) Rename _TMP to match Base table
5) Rename _BKP to _TMP
6) Truncate _TMP
ETA: #1 would be "one time"; 2-6 would be part of daily script.
This all assumes the performance of (1) detecting all new records and all updated records and (2) using MERGE (INSERT+UPDATE) to integrate those changed records into base table is "on par" with full load.
(Personally, I lean toward the full load approach anyway; on the day somebody tweaks a referential value that's incorporated into the view def and changes the value for all records, you'll find yourself waiting on a week-long update of 50,000,000 records. Such concerns are completely eliminated with full-load approach)
All that said, it should be noted that if MV is defined correctly, the MV-refresh approach is identical to this approach in every way, except:
1) Simpler / less moving pieces
2) More transparent (SQL of view def is attached to MV, not buried in some PL/SQL package or .sql script somewhere)
3) Will not have "blip" of time, between table renames, where queries / processes may not see table and fail.
ETA: It's possible to pull this off with "partition magic" in a couple of ways that avoid a "blip" of time where data or table is missing.
You can, for instance, have an even-day and odd-day partition. On odd-days, insert data (no commit), then truncate even-day (which simultaneously drops old day and exposes new). But is it worth the complexity? You need to add a column to partition by, and deal with complexity of reruns - if you're logic isn't tight, you'll wind up truncating the data you just loaded. This does, however, prevent a blip
One method that does avoid any "blip" and is a little less "whoops" prone:
1) Add "DUMMY" column that always has value 1.
2) Create _TMP table (also with "DUMMY" column) and partition by DUMMY column (so all rows go to same partition)
-- Daily script --
3) Load _TMP table
4) Exchange partition of _TMP table with main base table WITHOUT VALIDATION INCLUDING INDEXES
It bears repeating: all of these methods are equivalent if resource usage to MV-refresh; they're just more complex and tend to make developers feel "savvy" for solving problems that have already been solved.
Final note - addressing David Aldridge - first and foremost, daily refresh tables SHOULD NOT have logging enabled. In recovery scenario, just make sure you have step to run refresh scripts once base tables are restored.
Performance-wise, mileage is going to vary on this; but in my experience, the complexity of identifying and modifying changed/inserted rows can get very sticky (at some point, somebody will do something to base data that your script did not take into account; either yielding incorrect results or performance obstacles). DWH environments tend to be geared to accommodate processes like this with little problem. Unless/until the full refresh proves to have overhead above&beyond what the system can tolerate, it's generally the simplest "set-it-and-forget-it" approach.
On that note, if data can be logically separated into "live rows which might be updated" vs "historic rows that will never be updated", you can come up with a partitioning scheme and process that only truncates/reloads the "live" data on a daily basis.
A materialized view is just a set of metadata with an underlying table, and there's no reason why you cannot maintain a table in a manner similar to a materialized view's internal mechanisms.
I'd suggest using a MERGE statement as a single query rather than a truncate/insert. It will either succeed in its entirety or rollback to leave the previous data intact. 60,000 new records and 80,000 modified records is not much.
I think that you cannot go far wrong if you at least start with a simple, single SQL statement and then see how that works for you. If you do decide to go with a multistep process then ensure that it automatically recovers itself at any stage where it might go wrong part way through -- that might turn out to be the tricky bit.

What's the fastest way to copy data from one table to another in Django?

I have two models -
ChatCurrent - (which stores the messages for the current active chats)
ChatArchive - (which archives the messages for the chats that have ended)
The reason I'm doing this is so that the ChatCurrent table always has minimum number of entries, making querying the table fast (I don't know if this works, please let me know if I've got this wrong)
So I basically want to copy (cut) data from the ChatCurrent to the ChatArchive model. What would be the fastest way to do this. From what I've read online, it seems that I might have to execute a raw SQL query, if you would be kind enough to even state the Query I'll be grateful.
Additional details -
Both the models have the same schema.
My opinion is that today they are not reason to denormalize database in this way to improve performance. Indexes or partitioning + indexes should be enought.
Also, in case that, for semantic reasons, you prefer have two tables (models) like: Chat and ChatHistory (or ChatCurrent and ChatActive) as you say and manage it with django, I thing that the right way to keep consistence is to create ToArchive() method in ChatCurrent. This method will move chat entries to historical chat model. You can perform this operation in background mode, then you can thread the swap in a celery process, in this way online users avoid wait for request. Into celery process the fastest method to copy data is a raw sql. Remember that you can encapsulate sql into a stored procedure.
Edited to include reply to your comment
You can perform ChatCurrent.ToArchive() in ChatCurrent.save() method:
class ChatCurrent(model.Model):
closed=models.BooleanField()
def save(self, *args, **kwargs):
super(Model, self).save(*args, **kwargs)
if self.closed:
self.ToArchive()
def ToArchive(self):
from django.db import connection, transaction
cursor = connection.cursor()
cursor.execute("insert into blah blah")
transaction.commit_unless_managed()
#self.delete() #if needed (perhaps deleted on raw sql)
Try something like this:
INSERT INTO "ChatArchive" ("column1", "column2", ...)
SELECT "column1", "column2", ...
FROM "ChatCurrent" WHERE yourCondition;
and than just
DELETE FROM "ChatCurrent" WHERE yourCondition;
The thing you are trying to do is table partitioning.
Most databases support this feature without the need for manual book keeping.
Partitioning will also yield much better results than manually moving parts of the data to a different table. By using partitioning you avoid:
- Data inconsistency. Which is easy to introduce because you will move records in bulk and then remove a lot of them from the source table. It's easy to make a mistake and copy only a portion of the data.
- Performance drop - moving the data around and the associated overhead from transactions will generally neglect any benefit you got from reducing the size of the ChatCurrent table.
For a really quick rundown. Table partitioning allows you to tell the database that parts of the data are stored and retrieved together, this significantly speeds up queries as the database knows that it only has to look into a specific part of the data set. Example: chat's from the current day, last hour, last month etc. You can additionally store each partition on a different drive, that way you can keep your current chatter on a fast SSD drive and your history on regular slower disks.
Please refer to your database manual to know the details about how it handles partitioning.
Example for PostgreSQL: http://www.postgresql.org/docs/current/static/ddl-partitioning.html
Partitioning refers to splitting what is logically one large table into smaller physical pieces. Partitioning can provide several benefits:
Query performance can be improved dramatically in certain situations, particularly when most of the heavily accessed rows of the table are in a single partition or a small number of partitions. The partitioning substitutes for leading columns of indexes, reducing index size and making it more likely that the heavily-used parts of the indexes fit in memory.
When queries or updates access a large percentage of a single partition, performance can be improved by taking advantage of sequential scan of that partition instead of using an index and random access reads scattered across the whole table.
Bulk loads and deletes can be accomplished by adding or removing partitions, if that requirement is planned into the partitioning design. ALTER TABLE NO INHERIT and DROP TABLE are both far faster than a bulk operation. These commands also entirely avoid the VACUUM overhead caused by a bulk DELETE.
Seldom-used data can be migrated to cheaper and slower storage media.
def copyRecord(self,recordId):
emailDetail=EmailDetail.objects.get(id=recordId)
copyEmailDetail= CopyEmailDetail()
for field in emailDetail.__dict__.keys():
copyEmailDetail.__dict__[field] = emailDetail.__dict__[field]
copyEmailDetail.save()
logger.info("Record Copied %d"%copyEmailDetail.id)
As per the above solutions, don't copy over.
If you really want to have two separate tables to query, store your chats in a single table (and for preference, use all the database techniques here mentioned), and then have a Current and Archive table, whose objects simply point to Chat objects/