Join or storing directly - sql

I have a table A which contains entries I am regularly processing and storing the result in table B. Now I want to determine for each entry in A its latest processing date in B.
My current implementation is joining both tables and retrieving the latest date. However an alternative, maybe less flexible, approach would be to simply store the date in table A directly.
I can think of pros and cons for both cases (performance, scalability, ....), but didnt have such a case yet and would like to see whether someone here on stackoverflow had a similar situation and has a recommendation for either one for a specific reason.
Below a quick schema design.
Table A
id, some-data, [possibly-here-last-process-date]
Table B
fk-for-A, data, date
Thanks

Based on your description, it sounds like Table B is your historical (or archive) table and it's populated by batch.
I would leave Table A alone and just introduce an index on id and date. If the historical table is big, introduce an auto-increment PK for table B and have a separate table that maps the B-Pkid to A-pkid.
I'm not a fan of UPDATE on a warehouse table, that's why I didn't recommend a CURRENT_IND, but that's an alternative.

This is a fairly typical question; there are lots of reasonable answers, but there is only one correct approach (in my opinion).
You're basically asking "should I denormalize my schema?". I believe that you should denormalize your schema only if you really, really have to. The way you know you have to is because you can prove that - under current or anticipated circumstances - you have a performance problem with real-life queries.
On modern hardware, with a well-tuned database, finding the latest record in table B by doing a join is almost certainly not going to have a noticable performance impact unless you have HUGE amounts of data.
So, my recommendation: create a test system, populate the two tables with twice as much data as the system will ever need, and run the queries you have on the production environment. Check the query plans, and see if you can optimize the queries and/or indexing. If you really can't make it work, de-normalize the table.
Whilst this may seem like a lot of work, denormalization is a big deal - in my experience, on a moderately complex system, denormalized data schemas are at the heart of a lot of stupid bugs. It makes introducing new developers harder, it means additional complexity at the application level, and the extra code means more maintenance. In your case, if the code which updates table A fails, you will be producing bogus results without ever knowing about it; an undetected bug could affect lots of data.

We had a similar situation in our project tracking system where the latest state of the project is stored in the projects table (Cols: project_id, description etc.,) and the history of the project is stored in the project_history table (Cols: project_id, update_id, description etc.,). Whenever there is a new update to the project, we need find out the latest update number and add 1 to it to get the sequence number for the next update. We could have done this by grouping the project_history table on the project_id column and get the MAX(update_id), but the cost would be high considering the number of the project updates (in a couple of hundreds of thousands) and the frequency of update. So, we decided to store the value in the projects table itself in max_update_id column and keep updating it whenever there is a new update to a given project. HTH.

If I understand correctly, you have a table whose each row is a parameter and another table that logs each parameter value historically in a time series. If that is correct, I currently have the same situation in one of the products I am building. My parameter table hosts a listing of measures (29K recs) and the historical parameter value table has the value for that parameter every 1 hr - so that table currently has 4M rows. At any given point in time there will be a lot more requests FOR THE LATEST VALUE than for the history so I DO HAVE THE LATEST VALUE STORED IN THE PARAMETER TABLE in addition to it being in the last record in the parameter value table. While this may look like duplication of data, from the performance standpoint it makes perfect sense because
To get a listing of all parameters and their CURRENT VALUE, I do not have to make a join and more importantly
I do not have to get the latest value for each parameter from such a huge table
So yes, I would in your case most definitely store the latest value in the parent table and update it every time new data comes in. It will be a little slower for writing new data but a hell of a lot faster for reads.

Related

Best practice to update bulk data in table used for reporting in SQL

I have created a table for reporting purpose where I am storing data for about 50 columns and at some time interval my scheduler executes a service which processes other tables and fill up data in my flat table.
Currently I am deleting and inserting data in that table But I want to know if this is the good practice or should I check every column in every row and update it if any change found and insert new record if data does not exists.
FYI, total number of rows which are being reinserted is 100k+.
This is a very broad question that can only really be answered with access to your environment and discussion on your personal requirements. Obviously this is not possible via Stack Overflow.
This means you will need to make this decision yourself.
The information you need to understand to be able to do this are the types of table updates available and how you can achieve them, normally referred to as Slowly Changing Dimensions. There are several different types, each with their own advantages, disadvantages and optimal use cases.
Once you understand the how of getting your data to incrementally update as required, you can then look at the why and whether the extra processing logic required to achieve this is actually worth it. Your dataset of a few hundred thousand rows of data is not large and probably may therefore not need this level of processing just yet, though that assessment will depend on how complex and time consuming your current process is and how long you have to run it.
It is probably faster to repopulate the table of 100k rows. To do an update, you still need to:
generate all the rows to insert
compare values in every row
update the values that have changed
The expense of updating rows is heavily on the logging and data movement operations at the data page level. In addition, you need to bring the data together.
If the update is updating a significant portion of rows, perhaps even just a few percent of them, then it is likely that all data pages will be modified. So the I/O is pretty similar.
When you simply replace the table, you will start by either dropping the table or truncating it. Those are relatively cheap operations because they are not logged at the row level. Then you are inserting into the table. Inserting 100,000 rows from one table to another should be pretty fast.
The above is general guidance. Of course, if you are only changing 3 rows in the table each day, then update is going to be faster. Or, if you are adding a new layer of data each day, then just an insert, with a handful of changed historical values might be a fine approach.

Why would all tables not be temporal tables by default?

I'm creating a new database and plan to use temporal tables to log all changes. The data stored will be updated daily but will not be more than 5000 records per table
Is there any reason I shouldn't just make all tables temporal?
Ps. I am aware of the space usage of temporal tables, this is not as far as I understand a problem
I am aware of the space usage of temporal tables, this is not as far as I understand a problem
On the contrary - it's pretty big problem - and there are many other downsides too.
When you use Temporal Tables (at least in SQL Server), every UPDATE operation (even if the data is unchanged) results in a copy being made in the History table (granted, under-the-hood this may be a COW-optimized copy, but it's still another conceptual entity instance).
Secondly - from my personal experience working with LoB applications: most changes to databases are not important enough to justify creating an entire copy of a row, for example, imagine a table with 4 columns ( CREATE TABLE People ( FirstName nvarchar(50), LastName nvarchar(50), Address nvarchar(200), Biography nvarchar(max): whenever a typo in FirstName is fixed then all of the data in the other columns is copied-over, even if Biography contains a 4GB worth of text data - even if this is COW-optimized it's still creating copies for every user action that results in a change.
Is there any reason I shouldn't just make all tables temporal?
The main reason, in my experience, is that it makes changing your table schema much harder because the schemas (aka "table design") of the Active and History tables must be identical: so if you have a table with a NULL column that you want to change to a NOT NULL column and you have NULL values in your History table then you're stuck - at least until you write a data transformation step that will supply the History table with valid data - it's basically creating more work for yourself with little to gain.
Also, don't confuse Temporal Tables with Immutable, Append-only data-stores (like the Bitcoin Blockchain) - while they share similar design objectives (except true immutability) they exist to solve different problems - and if you consider the size requirements and scaling issues of the Ethereum block-chain (over a terabyte by now) then that should give you another idea why it's probably not a good idea.
Finally, even if Temporal Tables didn't have these issues - you still need to go through the effort to write your main software such that it can natively handle temporal data - and things like Entity Framework still don't have built-in support for querying Temporal Data.
...and even with all the historical records you've managed to save in the History table, what do you want it for? Do you really need to track every corrected typo and small, inconsequential change? How will your users react to needing to manually audit the changes to determine what's meaningful or not?
In short:
If your table design probably won't change much in the future...
AND small updates happen infrequently...
OR large updates happen regularly AND you need an audit record
...then go ahead and use Temporal Tables wherever you can.
if not, then you're just creating more future work for yourself with little to gain.
"log all changes" is not a good use case for the temporal features in SQL.
The use case for the SYSTEM TIME temporal feature is when there is a pressing requirement obligating you/the user to be able to easily and quickly reconstruct (should be in scare quotes actually) the state that your database was in at a given moment in time. That is difficult and error-prone and expensive if all you have is a true log of past changes. But if you can suffice with keeping just a log of the changes, then do that (it will be difficult and error-prone and expensive to recreate past database states from the current state and your log, but that's not a pressing problem if there's no pressing need).
Also note that the SQL temporal features encompass also the notion of BUSINESS TIME, which is a different time dimension than the SYSTEM TIME one. Business time is targeted to keeping a history of the world situation, system time is targeted at keeping a history of your database itself, that is, a history of your records of the world situation.

Auto-deleting particular records in one table of Oracle database using SQL

I've got question concerning auto deleting particular records in one table of Oracle database using SQL.
I am making small academic project of database for private clinic and I have to design Oracle database and client application in Java.
One of my ideas is to arrange table "Visits" which stores all patients visits which took place in the past for history purposes. Aforementioned table will grow pretty fast so it will have weak searching performance.
So the idea is to make smaller table called "currentVisits" which holds only appointments for future visits because it will be much faster to search through ~1000 records than few millions after few years.
My question is how to implement auto deleting records in SQL from temporary table "currentVisits" after they took place.
Both tables will store fields like dateOfVisit, patientName, doctorID etc.
Is there any possibility to make it work in simple way? For example using triggers?
I am quite new in this topic so thanks for every answer.
Don't worry about the data size. Millions of records is not particularly large for a database on modern computing hardware. You will need an appropriate data structure, however.
In this case, you will want an index on the column that indicates current records. In all likelihood, the current records will be appended onto the end of the table, so they will tend to be congregating on a handful of data pages. This is a good thing.
If you have a heavy deletion load on the table, or you are using a clustered index, then the pages with the current records might be spread throughout the database. In that case, you want to include the "current" column in the clustered index.

audit table vs. Type 2 Slowly Changing Dimension

In SQL Server 2008+, we'd like to enable tracking of historical changes to a "Customers" table in an operational database.
It's a new table and our app controls all writing to the database, so we don't need evil hacks like triggers. Instead we will build the change tracking into our business object layer, but we need to figure out the right database schema to use.
The number of rows will be under 100,000 and number of changes per record will average 1.5 per year.
There are at least two ways we've been looking at modelling this:
As a Type 2 Slowly Changing Dimension table called CustomersHistory, with columns for EffectiveStartDate, EffectiveEndDate (set to NULL for the current version of the customer), and auditing columns like ChangeReason and ChangedByUsername. Then we'd build a Customers view over that table which is filtered to EffectiveEndDate=NULL. Most parts of our app would query using that view, and only parts that need to be history-aware would query the underlying table. For performance, we could materialize the view and/or add a filtered index on EffectiveEndDate=NULL.
With a separate audit table. Every change to a Customer record writes once to the Customer table and again to a CustomerHistory audit table.
From a quick review of StackOverflow questions, #2 seems to be much more popular. But is this because most DB apps have to deal with legacy and rogue writers?
Given that we're starting from a blank slate, what are pros and cons of either approach? Which would you recommend?
In general, the issue with SCD Type- II is, if the average number of changes in the values of the attributes are very high, you end-up having a very fat dimension table. This growing dimension table joined with a huge fact table slows down the query performance gradually. It's like slow-poisoning.. Initially you don't see the impact. When you realize it, it's too late!
Now I understand that you will create a separate materialized view with EffectiveEndDate = NULL and that will be used in most of your joins. Additionally for you, the data volume is comparatively low (100,000). With average changes of only 1.5 per year, I don't think data volume / query performance etc. are going to be your problem in the near future.
In other words, your table is truly a slowly changing dimension (as opposed to a rapidly changing dimension - where your option #2 is a better fit). In your case, I will prefer option #1.

archiving the table : searching for the best way

there is a table which has 80.000 rows.
Everyday I will clone this table to another log table giving a name like 20101129_TABLE
, and every day the prefix will be changed according to date..
As you calculate, the data will be 2400 000 rows every month..
Advices please for saving space, and getting fast service and other advantages and disadvantages!! how should i think to create the best archive or log..
it is a table has the accounts info. branch code balance etc
It is quite tricky to answer your question since you are a bit vague on some important facts:
How often do you need the archived tables?
How free are you in your design-choices?
If you don't need the archived data often and you are free in your desgin I'd copy the data into an archive database. That will give you the option of storing the database on a separate disk (cost-efficiency) and you can have a separate backup-schedule on that database as well.
You could also store all the data in one table with just an additional column like ArchiveDate datetime. But I think this depends really on how you plan on accessing the data later.
Consider TABLE PARTITIONING (MSDN) - it is designed for exactly this kind of scenarios. Not only you can spread data across partitions (and map partitions to different disks), you can keep all data in the same table and let MSSQL do all the hard work in the background (what partition to use based on select criteria, etc.).