Archiving data in SQL

Archiving data in SQL - sql

Need some advice on how best to approach this. Basically we have a few tables in our database along with archive versions of those tables for deleted data (e.g. Booking and Booking_archive). The table structure in both these tables is exactly the same, except for two extra columns in the archive table: DateDeleted and DeletedBy.
I have removed these archive tables, and just added the DateDeleted and DeletedBy columns to the actual table. My plan is to then partition this table so I can separate archived info from non-archived.
Is this the best approach? I just did not like the idea of having two tables just to distinguish between archived and non-archived data.
Any other suggestions/pointers for doing this?

The point of archiving is to improve performance, so I would say it's definitely better to separate the data into another table. In fact, I would go as far as creating an archive database on a separate server, and keeping the archived data there. That would yield the biggest performance gains. Runner-up architecture is a 2nd "archive" database on the same server with exactly duplicated tables.
Even with partitioning, you'll still have table-locking issues and hardware limitations slowing you down. Separate tables or dbs will eliminate the former, and separate server or one drive per partition could solve the latter.
As for storing the archived date, I don't think I would bother doing that on the production database. Might as well make that your timestamp on the archive-db tables, so when you insert the record it'll auto-stamp it with the datetime when it was archived.

The solution approach depends on:
Number of tables having such archive tables
What is arrival rate of data into archive tables ?
Do you want to invest in software/hardware of separate server
Based on above - various options could be:
Same database, different schema on same server
Archive database on same server
Archive database on different server
Don't go for partitioning if it's archived data and has no chance of getting back into main tables.
You might also add lifecycle management columns on archived data (retention period or expiry date) so that archive data lifecycle can be also managed effectively.

Related

How to make 2 remote tables in sync along with determination of Delta records

We have 10 tables on vendor system and same 10 tables on our DB side along with 10 _HISTORIC tables i.e. for each table in order to capture updated/new records.
We are reading the main tables from Vendor system using Informatica to truncate and load into our tables. How do we find Delta records without using Triggers and CDC as it comes with cost on vendor system.
4 tables are such that which have 200 columns and records around 31K in each with expectation that 100-500 records might update daily.
We are using Left Join in Informatica to load new Records in our Main and _HISTORIC tables.
But what's efficient approach to find the Updated records of Vendor table and load them in our _HISTORIC table ?
For new Records using query :
-- NEW RECORDS
INSERT INTO TABLEA_HISTORIC
SELECT FROM TABLEA
LEFT JOIN TABLEB
ON A.PK = B.PK
WHERE B.PK IS NULL

I believe a system versioned temporary table will be something you are looking for here. You can create a system versioned table for any table in SQL server 2016 or later.
for example, say I have a table Employee
CREATE TABLE Employee
(
EmployeeId VARCHAR(20) PRIMARY KEY,
EmployeeName VARCHAR(255) NOT NULL,
EmployeeDOJ DATE,
ValidFrom datetime2 GENERATED ALWAYS AS ROW START,--Automatically updated by the system when the row is updated
ValidTo datetime2 GENERATED ALWAYS AS ROW END,--auto-updated
PERIOD FOR SYSTEM_TIME (ValidFrom, ValidTo)--to set the row validity period
)
the column ValidFrom, ValidTo determines the time period on which that particular row was active.
For More Info refer the micorsoft article
https://learn.microsoft.com/en-us/sql/relational-databases/tables/temporal-tables?view=sql-server-ver15

Create staging tables, load wipe&load them. Next, use them for finding the differences that need to be load into your target tables.
The CDC logic needs to be performed this way, but it will not affect your source system.
Another way - not sure if possible in your case - is to load partial data based on some source system date or key. This way you stage only the new data. This improves performance a lot, but makes finding the deletes in source impossible.

A. To replicate a smaller subset of records in the source without making schema changes, there are a few options.
Transactional Replication, however this is not very flexible. For example would not allow any differences in the target database, and therefore is not a solution for you.
Identify a "date modified" field in the source. This obviously has to already exist, and will not allow you to identify deletes
Use a "windowing approach" where you simply delete and reload the last months transactions, again based on an existing date. Requires an existing date that isn't back dated and doesn't work for non transactional tables (which are usually small enough to just do full copies anyway)
Turn on change tracking. Your vendor may or may not argue that tihs is a costly change (it isn't) or impacts application performance (it probably doesn't)
https://learn.microsoft.com/en-us/sql/relational-databases/track-changes/about-change-tracking-sql-server?view=sql-server-ver15
Turning on change tracking will allow you to more easily identify changes to all tables.
You need to ask yourself: is it really an issue to copy the entire table? I have built solutions that simple copy entire large tables (far larger than 31k records) every hour and there is never an issue.
You need to consider what complications you introduce by building an incremental solution, and whether the associated maintenance and complexity is worth being able to reduce a record copy from 31K (full table) to 500 records (changed). Again a full copy of 31K records is actually pretty fast under normal circumstances (like 10 seconds or so)
B. Target table
As already recommended by many, you might want to consider a temporal table, although if you do decide to do full copies, a temporal table might not be the beast option.

Keeping archived data schema up to date with running data warehouse

recently our 5-year old MySQL data warehouse (used mostly for business reporting) has gotten quite full and we need to come up with a way to archive old data which is not frequently accessed to clear up space.
I created a process which dumps old data from the DW into .parquet files in Amazon S3, which are then mapped onto an Athena table. This works quite well.
however we sometimes add/rename/delete columns in existing tables. I'd like the changes to be reflected in the old, archived data as well, but I just can't come up with a good way to do it without reprocessing the entire dataset.
is there a 'canon' way to mantain structural compatibility between a live data warehouse and its file-based archived data? I've googled relevant literature and come up with nothing.
should I just accept the fact that if I need to actively maintain schemas then the data is not really archived?

There are tons of materials in internet if you search the term "Schema evolution" in big data space.
The Athena documentation has a chapter on schema updates case by case example here.
If you are re-processing the whole archived dataset to handle schema change, probably you are doing a bit too much.
Since you have parquet files and by default Athena parquet resolves the column by column name rather than by index, you are safe in almost all cases i.e. add new columns, drop columns etc except column rename. TO handle renamed columns (and to handle addition/dropping of columns), the fastest way is to use view. In the view definition you can alias the renamed column. Also, if column rename is mostly the case of your schema evolution and if you are doing it a lot, you can also consider AVRO to gracefully handle that.

Plan A:
It's too late to do this, but PARTITIONing is an excellent tool for getting the data out of the table.
I say "too late" because adding partitioning would require enough space for making a copy of the already-big table. And you don't have that much disk space?
If the table were partitioned by Year or Quarter or Month, you could
Every period, "Export tablespace" to remove the oldest from the partition scheme.
That tablespace will the be a separate table; you could copy/dump/whatever, then drop it.
At about the same time, you would build a new partition to receive new data.
(I would keep the two processes separate so that you could stretch beyond 5 years or shrink below 5 with minimal extra effort.)
A benefit of the method is that there is virtually zero impact on the big table during the processing.
An extra benefit of partitioning: You can actually return space to the OS (assuming you have innodb_file_per_table=ON).
Plan B:
Look at what you do with the oooold data. Only a few things? Possibly involving summarization? So...
Don't archive the old data.
Summarize the data to-be-removed into new tables. Since they will be perhaps one-tenth the size, you can keep them online 'forever'.

Auto-deleting particular records in one table of Oracle database using SQL

I've got question concerning auto deleting particular records in one table of Oracle database using SQL.
I am making small academic project of database for private clinic and I have to design Oracle database and client application in Java.
One of my ideas is to arrange table "Visits" which stores all patients visits which took place in the past for history purposes. Aforementioned table will grow pretty fast so it will have weak searching performance.
So the idea is to make smaller table called "currentVisits" which holds only appointments for future visits because it will be much faster to search through ~1000 records than few millions after few years.
My question is how to implement auto deleting records in SQL from temporary table "currentVisits" after they took place.
Both tables will store fields like dateOfVisit, patientName, doctorID etc.
Is there any possibility to make it work in simple way? For example using triggers?
I am quite new in this topic so thanks for every answer.

Don't worry about the data size. Millions of records is not particularly large for a database on modern computing hardware. You will need an appropriate data structure, however.
In this case, you will want an index on the column that indicates current records. In all likelihood, the current records will be appended onto the end of the table, so they will tend to be congregating on a handful of data pages. This is a good thing.
If you have a heavy deletion load on the table, or you are using a clustered index, then the pages with the current records might be spread throughout the database. In that case, you want to include the "current" column in the clustered index.

Large Volume Database

We are creating a database where we store large number of records. We estimate millions (billions after few years) of record in one table and we always INSERT and rarely UPDATE or DELETE any of the record. Its a kind of archive system where we insert historic record on daily basis. We will generate different sort of reports on this historic record on user request so we've some concerns and require technical input from you people:
What is the best way to manage this kind of table and database?
What impact we may see in future for very large table?
Is there any limitation on number of records in one table or size of table?
How we suppose to INSERT bulk record from different sources (mostly from Excel sheet)?
What is the best way to index large data tables?
Which is the best ORM (object relational Mapping) should we use in this project?

You last statement sums it up. There is no ORM that will deal nicely with this volume of data and reporting queries: employ SQL experts to do it for you. You heard it here first.
Otherwise
On disk: filegroups, partitioning etc
Compress less-used data
Is all data required? (Data retention policies)
No limit of row numbers or table size
INSERT via staging tables or staging databases, clean/scrub/lookup keys, then flush to main table: DO NOT load main table directly
As much RAM as you can buy. Then add more.
Few, efficient indexes
Do you have parent tables or flat data mart? Have FKs but don't use them (eg bene update/delete in parent table) so no indexes needed
Use a SAN (easier to add disk space, more volumes etc)
Normalise
Some of these are based on our experiences of around 10 billion rows through one of our systems in 30 months, with peaks of 40k rows+ per second.
See this too for high volume systems: 10 lessons from 35K tps
Summary: do it properly or not at all...

What is the best way to manage this kind of table and database?
If you are planning to store billions of records then you'll be needing plenty of diskspace, I'd recommend a 64bit OS running SQL 2008 R2 and as much RAM and HD space as is available. Depending on what performance you need I'd be tempted to look into SSDs.
What impact we may see in future for very large table?
If you have the right hardware, with a properly indexed table and properly normalized the only thing you should notice are the reports will begin to run slower. Inserts may slow down slightly as the Index file becomes bigger and you'll just have to keep an eye on it.
Is there any limitation on number of records in one table or size of table?
On the right setup I described above, no. It's only limited by disk space.
How we suppose to INSERT bulk record from different sources (mostly from Excel sheet)?
I've run into problems running huge SQL queries but I've never tried to import from very large flat files.
What is the best way to index large data tables?
Index as few fields as necessary and keep them to numerical fields only.
Which is the best ORM (object relational Mapping) should we use in this project?
Sorry can't advise here.

Billions of rows in a "few years" is not an especially large volume. SQL Server should cope perfectly well with it - assuming your design and implementation is appropriate. There is no particular limit on the size of a table. Stick to solid design principles: normalize your tables, choose keys and data types carefully and have a suitable partitioning and indexing strategy.

archiving the table : searching for the best way

there is a table which has 80.000 rows.
Everyday I will clone this table to another log table giving a name like 20101129_TABLE
, and every day the prefix will be changed according to date..
As you calculate, the data will be 2400 000 rows every month..
Advices please for saving space, and getting fast service and other advantages and disadvantages!! how should i think to create the best archive or log..
it is a table has the accounts info. branch code balance etc

It is quite tricky to answer your question since you are a bit vague on some important facts:
How often do you need the archived tables?
How free are you in your design-choices?
If you don't need the archived data often and you are free in your desgin I'd copy the data into an archive database. That will give you the option of storing the database on a separate disk (cost-efficiency) and you can have a separate backup-schedule on that database as well.
You could also store all the data in one table with just an additional column like ArchiveDate datetime. But I think this depends really on how you plan on accessing the data later.

Consider TABLE PARTITIONING (MSDN) - it is designed for exactly this kind of scenarios. Not only you can spread data across partitions (and map partitions to different disks), you can keep all data in the same table and let MSSQL do all the hard work in the background (what partition to use based on select criteria, etc.).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas