Archiving (old) application data in (Oracle) Database

Archiving (old) application data in (Oracle) Database - sql

I have a fairly large Business Application related to Order Management, whose data is in Oracle Database.
All data can be related to Orders.
Now, the data size is huge, millions of records in tables - thereby slowing down my application SQL Queries.
I am planning a replica schema of my main 'Order Schema', say 'Archive Order Schema'.
And write SQL Queries to move (old order) data from main to archival schema, one (old) Order at a time.
But the SQL Queries are quite slow, and to move all data of a (old) Order (across so many tables), takes very long time.
Any design / approach / optimization inputs are welcome.

First, as the others have noted, a few million lines in Order Management tables is nothing. Even a few hundred million rows or billions of rows is not a challenge. We manage an EBS with larger Order Management tables without much effort. Make sure you are gathering schema statistics using the EBS Concurrent Request (not DBA tools); although you might check with your DBA on rebuilding your indexes. Also make sure you are patched up as Oracle EBS patches often include different indexes to improve performance from logged problems. Run some AWS stats or even a SQL trace to find your bottle necks and work with Oracle Support.
Next, DO NOT use SQL to archive seeded tables. You will have problems and Oracle will not help you then. You should instead do some research and use Oracle's built in and supported archive and purge processes. Start by reading Note 752322.1 on My Oracle Support, it will point you to managing your data in Order Management.

Related

SP running very slow in Azure DW even with tables with HASH distribution on unique column

I have a stored procedure in AZURE DW which runs very slow. I copied all the tables and the sp to a different server and there it is taking very less time to execute. I have created the tables using HASH distribution on the unique field but then also the sp is running very slow. Please advice how can I improve the performance of the sp in AZURE DW.

From your latest comment, the data sample is way too small for any reasonable tests on SQL DW. Remember SQL DW is MPP while your local on-premises SQL Server is SMP. Even with DWU100, the underlying layout of this MPP architecture is very different from your local SQL Server. For instance, every SQL DW has 60 user databases powering the DW and data is spread across them. Default storage is clustered column store which is optimized for common DW type workloads.
When a query is sent to DW, it has to build a distributed query plan that is pushed to the underlying DBs to build a local plan then executes and runs it back up the stack. This seems like a lot and it is for small data sets and simple queries. However, when you are dealing with hundreds of TBs of data with billions of rows and you need to run complex aggregations, this additional overhead is relatively tiny. The benefits you get from the MPP processing power makes that inconsequential.
There's no hard number on the actual size where you'll see real gains but at least half a TB is a good starting point and rows really should be in the tens of millions. Of course, there are always edge cases where your data set might not be huge but the workload naturally lends itself to MPP so you might still see gains but that's not common. If your data size is in the tens or low hundreds of GB range and won't grow significantly, you're likely to be much happier with Azure SQL Database.
As for resource class management, workload monitoring, etc... check out the following links:
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-develop-concurrency
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-manage-monitor
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-best-practices

SQL Database Best Practices - Use of Archive tables?

I'm not a trained DBA, but perform some SQL tasks and have this question:
In SQL databases I've noticed the use archive tables that mimic another table with the exact same fields and which are used to accept rows from the original table when that data is deemed for archiving. Since I've seen examples where those tables reside in the same database and on the same drive, my assumption is that this was done to increase performance. Such tables didn't have more than a about 10 million rows in them...
Why would this be done instead of using a column to designate the status of the row, such as a boolean for an in/active flag?
At what point would this improve performance ?
What would be the best pattern to structure this correctly, given that the data may still need to be queried (or unioned with current data) ?
What else is there to say about this ?

The notion of archiving is a physical, not logical, one. Logically the archive table contains the exact same entity and ought to be the same table.
Physical concerns tend to be pragmatic. The overarching notion is that the "database is getting too (big/slow"). Archiving records makes it easier to do things like:
Optimize the index structure differently. Archive tables can have more indexes without affecting insert/update performance on the working table. In addition, the indexes can be rebuilt with full pages, while the working table will generally want to have pages that are 50% full and balanced.
Optimize storage media differently. You can put the archive table on slower/less expensive disk drives that maybe have more capacity.
Optimize backup strategies differently. Working tables may require hot backups or log shipping while archive tables can use snapshots.
Optimize replication differently, if you are using it. If an archive table is only updated once per day via nightly batch, you can use snapshot as opposed to transactional replication.
Different levels of access. Perhaps you want different security access levels for the archive table.
Lock contention. If you working table is very hot you'd rather have your MIS developers access the archive table where they are less likely to halt your operations when they run something and forget to specify dirty read semantics.
The best practice would not to use archive tables but to move the data from the OLTP database to an MIS database, data warehouse, or data marts with denormalized data. But some organizations will have trouble justifying the cost of an additional DB system (which aren't cheap). There are far fewer hurdles to adding an additional table to an existing DB.

I say this frequently, but...
Multiple tables of identical structure almost never makes sense.
A status flag is a much better idea. There are proper ways to increase performance (partitioning/indexing) without denormalizing data or otherwise creating redundancies. 10 million records is pretty small in the world of modern rdbms, so what you're seeing is the product of poor planning or misunderstanding of databases.

Best optimizing for a large SQL Server table (100-200 Mil records)

What are the best options/recommendations and optimizations that can be performed when working with a large SQL Server 2005 table that contains anywhere from 100-200 Million records?

Since you didn't state the purpose of the database, or the requirements, here are some general things, in no particular order:
Small clustered index on each table. Consider making this your primary key on each table. This will be very efficient and save on space in the main table and dependent tables.
Appropriate non-clustered indexes (covering indexes where possible)
Referential Integrity
Normalized Tables
Consistent naming on all database objects for easier maintenance
Appropriate Partitioning (table and index) if you have the Enterprise Edition of SQL Server
Appropriate check constraints on tables if you are going to allow direct data manipulation in the database.
Decide where your business rules are going to reside and don't deviate from that. In most cases they do not belong in the database.
Run Query Analyzer on your heavily used queries (at least) and look for table scans. This will kill performance.
Be prepared to deal with deadlocks. With a database of this size, especially if there will be heavy writing, deadlocks could very well be a problem.
Take ample advantage of views to hide query join complexity and potential for query optimization and flexible security implementation.
Consider using schemas to better organize data and flexible security implementation.
Get familiar with Profiler. With a database of this size, you will more than likely be spending some time trying to determine query bottlenecks. Profiler can help you here.

A rule of thumb is if a table will contain more than 25 million records, you should consider table (and index) partitioning, but this feature is only available in the Enterprise edition of SQL Server (and correspondingly, developer edition).

Practical size limitations for RDBMS

I am working on a project that must store very large datasets and associated reference data. I have never come across a project that required tables quite this large. I have proved that at least one development environment cannot cope at the database tier with the processing required by the complex queries against views that the application layer generates (views with multiple inner and outer joins, grouping, summing and averaging against tables with 90 million rows).
The RDBMS that I have tested against is DB2 on AIX. The dev environment that failed was loaded with 1/20th of the volume that will be processed in production. I am assured that the production hardware is superior to the dev and staging hardware but I just don't believe that it will cope with the sheer volume of data and complexity of queries.
Before the dev environment failed, it was taking in excess of 5 minutes to return a small dataset (several hundred rows) that was produced by a complex query (many joins, lots of grouping, summing and averaging) against the large tables.
My gut feeling is that the db architecture must change so that the aggregations currently provided by the views are performed as part of an off-peak batch process.
Now for my question. I am assured by people who claim to have experience of this sort of thing (which I do not) that my fears are unfounded. Are they? Can a modern RDBMS (SQL Server 2008, Oracle, DB2) cope with the volume and complexity I have described (given an appropriate amount of hardware) or are we in the realm of technologies like Google's BigTable?
I'm hoping for answers from folks who have actually had to work with this sort of volume at a non-theoretical level.
The nature of the data is financial transactions (dates, amounts, geographical locations, businesses) so almost all data types are represented. All the reference data is normalised, hence the multiple joins.

I work with a few SQL Server 2008 databases containing tables with rows numbering in the billions. The only real problems we ran into were those of disk space, backup times, etc. Queries were (and still are) always fast, generally in the < 1 sec range, never more than 15-30 secs even with heavy joins, aggregations and so on.
Relational database systems can definitely handle this kind of load, and if one server or disk starts to strain then most high-end databases have partitioning solutions.
You haven't mentioned anything in your question about how the data is indexed, and 9 times out of 10, when I hear complaints about SQL performance, inadequate/nonexistent indexing turns out to be the problem.
The very first thing you should always be doing when you see a slow query is pull up the execution plan. If you see any full index/table scans, row lookups, etc., that indicates inadequate indexing for your query, or a query that's written so as to be unable to take advantage of covering indexes. Inefficient joins (mainly nested loops) tend to be the second most common culprit and it's often possible to fix that with a query rewrite. But without being able to see the plan, this is all just speculation.
So the basic answer to your question is yes, relational database systems are completely capable of handling this scale, but if you want something more detailed/helpful then you might want to post an example schema / test script, or at least an execution plan for us to look over.

90 million rows should be about 90GB, thus your bottleneck is disk.
If you need these queries rarely, run them as is.
If you need these queries often, you have to split your data and precompute your gouping summing and averaging on the part of your data that doesn't change (or didn't change since last time).
For example if you process historical data for the last N years up to and including today, you could process it one month (or week, day) at a time and store the totals and averages somewhere. Then at query time you only need to reprocess period that includes today.
Some RDBMS give you some control over when views are updated (at select, at source change, offline), if your complicated grouping summing and averaging is in fact simple enough for the database to understand correctly, it could, in theory, update a few rows in the view at every insert/update/delete in your source tables in reasonable time.

It looks like you're calculating the same data over and over again from normalized data. One way to speed up processing in cases like this is to keep SQL with it's nice reporting and relationships and consistency and such, and use a OLAP Cube which is calculated every x amount of minutes. Basically you build a big table of denormalized data on a regular basis which allows quick lookups. The relational data is treated as the master, but the Cube allows quick precalcuated values to be retrieved from the database at any one point.

If that is only 1/20 of your data, you almost surely need to look into more scalable and efficient solutions, such as Google's Big Table. Have a look at NoSQL
I personally think that MongoDB is an awesome inbetween of NoSQL and RDMS. It isn't relational, but it provides a lot more features than a simple document store.

In dimensional (Kimball methodology) models in our data warehouse on SQL Server 2005, we regularly have fact tables with that many rows just in a single month partition.
Some things are instant and some things take a while, it depends on the operation and how many stars are being combined and what's going on.
The same models perform poorly on Teradata, but it is my understanding that if we re-model in 3NF, Teradata parallelization will work a lot better. The Teradata installation is many times more expensive than the SQL Server installation, so it just goes to show how much of a difference modeling and matching your data and processes to the underlying feature set matters.
Without knowing more about your data, and how it's currently modeled and what indexing choices you've made it's hard to say anything more.

What is the best way to partition large tables in SQL Server?

In a recent project the "lead" developer designed a database schema where "larger" tables would be split across two separate databases with a view on the main database which would union the two separate database-tables together. The main database is what the application was driven off of so these tables looked and felt like ordinary tables (except some quirky things around updating). This seemed like a HUGE performance problem. We do see problems with performance around these tables but nothing to make him change his mind about his design. Just wondering what is the best way to do this, or if it is even worth doing?

I don't think that you are really going to gain anything by partitioning the table across multiple databases in a single server. All you have essentially done there is increased the overhead in working with the "table" in the first place by having several instances (i.e. open in two different DBs) of it under a single SQL Server instance.
How large of a dataset do you have? I have a client with a 6 million row table in SQL Server that contains 2 years worth of sales data. They use it transactionally and for reporting without any noticiable speed problems.
Tuning the indexes and choosing the correct clustered index is crucial to performance of course.
If your dataset is really large and you are looking to partition, you will get more bang for your buck partitioning the table across physical servers.

Partitioning is not something to be undertaken lightly as there can be many subtle performance implications.
My first question is are you referring simply to placing larger table objects in separate filegroups (on separate spindles) or are you referring to data partitioning inside of a table object?
I suspect that the situation described is an attempt to have the physical storage of certain large tables on different spindles from the rest of the tables. In this case, adding the extra overhead of separate databases, losing any ability to enforce referential integrity across databases, and the security implications of enabling cross-database ownership chaining does not provide any benefit over using multiple filegroups within a single database. If, as is quite possible, the separate databases you refer to in your question are not even stored on separate spindles but are all stored on the same spindle then you negate even the slight performance benefit you could have gained by physically separating your disk activity and have received absolutely no benefit.
I would suggest instead of using additional databases to hold large tables you look into the Filegroup topic in SQL Server Books Online or for a quick review see this article:
If you are interested in data partitioning (including partitioning into multiple file groups) then I recommend reading articles by Kimberly Tripp, who gave an excellent presentation at the time SQL Server 2005 came out about the improvements available there. A good place to start is this whitepaper

Which version of SQL Server are you using? SQL Server 2005 has partitioned tables, but in 2000 (or 7.0) you needed to use partition views.
Also, what was the reasoning for putting the table partitions in a separate database?
When I've had to partition tables in the past (pre-2005), it's usually by a date column or something similar, with a view over the various partitions. Books Online has a section that talks about how to do this and all of the rules around it. You need to follow the rules to make it work how it's supposed to work.
The key thing to remember is that your partitioning column must be part of the primary key and you want to try to always use that column in any access against the table so that the optimizer can ignore partitions that shouldn't be affected by the query.
Look up "partitioned table" in MSDN and you should be able to find a more complete tutorial for SQL Server 2005 partitioned tables as well as advice on how to set them up for maximum performance.

Are you asking about best practices in terms of database design, or convincing your lead to change his mind? :)
In terms of design... Back in the goode olde days, vertical partitioning was sometimes needed to work around database engine limitations, where the number of columns in a table was a hard limit, like 255 columns. These days the main benefits are purely for performance: putting rarely used columns, or blobs on a separate disk array. But if you're regularly pulling things from both tables it will likely be a loss. It sounds like your lead is suffering from a case of premature optimisation.
In terms of telling your lead is wrong... that requires diplomacy. If he's aware of mutterings of discontent in terms of performance, a benchmark is probably the best way to show the difference.
Create a new physical table somewhere with 'create table t1 as select * from view1' and then run some lengthy batch with the vertically partitioned table and your new table. If it's as bad as you say, the difference should be evident.
But this too may be premature optimisation. Find out what the end-users think of the performance. If the performance is good enough, for some definition of good, then don't fix what ain't broke.

There is a definite benefit for table partitioning (regardless whether it's on same or different filegroups /disks). If the partition column is correctly selected, you'll realize that your queries will hit only the required partition. So imagine if you have 100 million records (I've partitioned tables much bigger than that - about 20+ Billion rows) and if for the most part, more than 70% of your data access is only a certain category or timeline or type of data then it helps to keep the most accessed data in a separate partition. Plus you can align the partition with separate file groups with various type of disks (SATA, Fiber channel, SSDs) so that the most accessed/busy data are on the fastest storage and the least/rarely accessed are virtually on slower disks.
Although, in SQL Server there's limited partitioning ability, unlike Oracle. You can choose only one column for partitioning (even in SQL 2008). So you've to choose a column wisely where that column also is part of most of your frequent queries. For the most part, people find it easy to choose to partition by a date column. However although it seems logical to partition that way, if your queries do not have that column as part of the condition, you won't be gaining sufficient benefits from partitioning (in other words, your query will hit all the partition regardless).
It's much easier to partition for data warehouse/data mining type databases than OLTP as most DW database queries are limited by time period.
That's why these days due to the volume of data being handled by databases, it's wise to design the application in such a way that ever query is limited by some broader group such as time, geographical location or such so that when such columns are chosen for partitioning you'll gain maximum benefits.

I would disagree with the assumption that nothing can be gained by partitioning.
If the partition data is physically and logically aligned, then the potential IO of queries should be dramatically reduced.
For example, We have a table which has the batch field as an INT representing an INT.
If we partition the data by this field and then re-run a query for a particular batch, we should be able to run set statistics io ON before and after partitioning and see a reduction in IO,
If we have a million rows per partition and each partition is written to a separate device. The query should be able to eliminate the nonessential partitions.
I've not done a lot of partitioning on SQL Server, but I do have experience of partitioning on Sybase ASE, and this is known as partition eliminiation. When I have time I'm going to test out the scenario on a SQL Server 2005 machine.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas