We (Team) are in process of putting Audit Reporting solution for a huge online transactional website.
Our auditing solution is to enable CDC on source table and tracking every change happens on the objects, grab them and push them into Destination table for reporting.
As of now we got one to one table in source - destination.
There will be only inserts in destination and no updates or delete.
So end of the day audit tables will grow large than actual source tables as these keeps history of changes.
My plan is flatten the destination tables to fewer based on subject / module, enable column store indexes and then utilize same for reporting.
Is there any suggestion on the above approach or there is any alternative.
I would recomend that you rather keep the table structure in a single table and have a look at Partitioned Tables and Indexes
SQL Server supports table and index partitioning. The data of
partitioned tables and indexes is divided into units that can be
spread across more than one filegroup in a database. The data is
partitioned horizontally, so that groups of rows are mapped into
individual partitions. All partitions of a single index or table must
reside in the same database. The table or index is treated as a single
logical entity when queries or updates are performed on the data.
Related
I am in the process of building a new data warehouse. On this occasion the warehouse needs to support incremental nightly updates. Pretty standard stuff really.
Previously when building a data warehouse I've either used Created / Updated date columns to drive inserts and updates, or, used the PrimaryKey on the source table and stored that in the warehouse as the natural key and used HashBytes to compare row data.
However on this particular warehouse, the data will be fed from multiple sql tables, which would therefore have multiple primary keys, and / or multiple Created / Updated dates.
What is a typical design pattern for dealing with this situation?
I have large size table , close to 1 GB and the size of this table is growing every week, it has total rows as 190 millions, I started getting alerts from HANA to partition this table, so I planned to partition this with column which is frequently used in Where clause.
My HANA System is scale out system with 8 nodes.
In order to compare the partition query performance difference with this un-partitioned table, I created calculation views on top of this un-partitioned table and recorded the query performance.
I partitioned this table using HASH method and by number of servers, and recorded the query performance. By this way I would have good data distribution across servers. I created calculation view and recorded query performance.
To my surprise I have found that my un-partitioned table calculation view query is performing better in comparison to partitioned table calculation view.
This was really shock. Not sure why non partitioned table Calculation view responding better to partitioned table Calculation view.
I have plan viz output files but not sure where to attach it.
Let me know why this is the behaviour?
Ok, this is not a straight-forward question that can be answered correctly as such.
What I can do though is to list some factors that likely will play a role here:
a non-partitioned table needs a single access to the table structure while the partitioned version requires at least one access for each partition
if the SELECT is not actually providing a WHERE condition that can be evaluated by the HASH function used for the partitioning, then all partitions always have to be evaluated and no partition pruning can take place.
HASH partitioning does not take any additional knowledge about the data into account, which means that similar data does not get stored together. This has a negative impact on data compression. Also, each partition requires its own set of value dictionaries for the columns where a single-partition/non-partitioned table only needs one dictionary per column.
You mentioned that you are using a scale-out system. If the table partitions are distributed across the different nodes, then every query will result in cross-node network communication. That is an additional workload and waiting time that simply does not exist with non-partitioned tables.
When joining partitioned tables each partition of the first table has to be joined with each partition of the second table, if no partition-wise join is possible.
There are other/more potential reasons for why a query against partitioned tables can be slower than against a non-partitioned table. All this is extensively explained in the SAP HANA Administration Guide.
As a general guidance, tables should only be partitioned if that cannot be avoided and when the access pattern of queries are well understood. It is definitively not a feature that you just "switch on" and everything will just work fine.
I have a table with 340GB of data, but we use only last one week of data. So to minimize the cost planning to move this data to partition table or shard tables.
I have done some experiment with shard tables and partition. I have created partition table and loaded two days worth of data(two partitions) and created two shard tables(Individual tables). I tried to pull last two days worth of data.
Full table - 27sec
Partition Table - 33 sec
shard tables - 91 sec
Please let me know which way is best. Based on the experiment result is giving quick when I run against full table but full table will scan.
Thanks,
From GCP official documentation on Partitioning versus Sharding you should use Partitioned tables.
Partitioned tables perform better than tables sharded by date. When
you create date-named tables, BigQuery must maintain a copy of the
schema and metadata for each date-named table. Also, when date-named
tables are used, BigQuery might be required to verify permissions for
each queried table. This practice also adds to query overhead and
impacts query performance. The recommended best practice is to use
partitioned tables instead of date-sharded tables.
The difference in performance seems to be due to some background optimizations that have run on the non-partitioned table, but are yet to run on the partitioned table (since the data is newer).
I have a huge database with a table containing billion of records. I need to do monthly cleanup of this table (delete oldest records based on date field).
Since I need to delete a few hundred million records for one month worth of data, doing a DELETE or even deleting in chunks takes too long, because of indexes that slows the process.
bcp data out + truncate + bcp data in is also too long.
Now the solution I want to try is to partition the table into different filegroups (one month per partition). I get the part of building the partitions, but how will I delete a filegroup along with its data?
You can switch partitions to a new table and then drop that table. Filegroups do not really have anything to do with it other than the restriction that the table you switch to must be on the same filegroup. You do not necessarily have to map your partitions to separate filegroups although you may want to do that for other reasons.
Here's a good example of a partition-wise rolloff in sql server.
Need some advice on how best to approach this. Basically we have a few tables in our database along with archive versions of those tables for deleted data (e.g. Booking and Booking_archive). The table structure in both these tables is exactly the same, except for two extra columns in the archive table: DateDeleted and DeletedBy.
I have removed these archive tables, and just added the DateDeleted and DeletedBy columns to the actual table. My plan is to then partition this table so I can separate archived info from non-archived.
Is this the best approach? I just did not like the idea of having two tables just to distinguish between archived and non-archived data.
Any other suggestions/pointers for doing this?
The point of archiving is to improve performance, so I would say it's definitely better to separate the data into another table. In fact, I would go as far as creating an archive database on a separate server, and keeping the archived data there. That would yield the biggest performance gains. Runner-up architecture is a 2nd "archive" database on the same server with exactly duplicated tables.
Even with partitioning, you'll still have table-locking issues and hardware limitations slowing you down. Separate tables or dbs will eliminate the former, and separate server or one drive per partition could solve the latter.
As for storing the archived date, I don't think I would bother doing that on the production database. Might as well make that your timestamp on the archive-db tables, so when you insert the record it'll auto-stamp it with the datetime when it was archived.
The solution approach depends on:
Number of tables having such archive tables
What is arrival rate of data into archive tables ?
Do you want to invest in software/hardware of separate server
Based on above - various options could be:
Same database, different schema on same server
Archive database on same server
Archive database on different server
Don't go for partitioning if it's archived data and has no chance of getting back into main tables.
You might also add lifecycle management columns on archived data (retention period or expiry date) so that archive data lifecycle can be also managed effectively.