Reporting tables in SQL - sql

Our organization has a reporting application, that queries a real time transaction table to pull data for reports. As the query is against transaction table that is continuously updated the report performance is dismal. We are trying to come up with a new DB design to improve the performance.
My idea is to have three different tables for each year (eg; reports_2014,reports_2015,reports_2016) ( as we need to report only last three years of data) which will be created at the end of the year from the real time DB. The current year table (reports_2016) on the reporting DB will be updated with new records for the previous day at midnight. My reporting query will use a view that will be a union all of these three tables + the data from real time table for records from midnight to till this point in time.
Initially, I felt this to be a good design, provided I am going to have good indexes on these history tables.
However, I have a catch here arising from the inherent application design that updates these real time tables.
The status column of a transaction record can change to cancelled if I am cancelling a transaction , along with a new transaction cancellation record.
I could capture this by having a AFTER insert trigger and capturing the updates made correctly.
Now the issue is when there is a cancel record that is posted during the time my ETL to copy last days data to history table runs, I miss the update.
How do I capture this? Is there a way to delay the trigger untill my ETL is complete? Or is there a better approach to this problem?
My apologies if this is not the right place to post this question.
Thanks,
Roopesh

Multiple parallel tables with the same structure is almost never a good idea for a database design. Databases offer two important methods for handling performance:
Indexes
Partitioning
as well as other methods, such as rewriting queries, spatial indexes, full text indexes, and so on.
In your case, instead of multiple tables, consider table partitions.
As for your process, you should be using the creation/modification date of records. I would envision a job running early in the morning, say at 1:00 a.m., and this job would gather the previous day's records. Any changes after midnight simply do not apply. They will be included the following day.
If the reporting needs to be real-time as well, then you should consider building the reporting into the application itself. Some methods are:
Following the same approach as above, but doing the reporting runs more frequently (say once per hour rather once per each day).
Modifying the existing triggers to handle updates to reporting tables as well as the base tables.
Wrapping all DML transactions in stored procedures that handle both the transactional tables and the reporting tables.
Re-architecting the system to use queues with multiple readers to handle the disparate processing needs.

Thank You Gordon for your inputs. At this point ours is a real time reporting system. The database is a mirrored instance of production transactional database. Whenever a new transaction is entered to production database the same record flows to reporting database, which has the exactly similar schema, instantly. We do have indexes on columns those are queried frequently, however as there are many inserts in every hour the index performance is degraded quite fast. We rebuild them once in two weeks and it takes around 8 hours. That is where I thought having indexes on this huge transaction table with many inserts every hour may not be a good idea.. Please correct me if I am wrong...
I am actually reading through partitioning to see if it is a viable option for me. I had a discussion on the same with our DBA and I got following comment from him 'The reporting database is a mirrored instance of real time production database. You have to implement partitioning on the production transactional database. If you are using partitioning on a mirrored instance that would not work as your actual source DB is not partitioned' I am not sure how far this is true. Do you know if there is such a dependency between partitioning and mirroring??

Related

De-duplicating BigQuery in an Asynchronous Real Time ETL Pipeline

Our Data Warehouse team is evaluating BigQuery as a Data Warehouse column store solution and had some questions regarding its features and best use. Our existing etl pipeline consumes events asynchronously through a queue and persists the events idempotently into our existing database technology. The idempotent architecture allows us to on occasion replay several hours or days of events to correct for errors and data outages with no risk of duplication.
In testing BigQuery, we've experimented with using the real time streaming insert api with a unique key as the insertId. This provides us with upsert functionality over a short window, but re-streams of the data at later times result in duplication. As a result, we need an elegant option for removing dupes in/near real time to avoid data discrepancies.
We had a couple questions and would appreciate answers to any of them. Any additional advice on using BigQuery in ETL architecture is also appreciated.
Is there a common implementation for de-duplication of real time
streaming beyond the use of the tableId?
If we attempt a delsert (via an delete followed by an insert using
the BigQuery API) will the delete always precede the insert, or do
the operations arrive asynchronously?
Is it possible to implement real time streaming into a staging
environment, followed by a scheduled merge into the destination
table? This is a common solution for other column store etl
technologies but we have seen no documentation suggesting its use in
BigQuery.
We let duplication happen, and write our logic and queries in a such way that every entity is a streamed data. Eg: a user profile is a streamed data, so there are many rows placed in time and when we need to pick the last data, we use the most recent row.
Delsert is not suitable in my opinion as you are limited to 96 DML statements per day per table. So this means you need to temp store in a table batches, for later to issue a single DML statement that deals with a batch of rows, and updates a live table from the temp table.
If you consider delsert, maybe it's easier to consider writing a query to only read most recent row.
Streaming followed by scheduled merge is possible. Actually you can rewrite some data in the same table, eg: removing dups. Or scheduled query batch content from temp table and write to live table. This is somehow the same as let duplicate happening and later deal within a query with it, also called re-materialization if you write to the same table.

maintenance of application log files sql

I want to create a log table to keep track of users and their actions on website. For ex, when a user log in page a record will be created into log table. when user creates information, a record will be created into log table. similarly for every action, a record will be created into log table. In this way, the log table data will grow very faster. What is the better way to maintain such bigger tables apart from creating trigger and scheduling scripts to clean data frequently?
From my experience typically excessive logging doesnt really gain you much. A lot of people lose the usefulness of logging with the sheer volume of it...just a little warning before hand.
As for maintaining a table that size i recommend potentially partitioning the table and writing a specific set of stored procedures that effectively use a few indexes that you place on the table. Any ad-hoc work on the table should be done minimally and if it is done make sure the ad-hoc hits up against any index you setup on the table. Also with (nolock) will be your friend for SELECT statements if a large amount of inserts going on.
This is the basic general idea I do for the transaction tables I handle and they typically get around 1-2 million rows a day.

Synchronize SQL Server databases

I have a new idea and question about that I would like to ask you.
We have a CRM application on-premise / in house. We use that application kind of 24X7. We also do billing and payroll on the same CRM database which is OLTP and also same thing with SSRS reports.
It looks like whenever we do operation in front end which does inserts and updates to couple of entities at the same time, our application gets frozen until that process finishes. e.g. extracting payroll for 500 employees for their activities during last 2 weeks. Basically it summarize total working hours pulls that numbers from database and writes/updates that record where it says extract has been accomplished. so for 500 employees we are looking at around 40K-50K rows for Insert/Select/Update statements together.
Nobody can do anything while this process runs! We are considering the following options to take care of this issue.
Running this process in off-hours
OR make a copy of DB of Dyna. CRM and do this operations(extracting thousands of records and running multiple reports) on copy.
My questions are:
how to create first of all copy and where to create it (best practices)?
How to make it synchronize in real-time.
if we do select statement operation in copy DB than it's OK, but if we do any insert/update on copy how to reflect that on actual live db? , in short how to make sure both original and copy DB are synchronize to each other in real time.
I know I asked too many questions, but being SQL person, stepping into CRM team and providing suggestion, you know what I am trying to say.
Thanks folks for your any suggestion in advance.
Well to answer your question in regards to the live "copy" of a database a good solution is an alwayson availability group.
https://blogs.technet.microsoft.com/canitpro/2013/08/19/step-by-step-creating-a-sql-server-2012-alwayson-availability-group/
Though I dont think that is what you are going to want in this situation. Alwayson availability groups are typically for database instances that require very low failure time frames. For example: If the primary DB server goes down in the cluster it fails over to a secondary in a second or two at the most and the end users only notice a slight hiccup for a second.
What I think you would find better is to look at those insert statements that are hitting your database server and seeing why they are preventing you from pulling data. If they are truly locking the table maybe changing a large amount of your reads to "nolock" reads might help remedy your situation.
It would also be helpful to know what kind of resources you have allocated and also if you have proper indexing on the core tables for your DB. If you dont have proper indexing then a lot of the queries can take longer then normal causing the locking your seeing.
Finally I would recommend table partitioning if the tables you are pulling against are to large. This can help with a lot of disk speed issues potentially and also help optimize your querys if you partition by time segment (i.e. make a new partition every X months so when a query pulls from one time segment they only pull from that one data file).
https://msdn.microsoft.com/en-us/library/ms190787.aspx
I would say you need to focus on efficiency more then a "copy database" as your volumes arent very high to be needing anything like that from the sounds of it. I currently have a sql server transaction database running with 10 million+ inserts on it a day and I still have live reports hit against it. You just need the resources and proper indexing to accommodate.

How to log daily activity in SQL

I want to keep track of certain things on my site that happen throughout the day, and then I want to be able to compile the data so that I can view it on a day to day basis. What is the best way to do this in a SQL database? Would there be a better method than keeping "date|action|data" for each time (could be thousands per day) something happens, and then when I want to look at it just pull "where date = X"? Seems like this will have a lot of overhead, but I'm not sure how else to do it.
MySql may not be suited for storing large amounts of log data but with some optimizations it can get better here are some tips:
Use MyISAM with concurrent inserts
Rotate tables daily and use union to query
Use delayed inserts with MySQL or a job processing agent like Gearman
The retrieval of data may not be a problem here but the storage engine matters a lot and MySAM is your best option in this case given that there are not transactions(which it doesn't support).
The maximum number of rows supported for MyISAM is ~4.29E+09 with up to 64 indexs per table. Which I think is pretty good for you.
Check out this detailed article regarding efficient Logging with MySql

How to replicate database A to B, then truncate data on database A, leaving B alone?

I am having a problem with my SQL Server 2005 database. The database must handle 1000 inserts a sec constantly. This is proving to be very difficult when the database must also handle reporting of the data, thus indexing. It seems to slow down after a couple of days only achieving 300 inserts per sec. By 10 days it is almost non functional.
The requirement is to store 14 days worth of data. So far I can only manage 3 or 4 before everything falls apart. Is there a simple solution to this problem?
I was thinking that I could replicate the primary database allowing the new database to be the reporting database storing the 14 days worth of database, then truncate the primary database daily. Would this work?
It is unlikely you will want reporting running against a database capturing 1000 records per second. I'd suggest two databases, one handling the constant stream of inserts and a second reporting database that only loads records at an interval, either by querying the first for a finite set since the last load or by caching the incoming data and loading it separately.
However, reporting in near real time against a database capturing 86 million rows per day and carrying approximately 1.2 billion rows will require significant planning and hardware demands. Further, on the backend as you reach day 14 and start to remove old data you will put more load on the database. If you can run without logging that will help the primary system, but the reporting system with indexing demands and such will require some pretty significant performance considerations.
If the server has multiple harddrives I would try to split the database (or even the tables) in partitions.
Yeah, you dont need to copy a database over and then truncate/delete the live database on the fly. My guess is that the slowness is because your transaction logs are growing like crazy?
I think you are trying to say that you want to "shrink" the database periodically. If you have a FULL backup scheme, I think that if you backup the transaction logs once in a while that will shrink things down to normal again.