Query pulling 12-15 GB data From more than 120 tables - sql

I have a query which is pulling data from almost 125 different tables, I have created some 13 nested Stored Procedures calling other stored procedures to pull all the required data. Surprise Surprise Surprise The query takes ages to execute and sometimes I have to kill he connection and rerun it.
I have been advised to make use of staging table, Move required data there using SSIS packages and pull data from there, but I am a bit reluctant to use SSIS as Im not very comfortable with SSIS and This report is requested once in a while and also moving around 10-15 gb data for one report seems a lot of hassle.
Any suggestion any ideas please to make this hell of task a bit simpler, quicker and less error prone ???

Create a reporting database. On some frequency, be that hourly, daily, or whatever frequency meets the needs of the reports users, ETL the data from the transactional database into the reporting database.
You can use SSIS or you could choose to execute some stored procedures for ETL. Regardless, you probably will schedule it with a SQL Agent Job.
Finally, in terms of designing your report database, consider transforming the data in a way that will help the reports performance. Many people "flatten" or de-normalize data for the purpose of reporting. We ETL transactional data into a data warehouse that uses the "star schema" pattern and we also have an Analysis Services Database and MDX Reports as well. Most likely you don't need to go that far for one report, but, that is further down this same path of optimized data structures for reporting and BI.

Related

What's the best method of creating a SSRS report that will be run manually many times with different Parameters?

I have a SSRS Sales report that will be run many times a day by users, but with different parameters selected for the branch and product types.
The SQL query uses some large tables and is quite complex, therefore, running it many times is going to have a performance cost.
I assumed the best solution would be to create a dataset for the report with all permutations, ran once overnight and then apply filters when the users run the report.
I tried creating a snapshot in SSRS which doesn’t consider the parameters and therefore has all the required data, then filtering the Tablix using the parameters that the users selected. The snapshot works fine but it appears to be refreshed when the report is run with different parameters.
My next solution would be to create a table for the dataset which the report would then point to. I could recreate the table every night using a stored procedure. With a couple of small indexes the report would be lightning fast.
This solution would seem to work really well but my knowledge of SQL is limited, and I can’t help thinking this is not the right solution.
Is this suitable? Are there better ways? Can anybody confirm either way?
SSRS datasets have caching capabilities. I think you'll find this more useful instead of having to create extra db tables and such.
Please see here https://learn.microsoft.com/en-us/sql/reporting-services/report-server/cache-shared-datasets-ssrs?view=sql-server-ver15
If the rate of change of the data is low enough, and SSRS Caching doesn't suit your needs, then you could manually cache the record set from the report query (without the filtering) into its own table, then you can modify the report to query from that table.
Oracle and most Data Warehouse implementations have a formal mechanism specifically for this called Materialized Views, no such luck in SQL server though you can easily implement the same pattern yourself.
There are 2 significant drawbacks to this:
The data in the new table is a snapshot at the point in time that it was loaded, so this technique is better suited to slow moving datasets or reports where it is not critical that the data is 100% accurate.
You will need to manage the lifecycle of the data in this table, ideally you should setup a Job or Scheduled Task to automate this refresh but you could trigger a refresh as part of the logic in your report (not recommended, but possible).
Though it is possible, you would NOT consider using a TRIGGER to update the data as you have already indicated the query takes some time to execute, this could have a major impact on the rest of your LOB application
If you do go down this path you should write the refresh logic into a stored procedure so that it can be executed when needed and from other internal and external automation mechanisms.
You should also add a column that records the date and time of when the dataset was executed, then replace any references in your report that display the date and time the report was printed, with the time the data was prepared.
It is also worth pointing out that often performance issues with expensive queries in SSRS reports can be overcome if you can reducing the functions and value formatting that is in the SQL query itself and move that logic into the report definition. This goes for filtering operations too, you can easily add additional computed columns in the dataset definition or in the design surface and you can implement filtering directly in the tablix too, there is no requirement that every record from the SQL query be displayed in the report at all, just as we do not need to show every column.
Sometimes some well crafted indexes can help too, for complicated reports we can often find a balance between what the SQL engine can do efficiently and what the RDL can do for us.
Disclaimer: This is hypothetical advice, you should evaluate each report on a case by case basis.

Directly query databases with 1b rows of data using Tableau or PowerBI

I occasionally see people or companies showcasing querying a db/cube/etc from Tableau or PowerBI with less than 5s of response, sometimes even less than 1s. How do they do this? Is the data optimized to the gills? Are they using a massive Db?
On a related question, I've been experimenting with analysing a much smaller dataset 100m rows with Tableau against SQL DW and it still takes nearly a minute to calculate. Should I try some other tech? Perhaps Analysis Services or a big data technology?
These are usually one-off data analysis assignments so I do not have to worry about data growth.
Live connections in Tableau will only be as fast as the underlying data source. If you look at your log (C:\Users\username\Documents\My Tableau Repository\Logs\log.txt), you will see the sql tableau issued to the database. Run that query on the server itself...should take about the same amount of time. Side note: Tableau has a new data engine coming with the next release. It's called 'Hyper'. This should allow you to create an extract from 2b rows with very good performance. You can download the beta now...more info here

Does CQRS With OLTP and OLAP Databases Make Sense?

I have several OLTP databases with API's talking to them. I also have ETL jobs pushing data to an OLAP database every few hours.
I've been tasked with building a custom dashboard showing hight level data from the OLAP database. I want to build several API's pointing to the OLAP database. Should I:
Add to my existing API's and call the OLAP database and use a CQRS type pattern, so reads come from OLAP, while writes come from OLTP. My concern here is that there could be a mismatch in the data between reads and writes. How mismatched the data is depends on how often you run the ETL jobs (Hours in my case).
Add to my existing API's and call the OLAP databases then ask the client to choose whether they want OLAP or OLTP data where API's overlap. My concern here is that the client should not need to know about the implementation detail of where the data is coming from.
Write new API's that only point to the OLAP database. This is a lot of extra work.
Don't use #1: when management talk of analyzed reports it don't bother data mismatch between ETL process - obviously you will generate a CEO report after finishing ETL for the day
Don't use #2: this way you'll load transnational system with analytic overhead and dissolve isolation between purpose of two systems (not good for operation and maintenance)
Use #3 as its the best way to fetch processed results, Use modern tools like Excel, PowerQuery, PowerBI to allow you to create rich dashboard with speed instead of going into tables and writing APIs.

Warehouse and SSIS

I develop some application that has database wery generic so really can't use it for reporting. So I need solution how to create reporting. I'm developer so my knowledge in DBA domain is bounded. For now I have ideo to create another database where I'll pu denormalized data from original db. So I saw that I could use SSIS for that and woul be glad if someone could give me some advice how to attack that problem. Should I sync data once a day and run reports that way. Is there solution to sync data allways so reports would be up to date? Please any advice.. Thanks!
Damir,
What I get from your message is that you are getting close to build a Datawarehouse using a Star Schema pattern.
You could have two databases, One with normalized data and the other one with the Star Schema pattern (Your DW), and then create a script that would use your normalized data and put them in your datawarehouse. For the frequency of your script it is up to you : After each transaction, every hour, once a day, etc...
The advantage of having a datawarehouse is that you will be able to use OLAP cubes and the MDX language for your reports. It's a plus !
Hope it could help,
If you are on sql server 2005 or greater, explore Merge statement.
For smaller tables, just truncate and reload. 'Smaller' could be subjective - but if takes less than 2-3 minutes to load, that could be termed as small. Obviously, during that period any query that uses such tables would fail.

How do I keep a table synchronized with a query in SQL Server - ETL?

I wan't sure how to word this question so I'll try and explain. I have a third-party database on SQL Server 2005. I have another SQL Server 2008, which I want to "publish" some of the data in the third-party database too. This database I shall then use as the back-end for a portal and reporting services - it shall be the data warehouse.
On the destination server I want store the data in different table structures to that in the third-party db. Some tables I want to denormalize and there are lots of columns that aren't necessary. I'll also need to add additional fields to some of the tables which I'll need to update based on data stored in the same rows. For example, there are varchar fields that contain info I'll want to populate other columns with. All of this should cleanse the data and make it easier to report on.
I can write the query(s) to get all the info I want in a particular destination table. However, I want to be able to keep it up-to-date with the source on the other server. It doesn't have to be updated immediately (although that would be good) but I'd like for it be updated perhaps every 10 minutes. There are 100's of thousands of rows of data but the changes to the data and addition of new rows etc. isn't huge.
I've had a look around but I'm still not sure the best way to achieve this. As far as I can tell replication won't do what I need. I could manually write the t-sql to do the updates perhaps using the Merge statement and then schedule it as a job with sql server agent. I've also been having a look at SSIS and that looks to be geared at the ETL kind of thing.
I'm just not sure what to use to achieve this and I was hoping to get some advice on how one should go about doing this kind-of thing? Any suggestions would be greatly appreciated.
For that tables whose schemas/realtions are not changing, I would still strongly recommend Replication.
For the tables whose data and/or relations are changing significantly, then I would recommend that you develop a Service Broker implementation to handle that. The hi-level approach with service broker (SB) is:
Table-->Trigger-->SB.Service >====> SB.Queue-->StoredProc(activated)-->Table(s)
I would not recommend SSIS for this, unless you wanted to go to something like dialy exports/imports. It's fine for that kind of thing, but IMHO far too kludgey and cumbersome for either continuous or short-period incremental data distribution.
Nick, I have gone the SSIS route myself. I have jobs that run every 15 minutes that are based in SSIS and do the exact thing you are trying to do. We have a huge relational database and then we wanted to do complicated reporting on top of it using a product called Tableau. We quickly discovered that our relational model wasn't really so hot for that so I built a cube over it with SSAS and that cube is updated and processed every 15 minutes.
Yes SSIS does give the aura of being mainly for straight ETL jobs but I have found that it can be used for simple quick jobs like this as well.
I think, staging and partitioning will be too much for your case. I am implementing the same thing in SSIS now but with a frequency of 1 hour as I need to give some time for support activities. I am sure that using SSIS is a good way of doing it.
During the design, I had thought of another way to achieve custom replication, by customizing the Change Data Capture (CDC) process. This way you can get near real time replication, but is a tricky thing.