What are reporting cubes in regards to Oracle SQL? - sql

I am curious about what "reporting cubes" are and how they relate to Oracle SQL ?
I read that they are similar to V-Lookup in Excel, but I'm not understanding much else.
thanks !

They're rather more than that! A Cube is an Online Analytical Processing (OLAP) database, as opposed to a normal DB which is an Online Transaction Processing (OLTP) DB. It's a database optimised for reporting - many times faster than querying an OLTP database. For example, I had a DB which took users up to 2 hours to get reports out. We put the data in an OLAP cube and the queries took less than 10 seconds.
This Wikipedia article is a reasonable place to start.
Note that most OLAP databases will not be updated in real time as the OLTP db is updated, but will have to have extracts made on a regular basis. Also, designing an OLAP db is not like designing an OLTP one. You need to analyse the queries the users are going to want, and split your data into Fact tables (the base data which is being reported) and Dimensions (how the users will want the data selected selected or summed). Not too difficult once you get your head round the idea, though.

Related

SSAS Cube processing option which has greater performance

I have SQL 2012 tabular and multidimensional models which is currently been processed through SQL jobs. All the models are processing with 'Process Full' Option everyday. However some of the models are taking long time for processing. Can anyone teel which is the best processing option that will not affect the performance of the SQL instance.
Without taking a look at you DB is hard to know but maybe I can give you a couple of hints:
It depends on how the data in DB is being updated. If all the fact table data is deleted and inserted every night, Process Full is probably the best way to go. But maybe you can partition data by date and proccess only the affected partitions.
On a multidimensional model you should check if aggregations are taking to much time. If so, you should consider redesign them.
On tabular models I found that some times having unnecesary big varchar fields can take huge amounts of time and memory to proccess. I found Kasper de Jonge's Server memory Analysis very helpful on identifying this kind of problems:
http://www.powerpivotblog.nl/what-is-using-all-that-memory-on-my-analysis-server-instance/bismservermemoryreport/

Multidimensional Data Warehouse Alternative for Reporting

I'm a few months into developing a reporting solution. Currently I am loading a relational data warehouse (Fact and Dimension tables) using SSIS. SSAS cubes and dimensions are then created from the relational Data warehouse. I then use SSRS to build reports using MDX queries.
The problem I have is that things are starting to get rather complicated trying to understand how multidimensional modelling works as well as MDX and cubes. Since the organization it's being designed for is rather small, I'm thinking that I should re-evaluate my approach.
I think maybe I should just eliminate SSAS from the picture and simply create reports that report directly off the relational data warehouse using SQL queries. The relational data warehouse could still be loaded nightly to allow up to date data for reporting.
I'm just wondering if that would be a good idea considering I'm not very experienced with data warehousing and SSAS. Also I wanted to know if keeping my relational data warehouse in dimension and fact tables would still work with SQL queries or would I need to redesign the tables. I don't want to make the decision to eliminate SSAS if that will end up causing more headaches or issues.
The reports will not include complicated calculations besides row counts and YTD percentages. For example "How many callers were male?" and "How many callers called for Product A?" Which are then broken down by month.
Any comments or suggestion are much appreciated cause I'm starting to feel rather frustrated with trying get SSAS cubes developed properly.
I was in a similar situation at my company. I had never used SSAS, and I was asked to do research on the benefits of using cubes to do some reporting. It was a pretty steep learning curve because my background is in development not data and reporting. SSAS is most useful when aggregate queries on a relational database are time consuming and if reports need to be broken down into hierarchies that an analyst can use to better understand the state of the business. Since SSAS stores aggregate info, queries of that nature are very quick. If your organization's data is small, the relational queries might be quick enough that you don't really need the benefit of storing aggregates.
Also you need to take into consideration the maintainability of using SSAS. If you're having trouble figuring out SSAS and MDX then how easy of a time will others? I tried to explain an MDX query I wrote to my boss who is experienced with SQL, but it's really quite different from relational queries. How easy is it going to be to add more complex reports?
A benefit to using SSAS is it can put the analyst in control of the report. Second, there are great tools and support. Finally, it's pretty easy to deploy and connect.
You can remove SSAS from your architecture yes because all the results you can get from an MDX query to SSAS, you can get from a T-SQL query to your datawarehouse because the cube was built reading data from the DW. BUT, bear in mind the following: the main advantage on an OLAP cube, in my point of view, are aggregations.
Very simple explanation: lets say you have a fact table called orders with 1 million orders per month. If you want to know how much you sold on that month, using sql you need to read row by row and sum the value to produce the total. That's like 1 million reads on your DB. If you have a cube, with the propper agrregations configured, you can have that value pre-calculated and pre-stored on your cube so if you need to know how much you sold on a month, you will have only one read to your cube.
Its a matter of analyse your situation, if you have a small cube, maybe aggregations are not necessary and you cna do fine with SQL, but depending on the situation, they can be very helpfull

Operational database schema to data mart schema, table reduction?

I'm starting to study SQL Server Analysis Services and I'm working my way through the training book, as well as the Developer Training Kit. In both, I find suggestions that the number of tables used in an OLAP database (ideally, star schema) is greatly reduced from the production OLTP database.
From the training kit:
We followed the data dimensional methodology to architect the data mart schema. From some 200 tables in the operational database, the data mart schema contained about 10 dimension tables and 2 fact tables.
From what I understand, the operational databases are usually (somewhat) normalised and the data mart schemas are heavily denormalised. I also believe that denormalising data usually involves adding more tables, not less.
I can't see how you can go from 200 tables to 12, unless you only need to report on a subset of data. And if you do only need to report on a subset of data, why can't you just use the appropriate tables in the operational database (unless there are significant performance gains to be made by using a denormalised star schema)?
Denormalizing is exactly the opposite of Normalizing a database. In a normalized database everything is spit apart into different tables to support concurrent writes to the data. This also has the side effect of generating any given subset of data exactly once (In an ideal 3rd normal form data structrure). A draw back of normalizing is that reads take a lot longer because of the fact that the data is scattered and we need to join tables to make sense of it again (Joins are pretty expensive operations).
When we denormalize, we are taking the data from multiple tables and merging them in to one table. So now we have repeating data in these tables. The repeating data is useful because we don't have to make joins to any other table to get it anymore. Writing to the data store is normally a bad idea because it would mean alot of writes to change all of the data in a table, whereas it would only take one in a normalized database.
OLTP stands for Online Transactional Processing, notice the word Transactional. Transactions are write operations and the OLTP model is optimiized for this. OLAP stands for Online Analytical Processing, Analysis being the keyword meaning lots of reads.
Going from 200 tables to 12 in an OLTP to OLAP process will suprisingly hold nearly all of the data in the OLTP database plus more. The OLTP is unable to record all of the changes over time, but OLAP specializes in this so you get all of your historical data as well as current data.
The star schema is probably the most common for OLAP data stores, the snowflake schema is also pretty common. You should learn about both and how to properly use them. It's just another great tool in your arsenal.
These two books from IBM will answer your questions much more thouroughly and they are free pdf's.
http://www.redbooks.ibm.com/abstracts/sg247138.html
http://www.redbooks.ibm.com/abstracts/sg242238.html

Practical size limitations for RDBMS

I am working on a project that must store very large datasets and associated reference data. I have never come across a project that required tables quite this large. I have proved that at least one development environment cannot cope at the database tier with the processing required by the complex queries against views that the application layer generates (views with multiple inner and outer joins, grouping, summing and averaging against tables with 90 million rows).
The RDBMS that I have tested against is DB2 on AIX. The dev environment that failed was loaded with 1/20th of the volume that will be processed in production. I am assured that the production hardware is superior to the dev and staging hardware but I just don't believe that it will cope with the sheer volume of data and complexity of queries.
Before the dev environment failed, it was taking in excess of 5 minutes to return a small dataset (several hundred rows) that was produced by a complex query (many joins, lots of grouping, summing and averaging) against the large tables.
My gut feeling is that the db architecture must change so that the aggregations currently provided by the views are performed as part of an off-peak batch process.
Now for my question. I am assured by people who claim to have experience of this sort of thing (which I do not) that my fears are unfounded. Are they? Can a modern RDBMS (SQL Server 2008, Oracle, DB2) cope with the volume and complexity I have described (given an appropriate amount of hardware) or are we in the realm of technologies like Google's BigTable?
I'm hoping for answers from folks who have actually had to work with this sort of volume at a non-theoretical level.
The nature of the data is financial transactions (dates, amounts, geographical locations, businesses) so almost all data types are represented. All the reference data is normalised, hence the multiple joins.
I work with a few SQL Server 2008 databases containing tables with rows numbering in the billions. The only real problems we ran into were those of disk space, backup times, etc. Queries were (and still are) always fast, generally in the < 1 sec range, never more than 15-30 secs even with heavy joins, aggregations and so on.
Relational database systems can definitely handle this kind of load, and if one server or disk starts to strain then most high-end databases have partitioning solutions.
You haven't mentioned anything in your question about how the data is indexed, and 9 times out of 10, when I hear complaints about SQL performance, inadequate/nonexistent indexing turns out to be the problem.
The very first thing you should always be doing when you see a slow query is pull up the execution plan. If you see any full index/table scans, row lookups, etc., that indicates inadequate indexing for your query, or a query that's written so as to be unable to take advantage of covering indexes. Inefficient joins (mainly nested loops) tend to be the second most common culprit and it's often possible to fix that with a query rewrite. But without being able to see the plan, this is all just speculation.
So the basic answer to your question is yes, relational database systems are completely capable of handling this scale, but if you want something more detailed/helpful then you might want to post an example schema / test script, or at least an execution plan for us to look over.
90 million rows should be about 90GB, thus your bottleneck is disk.
If you need these queries rarely, run them as is.
If you need these queries often, you have to split your data and precompute your gouping summing and averaging on the part of your data that doesn't change (or didn't change since last time).
For example if you process historical data for the last N years up to and including today, you could process it one month (or week, day) at a time and store the totals and averages somewhere. Then at query time you only need to reprocess period that includes today.
Some RDBMS give you some control over when views are updated (at select, at source change, offline), if your complicated grouping summing and averaging is in fact simple enough for the database to understand correctly, it could, in theory, update a few rows in the view at every insert/update/delete in your source tables in reasonable time.
It looks like you're calculating the same data over and over again from normalized data. One way to speed up processing in cases like this is to keep SQL with it's nice reporting and relationships and consistency and such, and use a OLAP Cube which is calculated every x amount of minutes. Basically you build a big table of denormalized data on a regular basis which allows quick lookups. The relational data is treated as the master, but the Cube allows quick precalcuated values to be retrieved from the database at any one point.
If that is only 1/20 of your data, you almost surely need to look into more scalable and efficient solutions, such as Google's Big Table. Have a look at NoSQL
I personally think that MongoDB is an awesome inbetween of NoSQL and RDMS. It isn't relational, but it provides a lot more features than a simple document store.
In dimensional (Kimball methodology) models in our data warehouse on SQL Server 2005, we regularly have fact tables with that many rows just in a single month partition.
Some things are instant and some things take a while, it depends on the operation and how many stars are being combined and what's going on.
The same models perform poorly on Teradata, but it is my understanding that if we re-model in 3NF, Teradata parallelization will work a lot better. The Teradata installation is many times more expensive than the SQL Server installation, so it just goes to show how much of a difference modeling and matching your data and processes to the underlying feature set matters.
Without knowing more about your data, and how it's currently modeled and what indexing choices you've made it's hard to say anything more.

Performance of Aggregate Functions on Large Infrequently Changing Datasets

I need to extract some management information (MI) from data which is updated in overnight batches. I will be using aggregate functions to generate the MI from tables with hundreds of thousands and potentially millions of rows. The information will be displayed on a web page.
The critical factor here is the efficiency of SQL Server's handling of aggregate functions.
I am faced with two choices for generating the data:
Write stored procs/views to generate the information from the raw data which are called every time someone accesses a page
Create tables which are refreshed daily and act as a cache for the MI
What is the best approach to take?
Cache the values during your nightly load if the data doesn't change throughout the day. It will make retrieval much faster. I'm a big fan of summary tables when necessary. In your case, they're necessary!
One thing you may want to look into, since you own SQL Server, is Analysis Services. By creating a Multidimensional Database, or a cube, these aggregations all happen automagically, and you can drill down and across your data to find numbers at the speed of thought, instead of trying to write reports that capture all of those numbers. Spend 10 minutes and watch the intro video of it, and I think you'll garner a real appreciation for SSAS's power.
It sounds to me like an Analysis Services Cube would actually be the best fit to your problem. The cube processesing can be run after the data loads occur to aggregate the data for later use.
However, you could also possibly use an indexed view, which if designed correctly and used in conjunction with the NO EXPAND table hint can provide a significant performance increase.
SQL 2005 Indexed Views
SQL 2008 Indexed Views