Does anyone know a method to use to get a rough size of an OLAP cube based on a star schema data warehouse. Something based on the number of dimensions, the number of records in the dimension tables and the number of fact records and finally the number of aggregations or distinct records etc..
The database I am looking at has a fact table of over 20 billion rows and a few dimension tables of 20 million, 70 million and 1.3 billion rows.
Thanks
Nicholas
I can see some roadblocks to creating this estimate. Knowing the row counts and cardinalities of the dimension tables in isolation isn't nearly as important as the relationships between them.
Imagine two low-cardinality dimensions with n and m unique values respectively. Caching OLAP aggregates over those dimensions produces anywhere from n + m values to n * m values depending on how closely the relationship resembles a pure bijection. Given only the information you provided, all you can say is you'll end up with fewer than 3.64 * 10^34 values, which is not very useful.
I'm pessimistic there's an algorithm fast enough that it wouldn't make more sense to generate the cube and weigh it when you're done.
We wrote a research paper that seems relevant:
Kamel Aouiche and Daniel Lemire, A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP, DOLAP 2007, pp. 17-24, 2007.
http://arxiv.org/abs/cs.DB/0703058
Well. You can use a general rule of Analysis Services data being about 1/4 - 1/3 size of the same data stored in relational database.
Edward.
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/6b16d2b2-2913-4714-a21d-07ff91688d11/cube-size-estimation-formula
Related
I'd like to know if the time taken to query from a table increases linearly with the number of rows in the table. In short, will the following query
SELECT * FROM my_table
take 10 times longer to run on average if the table has 10 times as many rows?
I think there are many factors that affect the speed of the query (like sharding of tables), but I'd like to know if on average we can expect it to be linear or perhaps sub-linear.
I tried running queries on different tables of different sizes and ended up with results that suggest it is sub-linear in time. But I'd like to make sure.
The time complexity depends on many factors. It is not limited to any single factor.Mostly the time taken varies on partitioning of table or data skewness etc. More number of rows will take more time to run select *. Since there are too many factors involved, representing in Big O notation is difficult.
Hi I created a data model with 295K rows and 27 columns which comes out to be 17 MB when i ran a power pivot table. After adding another table with 227K row to the data model all sudden data size jumps to over 42 mb? Can anyone advise me on how to reduce the size of the data model...
Thanks
There's a few options. For reference I would recommend reading this article which discusses PowerPivot's compression techniques.
Powerpivot is more about the columns than the rows. The fewer distinct values you have in your columns, the better. So can you roll-up some of your dimensions or get rid of some columns BEFORE importing the data?
Reduce rows by aggregating your data. Do you really need that much detail, for 275K rows, in order to run your analysis? You'd be surprised if you take a critical look, how little you might need.
I have data which is a matrix of integer values which indicate a banded distribution curve.
I'm optimizing for SELECT performance over INSERT performance. There are max 100 bands.
I'll primarily be querying this data by summing or averaging bands across a period of time.
My question is can I achieve better performance by flattening this data across a table with 1 column for each band, or by using a single column representing the band value?
Flattened data
UserId ActivityId DateValue Band1 Band2 Band3....Band100
10001 10002 1/1/2013 1 5 100 200
OR Normalized
UserId ActivityId DateValue Band BandValue
10001 10002 1/1/2013 1 1
10001 10002 1/1/2013 2 5
10001 10002 1/1/2013 3 100
Sample query
SELECT AVG(Band1), AVG(Band2), AVG(Band3)...AVG(Band100)
FROM ActivityBands
GROUP BY UserId
WHERE DateValue > '1/1/2012' AND DateValue < '1/1/2013'
Store the data in the normalized format.
If you are not getting acceptable performance from this scheme, instead of denormalizing, first consider what indexes you have on the table. You're likely missing an index that would make this perform similar to the denormalized table. Next, try writing a query to retrieve data from the normalized table so that the result set looks like the denormalized table, and use that query to create an indexed view. This will give you select performance identical to the denormalized table, but retain the nice data organization benefits of the proper normalization.
Denormalization optimizes exactly one means of accessing the data, at the expense of (almost all) others.
If you have only one access method that is performance critical, denormalization is likely to help; though proper index selection is of greater benefit. However, if you have multiple performance critical access paths to the data, you are better to seek other optimizations.
Creation of an appropriate clustered index; putting your non-clustered indices on SSD's. increasing memory on your server; are all techniques that will improve performance for all* accesses, rather than trading off between various accesses.
If you are accessing all (or most) of the bands in each row, then the denormalized form is better. Much better in my experience.
The reason is simple. The size of the data in the pages is much smaller, so many fewer pages need to be read to satisfy the query. The overhead for storing one band per row is about 4 integers or 32 bytes. So, 100 bands is about 3200 bytes. Within a single record, the record size is 100*4+8 or about 408 bytes. If your query is reading a significant number of records, this reduces the I/O requirements, significantly.
There is a caveat. If you only are reading one records worth, then 100 records fit on a single page in SQL and one record fits on a single page. The I/O for a single page read could be identical in the two cases. The benefit arises are you read more and more data.
Your sample query is reading hundreds or thousands of rows, so denormalization should benefit such a query.
If you would like to fetch data very fast then you should flatten out the table and use indexes to improve selecting over a broad column range similar to what you have proposed. However, If you are interested in building data for quick updates then using 3rd or 4th level normalization in combination with a lot of table joins should offer better performance.
I have a SSAS cube in which one of my dimension has 5 million recrods. When I try to view data for the dimension, report or excel pivot becomes lengthy and also the performance is poor. I cant categorize that particular dimension data. Only way I can think of to restrict data is select top 10K rows from the dimension which has metric values. Apart from restricting it to top 10K dimension records can anyone please suggest other possibilities?
Have you set up aggregations? I would venture to guess that the majority of the time being spent getting your data to a viewing point has to do with your measures. If I was you I would try adding in aggregations or upping the aggregation percent in order to alleviate some of the pressure at querytime by passing this workload to the processing time of the dimension/cube.
Generally, people set their aggregation levels at about 30% to start.
If you have done this already, I would think about upgrading your hardware on the server that your cube sits on. (depending on what you already have)
These are just suggestions as it could also be an issue in your cube design that is causing a lengthy runtime.
I would suggest you to create a hierarchy for showing 5 million records. Group by substring in Level 1,( if required some characters in Level 2), then the data falling under that group. For example :
Level 1 Value
A Apple
A Ant
This would mean that you wont be showing all 5 million records at once and it is very effective now to use aggregations too.
I built an analytics engine that pulls 50-100 rows of raw data from my database (lets call it raw_table), runs a bunch statistical measurements on it in PHP and then comes up with exactly 140 datapoints that I then need to store in another table (lets call it results_table). All of these data points are very small ints ("40","2.23","-1024" are good examples of the types of data).
I know the maximum # of columns for mysql is quite high (4000+) but there appears to be a lot of grey area as far as when performance really starts to degrade.
So a few questions here on best performance practices:
1) The 140 datapoints could be, if it is better, broken up into 20 rows of 7 data points all with the same 'experiment_id' if fewer columns is better. HOWEVER I would always need to pull ALL 20 rows (with 7 columns each, plus id, etc) so I wouldn't think this would be better performance than pulling 1 row of 140 columns. So the question: is it better to store 20 rows of 7-9 columns (that would all need to be pulled at once) or 1 row of 140-143 columns?
2) Given my data examples ("40","2.23","-1024" are good examples of what will be stored) I'm thinking smallint for the structure type. Any feedback there, performance-wise or otherwise?
3) Any other feedback on mysql performance issues or tips is welcome.
Thanks in advance for your input.
I think the advantage to storing as more rows (i.e. normalized) depends on design and maintenance considerations in the face of change.
Also, if the 140 columns have the same meaning or if it differs per experiment - properly modeling the data according to normalization rules - i.e. how is data related to a candidate key.
As far as performance, if all the columns are used it makes very little difference. Sometimes a pivot/unpivot operation can be expensive over a large amount of data, but it makes little difference on a single key access pattern. Sometimes a pivot in the database can make your frontend code a lot simpler and backend code more flexible in the face of change.
If you have a lot of NULLs, it might be possible to eliminate rows in a normalized design and this would save space. I don't know if MySQL has support for a sparse table concept, which could come into play there.
You have a 140 data items to return every time, each of type double.
It makes no practical difference whether this is 1x140 or 20x7 or 7x20 or 4x35 etc. It could be infinitesimally quicker for one shape of course but then have you considered the extra complexity in the PHP code to deal with a different shape.
Do you have a verified bottleneck, or is this just random premature optimisation?
You've made no suggestion that you intend to store big data in the database, but for the purposes of this argument, I will assume that you have 1 billion (10^9) data points.
If you store them in 140 columns, you'll have a mere 7 millon rows, however, if you want to retrieve a single data point from lots of experiments, then it will have to fetch a large number of very wide rows.
These very wide rows will take up more space in your innodb_buffer_pool, hence you won't be able to cache so many; this will potentially slow you down when you access them again.
If you store one datapoint per row, in a table with very few columns (experiment_id, datapoint_id, value) then you'll need to pull out the same number of smaller rows.
However, the size of rows makes little difference to the number of IO operations required. If we assume that your 1 billion datapoints doesn't fit in ram (which is NOT a safe assumption nowadays), maybe the resulting performance will be approximately the same.
It is probably better database design to use few columns; but it will use less disc space and perhaps be faster to populate, if you use lots of columns.