Reduce the size of Very Large Excel Data Model - powerpivot

Hi I created a data model with 295K rows and 27 columns which comes out to be 17 MB when i ran a power pivot table. After adding another table with 227K row to the data model all sudden data size jumps to over 42 mb? Can anyone advise me on how to reduce the size of the data model...
Thanks

There's a few options. For reference I would recommend reading this article which discusses PowerPivot's compression techniques.
Powerpivot is more about the columns than the rows. The fewer distinct values you have in your columns, the better. So can you roll-up some of your dimensions or get rid of some columns BEFORE importing the data?
Reduce rows by aggregating your data. Do you really need that much detail, for 275K rows, in order to run your analysis? You'd be surprised if you take a critical look, how little you might need.

Related

SSAS large dimension

I have a SSAS cube in which one of my dimension has 5 million recrods. When I try to view data for the dimension, report or excel pivot becomes lengthy and also the performance is poor. I cant categorize that particular dimension data. Only way I can think of to restrict data is select top 10K rows from the dimension which has metric values. Apart from restricting it to top 10K dimension records can anyone please suggest other possibilities?
Have you set up aggregations? I would venture to guess that the majority of the time being spent getting your data to a viewing point has to do with your measures. If I was you I would try adding in aggregations or upping the aggregation percent in order to alleviate some of the pressure at querytime by passing this workload to the processing time of the dimension/cube.
Generally, people set their aggregation levels at about 30% to start.
If you have done this already, I would think about upgrading your hardware on the server that your cube sits on. (depending on what you already have)
These are just suggestions as it could also be an issue in your cube design that is causing a lengthy runtime.
I would suggest you to create a hierarchy for showing 5 million records. Group by substring in Level 1,( if required some characters in Level 2), then the data falling under that group. For example :
Level 1 Value
A Apple
A Ant
This would mean that you wont be showing all 5 million records at once and it is very effective now to use aggregations too.

Large Denormalized Table Optimization

I have a single large denormalized table that mirrors the make up of a fixed length flat file that is loaded yearly. 112 columns and 400,000 records. I have a unique clustered index on the 3 columns that make up the where clause of the query that is run most against this table. Index Frag is .01. Performance on the query is good, sub second. However, returning all the records takes almost 2 minutes. The execution plan shows 100% of the cost is on a Clustered Index Scan (not seek).
There are no queries that require a join (due to the denorm). The table is used for reporting. All fields are type nvarchar (of the length of the field in the data file).
Beyond normalizing the table. What else can I do to improve performance.
Try paginating the query. You can split the results into, let's say, groups of 100 rows. That way, your users will see the results pretty quickly. Also, if they don't need to see all the data every time they view the results, it will greatly cut down the amount of data retrieved.
Beyond this, adding parameters to the query that filter the data will reduce the amount of data returned.
This post is a good way to get started with pagination: SQL Pagination Query with order by
Just replace the "50" and "100" in the answer to use page variables and you're good to go.
Here are three ideas. First, if you don't need nvarchar, switch these to varchar. That will halve the storage requirement and should make things go faster.
Second, be sure that the lengths of the fields are less than nvarchar(4000)/varchar(8000). Anything larger causes the values to be stored on a separate page, increasing retrieval time.
Third, you don't say how you are retrieving the data. If you are bringing it back into another tool, such as Excel, or through ODBC, there may be other performance bottlenecks.
In the end, though, you are retrieving a large amount of data, so you should expect the time to be much longer than for retrieving just a handful of rows.
When you ask for all rows, you'll always get a scan.
400,000 rows X 112 columns X 17 bytes per column is 761,600,000 bytes. (I pulled 17 out of thin air.) Taking two minutes to move 3/4 of a gig across the network isn't bad. That's roughly the throughput of my server's scheduled backup to disk.
Do you have money for a faster network?

Optimal row size to fetch at a time from a big table

I have a very big table contains around 20 million rows. I have to fetch some 4 million rows from this table based on some filtering criteria.
All the columns in filtering criteria are covered by some index and table stats are upto date.
I have been suggested that instead of loading all rows in a single go, use a batch size e.g. say 80000 rows at a time and that will be faster compared to loading all the rows at a time.
Can you suggest if this idea makes sense?
If it makes sense, what will be optimal row size to load at a time.
It can be much faster than single sql.
Split data using PK.
Batch size. It depends on the length of lines and processing time. Start with 10 000.
Thread job if possible.
Use SSIS to manipulate your data...it does everything you are wanting like threading and optimizations on load sizing and cache.
Spin up a cube or look into Business Intelligence Data Warehouse Tools...

How to calculate the likely size of an OLAP cube

Does anyone know a method to use to get a rough size of an OLAP cube based on a star schema data warehouse. Something based on the number of dimensions, the number of records in the dimension tables and the number of fact records and finally the number of aggregations or distinct records etc..
The database I am looking at has a fact table of over 20 billion rows and a few dimension tables of 20 million, 70 million and 1.3 billion rows.
Thanks
Nicholas
I can see some roadblocks to creating this estimate. Knowing the row counts and cardinalities of the dimension tables in isolation isn't nearly as important as the relationships between them.
Imagine two low-cardinality dimensions with n and m unique values respectively. Caching OLAP aggregates over those dimensions produces anywhere from n + m values to n * m values depending on how closely the relationship resembles a pure bijection. Given only the information you provided, all you can say is you'll end up with fewer than 3.64 * 10^34 values, which is not very useful.
I'm pessimistic there's an algorithm fast enough that it wouldn't make more sense to generate the cube and weigh it when you're done.
We wrote a research paper that seems relevant:
Kamel Aouiche and Daniel Lemire, A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP, DOLAP 2007, pp. 17-24, 2007.
http://arxiv.org/abs/cs.DB/0703058
Well. You can use a general rule of Analysis Services data being about 1/4 - 1/3 size of the same data stored in relational database.
Edward.
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/6b16d2b2-2913-4714-a21d-07ff91688d11/cube-size-estimation-formula

mysql - Creating rows vs. columns performance

I built an analytics engine that pulls 50-100 rows of raw data from my database (lets call it raw_table), runs a bunch statistical measurements on it in PHP and then comes up with exactly 140 datapoints that I then need to store in another table (lets call it results_table). All of these data points are very small ints ("40","2.23","-1024" are good examples of the types of data).
I know the maximum # of columns for mysql is quite high (4000+) but there appears to be a lot of grey area as far as when performance really starts to degrade.
So a few questions here on best performance practices:
1) The 140 datapoints could be, if it is better, broken up into 20 rows of 7 data points all with the same 'experiment_id' if fewer columns is better. HOWEVER I would always need to pull ALL 20 rows (with 7 columns each, plus id, etc) so I wouldn't think this would be better performance than pulling 1 row of 140 columns. So the question: is it better to store 20 rows of 7-9 columns (that would all need to be pulled at once) or 1 row of 140-143 columns?
2) Given my data examples ("40","2.23","-1024" are good examples of what will be stored) I'm thinking smallint for the structure type. Any feedback there, performance-wise or otherwise?
3) Any other feedback on mysql performance issues or tips is welcome.
Thanks in advance for your input.
I think the advantage to storing as more rows (i.e. normalized) depends on design and maintenance considerations in the face of change.
Also, if the 140 columns have the same meaning or if it differs per experiment - properly modeling the data according to normalization rules - i.e. how is data related to a candidate key.
As far as performance, if all the columns are used it makes very little difference. Sometimes a pivot/unpivot operation can be expensive over a large amount of data, but it makes little difference on a single key access pattern. Sometimes a pivot in the database can make your frontend code a lot simpler and backend code more flexible in the face of change.
If you have a lot of NULLs, it might be possible to eliminate rows in a normalized design and this would save space. I don't know if MySQL has support for a sparse table concept, which could come into play there.
You have a 140 data items to return every time, each of type double.
It makes no practical difference whether this is 1x140 or 20x7 or 7x20 or 4x35 etc. It could be infinitesimally quicker for one shape of course but then have you considered the extra complexity in the PHP code to deal with a different shape.
Do you have a verified bottleneck, or is this just random premature optimisation?
You've made no suggestion that you intend to store big data in the database, but for the purposes of this argument, I will assume that you have 1 billion (10^9) data points.
If you store them in 140 columns, you'll have a mere 7 millon rows, however, if you want to retrieve a single data point from lots of experiments, then it will have to fetch a large number of very wide rows.
These very wide rows will take up more space in your innodb_buffer_pool, hence you won't be able to cache so many; this will potentially slow you down when you access them again.
If you store one datapoint per row, in a table with very few columns (experiment_id, datapoint_id, value) then you'll need to pull out the same number of smaller rows.
However, the size of rows makes little difference to the number of IO operations required. If we assume that your 1 billion datapoints doesn't fit in ram (which is NOT a safe assumption nowadays), maybe the resulting performance will be approximately the same.
It is probably better database design to use few columns; but it will use less disc space and perhaps be faster to populate, if you use lots of columns.