SSAS Tabular in memory compression - ssas

I am testing SSAS tabular on my existing data warehouse. I read that compression of data in memory will be fantastic, up to 10 times. The warehouse weights about 600MB, analytical model has about 60 measures (mostly row counts and basic calculations). In sql server management studio I checked what is the esimated size of analytical database: ~1000MB. Not what I expected (was hoping for 100MB at most).
I check memory usage of msmdsrv.exe process using a simple Resource Monitor. To my surprise, after full processing of the database memory consumption of the msmdsrv process jumped from 200MB to 1600MB. I deployed second instance of the same model connected to the same source and it grew to over 2500MB. So estimated size was in fact correct.
Data Warehouse is quite typical - star schema, facts and dimensions, nothing fancy.
Why was the data not compressed in any way? How is it possible that it takes even more memory than the uncompressed source warehouse?
I will be most grateful for any tips on this mystery :)

You should read and watch Marco Russo materials about vertipaq analyzer. You can find what part of your model take most of your memory.
https://www.sqlbi.com/articles/data-model-size-with-vertipaq-analyzer/
https://www.sqlbi.com/tv/checking-model-size-using-vertipaq-analyzer-in-dax-studio/
And maybe this can get you some light:
https://www.microsoftpressstore.com/articles/article.aspx?p=2449192&seqNum=3
Tabular Model is based on column Store that mean if you have many unique value in column then you get lower compression (for eg. incremental ID column like transactionID).
-> Omit high-cardinality columns where possible
-> Try to split columns when possible If you have DateTime columns, you should split them into two parts (date and time). You have then more reapeted values
-> Sort Order of data in partitions may have affect to compression rate [Run Length Encoding (RLE)]
-> Use measure (it takes no space) instead of a calculated column (it takes up)
Run Length Encoding (RLE)

Related

SQL Server row length calculator

I am looking for an up to date tool to accurately calculate the total row size and page-density of any SQL table definition for SQL Server 2005+.
Please note that there are plenty of resources concerning calculating sizes of rows in existing tables, estimating techniques for sizing, etc... However, I am designing tables and have some options about column size which I am trying to balance with efficient data access - meaning that I can relocate less-frequently accessed long text into dedicated tables to allow the most frequent access of these new tables to operate at optimum speed.
Ideally there would be an online facility where a create statement can be cut and pasted, or a sproc I can run on a dev db.
and The answer is a simple one until you start making proper table design and balance that against joins and FK data and disk access.
I'd have a look an see how many data pages you are using and remember that one reads an extend (8 data pages) from disk, not only the data page you are looking for. Then there is the option for data compression in your table as well as sparse columns and out of row type of data storage and variable length characters.
It's not about how much data is in a column, it's really about how many data reads and CPU you need to get it. this you can test when executing a Query and looking against the ACTUAL QUERY PLAN.
As for space used you can use a stored procedure called sp_spaceused. here is a source you can use to see how one could use it in dbforms
Hope it helps
Walter

How to improve the performance of SAS Enterprise Guide 4

I have a table with almost 200,000,000 records. It takes a long long time to query.
Any idea about improving the performance?
Consider adding index on id-type columns in that table.
BTW, this has nothing to do with SAS EG performance, but everything to do with the underlying BASE SAS engine.
In addition to indexing you should also consider:
Compression. When you build the dataset make sure you use the compress=yes option if you're not already. This will shrink the size of the table on disk resulting in less disk I/O (the slowest part of querying).
Check column lengths - make sure you're not using a field length of $255 to store something that only needs a length of $20 etc...
Use the SAS SPDE (Scalable Performance Data Engine). It allows you to partition your SAS datasets into multiple files and optionally spread them across different disks. Once your SAS datasets reach a certain size you can see performance improvements. I generally tend to use SPD libnames any time a dataset grows > 10G. No additional SAS modules are requires - this is enabled as part of Base SAS.

Is it sensible to store long, unique text strings in OLAP cubes for drillthrough retrieval (especially in SSAS)?

I'm motivated to store some long text strings in an OLAP cube, long on the order of 1,000s or 10,000s of characters -- but I'm wondering if this will lead me astray. (I'm also curious to learn a little more about how OLAP engines handle strings.) The particular use case I have in mind is that I have a unique, pre-existing "record description" for each of my OLAP facts, and I want to put those descriptions in the cube so that I have the option to get them back when I do a DRILLTHROUGH operation. In contrast, I don't need the record descriptions to appear when doing normal pivot table / aggregate type operations. (The descriptions are too long to display sensibly in a pivot table, plus each fact has a unique description, meaning it doesn't make sense to aggregate over descriptions.) My current dataset has around 700,000 facts, though I'm also curious if the answer would change for larger datasets.
My hope was that an OLAP server could do something sensible if I put these long strings in a cube. In the Sql Server / SSAS case in particular, I thought perhaps I'd put them in a dimension marked as ROLAP, to save memory usage, and use a degenerate dimension (aka a "fact dimension", in SSAS terminology), to avoid needless ETL complexities. But I'm curious if this would be regarded as a horrible practice for some reason, or if there are any hidden gotchas.
Update: My example use case is where you have a string associated with each OLAP fact. But it might also be instructive to consider the case where the strings are instead associated with each particular value of a particular dimension. (e.g. Suppose you had a Company dimension and each company had a somewhat lengthy Company Description string.)
Here's what I've been able to uncover about the implications of storing such strings in SSAS, especially SSAS 2008. Where I consider data structures, it's exclusively focused on MOLAP storage, which is what I've been experimenting with.
First, standard MS ETL (extract/transform/load, i.e. data import) tools like Business Intelligence Development Studio may try to prevent you from importing large textfields, especially varchar(max) fields, but there is a workaround, and it's proven effective for me. (For BIDS it involves manually setting the DataSize element in an XML file, potentially to the magic size of 163315555 bytes. Props to Matija Lah for figuring this out.)
Second, as far as I can tell, storing lots of long, unique strings shouldn't wreak havoc on the on-disk data structures used by SSAS. Also, the size of the string data on disk should be of the same order of magnitude as the string data in your data source. Here's some rough info on SSAS handles strings:
The core OLAP data structures (e.g. for the attributes of a dimension, or for the facts of a measure groups) don't directly contain strings; instead contain offsets into "string store" files (extensions .ksstore, .asstore, .bsstore, or .string.data), which contain the actual string data.
Within a given string store, each string is represented only once. If several rows in your source data tables contain duplicate strings, then at the SSAS/MOLAP level, that will translate into duplicated file-offsets, rather than duplicated string values
If you're source string has length n, then the corresponding data structure in the string store has 8-ish bytes of overhead, plus 2*n bytes per character. (Strings are inherently stored in 2-byte Unicode format in SSAS.)
For some fantastic detail about this stuff, I suggest the book Microsoft SQL Server 2008 Analysis Services Unleashed, in particular chapter 20, "The Physical Data Model".
At least in my experiments, string store files do not seem to be compressed -- at least they're not notably smaller than an uncompressed string store would be.
I've verified experimentally that text data takes the same order of magnitude of bytes whether stored in SSAS MOLAP or in a sql table. In particular, I did a "select sum(len(myfield)) from mytable" from one of my dimension tables, and then compared to the size of the corresponding attribute's files in my SSAS data directory. Size was 172MB in SQL and 304MB in SQL server. (Sql size was 147MB if I summed all unique strings, rather than all strings.) In my case the size difference was mostly explained by character encoding; my source sql data is stored with one byte per character, whereas SSAS stores all strings with two bytes per character. I found that the .kssstore file totally dominated all the other files associated with this attribute in size, regardless of whether or not I optimized the attribute via AttributeHierarchyOptimizedState=FullyOptimized.
Third, there is a 4GB cap on the size of string store files, which limits the amount of unique text that can be associated, say, with a particular dimension/attribute. In my case I'm less than 10% of the way to the limit, but this might affect some people. (Quick order-of-magnitude calculation for the original post: 1M facts * 10,000 bytes/per fact = 10GB-ish worth of text.) If you do hit this limit, you'll apparently hit it at cube "processing" time. Apparently it applies even to ROLAP dimensions. There may be some hacks to work around this. See here. Note that Sql Server 2012 may remove this 4GB limitation.
Forth, it seems that if long unique strings create a problem in SSAS, they do so at the level of in-memory representation. One potential problem (that I haven't looked into in detail) is that having these extra strings cached in memory will keep SSAS from keeping other important data structures in memory, and thus degrade performance. Another problem, suggested by the book The Microsoft Data Warehouse Toolkit (though I haven't yet found this claim elsewhere), is that SSAS does some expansive string padding on its in-memory data structures:
"The relational database stores variable length string columns ... However, other parts of the SQL Server toolset will fill these columns out to their full width. Notable, Integration Services and Analysis Services pad string columns with spaces as they are loaded into memory. Both Integration Services and Analysis Services love physical memory, so there's a cost to declaring string columns that are far wider than they need to be."
To conclude, so far storing my long string data in the cube seems convenient, and I haven't uncovered any reasons to expect disaster, so I'm giving it a try. I'll try to provide an update if things don't work out.
You could store the values in a table relationaly and then create an integer surrogate key.
add the integer surrogate to your UDM and create a SSRS Drillthrough action
http://msdn.microsoft.com/en-US/library/ms174526(v=SQL.90).aspx
that looks up the text field by the key value.
I would use a degenerate dimension, but hide it via SSAS until requested via a Drillthrough Action.
I can't guide you on the internal storage of strings for the AS engine, but as for storing them in SQL, I would make sure your varchar(MAX) column was at the end of your columns to speed up SQL engines scanning of those rows.
At 700,000 rows, with enough memory and disk I/O, you aren't taxing SQL much.
Haven't worked through all the possibilities described and link to from it yet, but this thread from 2007 is on the same topic and seems pretty relevant:
http://www.sqldev.org/sql-server-analysis-services/discussion-about-how-to-create-a-fact-drillthrough-dimension-the-best-way-34857.shtml
One new possibility raised here is that, rather than treating text stored in the fact table as a degenerate dimension, you could potentially treat it as a text-valued (vs numeric-valued) measure. Initial googling suggests that SSAS might support this but there are some tricks to getting this right, e.g. you probably want to disable aggregation for that measure, you might need to do something non-standard to get the field to appear in a drillthrough, and it might require SSAS enterprise edition.

Quickly Large Data Pivoting

We are developing a product which can be used for developing predictive models and the slicing and dicing of the data in order to provide BI.
We are having two kind of data access requirements.
For predictive modeling, we need to read data on daily basis and do it row by row. In this the normal SQL Server database is sufficient and we are not getting any issues.
In case of slicing and dicing data of huge sizes like 1GB of data having let us say 300 M rows. We want to pivot that data easily with minimum response time.
The current SQL Database is having response time issues in this.
We like our product to run on any normal client machine with 2GB RAM with Core 2 Duo processor.
I would like to know how should I store this data and then how I can create a pivoting experience for each of the dimension.
Ideally we will have data of let us say daily sales by sales person by region by product for a large corporation. Then we would like to slice and dice it based on any dimension and also be able to perform aggregation, unique values, maximum, minimum, average values and some other statistical functions.
I would build an in-memory cube on top of that data. To give you an example, icCube is having sub-second response time for 3/4 measures over 50M rows on a single core i5 - without any cache or pre-aggregation (i.e., this response time is constant in all the dimensions).
Contact us directly for more details about how to integrate it into your product.
You could also use PowerPivot to do this. This is a free addin for Excel 2010, which would allow large data sets to be handled, sliced+diced, etc.
If you want to code around it, you can connect to the PowerPivot database (effectively an SSAS cube) using the SSAS database connector
Hope that is of some use..

Do relational databases provide a feasible backend for a process historian?

In the process industry, lots of data is read, often at a high frequency, from several different data sources, such as NIR instruments as well as common instruments for pH, temperature, and pressure measurements. This data is often stored in a process historian, usually for a long time.
Due to this, process historians have different requirements than relational databases. Most queries to a process historian require either time stamps or time ranges to operate on, as well as a set of variables of interest.
Frequent and many INSERT, many SELECT, few or no UPDATE, almost no DELETE.
Q1. Is relational databases a good backend for a process historian?
A very naive implementation of a process historian in SQL could be something like this.
+------------------------------------------------+
| Variable |
+------------------------------------------------+
| Id : integer primary key |
| Name : nvarchar(32) |
+------------------------------------------------+
+------------------------------------------------+
| Data |
+------------------------------------------------+
| Id : integer primary key |
| Time : datetime |
| VariableId : integer foreign key (Variable.Id) |
| Value : float |
+------------------------------------------------+
This structure is very simple, but probably slow for normal process historian operations, as it lacks "sufficient" indexes.
But for example if the Variable table would consist of 1.000 rows (rather optimistic number), and data for all these 1.000 variables would be sampled once per minute (also an optimistic number) then the Data table would grow with 1.440.000 rows per day. Lets continue the example, estimate that each row would take about 16 bytes, which gives roughly 23 megabytes per day, not counting additional space for indexes and other overhead.
23 megabytes as such perhaps isn't that much but keep in mind that numbers of variables and samples in the example were optimistic and that the system will need to be operational 24/7/365.
Of course, archiving and compression comes to mind.
Q2. Is there a better way to accomplish this? Perhaps using some other table structure?
I work with a SQL Server 2008 database that has similar characteristics; heavy on insertion and selection, light on update/delete. About 100,000 "nodes" all sampling at least once per hour. And there's a twist; all of the incoming data for each "node" needs to be correlated against the history and used for validation, forecasting, etc. Oh, there's another twist; the data needs to be represented in 4 different ways, so there are essentially 4 different copies of this data, none of which can be derived from any of the other data with reasonable accuracy and within reasonable time. 23 megabytes would be a cakewalk; we're talking hundreds-of-gigabytes to terabytes here.
You'll learn a lot about scale in the process, about what techniques work and what don't, but modern SQL databases are definitely up to the task. This system that I just described? It's running on a 5-year-old IBM xSeries with 2 GB of RAM and a RAID 5 array, and it performs admirably, nobody has to wait more than a few seconds for even the most complex queries.
You'll need to optimize, of course. You'll need to denormalize frequently, and maintain pre-computed aggregates (or a data warehouse) if that's part of your reporting requirement. You might need to think outside the box a little: for example, we use a number of custom CLR types for raw data storage and CLR aggregates/functions for some of the more unusual transactional reports. SQL Server and other DB engines might not offer everything you need up-front, but you can work around their limitations.
You'll also want to cache - heavily. Maintain hourly, daily, weekly summaries. Invest in a front-end server with plenty of memory and cache as many reports as you can. This is in addition to whatever data warehousing solution you come up with if applicable.
One of the things you'll probably want to get rid of is that "Id" key in your hypothetical Data table. My guess is that Data is a leaf table - it usually is in these scenarios - and this makes it one of the few situations where I'd recommend a natural key over a surrogate. The same variable probably can't generate duplicate rows for the same timestamp, so all you really need is the variable and timestamp as your primary key. As the table gets larger and larger, having a separate index on variable and timestamp (which of course needs to be covering) is going to waste enormous amounts of space - 20, 50, 100 GB, easily. And of course every INSERT now needs to update two or more indexes.
I really believe that an RDBMS (or SQL database, if you prefer) is as capable for this task as any other if you exercise sufficient care and planning in your design. If you just start slinging tables together without any regard for performance or scale, then of course you will get into trouble later, and when the database is several hundred GB it will be difficult to dig yourself out of that hole.
But is it feasible? Absolutely. Monitor the performance constantly and over time you will learn what optimizations you need to make.
It sounds like you're talking about telemetry data (time stamps, data points).
We don't use SQL databases for this (although we do use SQL databases to organize it); instead, we use binary streaming files to capture the actual data. There are a number of binary file formats that are suitable for this, including HDF5 and CDF. The file format we use here is a proprietary compressible format. But then, we deal with hundreds of megabytes of telemetry data in one go.
You might find this article interesting (links directly to Microsoft Word document):
http://www.microsoft.com/caseStudies/ServeFileResource.aspx?4000003362
It is a case study from the McClaren group, describing how SQL Server 2008 is used to capture and process telemetry data from formula one race cars. Note that they don't actually store the telemetry data in the database; instead, it is stored in the file system, and the FILESTREAM capability of SQL Server 2008 is used to access it.
I believe you're headed in the right path. We have a similar situation were we work. Data comes from various transport / automation systems across various technologies such as manufacturing, auto, etc. Mainly we deal with the big 3: Ford, Chrysler, GM. But we've had a lot of data coming in from customers like CAT.
We ended up extracting data into a database and as long as you properly index your table, keep updates to a minimum and schedule maintenance (rebuild indexes, purge old data, update statistics) then I see no reason for this to be a bad solution; in fact I think it is a good solution.
Certainly a relational database is suitable for mining the data after the fact.
Various nuclear and particle physics experiments I have been involved with have explored several points from not using a RDBMS at all though storing just the run summaries or the run summaries and the slowly varying environmental conditions in the DB all the way to cramming every bit collected into the DB (though it was staged to disk first).
When and where the data rate allows more and more groups are moving towards putting as much data as possible into the database.
IBM Informix Dynamic Server (IDS) has a TimeSeries DataBlade and RealTime Loader which might provide relevant functionality.
Your naïve schema records each reading 100% independently, which makes it hard to correlate across readings- both for the same variable at different times and for different variables at (approximately) the same time. That may be necessary, but it makes life harder when dealing with subsequent processing. How much of an issue that is depends on how often you will need to run correlations across all 1000 variables (or even a significant percentage of the 1000 variables, where significant might be as small as 1% and would almost certainly start by 10%).
I would look to combine key variables into groups that can be recorded jointly. For example, if you have a monitor unit that records temperature, pressure and acidity (pH) at one location, and there are perhaps a hundred of these monitors in the plant that is being monitored, I would expect to group the three readings plus the location ID (or monitor ID) and time into a single row:
CREATE TABLE MonitorReading
(
MonitorID INTEGER NOT NULL REFERENCES MonitorUnit,
Time DATETIME NOT NULL,
PhReading FLOAT NOT NULL,
Pressure FLOAT NOT NULL,
Temperature FLOAT NOT NULL,
PRIMARY KEY (MonitorID, Time)
);
This saves having to do self-joins to see what the three readings were at a particular location at a particular time, and uses about 20 bytes instead of 3 * 16 = 48 bytes per row. If you are adamant that you need a unique ID integer for the record, that increases to 24 or 28 bytes (depending on whether you use a 4-byte or 8-byte integer for the ID column).
Yes, a DBMS is appropriate for this, although not the fastest option. You will need to invest in a reasonable system to handle the load though. I will address the rest of my answer to this problem.
It depends on how beefy a system you're willing to throw at the problem. There are two main limiters for how fast you can insert data into a DB: bulk I/O speed and seek time. A well-designed relational DB will perform at least 2 seeks per insertion: one to begin the transaction (in case the transaction can not be completed), and one when the transaction is committed. Add to this additional storage to seek to your index entries and update them.
If your data are large, then the limiting factor will be how fast you can write data. For a hard drive, this will be about 60-120 MB/s. For a solid state disk, you can expect upwards of 200 MB/s. You will (of course) want extra disks for a RAID array. The pertinent figure is storage bandwidth AKA sequential I/O speed.
If writing a lot of small transactions, the limitation will be how fast your disk can seek to a spot and write a small piece of data, measured in IO per second (IOPS). We can estimate that it will take 4-8 seeks per transaction (a reasonable case with transactions enabled and an index or two, plus some integrity checks). For a hard drive, the seek time will be several milliseconds, depending on disk RPM. This will limit you to several hundred writes per second. For a solid state disk, the seek time is under 1 ms, so you can write several THOUSAND transactions per second.
When updating indices, you will need to do about O(log n) small seeks to find where to update, so the DB will slow down as the record counts grow. Remember that a DB may not write in the most efficient format possible, so data size may be bigger than you expect.
So, in general, YES, you can do this with a DBMS, although you will want to invest in good storage to ensure it can keep up with your insertion rate. If you wish to cut on cost, you may want to roll data over a specific age (say 1 year) into a secondary, compressed archive format.
EDIT:
A DBMS is probably the easiest system to work with for storing recent data, but you should strongly consider the HDF5/CDF format someone else suggested for storing older, archived data. It is an flexible and widely supported format, provides compression, and provides for compression and VERY efficient storage of large time series and multi-dimensional arrays. I believe it also provides for some methods of indexing in the data. You should be able to write a little code to fetch from these archive files if data is too old to be in the DB.
There is probably a data structure that would be more optimal for your given case than a relational database.
Having said that, there are many reasons to go with a relational DB including robust code support, backup & replication technology and a large community of experts.
Your use case is similar to high-volume financial applications and telco applications. Both are frequently inserting data and frequently doing queries that are both time-based and include other select factors.
I worked on a mid-sized billing project that handled cable bills for millions of subscribers. That meant an average of around 5 rows per subscriber times a few million subscribers per month in the financial transaction table alone. That was easily handled by a mid-size Oracle server using (now) 4 year old hardware and software. Large billing platforms can have 10x that many records per unit time.
Properly architected and with the right hardware, this case can be handled well by modern relational DB's.
Years ago, a customer of ours tried to load an RDBMS with real-time data collected from monitoring plant machinery. It didn't work in a simplistic way.
Is relational databases a good backend for a process historian?
Yes, but. It needs to store summary data, not details.
You'll need a front-end based in-memory and on flat files. Periodic summaries and digests can be loaded into an RDBMS for further analysis.
You'll want to look at Data Warehousing techniques for this. Most of what you want to do is to split your data into two essential parts ---
Facts. The data that has units. Actual measurements.
Dimensions. The various attributes of the facts -- date, location, device, etc.
This leads you to a more sophisticated data model.
Fact: Key, Measure 1, Measure 2, ..., Measure n, Date, Geography, Device, Product Line, Customer, etc.
Dimension 1 (Date/Time): Year, Quarter, Month, Week, Day, Hour
Dimension 2 (Geography): location hierarchy of some kind
Dimension 3 (Device): attributes of the device
Dimension *n*: attributes of each dimension of the fact
You may want to look at KDB. It is specificaly optimized for this kind of usage: many inserts, few or no updates or deletes.
It isn't as easy to use as traditional RDBMS though.
The other aspect to consider is what kind of selects you're doing. Relational/SQL databases are great for doing complex joins dependent on multiple indexes, etc. They really can't be beaten for that. But if you're not doing that kind of thing, they're probably not such a great match.
If all you're doing is storing per-time records, I'd be tempted to roll your own file format ... even just output the stuff as CSV (groans from the audience, I know, but it's hard to beat for wide acceptance)
It really depends on your indexing/lookup requirements, and your willingness to write tools to do it.
You may want to take a look at a Stream Data Manager System (SDMS).
While not addressing all your needs (long-time persistence), sliding windows over time and rows and frequently changing data are their points of strength.
Some useful links:
Stanford Stream Data Manager
Stream Mill
Material about Continuous Queries
AFAIK major database makers all should have some kind of prototype version of an SDMS in the works, so I think it's a paradigm worth checking out.
I know you're asking about relational database systems, but those are unicorns. SQL DBMSs are probably a bad match for your needs because no current SQL system (I know of) provides reasonable facilities to deal with temporal data. depending on your needs you might or might not have another option in specialized tools and formats, see e. g. rrdtool.