Quickly Large Data Pivoting - sql

We are developing a product which can be used for developing predictive models and the slicing and dicing of the data in order to provide BI.
We are having two kind of data access requirements.
For predictive modeling, we need to read data on daily basis and do it row by row. In this the normal SQL Server database is sufficient and we are not getting any issues.
In case of slicing and dicing data of huge sizes like 1GB of data having let us say 300 M rows. We want to pivot that data easily with minimum response time.
The current SQL Database is having response time issues in this.
We like our product to run on any normal client machine with 2GB RAM with Core 2 Duo processor.
I would like to know how should I store this data and then how I can create a pivoting experience for each of the dimension.
Ideally we will have data of let us say daily sales by sales person by region by product for a large corporation. Then we would like to slice and dice it based on any dimension and also be able to perform aggregation, unique values, maximum, minimum, average values and some other statistical functions.

I would build an in-memory cube on top of that data. To give you an example, icCube is having sub-second response time for 3/4 measures over 50M rows on a single core i5 - without any cache or pre-aggregation (i.e., this response time is constant in all the dimensions).
Contact us directly for more details about how to integrate it into your product.

You could also use PowerPivot to do this. This is a free addin for Excel 2010, which would allow large data sets to be handled, sliced+diced, etc.
If you want to code around it, you can connect to the PowerPivot database (effectively an SSAS cube) using the SSAS database connector
Hope that is of some use..

Related

SSAS Tabular in memory compression

I am testing SSAS tabular on my existing data warehouse. I read that compression of data in memory will be fantastic, up to 10 times. The warehouse weights about 600MB, analytical model has about 60 measures (mostly row counts and basic calculations). In sql server management studio I checked what is the esimated size of analytical database: ~1000MB. Not what I expected (was hoping for 100MB at most).
I check memory usage of msmdsrv.exe process using a simple Resource Monitor. To my surprise, after full processing of the database memory consumption of the msmdsrv process jumped from 200MB to 1600MB. I deployed second instance of the same model connected to the same source and it grew to over 2500MB. So estimated size was in fact correct.
Data Warehouse is quite typical - star schema, facts and dimensions, nothing fancy.
Why was the data not compressed in any way? How is it possible that it takes even more memory than the uncompressed source warehouse?
I will be most grateful for any tips on this mystery :)
You should read and watch Marco Russo materials about vertipaq analyzer. You can find what part of your model take most of your memory.
https://www.sqlbi.com/articles/data-model-size-with-vertipaq-analyzer/
https://www.sqlbi.com/tv/checking-model-size-using-vertipaq-analyzer-in-dax-studio/
And maybe this can get you some light:
https://www.microsoftpressstore.com/articles/article.aspx?p=2449192&seqNum=3
Tabular Model is based on column Store that mean if you have many unique value in column then you get lower compression (for eg. incremental ID column like transactionID).
-> Omit high-cardinality columns where possible
-> Try to split columns when possible If you have DateTime columns, you should split them into two parts (date and time). You have then more reapeted values
-> Sort Order of data in partitions may have affect to compression rate [Run Length Encoding (RLE)]
-> Use measure (it takes no space) instead of a calculated column (it takes up)
Run Length Encoding (RLE)

Storage of website analytical data - relational or time series?

We have a requirement to store website analytical data (think: views on a page, interactions, etc). Note: this is seperate to Google Analytics data, as we want to own the data and enrich it as we see fit.
Storage requirements:
each 'event' will have a timestamp, event type and some other metadata (user id, etc)
the storage is append only. No updates or deletes
writes are consistent, but not IOT scale. Maybe, 50/sec
estimating growth of about 100 million rows a year
Query requirements:
graphing data cumulatively over a period of time
slice/filter data by all the metadata as well as day/week/month/year slices
will likely need to be integrated into a larger data warehouse
Question: Is this a no brainer for a time series DB like InfluxDB,or can I get away with a well tuned SQL server table?

Google BigQuery move to SQL Server, Big Data table optimisation

I have a curious question and as my name suggests I am a novice so please bear with me, oh and hi to you all, I have learned so much using this site already.
I have an MSSQL database for customers where I am trying to track their status on a daily basis, with various attributes being recorded in several tables, which are then joined together using a data table to create a master table which yields approximately 600million rows.
As you can imagine querying this beast on a middling server (Intel i5, SSD HD OS, 2tb 7200rpm HD, Standard SQL Server 2017) is really slow. I was using Google BigQuery, but that got expensive very quickly. I have implemented indexes which have somewhat sped up the process, but still not fast enough. A simple select distinct on customer id for a given attribute is still taking 12 minutes on average for a first run.
The whole point of having a daily view is to make it easier to have something like tableau or QLIK connect to a single table to make it easy for the end user to create reports by just dragging the required columns. I have thought of using the main query that creates the master table and parameterizes it, but visualization tools aren't great for passing many variables.
This is a snippet of the table, there are approximately 300,000 customers and a row per day is created for customers who join between 2010 and 2017. They fall off the list if they leave.
My questions are:
1) should I even bother creating a flat file or should I just parameterize the query.
2) Are there any techniques I can use aside from setting the smallest data types for each column to keep the DB size to a minimal.
3) There are in fact over a hundred attribute columns, a lot of them, once they are set to either a 0 or 1, seldom change, is there another way to achieve this and save space?
4)What types of indexes should I have on the master table if many of the attributes are binary
any ideas would be greatly received.

How to provide YTD, 12M, and Annualized measures in SQL Server data warehouse?

Project requires a usable data warehouse (DW) in SQL Server tables. They prefer no analysis services, with the SQL Server DW providing everything they need.
They are starting to use PowerBI, and have expressed the desire to provide all facts and measures in SQL Server tables, as opposed to a multi dimensional cube. Client has also used SSRS (to a large degree), and some Excel (by users).
At first they only required the Revenue FACT for Period x Product x Location. This uses a periodic snapshot type of fact, not a transactional grain fact.
So, then, to provide YTD measures for all periods, my first challenge was filling in the "empty" facts, for which there was no revenue, but there was revenue in prior (and/or subsequent) periods. The empty fact table includes a column for the YTD measure.
I solved that by creating empty facts -- no revenue -- for the no revenue periods, such as this:
Period 1, Loc1, Widget1, $1 revenue, $1 YTD
Period 2, Loc1, Widget1, $0 revenue, $1 YTD (this $0 "fact" created for YTD)
Period 3, Loc1, Widget1, $1 revenue, $2 YTD
I'm just using YTD as an example, but the requirements include measures for Last 12 Months, and Annualized, in addition to YTD.
I found that the best way to tally YTD was actually to create records to hold the measure (YTD) where there is no fact coming from the transactional data (meaning: no revenue for that combination of dimensions).
Now the requirements need the revenue fact by two more dimensions, Market Segment and Customer. This means I need to refactor my existing stored procedures to do the same process, but now for a more granular fact:
Period x Widget x Location x Market x Customer
This will result in creating many more records to hold the YTD (and other) measures. There will be far more records for these measures than are were real facts.
Here are what I believe are possible solutions:
Just do it in SQL DW table(s). This makes it easy to use wherever needed. (like it is now)
Do this in Power BI -- assume DAX expression in PBIX?
SSAS Tabular -- is tabular an appropriate place calculate the YTD, etc. measures, or should it be handled at the reporting layer?
For what it's worth, the client is reluctant to use SSAS Tabular because they want to keep the number of layers to a minimum.
Follow up questions:
Is there a SQL Server architecture to provide this sort of solution as I did it, maybe reducing the number of records necessary?
If they use PowerBI for YTD, 12M, Annualized measures, what do I need to provide in the SQL DW, anything more than the facts?
Is this something that SSAS Tabular solves, inherently?
This is my experience:
I have always maintained that the DW should be where all of your data is. Then any client tool can use that DW and get the same answer.
I came across the same issue in a recent project: Generating "same day last year" type calcs (along with YTD, Fin YTD etc.). SQL Server seemed like the obvious place to put these 'sparse' facts but as I discovered (as you have) the sparsity just gets bigger and bigger and more complicated as the dimensions increase, and you end up blowing out the size and continually coming back and chasing down of those missing sparse facts, and worst of all having to come up with weird 'allocation' rules to push measures down to the require level of detail
IMHO DAX is the place to do this but there is a lot of pain in learning the language, especially of you come from a traditional relational background. But I really do think it's the best thing since SQL, if you can just get past the learning curve.
One of the most obvious advantages of using DAX, not the DW, is that DAX recognises what your current filters are in the client tool (in Power BI, excel, or whatever) at run time and can adjust it's calc automatically. Obviously you can't do that with figures in the DW. For example you can recognise the person or chart or row is filtered on a given date, so your current year/prior calcs automatically calc the correct YTD based on the date.
DAX has a number of 'calendar' type functions (called "time intelligence"), but they only work for a particular type of calendar and there are a lot of constraints, so usually you end up needing to create your own calendar table and build functions around that calendar table.
My advice is to start here: https://www.daxpatterns.com/ and try generating some YTD calcs in DAX
For what it's worth, the client is reluctant to use SSAS Tabular
because they want to keep the number of layers to a minimum.
Power BI already has a (required) modelling layer that effectively uses SSAS tabular internally, so you already have an additional logical layer. It's just in the same tool as the reporting layer. The difference is that doing modelling only in Power BI currently isn't an "Enterprise" approach. Features such as model version control, partitioned loads, advanced row level security aren't supported by Power BI (although who knows what next month will bring)
Layers are not bad things as long as you keep them under control. Otherwise we should just go back to monolithic cobol programs.
It's certainly possible to start doing your modelling purely in Power BI then at a later stage when you need the features, control and scalabiliyt, migrate to SSAS Tabular.
One thing to consider is that SSAS Tabular PaaS offering in Azure can get pretty pricey, but if you ever need partitioned loads (i.e. load just this weeks data into a very large cube with a lot of history), you'll need to use it.
Is there a SQL Server architecture to provide this sort of solution as I did it, maybe reducing the number of records necessary?
I guess that architecture would be defining the records in views. That has a lot of obvious downfalls. There is a 'sparse' designator but that just optimise storage for fields that have lots of NULLs, which may not even be the case.
If they use PowerBI for YTD, 12M, Annualized measures, what do I need to provide in the SQL DW, anything more than the facts?
You definitely need a comprehensive calendar table defining the fiscal year
Is this something that SSAS Tabular solves, inherently?
If you only want to report by calendar periods (1st Jan to 31st Dec) then built in time intelligence is "inherent" but if you want to report by fiscal periods, the time intelligence can't be used. Regardless you still need to define the DAX calcs. and they can get really big
First, SSAS Tabular and Power BI use the same engine. So they are equally applicable.
And the ability to define measures that can be calculated across any slice of the data, using any of a large number categorical attributes is one of the main reasons why you want something like SSAS Tabular or Power BI in front of your SQL Server. (The others are caching, simplified end-user reporting, the ability to mash-up data across sources, and custom security.)
Ideally, SQL Server should provide the Facts, along with single-column joins to any dimension tables, including a Date Dimension table. Power BI / SSAS Tabular would then layer on the DAX Measure definitions, the Filter Flow behavior and perhaps Row Level Security.

Using PowerBI to visualize large amounts of data on a SQL Data Warehouse

I have a SQL DW which is about 30 GB. I want to use PowerBI to visualize this data, but I noticed PowerBI desktop only supports file size up to 250MB. What is the best way to connect to PowerBI to visualize this data?
You have a couple of choices depending on your use case:
Direct query of the source data
View based aggregations of the source data
Direct Query
For smaller datasets (think in the thousands of rows), you can simply connect PowerBI directly to Azure SQL Data Warehouse and use the table view to pull in the data as necessary.
View Based Aggregations
For larger datasets (think millions, billions, even trillions of rows) you're better served by running the aggregations within SQL Data Warehouse. This can take the shape of view that is creating the aggregations (think sales by hour instead of every individual sale) or you can create a permanent table at data loading time through a CTAS operation that contains the aggregations your users commonly query against. This latter CTAS operation model is a simple select with filter operation for the user (say Aggregated Sales greater than today - 90 days). Once the view or reporting table is created, you can simply connect to PowerBI as you normally would.
The PowerBI team has a blog post - Exploring Azure SQL Data Warehouse with PowerBI - that covers this as well.
You could also create a query (power query - M) that retrieves only the required data level (ie groups, joins, filters, etc). If done right the queries are translated to tsql and only limited amount of data is downloaded into power bi designer