As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
For interview purpose what questions can we expect from SSAS prespective.
a) Entry/Beginners level (1-6 months)
b) Intermediate
c) Advance
Thanks
Here are some general approaches I use for interviewing different groups of SSAS programmers:
Test Knowledge of BIDS for Developing Cubes
Ask the candidate to explain all the steps they need to complete in BIDS to create and publish a cube from scratch. For simplicity sake, I usually ask them to assume they have a Kimball method data warehouse on one SQL Server that has 2 fact tables and 5 dimension tables.
Most candidates who claim to have SSAS experience can explain the life cycle of building a cube, but rarely can they actually explain the steps to build a cube correctly. Experienced users should talk about setting up the databse connection, creating a DSV, generating a cube, generating dimension tables or modifying dimension tables created by the cube, defining attribute relationships for dimensions, defining relationships in the cube between fact and dimension tables, deploying the cube, etc. Candidates should know the terminology inside and out.
If the candidate describes the top-line process for building cubes in BIDS, then drill into details about the DSV. What are named queries? What are advantages and disadvantages of named queries? Should you link directly to tables, views, or named queries? Do views have any advantages over direct links to tables?
Ask the candidate to describe in detail how they would add a new attribute to a dimension. Assume for simplicity sake that someone has already added the column to the underlying database table and you now need to adjust the cube definition and deploy the changes.
Ask the candidate how cubes are maintained from day to day. Ask about the differences between fully processing cubes and dimensions versus partially processing cubes. Ask about what happens if a customer cancels an order and how that should propagate through the data warehouse. See if the candidate talks about ledger-style transactions versus status changes and how this impacts processing the fact table. Ask about how partitions are used, how they are defined, when you should use them, and when you shouldn't use them.
Ask detailed questions about advantages and disadvantages of date dimensions, time dimensions, how the dimensions should be maintained to handle new dates, etc. The candidate should explain an automated method for maintaining dates except for holidays.
Ask how changes to the cube are tested before publishing the changes to end users. I once interviewed a candidate who answered most of the technical questions about how to build a cube in BIDS correctly, but then couldn't explain to me how to test the cube. The candidate simply said he would publish the changes and then his manager would take care of everything. When I asked how he would test drill through actions, slicing behavior, etc., it became clear that the "architect" had no idea how any of this actually worked.
Ask how the candidate troubleshoots performance issues. Good answers should talk about SQL Profiler, testing MDX queries directly in Management Studio, monitoring key perfmon statistics, redefining attribute relationships and cube relationships, loading data into cleansed tables instead of using raw source tables, isolating analysis services performance from other application or sql server services, etc.
Test Knowledge of MDX
Ask the candidate some basic MDX questions. Ask questions like "I have a cube called new_cube and it has a products dimension and a orders fact table. Tell me roughly how you would filter this down to 3 orders." If the candidate can only explain how to do it in a GUI such as int Excel or SSRS, then ask some deeper questions about returning nulls, returning all records regardless of nulls, or returning non-null values.
Ask the candidate about when they actually code MDX versus just use a GUI. Ask about which tools the candidate used to interact with the data. If it is Excel, then ask if they have used the olap extensions or data mining extensions. Ask what they can see in SQL Server Mangement Studio. If it is Excel, then ask how they handled refreshing data between months without having to change parameters. If it is SSRS, then ask how they handled multivalue parameters or changing dates for subscriptions. If they did most of their work in Management Studio, then ask questions about syntax and different methods for limiting data to a subset of users, orders, or dates.
Test Knowledge of Data Warehousing Design Principals
Ask questions about Kimball method data warehouses, star schemas, snowflake schemas, degenerate dimensions, data dimensions, time dimensions, surrogate keys, etc.
Ask questions about SQL Server database design pricipals such as the differences between indexes, non-clustered indexes, clustered indexes, composite indexes, CTEs, table-value functions, looping over data, fizzbuzz test, creating and managing SQL Server Agent Jobs and schedules, how to troubleshoot slow-performing queries, etc. An excellent SSAS architect should be an expert SQL DBA from a data warehousing perspective. Don't ask questions about replication, log shipping, mirroring, clustering, etc., since this is usually outside of the pervue of data warehousing SQL DBAs.
Ask questions about SSIS. An excellent SSAS architect must understand how to build complex SSIS packages including importing a filtered list of changing files from a directory, pulling data in via data flows, explain how to use fast load options for bulk inserts, talk about script components as sources or transformations, etc.
At the end of all of this, you should be able to determine if the user is a SSAS architect, a wannabe SSAS architect who has lots of SQL DBA data warehousing architecture experience, a SSAS report writer in Excel, SSRS, or other BI platform, a report writer who doesn't really understand what's happening under the covers, a newbie, or a faker. Keep in mind that a lot of really good data warehouse architects don't have much SSAS experience. If you are looking for an experienced SSAS architect, then they basically have to be able to do the entire Microsoft BI stack. Anyone else fits into some other category.
Related
I'm assigned to design multidimensional cube in SSAS.
As I am very new to SSAS, and currently this is in analysis phase.
Just wanted to see , is there any standard process or guideline should I follow or any general questions should I prepare prior to cube designing?
One thing client specifically mentioned about the volume of data as
One service area has 3 million rows, 3 years of data
Does it mean, we should plan for partition strategy ? if yes then what are the things should I be looking ? one thing comes in my mind
what field should we consider to split the cube (am I heading in right direction ?)
What are the other factor should I consider during analysis ?
SSAS design is a large topic with different angels. If i were in your shoes, I'd google for "SSAS Design" or something along those lines to learn more. For example, here's a model chapter from a book provided by Microsoft themselves: https://www.microsoftpressstore.com/articles/article.aspx?p=2812063
I'd skip for partitioning at this stage. See how it performs first and tune it later if really necessary. Usually partitioning is done on some accumulating field , like a date, where old data is not processed daily and only the latest data (partition) is updated (processed). This of course depends on the data you're dealing with.
Problem:
We are looking for some guidance on what database to use and how to model our data to efficiently query for aggregated statistics as well as statistics related to a specific entity.
We have different underlying data but this example should showcase the fundamental problem:
Let's say you have data of Facebook friend requests and interactions over time. You now would like to answer questions like the following:
In 2018 which American had the most German friends that like ACDC?
Which are the friends that person X most interacted with on topic Y?
The general problem is that we have a lot of changing filter criteria (country, topic, interests, time) on both the entities that we want to calculate statistics for and the relevant related entities to calculate these statistics on.
Non-Functional Requirements:
It is an offline use-case, meaning there are no inserts, deletes or
updates happening, instead every X weeks a new complete dump is imported to replace the old data.
We would like to have an upper bound of 10 seconds
to answer our queries. The faster the better max 2 seconds for queries would be great.
The actual data has around 100-200 million entries, growth rate is linear.
The system has to serve a limited amount of concurrent users, max 100.
Questions:
What would be the right database technology or mixture of technologies to solve our problem?
What would be an efficient data model for computing aggregations with changing filter criteria in several dimensions?
(Bonus) What would be the estimated hardware requirements given a specific technology?
What we tried so far:
Setting up a document store with denormalized entries. Problem: It doesn't perform well on general queries because it has to scan too many entries for aggregations.
Setting up a graph database with normalized entries. Problem: performs even more poorly on aggregations.
You talk about which database to use, but it sounds like you need a data warehouse or business intelligence solution, not just a database.
The difference (in a nutshell) is that a data warehouse (DW) can support multiple reporting views, custom data models, and/or pre-aggregations which can allow you to do advanced analysis and detailed filtering. Data warehouses tend to hold a lot of data and are generally built to be very scalable and flexible (in terms of how the data will be used). For more details on the difference between a DW and database, check out this article.
A business intelligence (BI) tool is a "lighter" version of a data warehouse, where the goal is to answer specific data questions extremely rapidly and without heavy technical end-user knowledge. BI tools provide a lot of visualization functionality (easy to configure graphs and filters). BI tools are often used together with a data warehouse: The data is modeled, cleaned, and stored inside of the warehouse, and the BI tool pulls the prepared data into specific visualizations or reports. However many companies (particularly smaller companies) do use BI tools without a data warehouse.
Now, there's the question of which data warehouse and/or BI solution to use.
That's a whole topic of its own & well beyond the scope of what I write here, but here are a few popular tool names to help you get started: Tableau, PowerBI, Domo, Snowflake, Redshift, etc.
Lastly, there's the data modeling piece of it.
To summarize your requirements, you have "lots of changing filter criteria" and varied statistics that you'll need, for a variety of entities.
The data model inside of a DW would often use a star, snowflake, or data vault schema. (There are plenty of articles online explaining those.) If you're using purely BI tool, you can de-normalize the data into a combined dataset, which would allow you a variety of filtering & calculation options, while still maintaining high performance and speed.
Let's look at the example you gave:
Data of Facebook friend requests and interactions over time. You need to answer:
In 2018 which American had the most German friends that like ACDC?
Which are the friends that person X most interacted with on topic Y?
You want to filter/re-calculate the answers to those questions based on country, topic, interests, time.
One potential dataset can be structured like:
Date of Interaction | Initiating Person's Country | Responding Person's Country | Topic | Interaction Type | Initiating Person's Top Interest | Responding Person's Top Interest
This would allow you to easily count the amount of interactions, grouped and/or filtered by any of those columns.
As you can tell, this is just scratching the surface of a massive topic, but what you're asking is definitely do-able & hopefully this post will help you get started. There are plenty of consulting companies who would be happy to help, as well. (Disclaimer: I work for one of those consulting companies :)
Dear community,
I hope the headline gives you a hint of what I want to talk about / need advice.
I'm a BI Developer with 3 years of experience working on big BI projects - some on the health industry and some were on the finance industry when I was working at IBM.
On my current job I came to a startup company, the company has an operational DB for the purpose of the product and the data is on SQL Server DB.
For 4 months I was putting fires out regarding all the mass my predecessor did and now I'm ready for the next step - Modeling the operational DB tables for DWH DB to be able to extract and use the data for analytical and BI purposes.
I don't have any resources at all - so I will build the DWH first on the operational DB and then my vision is the DWH will be on Snowflake DB after I will get resources from my CTO.
The modeling issue:
When I'm tackling the issue of data modeling I encountered some confusion about the right way to model data, there is the traditional way I'm familiar with IBM, but there are the Cloud DWH modeling and the hybrid approach.
My model need to be flexible and the data should be extract very fast.
what is the best way to store and extract data for analytical purposes?
Fact tables with a lot of dimensions - normalize approach
OR
putting all the data I need with regard to granularity at the same table (thinking about the future, moving to Snowflake) I will have several tables each one with is one granularity and his world.
I'm just interested to hear what some of you implemented at your company and if you have an advise or UC you can share, I searched at the web a lot and what I saw is a lot of biased info and very confusing - nobody is really saying what is working in the real world.
Thanks in advance!
Well two key points of normalisation are to reduce disk space used and optimise data retrieval; neither of which are all that relevant in Snowflake. Storage is dirt cheap. And for the best part, the database is self-optimised - worse case you might have to set up clustering keys on very large tables (see: https://docs.snowflake.net/manuals/user-guide/tables-clustering-keys.html)
I've found that big tables with lots of columns perform better than many smaller tables with joins. For example when testing on a flat table with 10 mil rows, with a clustering key set up; it was about 180% faster than obtaining the same resultset but with a more complex model / multi-table.
If you're anticipating a lot of writeback and require object level changes, then you should still consider normalisation - but you'd be better off with star schema in that case.
I'm a few months into developing a reporting solution. Currently I am loading a relational data warehouse (Fact and Dimension tables) using SSIS. SSAS cubes and dimensions are then created from the relational Data warehouse. I then use SSRS to build reports using MDX queries.
The problem I have is that things are starting to get rather complicated trying to understand how multidimensional modelling works as well as MDX and cubes. Since the organization it's being designed for is rather small, I'm thinking that I should re-evaluate my approach.
I think maybe I should just eliminate SSAS from the picture and simply create reports that report directly off the relational data warehouse using SQL queries. The relational data warehouse could still be loaded nightly to allow up to date data for reporting.
I'm just wondering if that would be a good idea considering I'm not very experienced with data warehousing and SSAS. Also I wanted to know if keeping my relational data warehouse in dimension and fact tables would still work with SQL queries or would I need to redesign the tables. I don't want to make the decision to eliminate SSAS if that will end up causing more headaches or issues.
The reports will not include complicated calculations besides row counts and YTD percentages. For example "How many callers were male?" and "How many callers called for Product A?" Which are then broken down by month.
Any comments or suggestion are much appreciated cause I'm starting to feel rather frustrated with trying get SSAS cubes developed properly.
I was in a similar situation at my company. I had never used SSAS, and I was asked to do research on the benefits of using cubes to do some reporting. It was a pretty steep learning curve because my background is in development not data and reporting. SSAS is most useful when aggregate queries on a relational database are time consuming and if reports need to be broken down into hierarchies that an analyst can use to better understand the state of the business. Since SSAS stores aggregate info, queries of that nature are very quick. If your organization's data is small, the relational queries might be quick enough that you don't really need the benefit of storing aggregates.
Also you need to take into consideration the maintainability of using SSAS. If you're having trouble figuring out SSAS and MDX then how easy of a time will others? I tried to explain an MDX query I wrote to my boss who is experienced with SQL, but it's really quite different from relational queries. How easy is it going to be to add more complex reports?
A benefit to using SSAS is it can put the analyst in control of the report. Second, there are great tools and support. Finally, it's pretty easy to deploy and connect.
You can remove SSAS from your architecture yes because all the results you can get from an MDX query to SSAS, you can get from a T-SQL query to your datawarehouse because the cube was built reading data from the DW. BUT, bear in mind the following: the main advantage on an OLAP cube, in my point of view, are aggregations.
Very simple explanation: lets say you have a fact table called orders with 1 million orders per month. If you want to know how much you sold on that month, using sql you need to read row by row and sum the value to produce the total. That's like 1 million reads on your DB. If you have a cube, with the propper agrregations configured, you can have that value pre-calculated and pre-stored on your cube so if you need to know how much you sold on a month, you will have only one read to your cube.
Its a matter of analyse your situation, if you have a small cube, maybe aggregations are not necessary and you cna do fine with SQL, but depending on the situation, they can be very helpfull
Over the years I have read a lot of people's opinions on how to get better performance out of their SQL (Microsoft SQL Server, just so we are all on the same page...) queries. However, they all seem to be tightly tied to either a high-performance OLTP setup or a data warehouse OLAP setup (cubes-galore...). However, my situation today is kind of in the middle of the 2, hence my indecision.
I have a general DB structure of [Contacts], [Sites], [SiteContacts] (the junction table of [Sites] and [Contacts]), [SiteTraits], and [ContractTraits]. I have nearly 3 million contacts with about 50 fields (between [Contacts] and [ContactTraits]) relating to just the contact, and about 600 thousand sites with about 150 fields (between [Sites] and [SiteTraits]) relating to just the sites. Basically it’s a pretty big flattened table or view… Most of the columns are int, bit, char(3), or short varchar(s). My problem is that a good portion of these columns are available to be used in ad-hoc queries by the user, and as quickly as possible because the main UI for this will be a website. I know the most common filters, but even with heavy indexing on them I think this will still be a beast… This data is read-only; the data doesn’t change at all during the day and the database will only be refreshed with the latest information during scheduled downtime. So I see this situation like an OLAP database with the read requirements of an OLTP database.
I see 3 options; 1. Break the table into smaller divisible units sub-query everything, 2. make one flat table and really go to town on the indexing 3. Create an OLAP cube and sub-query the rest based on what filter values I don’t put as the cube dimensions, and. I have not done much with OLAP cubes so I frankly don’t even know if that is an option, but from what I’ve done with them in the past I think it might be an option. Also, just to clarify what I mean when I say “sub-query everything” is instead of having a WHERE clause on the outer select, there would be one (if applicable) for each table being brought into the query and then the tables are INNER JOINed, to eliminate a really large Cartesian Product. As for the second option of the one large table, I have heard and seen conflicting results with that approach as it will save on joins but at the same time a table scan takes much longer.
Ideas anyone? Do I need to share what I’m smoking? I think this could turn into a pretty good discussion if everyone puts in their 2 cents. Oh, and feel free to tell me if I’m way off base with the OLAP cube idea if that’s the case, I’m new to that stuff too.
Thanks in advance to any and all opinions and help with this dilemma I’ve found myself in.
You may want to consider this as a relational data warehouse. You could design your relational database tables as a star schema (or, a snowflake schema). This design is very similar to the OLAP cube logical structure, but the physical structure is in the relational database.
In the star schema you would have one or more fact tables, which represent transactions of some sort and is usually associated with a date. I'm not sure what a transaction might be in this case though. The fact may be the association of sites to contacts and the table.
The fact table would reference dimension tables, which describe the fact. Dimensions might be Sites and Contacts. A dimension contains attributes, such as contact name, contact address, etc. If you are familiar with the OLAP cube, then this will be a familiar logical architecture.
It wouldn't be a very big problem to add numerous indexes to your architecture. The database is mostly read only, except for the refresh time. You won't have to worry about read performance while indexes are being updated. So, the architecture can accommodate all indexes that are needed (as long as you can dedicate enough downtime to refresh the data).
I agree with bobs answer: throw an OLAP front end and query through the cube. The reason why this will be a good think is that cubes are highly efficient at querying (often precomputed) aggregates by multiple dimensions and they store the data in a column-oriented format that is more efficient for data analysis.
The relational data underneath the cube will be great for detail drill-ins to find the individual facts that give a certain aggregate value. But querying directly the relational data will always be slow, because those aggregates users are interested in for analysis can only be produced by scanning large amounts of data. OLAP is just better at this.
OLAP/SSAS is efficient for aggregate queries, not as much for granular data in my experience.
What are the most common queries? For single pieces of data or aggregates?
If the granularity of SiteContacts is pretty close to that of Contacts (ie. circa 3 million records - most contacts associated with only a single site), you may get the best performance out of a single table (with plenty of appropriate indexes, obviously; partitioning should also be considered).
On the other hand, if most contacts are associated with many sites, it might be better to stick with something close to your current schema.
OLAP tends to produce the best results on aggregated data - it sounds as though there will be relatively little aggregation carried out on this data.
Star schemas consist of fact tables with dimensions hanging off them - depending on the relationship between Sites and Contacts, it sounds as though you either have one huge dimension table, or two large dimensions with a factless fact table (sounds like an oxymoron, but is covered in Kimball's methodology) linking them.