SSAS Cube Design Issue linked with Performance - ssas

I have identified 3 dimensions and 1 measure table.
It will be Star schema.
My measure group would have Count(A/C number).
Each dimension table has look up table tied to A/c number kind of one to one relationship.
Dim1
ID1
Cat1
Dim2
ID2
Cat2
Dim3
ID3
Cat3
Fact
A/c number
Count(A/c)
ID1
ID2
ID3
Above is just an example,
Of course in real time there are 15 dimension table(one to one relation) with fact table and data close to million records that's why we need to come up with best design/performance.
I know FACT/Measure is always aggregate or a measure of business and in this case measure is count(A/c number).
Question:
1. Do i need to add A/c number to the fact table.
Remember adding A/c number to the fact table, fact would be huge/big.
Good or bad, performance wise??
Do i create additional Factless fact table similar to fact table but fact table will have only count(a/c number) and Factless fact table would have actual a/c numbers with dimension values too.. this would be a big table.
Good or bad, performance wise??
Do i create additional column(a/c number) along with look up values on the dimension tables so fact table would have facts.. Good or bad, performance wise??
Also i need to know, dimension process/deploy is faster(or should be faster) fact process/deploy is faster(or should be faster) and what's preferred in real time.
I want to know which option to select in real time or is there better solution.
Please let me know!!

If I understood correctly you are talking about a degenarated dimension.
That´s a common practice and in my opinion is the correct way to tackle your issue.
For instance, let´s say that we have a order details table with a granularity of one row per order line. Something like this:
Please visit this link to see the image because I'm not still able to post images in the forum:
http://i623.photobucket.com/albums/tt313/pauldj54/degeneratedDimension.jpg
If your measure is the count of orders, from the example above the result is: 2
Please check the following link:
Creating Degenerated dimension
Let me know if you have further questions.
Kind Regards,
Paul

You described your volume as "close to million records". That sounds trivial to process on any server (or even desktop or laptop) built in the last 5 years.
Therefore I would not limit the design to solve an imagined performance issue.

Related

Data Warehouse - Storing unique data over time

Basically we are building a reporting dashboard for our software. We are giving the Clients the ability to view basic reporting information.
Example: (I've removed 99% of the complexity of our actual system out of this example, as this should still get across what I'm trying to do)
One example metric would be...the number of unique products viewed over a certain time period. AKA, if 5 products were each viewed by customers 100 times over the course of a month. If you run the report for that month, it should just say 5 for number of products viewed.
Are there any recommendations on how to go about storing data in such a way where it can be queried for any time range, and return a unique count of products viewed. For the sake of this example...lets say there is a rule that the application cannot query the source tables directly, and we have to store summary data in a different database and query it from there.
As a side note, we have tons of other metrics we are storing, which we store aggregated by day. But this particular metric is different because of the uniqueness issue.
I personally don't think it's possible. And our current solution is that we offer 4 pre-computed time ranges where metrics affected by uniqueness are available. If you use a custom time range, then that metric is no longer available because we don't have the data pre-computed.
Your problem is that you're trying to change the grain of the fact table. This can't be done.
Your best option is what I think you are doing now - define aggregate fact tables at the grain of day, week and month to support your performance constraint.
You can address the custom time range simply by advising your users that this will be slower than the standard aggregations. For example, a user wanting to know the counts of unique products sold on Tuesdays can write a query like this, at the expense of some performance loss:
select distinct dim_prod.pcode
,count(*)
from fact_sale
join dim_prod on dim_prod.pkey = fact_sale.pkey
join dim_date on dim_date.dkey = fact_sale.dkey
where dim_date.day_name = 'Tuesday'
group by dim_prod.pcode
The query could also be written against a daily aggregate rather than a transactional fact, and as it would be scanning less data it would run faster, maybe even meeting your need
From the information that you have provided, I think you are trying to measure ' number of unique products viewed over a month (for example)'.
Not sure if you are using Kimball methodologies to design your fact tables. I believe in Kimball methodology, an Accumulating Snapshot Fact table will be recommended to meet such a requirement.
I might be preaching to the converted(apologies in that case), but if not then I would let you go through the following link where the experts have explained the concept in detail:
http://www.kimballgroup.com/2012/05/design-tip-145-time-stamping-accumulating-snapshot-fact-tables/
I have also included another link from Kimball, which explains different types of fact tables in detail:
http://www.kimballgroup.com/2014/06/design-tip-167-complementary-fact-table-types/
Hope that explains the concepts in detail. More than happy to answer any questions(to the best of my ability)
Cheers
Nithin

Database modeling for stock prices

I have recently been given the assignment of modelling a database fit to
store stock prices for over 140 companies. The data will be collected
every 15 min for 8.5 h each day from all these companies. The problem I'm
facing right now is how to setup the database to achieve fast search/fetch
given this data.
One solution would be to store everything in one table with the following columns:
| Company name | Price | Date | Etc... |
Or I could create a table for each company and just store the price and the date for
when the data was collected (and other parameters not known atm).
What is your thought about these kind of solutions? I hope the problem was explained
in sufficient detail, else please let me know.
Any other solution would be greatly appreciated!
I take it you're concerned about performance given the large number of records your likely to generate - 140 companies * 4 data points / hour * 8.5 hours * 250 trading days / year means you're looking at around 1.2 million data points per year.
Modern relational database systems can easily handle that number of records - subject to some important considerations - in a single table - I don't see an issue with storing 100 years of data points.
So, yes, your initial design is probably the best:
Company name | Price | Date | Etc... |
Create indexes on Company name and date; that will allow you to answer questions like:
what was the highest share price for company x
what was the share price for company x on date y
on date y, what was the highest share price
To help prevent performance problems, I'd build a test database, and populate it with sample data (tools like dbMonster make this easy), and then build the queries you (think you) will run against the real system; use the tuning tools for your database system to optimize those queries and/or indices.
On top of what has already been said, I'd like to say the following thing: Don't use "Company name" or something like "Ticker Symbol" as your primary key. As you're likely to find out, stock prices have two important characteristics that are often ignored:
some companies can be quoted on multiple stock exchanges, and therefore have different quote prices on each stock exchange.
some companies are quoted on multiple times on the same stock exchange, but in different currencies.
As a result, a properly generic solution should use the (ISIN, currency, stock exchange) triplet as identifier for a quote.
The first, more important question is what are the types and usage patterns of the queries that will be executed against this table. Is this an Online Transactional Processing (OLTP) application, where the great majority of queries are against a single record, or at most a small set of records? or is to an Online Analytical Processing application, where most queries will need to read, and process, significantly large sets of data to generate aggregations and do analysis. These two very different types of systems should be modeled in different ways.
If it is the first type of app, (OLTP), your first option is a better one, but the usage patterns and types of queries would still be important to determine the types of indices to place on the table.
If it is an OLAP application, (and a system storing billions of stock prices sounds more like an OLAP app) then the data structure you set up might be better organized to store pre-aggregated data values, or even go all the way an use a multi-dimensional database like an OLAP cube, based on a star schema.
Put them into a single table. Modern DB engines can easily handle those volumes you specified.
rowid | StockCode | priceTimeInUTC | PriceCode | AskPrice | BidPrice | Volume
rowid: Identity UniqueIdentifier.
StockCode instead of Company. Companies have multiple types of socks.
PriceTimeInUTC is to standardize any datetime into a specific timezone.
Also datetime2 (more accurate).
PriceCode is used to identify what of price it is: Options/Futures/CommonStock, PreferredStock, etc
AskPrice is the Buying price
BidPrice is the Selling price.
Volume (for buy/sell) might be useful for you.
Separately, have a StockCode table and a PriceCode table.
That is a Brute Force approach. The second you add searchable factors it can change everything. A more flexible and elegant option is a star schema, which can scale to any
amount of data. I am a private party working on this myself.

Database Design for Audit: Many Rows for Answers vs Many Columns

I am designing a table structure within a SQL Server 2008 database that will hold the results of an audit. The audit currently has 65 questions and possible answers of 0-4 or N/A. The table structure I have created to hold this data (still in testing) is described below. Upon submission, a record is created in the AuditDetail table for each question. If the chosen answer is 0,1, or 2, the user must input details describing why is it low, how to fix, and who is responsible (this creates a record in the AuditIssue table). Each question is described by two different categories, named QuestionCategory and ItemCategory.
The issue that I am concerned about is that with my current table design, 65 rows are added to the AuditDetail table for each audit that is submitted. This audit needs to be completed at least 70 times each month (it is used by many departments). So this table structure will add approximately 4550 rows per month to the AuditDetail table. I am worried that this may negatively affect performance in the future and want to prevent having to redesign the table structure once i move this into a production environment.
The only other solution that I can come up with is to replace the AuditDetail table with a table that has a column for each question and stores the score for each audit in 1 row, across 65+ columns.
I feel that my current design follows the normalization rules, whereas I do not think creating a column for each question would. I am almost certain that the questions will change in the future (perhaps many times), including adding/removing questions and changing existing ones.
My searching for answers to this problem lead me to these two sources:
Many rows or many columns
Storing Answers In Columns
I understand that it would not be ideal to add/remove columns each time a question changes. My question is how badly will creating 4550 rows per month affect the performance of my queries? I do not know if my situation is the same as the one described in "Storing Answers In Columns" because it seems that they were only going to have 100 rows in their table. If the performance of the queries is going to be drastically reduced, is there a better table structure that I have not thought of?
My queries will mostly be used to produce charts which show Total Audits Completed Monthly, Issues Opened vs Closed vs Overdue, Top 10 questions that produce issues, and Monthly or Daily Audit Score (Answer/Total Possible Points Per QuestionCategory or Answer/Total Possible Point). Each of these charts will need to be sortable by Department, Month, Area, etc.
Confession: I have tend to end up using correlated subqueries to produce some of these charts, which I know already decreases query performance. I try to work around them, but with me not being a SQL master, I end up stuck in them.
The current table structure i am using for testing is as follows:
**AuditMain:**
--AuditId <-- PK
--DeptNumber <-- FK to Dept Table
--AuditorId <-- FK to Auditor Table
--StartDate
--Area_Id <-- FK to Area Table
**AuditDetail**
--DetailId <-- PK
--QuestionId <-- FK to Question Table
--Answer
--NotApplicable (boolean to determine if they chose N/A, needed to calcualte audit score)
--AuditId <-- FK to AuditMain
**AuditIssue**
--IssueId <-- PK
--IssueDescription
--Countermeasure
--PersonResponsible
--Status
--DueDate
--EndDate
--DetailId <--FK to AuditDetail
**AuditQuestion**
--QuestionId <-- PK
--QuestionNumber (corresponds to the question number on the audit input form)
--QuestionDescription
--QuestionCategoryId <-- FK to QuestionCategory
--ItemCategoryId <-- FK to ItemCategory
**QuestionCategory**
--QuestionCategoryId <-- PK
--CategoryDescription
--CategoryName
**ItemCategory**
--ItemCategoryId <--PK
--ItemCategoryDescription
Thanks for reading through so much explanation. I wanted to err on the side of too much information rather than too little, but please let me know if any further information is needed. I appreciate any and all suggestions!
Unless your production environment is seriously underpowered, it should be able to hold a half a million rows in a table without seriously degrading performance. Retrieval performance will be greatly affected by the fields you use for queries and the fields you have built indexes upon. This can make the difference between witing seconds and waiting minutes.
There's too much detail to go into here, but there are many excellent tutorials on database design. The best of these titorials will teach you how to design not only for performance, but also for future flexibility, which is just as important.
Your table structure looks pretty good at first glance.

database design - when to split tables?

Sometimes creating a separate table would produce much more work, should I split it anyway?
for example: In my project I have a table of customers, each customer has his own special price for each product (there are only 5 products & more products are not planned in the future), each customer also have unique days of the week when the company delivers to him the products.
Many operations like changing days/price for a customer, or displaying days & prices of all customers would be much easier when the days & product prices are columns in the customers table and not separate tables, so is it refuted to create only one big customers table in such case? What are the drawbacks?
UPDATE: They just informed me that after a year or so there's a chance that they add more products, they say their business won't exceed 20-30 products in any event.
I still can't understand why in such case when product's prices has no relation (each customer has his own special price) adding rows to Products table is better then adding Columns to Customers table?
The only benefit I could think of is that customer that has only 5 products won't have to 'carry' 20 nullable products (saves space on server)? I don't have much experience so maybe I'm missing the obvious?
Clearly, just saying that one should always normalize is not pragmatic. No advice is always true.
If you can say with certainty that 5 "items" will be enough for a long time I think it is perfectly fine to just store them as columns if it saves you work.
If your prediction fails and a 6th items needs to be stored you can add a new column. As long as the number of columns doesn't get out of hand with very high probability, this should not be a problem.
Just be careful with such tactics as the ability of many programmers to predict the future turns out to be very limited.
In the end only one thing counts: Delivering the requested solution at the lowest cost. Purity of code is not a goal.
Normalization is all about data integrity (consistency), nothing else; not about hard, easy, fast, slow, efficient and other murky attributes. The current design almost certainly allows for data anomalies. If not right now, the moment you try to track price changes, invoices, orders, etc, it is a dead end.

SQL Data Normalisation / Performance

I am working on a web API for the insurance industry and trying to work out a suitable data structure for the quoting of insurance.
The database already contains a "ratings" table which is basically:
sysID (PK, INT IDENTITY)
goods_type (VARCHAR(16))
suminsured_min (DECIMAL(9,2))
suminsured_max (DECIMAL(9,2))
percent_premium (DECIMAL(9,6))
[Unique Index on goods_type, suminsured_min and suminsured_max]
[edit]
Each type of goods typically has 3 - 4 ranges for suminsured
[/edit]
The list of goods_types rarely changes and most queries for insurance will involve goods worth less than $100. Because of this, I was considering de-normalising using tables in the following format (for all values from $0.00 through to $100.00):
Table Name: tblRates[goodstype]
suminsured (DECIMAL(9,2)) Primary Key
premium (DECIMAL(9,2))
Denormalising this data should be easy to maintain as the rates are generally only updated once per month at most. All requests for values >$100 will always be looked up in the primary tables and calculated.
My question(s) are:
1. Am I better off storing the suminsured values as DECIMAL(9,2) or as a value in cents stored in a BIGINT?
2. This de-normalisation method involves storing 10,001 values ($0.00 to $100.00 in $0.01 increments) in possibly 20 tables. Is this likely to be more efficient than looking up the percent_premium and performing a calculation? - Or should I stick with the main tables and do the calculation?
Don't create new tables. You already have an index on goods, min and max values, so this sql for (known goods and its value):
SELECT percent_premium
FROM ratings
WHERE goods='PRECIOUST' and :PREC_VALUE BETWEEN suminsured_min AND suminsured_max
will use your index efficently.
The data type you are looking for is smallmoney. Use it.
The plan you suggest will use a binary search on 10001 rows instead of 3 or 4.
It's hardly a performance improvement, don't do that.
As for arithmetics, BIGINT will be slightly faster, thought I think you will hardly notice that.
i am not entirely sure exactly what calculations we are talking about, but unless they are obnoxiously complicated, they will more than likely be much quicker than looking up data in several different tables. if possible, perform the calculations in the db (i.e. use stored procedures) to minimize the data traffic between your application layers too.
and even if the data loading would be quicker, i think the idea of having to update de-normalized data as often as once a month (or even once a quarter) is pretty scary. you can probably do the job pretty quickly, but what about the next person handling the system? would you require of them to learn the db structure, remember which of the 20-some tables that need to be updated each time, and do it correctly? i would say the possible performance gain on de-normalizing will not be worth much to the risk of contaminating the data with incorrect information.