Data Warehouse - Storing unique data over time - sql

Basically we are building a reporting dashboard for our software. We are giving the Clients the ability to view basic reporting information.
Example: (I've removed 99% of the complexity of our actual system out of this example, as this should still get across what I'm trying to do)
One example metric would be...the number of unique products viewed over a certain time period. AKA, if 5 products were each viewed by customers 100 times over the course of a month. If you run the report for that month, it should just say 5 for number of products viewed.
Are there any recommendations on how to go about storing data in such a way where it can be queried for any time range, and return a unique count of products viewed. For the sake of this example...lets say there is a rule that the application cannot query the source tables directly, and we have to store summary data in a different database and query it from there.
As a side note, we have tons of other metrics we are storing, which we store aggregated by day. But this particular metric is different because of the uniqueness issue.
I personally don't think it's possible. And our current solution is that we offer 4 pre-computed time ranges where metrics affected by uniqueness are available. If you use a custom time range, then that metric is no longer available because we don't have the data pre-computed.

Your problem is that you're trying to change the grain of the fact table. This can't be done.
Your best option is what I think you are doing now - define aggregate fact tables at the grain of day, week and month to support your performance constraint.
You can address the custom time range simply by advising your users that this will be slower than the standard aggregations. For example, a user wanting to know the counts of unique products sold on Tuesdays can write a query like this, at the expense of some performance loss:
select distinct dim_prod.pcode
,count(*)
from fact_sale
join dim_prod on dim_prod.pkey = fact_sale.pkey
join dim_date on dim_date.dkey = fact_sale.dkey
where dim_date.day_name = 'Tuesday'
group by dim_prod.pcode
The query could also be written against a daily aggregate rather than a transactional fact, and as it would be scanning less data it would run faster, maybe even meeting your need

From the information that you have provided, I think you are trying to measure ' number of unique products viewed over a month (for example)'.
Not sure if you are using Kimball methodologies to design your fact tables. I believe in Kimball methodology, an Accumulating Snapshot Fact table will be recommended to meet such a requirement.
I might be preaching to the converted(apologies in that case), but if not then I would let you go through the following link where the experts have explained the concept in detail:
http://www.kimballgroup.com/2012/05/design-tip-145-time-stamping-accumulating-snapshot-fact-tables/
I have also included another link from Kimball, which explains different types of fact tables in detail:
http://www.kimballgroup.com/2014/06/design-tip-167-complementary-fact-table-types/
Hope that explains the concepts in detail. More than happy to answer any questions(to the best of my ability)
Cheers
Nithin

Related

ER Diagram for Government Statistics

I am new to database design and I am trying to practice with available government statistics for a small country. I have found almost 100 tables that store information collected for given years and months from a specific region. Some tables are updated monthly, while others are updated annually. I believe this means that in each table, there will be a natural composite PK made up of the year and month, or simply a PK made up of the year.
ER Diagram
In the above image, each parent attribute of Trip Survey represents one of the many data tables I have collected from public databanks specific to the region being researched (e.g. satisfaction_level, motivation_level, amount_spent all represent different surveys on the same population). Does it make sense to combine all of the tables into one table (e.g. Trip Survey)?
I'm not sure if my relationships are accurate (total and partial participation). My goal is to be able to queries the data to find points of correlation and make predictions about the future. I want to try and connect all of the tables over time.
The surveys collected can cover nearly any topic, but the common thread is they represent a moment in time, either monthly or annually. I want to eventually add a table of significant political events that may reflect outliers from trends.
Example Result: When motivation levels were low in 2018, spending was also down and length of stay was shorter relative to 'n' period.
As a newbie, any and all help is greatly appreciated.
Thank you
Simplify simplify simplify.
Start with one table, with at least some columns you comprehend. Load it into some dbms (pick one with geospatial capabilities and windowing functions, you may want them later: recent versions of MariaDB, MySQL and PostreSQL are fine choices). Import your table. This can be a pain in the axx neck to get right, but do your best to get it right anyhow.
Don't worry about primary keys or unique indexes when starting out. You're just exploring the data, not building it. Don't worry about buying or renting a server: most laptops can handle this kind of exploration just fine.
Pick a client program that keeps a history of the queries you put into it. HeidiSQL is a good choice. The relatively new Datagrip from Jetbrains is worth a look. Avoid Microsoft's SQL Server Management Studio: no history feature. (You'll often want to go back to something you tried a few hours or days ago when you're exploring, so the query history feature is vital.)
Then fiddle around with queries, especially aggregates ... e.g.
SELECT COUNT(*), year, origin, destination
FROM trip
GROUP BY year, origin, destination;
Look for interesting stuff you can glean from the one table. Get the hang of it. Then add another table that can be JOINed easily to the first table. Repeat your exploration.
That should get you started. Once you begin to understand your dataset, you can start ranking stuff, working out quintiles, and all that.
And, when you have to update or augment your data without reloading it, you'll need various primary / unique keys. That's in the future for you.

At what scale of data is the ROI of partitioning the most valuable?

So I'm looking into data warehousing and partitioning and am very curious at to what scale makes the most sense for partitioning a data on a key (for instance, SaleDate).
Tutorials often mention that you're trying to break it down into logical chunks so as to make updating the data less likely to cause service disruptions.
So let's say I'm a medium scale company working in a given US state. I do a lot of work in relation to SaleDate, often tens of thousands of transactions a day (with requisite transaction details, 4-50 each?), and have about 5 years of data. I would like to query and build trend information off of that; for instance:
On a yearly basis to know what items are becoming less popular over time.
On a monthly basis to see what items get popular at a certain time of year (ice in summer)
On a weekly basis to see how well my individual stores are doing
On a daily basis to observe theft trends or something
Now my business unit also wants to query that data, but I'd like to be able to keep it responsive.
How do I know that it would be best to partition on Year, Month, Week, Day, etc for this data set? Is it just whatever I actually observe as providing the best response time by testing out each scenario? Or is there some kind of scale that I can use to understand where my partitions would be the most efficient?
Edit: I, personally, am using Sql Server 2012. But I'm curious as to how others view this question in relation to the core concept rather than the implementation (Unless this isn't one of those cases where you can do so).
Things to consider:
What type of database are you using? Really important, different strategies for Oracle vs SQLServer vs IBM, etc.
Sample queries and run times. Partitions usage depends on the conditions in your where clause, what are you filtering on?
Does it make sense to create/use aggregate tables? Seems like a monthly aggregate would save you some time.
Partitions usage depends on the conditions in your where clause, what are you filtering on?
Lots of options based on the hardware and storage options available to you, need more details to make a more specific recommendation.
Here is an Ms-SQL 2012 database with 7 million records a day, with an ambition to grow the database to 6 years of data for trend analyses.
The partitions are based on the YearWeek column, expressed as an integer (after 201453 comes 201501). So each partition holds one week of transaction data.
This makes for a maximum of 320 partitions, which is well chosen below the maximum of 1000 partitions within a scheme. The maximum size for one partition in one table is now approx. 10 Gb, which makes it much easier to handle than the 3Tb size of the total.
A new file in the partition scheme is used for each new year. The 500Gb datafiles are suitable for backup and deletion.
When calculating data for one month the 4 processors are working in parallel to handle one partition each.

How to build a proper DB schema to have "periodic snapshots" of a table for a selected day? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Problem to be solved:
Im new to DataBases and Im trying to find out the best way to store changes in a table, that is a daily snapshot of some statuses: eg. "hotel_room_rentals" table (with 20 columns - every can change).
Id like to be able to generate that table for a selected day (e.g. data inside changes on production, so I have to store it somewhere else), or do some other transformations on it (e.g. average number of days rented in a period)
My theoretical example - detailed:
Let's say that Im creating a DB for a hotel.
In the production system I have a table that shows info for all 10 000 rooms in the hotel.
This is a daily snapshot - let's assume that the table is updated once per day.
Some attributes of a room change often: e.g. is_rented; customer_number, rate_usd.
Some attributes dont change too often: e.g. disabled_room, room_color, type_of_furniture.
Room_number obviously does not change (primary key)
Now I want to find the best way to track changes in this table; the best way to create statistics on base of this table (e.g. average number of days rented in a period) and to be able to generate the table for selected date (e.g. 2013-01-01)
MY IDEA:
Since I have no clue about databases, my idea is to copy the whole table every day, with 1 more column, called "DB_dump_date" (with a date). This is a pretty straightforward approach, which will probably require a lot of space; since my 10k rooms table, will have to be copied 365 times in a year.
OTHER SOLUTIONS:
On some other website, I was recommended to create two tables:
"Reservation" table with these columns: Startdate Enddate Room Rate Occupant_name
Then to transform this table into a FactReservations table: Date Room Is_occupied Rate Occupant_name
I do not understand how does this help me... in fact I assume I would have to make 20 intermediary tables and then 20 Fact tables (since I have 20 columns in my database).
QUESTIONS:
What are the recommended ways to deal with such problems?
Is there any DB schema that is prepared to deal with it, without the user making magic ETLs? (e.g. a DB that can optimize the problem by itself)
What are the alternatives?
How would you, smart people, do this? (preferably in MS Access... or some freeware technology)
edit:
one more thing - everything can change in the table, not only room reservetions, everything; and I want to be able to track the changes
stop - slow down - and take a breath.
do not - repeat do not make copies of tables each day. this approach is way off base.
your problem is a normalization problem. as you indicate - you have other suggestions on how to normalize - this is the direction you want to go.
Your goal will be to find a structure that accommodates the SQL statements that can answer your questions (and hopefully many more that you haven't thought up yet) This will be one static model where the tables do not change or get copied, but are instead static - and the only thing that changes is the data inside the tables. (ideally - to me there will also be few to no updates, only inserts)
You will certainly need a ROOM table, and a CUSTOMER table, and then a relation between them possibly RESERVATION.
these can then fill up - and you can get all the answers to the questions you posed without any copying or materialization or anything.. just SQL.
You need to focus on the requirements and start there. So far for requirements I see are:
-Generate that table for a selected day
-average number of days rented in a period
If we consider two extremes of design, at the more complex end would be a datamart with SCD tables, tracking changes to rooms, and at the simple end would be some kind of log table, along the lines of what you have already mentioned.
Reading between the lines, I don't really see any requirement for knowing the attributes of a room on a given day, but I do see a requirement for analysis of historical transactions.
So my suggestion is have a good hard think about your requirements before you start designing the database.
There is no magic design to cover this automatically. Dimensional design is a standard way of modelling business data to allow for easy analysis, but it might be over the top for your requirement.
Welcome to the world of databases! With that in mind – take almost everything that you know about Excel and throw it out the window. Whereas it’s much more difficult in Excel to define relationships between two sheets of a workbook and report off of those two different sheets, so the majority of the time it’s easier to simply copy the same data down a single sheet, it’s trivially easy to do using Access or any other relational database.
Typically what you’d want to do is create several normalized tables and define a relationship between them. Then, when querying the view, you can easily join between the tables to get the data that you need.
So, working off of the assumption that you’re building this for simple reporting and not to create a property management system (if you are looking at that – I’d recommend that you look at some of the players in the industry, like Micros or Agilysys), based on my experience working in the industry, I’d recommend the following table layout:
Reservations – this holds the reservation information (guest name,
arrival date, departure date, check-in date, check-out date, rate if
you use a blended rate, etc.)
Rooms – this holds information on your rack (number, wing code, max
guests, # beds, smoking/non, view, type, etc.)
Room Status – Only if you need to track if a room is on
reserve/hold/OOO/OTM (Status type, date start, date end)
Room Status Types – Types of room status holds and how it affects
inventory (type, out of inventory flag)
Rates (if you don’t use a blended rate) – one entry per reservation
per night (guest, rate)
Personally, I’m a huge fan of using surrogate keys for the unique identifiers, because all too often I've been burned where something changes in the business process and a natural key that was previously unique all of a sudden can be duplicated. In that vein, each table would have a surrogate key and the joins would be as follows:
Reservations – Rooms (many to one)
Rooms – Room Status (one to many)
Room Status – Room Status Types (many to one)
Reservations – Rates (one to many)
If you define the relationships properly in Access (i.e. foreign key relationships in other DBMS), it should automatically use them to build your joins when creating your queries (called Views in just about every other DBMS) or reports.
For learning about databases I’d recommend that you review:
Wikipedia on Join types
Wikipedia on Slowly Changing Dimension (you could use some of
these techniques to record changes in room information over time)
Wikipedia on Relational Databases
Office documentation on Access
Kimball Group Design Tips (great for data warehouse/datamart
design)
if you need to use your existing table then the following is not applicable. If the data can be migrated to a new schema then this will readily address the challenge. TRE is an approach which uses the current view paradigm for development but fully supports the time dimensions of data (which are system time=when the data goes into the db and valid time=the business time which applies to the data). By working in the current view approach of TRE this sort of problem is straightforward. Take a look at:- http://youtu.be/V1EcsuJxUno

Best way to calculate sum depending on dates with SQL

I don't know a good way to maintain sums depending on dates in a SQL database.
Take a database with two tables:
Client
clientID
name
overdueAmount
Invoice
clientID
invoiceID
amount
dueDate
paymentDate
I need to propose a list of the clients and order it by overdue amount (sum of not paid past invoices of the client). On big database it isn't possible to calculate it in real time.
The problem is the maintenance of an overdue amount field on the client. The amount of this field can change at midnight from one day to the other even if nothing changed on the invoices of the client.
This sum changes if the invoice is paid, a new invoice is created and due date is past, a due date is now past and wasn't yesterday...
The only solution I found is to recalculate every night this field on every client by summing the invoices respecting the conditions. But it's not efficient on very big databases.
I think it's a common problem and I would like to know if a best practice exists?
You should read about data warehousing. It will help you to solve this problem. It looks similar as what you just said
"The only solution I found is to recalculate every night this field
on every client by summing the invoices respecting the conditions. But
it's not efficient on very big databases."
But it has something more than that. When you read it, try to forget about normalization. Its main intention is for 'show' data, not 'manage' data. So, you would feel weird at beginning but if you understand 'why we need data warehousing', it will be very very interesting.
This is a book that can be a good start http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247 , classic one.
Firstly, I'd like to understand what you mean by "very big databases" - most RDBMS systems running on decent hardware should be able to calculate this in real time for anything less than hundreds of millions of invoices. I speak from experience here.
Secondly, "best practice" is one of those expressions that mean very little - it's often used to present someone's opinion as being more meaningful than simply an opinion.
In my opinion, by far the best option is to calculate it on the fly.
If your database is so big that you really can't do this, I'd consider a nightly batch (as you describe). Nightly batch runs are a pain - especially for systems that need to be available 24/7, but they have the benefit of keeping all the logic in a single place.
If you want to avoid nightly batches, you can use triggers to populate an "unpaid_invoices" table. When you create a new invoice record, a trigger copies that invoice to the "unpaid_invoices" table; when you update the invoice with a payment, and the payment amount equals the outstanding amount, you delete from the unpaid_invoices table. By definition, the unpaid_invoices table should be far smaller than the total number of invoices; calculating the outstanding amount for a given customer on the fly should be okay.
However, triggers are nasty, evil things, with exotic failure modes that can stump the unsuspecting developer, so only consider this if you have a ninja SQL developer on hand. Absolutely make sure you have a SQL query which checks the validity of your unpaid_invoices table, and ideally schedule it as a regular task.

How to handle archiving for seasonal database values on SQL Server

I am on SQL Server 2008 R2 and I am currently developing a database structure which contains seasonal values for some products.
By seasonal I mean that those values won't be useful after a particular date in terms of customer use. But, those values will be used for statistical results by internal stuff.
On the sales web site, we will add a feature for product search and one of my aim is to make this search as optimized as possible. But, more row inside the database table, less fast this search will become. So, I consider archiving the unused values.
I can handle auto archiving with SQL Server Jobs automatically. No problem there. But I am not sure how I should archive those values.
Best way I can come up with is that I create another table inside the same database with same columns and put them there.
Example :
My main table name is ProductPrices and there a primary key has been
defined for this database. Then, I have created another table named
ProdutcPrices_archive. I created a primary key field for this table
as well and the same columns as ProductPrices table except for
ProdutPrices primary key value. I don't think it is useful to
archive that value (do I think correct?).
For the internal use, I consider putting two table values together
with UNION (Is that the correct way?).
This database is meant to use for long time and it should be designed with best structure. I am not sure if I miss something here for the long run.
Any advice would be appreciated.
I'd consider one of two options initially
Use partitioning to separate the single table into current working set and archive data.
No need to use an archive table
Add validForm, ValidTo columns to implement a type 2 SCD
Then add an indexed view for ValidTo IS NULL to get the current set of data
I wouldn't have 2 separate tables if all data has to be "on-line" in one database.
This leads to a 3rd option: an entirely separate database with all data. Only "current" data stays in live. (as #Mike_Walsh's answer explains)
The indexed view option is easiest and works with standard edition (with NOEXPAND hint)
gbn brings up some good approaches. I think the "right" longer term answer for you is the t3rd option, though.
It sounds like you have two business use cases of your data -
1.) Real time Online Transaction Processing (OLTP). This is the POS transactions, inventory management, quick "how did receipts look today, how is inventory, are we having any operational problems?" kind of questions and keeps the business running day to day. Here you want the data necessary to conduct operations and you want a database optimized for updates/inserts/etc.
2.) Analytical type questions/Reporting. This is looking at month over month numbers, year over year numbers, running averages. These are the questions that you ask as that are strategic and look at a complete picture of your history - You'll want to see how last years Christmas seasonal items did against this years, maybe even compare those numbers with the seasonal items from that same period 5 years ago. Here you want a database that contains a lot more data than your OLTP. You want to throw away as little history as possible and you want a database largely optimized for reporting and answering questions. Probably more denormalized. You want the ability to see things as they were at a certain time, so the Type 2 SCDs mentioned by gbn would be useful here.
It sounds to me like you need to create a reporting database. You can call it a data warehouse, but that term scares people these days. Doesn't need to be scary, if you plan it properly it doesn't have to take you 6 years and 6 million dollars to make ;-)
This is definitely a longer term answer but in a couple years you'll be happy you spent the time creating one. A good book to understand the concept of dimensional modeling and thinking about data warehouses and their terminology is The Data Warehouse Toolkit.