Database design for summarized data - sql

I have a new table I'm going to add to a bunch of other summarized data, basically to take some of the load off by calculating weekly avgs.
My question is whether I would be better off with one model over the other. One model with days of the week as a column with an additional column for price or another model as a series of fields for the DOW each taking a price.
I'd like to know which would save me in speed and/or headaches? Or at least the trade off.
IE.
ID OBJECT_ID MON TUE WED THU FRI SAT SUN SOURCE
OR
ID OBJECT_ID DAYOFWEEK PRICE SOURCE

I would give the first preference to the following aggreate model:
ID | OBJECT_ID | DATE | PRICE | SOURCE
---+-----------+------------+--------+--------
1 | 100 | 2010/01/01 | 10.00 | 0
2 | 100 | 2010/01/02 | 15.00 | 0
3 | 100 | 2010/01/03 | 20.00 | 0
4 | 100 | 2010/01/04 | 12.00 | 0
You would then be able to aggreate the above data to generate averages for every week/month/year very easily and relatively quickly.
To get the list of weekly averages, you would be able to do the following:
SELECT WEEK(date), AVG(price) FROM table GROUP BY WEEK(date);
For some further examples, the following query would return the average price on Sundays:
SELECT AVG(price) FROM table WHERE DAYOFWEEK(date) = 1;
Or maybe get the average daily price for the 8th week of the year:
SELECT AVG(price) FROM table WHERE WEEK(date) = 8;
It would also be quite easy to get monthly or yearly averages:
SELECT MONTH(date), AVG(price) FROM table GROUP BY MONTH(date);
I would only opt for more de-normalized options like the two you proposed if the above aggregations would still be too expensive to compute.

I would vote for the second. With the first, you would need some contraints to ensure that any row has only one of MON, TUE, WED, THU, FRI, SAT, SUN. Of course, with the second, you might need some additional reference data to define the Days of the Week, to populate DAYOFWEEK.
UPDATE:
Ok it wasn't clear there would always be a price for every day. In that case my point about constraints isn't so valid. I'd still prefer the second model though, it seems better normalized. I don't know enough about this case now to say if this is a good time to cast off some good normalization practices for clarity and performance, but it might be...

Related

Performance penalty of putting all records in one table

I'm setting up an Azure SQL database to load about 1M rows a day.
I'm planning on loading all the data into one table with the following structure:
TAG_NAME | START_DATETIME | END_DATETIME | READING | READING_UOM | INTERVAL_SECS (computed column)
Each (TAG_NAME, START_DATETIME, END_DATETIME) are unique. So the following case is possible:
TAG_NAME | START_DATETIME | END_DATETIME | READING | READING_UOM | INTERVAL_SECS (computed column)
X | 2020-01-01 01:00:00 | 2020-01-01 02:00:00 | 9.8 | m3 | 3600
X | 2020-01-01 01:00:00 | 2020-01-02 02:00:00 | 232.1 | m3 | 90000
I'm planning to create indexes on TAG_NAME, START_DATETIME and END_DATETIME.
From there I will create views. For example a view that pulls all the month-long readings for tags X, Y and Z.
Then another view that pulls the minute-readings for tags X, Y and D.
And so on..
So my question is, is there a performance impact on loading everything into one table? Should I divide the inputs into 'minute', 'hour', 'month', etc tables?
As what #Grant Fritchey said the longer the key, the fewer rows stored on a page, so the greater the index depth. When the index come to be too large, it will impact performance.
Due to the rapid growth of data, I think you should divide the fact table into several tables, such as active table and historical archive table,distinguish by year.
You can consider about using columnstore indexes to compress data and improve query performance.
I would use a date dimension table configured with whatever specific columns you may need for slicing and grouping your data. If you only need Year, Month, and Day numbers, then that's all your date dimension would need. But, if you need Hours, Minutes, Weeks, Quarters, or anything else, you can include those columns in the date dimension table too.
Indexing on the date dimension is easy and quick, since the row count is small.
Then, your fact table above would have FK relationships to the date dimension table for your START_DATETIME and END_DATETIME.

How to aggregate number of customers in SQL Server?

I have transaction data like this:
| Time_Stamp | Customer_ID | Amount | Department | Pay_Method | Channel |
|---------------------|-------------|--------|------------|-------------|------------|
| 2018-03-07 14:23:33 | 374856829 | 14.63 | Fruit | Credit Card | Mobile App |
I have written an aggregation procedure like this:
INSERT INTO Days
(
Year,
Month,
Day,
Department,
Pay_Method,
Total_Dollars,
Total_Transactions,
Total_Customers
)
SELECT
YEAR(Time_Stamp),
MONTH(Time_Stamp),
DAY(Time_Stamp),
Department,
Pay_Method,
SUM(Amount),
COUNT(*),
COUNT(DISTINCT(Customer_ID))
FROM
Transactions
GROUP BY
YEAR(Time_Stamp),
MONTH(Time_Stamp),
DAY(Time_Stamp),
Department,
Pay_Method
Which populates a data mart table like this:
| Year | Month | Day | Department | Pay_Method | Total_Dollars | Total_Transactions | Total_Customers |
|------|-------|-----|------------|------------|---------------|--------------------|-----------------|
| 2018 | 3 | 7 | Home | Cash | 2398540.57 | 543084 | 325783 |
| 2018 | 3 | 7 | Home | Credit | 7458392.47 | 1587695 | 758643 |
So far, so good.
I then have procedures which feed the charts UI like this:
SELECT
Year,
Month,
Day,
SUM(Total_Dollars),
SUM(Total_Transactions),
SUM(Total_Customers)
FROM
Days
WHERE
IIF(#Department IS NULL, Department, #Department) AND
IIF(#Pay_Method IS NULL, Pay_Method, #Pay_Method)
GROUP BY
Year,
Month,
Day
This all works great for Total_Transactions and Total_Dollars, but not for Total_Customers.
The Total_Customers numbers in the Days table are correct in each row, for that specific combination of Year, Month, Day, Department and Pay_Method, but when two of those rows are summed together, the total becomes inaccurate, because the same customer may have made multiple transactions using different Department(s) and Pay_Method(s) on the same date. The numbers become even more inaccurate when adding days together to get monthly customer counts, etc...
I thought the solution would be to try and trick SQL Server into considering "all" as possible values for the various "group by" fields, and played around with group by and case quite a bit but couldn't figure it out. Essentially, in addition to my Days table containing every specific combination of Year, Month, Day, Department and Pay_Method, I also need to generate rows where Year, Month, Day, Department and Pay_Method are considered as "any" or "all". Lastly, I don't need to generate rows where Year is "any" and Month and Day are specified (although it wouldn't hurt really), as no one cares for totals of March 7th in any year, etc...
Can someone help me write the query to properly populate my Days table?
Your problem is because the "grain" of your model is wrong. Grain is the term given to the level of detail in a fact table.
You always want to store your facts at the finest level of detail, then you can aggregate your data correctly. You were already at that point with your first table.
Rather than aggregating the data (incorrectly) into your second table, simply rewrite or amend that table to break your date/time into the fields you require for reporting.
By the way, if this is truly representative of your data, I suspect that you might actually be hiding an error in your transaction count. You may need a finer level of detail than "department", and I suspect it might be a concept like "product". What would happen to your model if a customer bought both apples and oranges?

Structuring Month-Based Data in SQL

I'm curious about what the best way to structure data in a SQL database where I need to keep track of certain fields and how they differ month to month.
For example, if I had a users table in which I was trying to store 3 different values: name, email, and how many times they've logged in each month. Would it be best practice to create a new column for each month and store the number of times they logged in that month under that column? Or would it be better to create a new row/table for each month?
My instinct says creating new columns is the best way to reduce redundancy, however I can see it getting a little unwieldy when the number of columns in the table changes over time. (I was also thinking that if I were to do it by column, it would warrant having a total_column that keeps track of all months at a time).
Thanks!
In my opinion, the best approach is to store each login for each user.
Use a query to summarize the data the way you need it when you query it.
You should only be thinking about other structures if summarizing the detail doesn't meet performance requirements -- which for a monthly report don't seem so onerous.
Whatever you do, storing counts in separate columns is not the right thing to do. Every month, you would need to add another column to the table.
I'm not an expert but in my opinion, it is best to store data in a separate table (in your case). That way you can manipulate the data easily and you don't have to modify the table design in the future.
PK: UserID & Date or New Column (Ex: RowNo with auto increment)
+--------+------------+-----------+
| UserID | Date | NoOfTimes |
+--------+------------+-----------+
| 01 | 2018.01.01 | 1 |
| 01 | 2018.01.02 | 3 |
| 01 | 2018.01.03 | 5 |
| .. | | |
| 02 | 2018.01.01 | 2 |
| 02 | 2018.01.02 | 6 |
+--------+------------+-----------+
Or
PK: UserID, Year & Month or New Column (Ex: RowNo with auto increment)
+--------+------+-------+-----------+
| UserID | Year | Month | NoOfTimes |
+--------+------+-------+-----------+
| 01 | 2018 | Jan | 10 |
| 01 | 2018 | feb | 13 |
+--------+------+-------+-----------+
Before you create the table, please take a look at the database normalization. Especially 1st (1NF), 2nd (2NF) and 3rd (3NF) normalization forms.
https://www.tutorialspoint.com/dbms/database_normalization.htm
https://www.lifewire.com/database-normalization-basics-1019735
https://www.essentialsql.com/get-ready-to-learn-sql-database-normalization-explained-in-simple-english/
https://www.studytonight.com/dbms/database-normalization.php
https://medium.com/omarelgabrys-blog/database-normalization-part-7-ef7225150c7f
Either approach is valid, depending on query patterns and join requirements.
One row for each month
For a user, the row containing login count for the month will be inserted when data is available for the month. There will be 1 row per month per user. This design will make it easier to do joins by month column. However, multiple rows will need to be accessed to get data for a user for the year.
-- column list
name
email
month
login_count
-- example entries
'user1', 'user1#email.com','jan',100
'user2', 'user2#email.com','jan',65
'user1', 'user1#email.com','feb',90
'user2', 'user2#email.com','feb',75
One row for all months
You do not need to dynamically add columns, since number of months is known in advance. The table can be initially created to accommodate all months. By default, all month_login_count columns would be initialized to 0. Then, the row would be updated as the login count for the month is populated. There will be 1 row per user. This design is not the best for doing joins by month. However, only one row will need to be accessed to get data for a user for the year.
-- column list
name
email
jan_login_count
feb_login_count
mar_login_count
apr_login_count
may_login_count
jun_login_count
jul_login_count
aug_login_count
sep_login_count
oct_login_count
nov_login_count
dec_login_count
-- example entries
'user1','user1#email.com',100,90,0,0,0,0,0,0,0,0,0,0
'user2','user2#email.com',65,75,0,0,0,0,0,0,0,0,0,0

SQL payments matrix

I want to combine two tables into one:
The first table: Payments
id | 2010_01 | 2010_02 | 2010_03
1 | 3.000 | 500 | 0
2 | 1.000 | 800 | 0
3 | 200 | 2.000 | 300
4 | 700 | 1.000 | 100
The second table is ID and some date (different for every ID)
id | date |
1 | 2010-02-28 |
2 | 2010-03-01 |
3 | 2010-01-31 |
4 | 2011-02-11 |
What I'm trying to achieve is to create table which contains all payments before the date in ID table to create something like this:
id | date | T_00 | T_01 | T_02
1 | 2010-02-28 | 500 | 3.000 |
2 | 2010-03-01 | 0 | 800 | 1.000
3 | 2010-01-31 | 200 | |
4 | 2010-02-11 | 1.000 | 700 |
Where T_00 means payment in the same month as 'date' value, T_01 payment in previous month and so on.
Is there a way to do this?
EDIT:
I'm trying to achieve this in MS Access.
The problem is that I cannot connect name of the first table's column with the date in the second (the easiest way would be to treat it as variable)
I added T_00 to T_24 columns in the second (ID) table and was trying to UPDATE those fields
set T_00 =
iif(year(date)&"_"&month(date)=2010_10,
but I realized that that would be to much code for access to handle if I wanted to do this for every payment period and every T_xx column.
Even if I would write the code for T_00 I would have to repeat it for next 23 periods.
Your Payments table is de-normalized. Those date columns are repeating groups, meaning you've violated First Normal Form (1NF). It's especially difficult because your field names are actually data. As you've found, repeating groups are a complete pain in the ass when you want to relate the table to something else. This is why 1NF is so important, but knowing that doesn't solve your problem.
You can normalize your data by creating a view that UNIONs your Payments table.
Like so:
CREATE VIEW NormalizedPayments (id, Year, Month, Amount) AS
SELECT id,
2010 AS Year,
1 AS Month,
2010_01 AS Amount
FROM Payments
UNION ALL
SELECT id,
2010 AS Year,
2 AS Month,
2010_02 AS Amount
FROM Payments
UNION ALL
SELECT id,
2010 AS Year,
3 AS Month,
2010_03 AS Amount
FROM Payments
And so on if you have more. This is how the Payments table should have been designed in the first place.
It may be easier to use a date field with the value '2010-01-01' instead of a Year and Month field. It depends on your data. You may also want to add WHERE Amount IS NOT NULL to each query in the UNION, or you might want to use Nz(2010_01,0.000) AS Amount. Again, it depends on your data and other queries.
It's hard for me to understand how you're joining from here, particularly how the id fields relate because I don't see how they do with the small amount of data provided, so I'll provide some general ideas for what to do next.
Next you can join your second table with this normalized Payments table using a method similar to this or a method similar to this. To actually produce the result you want, include a calculated field in this view with the difference in months. Then, create an actual Pivot Table to format your results (like this or like this) which is the proper way to display data like your tables do.

Attributes of my Time dimension table in star schema

I'm building a DW with a star schema modeling. I'll use it for a BI project with pentaho.
I'll have of course a time dimension table. I'll analyze my fact table with differents granularity (day, week, month year, perhaps other)
Should I put one attribute for each of those granularity in my dimension table (so I have one day attribute, one month attribute, one year attribute ...) or should I just write the date and then calculate everything with this date (get the month of the date, the year of the date ...)?
thks a lot for your help
In addition to day, week, month, and year, you should think of other attributes like "company holiday", or "fiscal quarter". This can be an enormous resource for driving the same query off of different time windows.
I would add the attributes of the dates as their own columns. This does not take up significantly more space, and generally gives the query optimiser a better shot at working out how many of the dimension table records match a given criterion (for example, that the day_of_month = 31).
Typically, the more, the merrier.
Here is an example I'm using...
ledger#localhost-> select * from date_dimension where date = '2015-12-25';
-[ RECORD 1 ]----+--------------------
date | 2015-12-25
year | 2015
month | 12
monthname | December
day | 25
dayofyear | 359
weekdayname | Friday
calendarweek | 52
formatteddate | 25. 12. 2015
quartal | Q4
yearquartal | 2015/Q4
yearmonth | 2015/12
yearcalendarweek | 2015/52
weekend | Weekday
americanholiday | Holiday
austrianholiday | Holiday
canadianholiday | Holiday
period | Christmas season
cwstart | 2015-12-21
cwend | 2015-12-27
monthstart | 2015-12-01
monthend | 2015-12-31 00:00:00
It's based on queries from the PostgreSQL wiki here... https://wiki.postgresql.org/wiki/Date_and_Time_dimensions
It would be interesting to augment this with further things:
Religious days (Easter, some of the numerous Saints' days, Ramadan, Jewish festivals, etc)
Statutory holidays for relevant jurisdictions. The firm I work for winds up publicizing Irish banking holidays because a number of the customers pay via bank transfers.
If you operate in France, you might want Lundi, Mardi, Mercredi, ... rather than English day names.
Daylight Savings Time (as true/false) would be a nice addition.