Attributes of my Time dimension table in star schema - sql

I'm building a DW with a star schema modeling. I'll use it for a BI project with pentaho.
I'll have of course a time dimension table. I'll analyze my fact table with differents granularity (day, week, month year, perhaps other)
Should I put one attribute for each of those granularity in my dimension table (so I have one day attribute, one month attribute, one year attribute ...) or should I just write the date and then calculate everything with this date (get the month of the date, the year of the date ...)?
thks a lot for your help

In addition to day, week, month, and year, you should think of other attributes like "company holiday", or "fiscal quarter". This can be an enormous resource for driving the same query off of different time windows.

I would add the attributes of the dates as their own columns. This does not take up significantly more space, and generally gives the query optimiser a better shot at working out how many of the dimension table records match a given criterion (for example, that the day_of_month = 31).

Typically, the more, the merrier.
Here is an example I'm using...
ledger#localhost-> select * from date_dimension where date = '2015-12-25';
-[ RECORD 1 ]----+--------------------
date | 2015-12-25
year | 2015
month | 12
monthname | December
day | 25
dayofyear | 359
weekdayname | Friday
calendarweek | 52
formatteddate | 25. 12. 2015
quartal | Q4
yearquartal | 2015/Q4
yearmonth | 2015/12
yearcalendarweek | 2015/52
weekend | Weekday
americanholiday | Holiday
austrianholiday | Holiday
canadianholiday | Holiday
period | Christmas season
cwstart | 2015-12-21
cwend | 2015-12-27
monthstart | 2015-12-01
monthend | 2015-12-31 00:00:00
It's based on queries from the PostgreSQL wiki here... https://wiki.postgresql.org/wiki/Date_and_Time_dimensions
It would be interesting to augment this with further things:
Religious days (Easter, some of the numerous Saints' days, Ramadan, Jewish festivals, etc)
Statutory holidays for relevant jurisdictions. The firm I work for winds up publicizing Irish banking holidays because a number of the customers pay via bank transfers.
If you operate in France, you might want Lundi, Mardi, Mercredi, ... rather than English day names.
Daylight Savings Time (as true/false) would be a nice addition.

Related

How to aggregate number of customers in SQL Server?

I have transaction data like this:
| Time_Stamp | Customer_ID | Amount | Department | Pay_Method | Channel |
|---------------------|-------------|--------|------------|-------------|------------|
| 2018-03-07 14:23:33 | 374856829 | 14.63 | Fruit | Credit Card | Mobile App |
I have written an aggregation procedure like this:
INSERT INTO Days
(
Year,
Month,
Day,
Department,
Pay_Method,
Total_Dollars,
Total_Transactions,
Total_Customers
)
SELECT
YEAR(Time_Stamp),
MONTH(Time_Stamp),
DAY(Time_Stamp),
Department,
Pay_Method,
SUM(Amount),
COUNT(*),
COUNT(DISTINCT(Customer_ID))
FROM
Transactions
GROUP BY
YEAR(Time_Stamp),
MONTH(Time_Stamp),
DAY(Time_Stamp),
Department,
Pay_Method
Which populates a data mart table like this:
| Year | Month | Day | Department | Pay_Method | Total_Dollars | Total_Transactions | Total_Customers |
|------|-------|-----|------------|------------|---------------|--------------------|-----------------|
| 2018 | 3 | 7 | Home | Cash | 2398540.57 | 543084 | 325783 |
| 2018 | 3 | 7 | Home | Credit | 7458392.47 | 1587695 | 758643 |
So far, so good.
I then have procedures which feed the charts UI like this:
SELECT
Year,
Month,
Day,
SUM(Total_Dollars),
SUM(Total_Transactions),
SUM(Total_Customers)
FROM
Days
WHERE
IIF(#Department IS NULL, Department, #Department) AND
IIF(#Pay_Method IS NULL, Pay_Method, #Pay_Method)
GROUP BY
Year,
Month,
Day
This all works great for Total_Transactions and Total_Dollars, but not for Total_Customers.
The Total_Customers numbers in the Days table are correct in each row, for that specific combination of Year, Month, Day, Department and Pay_Method, but when two of those rows are summed together, the total becomes inaccurate, because the same customer may have made multiple transactions using different Department(s) and Pay_Method(s) on the same date. The numbers become even more inaccurate when adding days together to get monthly customer counts, etc...
I thought the solution would be to try and trick SQL Server into considering "all" as possible values for the various "group by" fields, and played around with group by and case quite a bit but couldn't figure it out. Essentially, in addition to my Days table containing every specific combination of Year, Month, Day, Department and Pay_Method, I also need to generate rows where Year, Month, Day, Department and Pay_Method are considered as "any" or "all". Lastly, I don't need to generate rows where Year is "any" and Month and Day are specified (although it wouldn't hurt really), as no one cares for totals of March 7th in any year, etc...
Can someone help me write the query to properly populate my Days table?
Your problem is because the "grain" of your model is wrong. Grain is the term given to the level of detail in a fact table.
You always want to store your facts at the finest level of detail, then you can aggregate your data correctly. You were already at that point with your first table.
Rather than aggregating the data (incorrectly) into your second table, simply rewrite or amend that table to break your date/time into the fields you require for reporting.
By the way, if this is truly representative of your data, I suspect that you might actually be hiding an error in your transaction count. You may need a finer level of detail than "department", and I suspect it might be a concept like "product". What would happen to your model if a customer bought both apples and oranges?

Dynamically choose a column to query in Redshift

I want to take a one-string query response to populate a SELECT statement in another query. Some say it's impossible, but Redshift makes fools of us all.
Imagine I have a table day_of_week as follows:
Day of Week | Weekend
---------------------
Monday | No
Tuesday | No
Wednesday | No
Thursday | No
Friday | No
Saturday | Yes
Sunday | Yes
And another party_time like this:
Yes | No
--------------------------
All the time | None of the time
I want to allow someone to just tell me a day (eg, "Wednesday") and then use the resulting Weekend value to query of party_time.
eg
SELECT (SELECT Weekend FROM day_of_week WHERE "Day of Week" = 'Wednesday')
FROM party_time
Result: 'None of the time'
How?
SQL itself isn't dynamic/self-referential although most implementations have some sort of meta language to partly get around that.
The most obvious solution for your problem is to change table party_time:
Test Meaning
--------------------------
Yes All the time
No None of the time
Then you can use a join or a sub-select to get to your answer:
select meaning
from party_time
inner join day_of_week
on weekend = test
where Day_of_Week = 'Wednesday'
Example: http://sqlfiddle.com/#!9/f10b7/1

Can you define a custom "week" in PostgreSQL?

To extract the week of a given year we can use:
SELECT EXTRACT(WEEK FROM timestamp '2014-02-16 20:38:40');
However, I am trying to group weeks together in a bit of an odd format. My start of a week would begin on Mondays at 4am and would conclude the following Monday at 3:59:59am.
Ideally, I would like to create a query that provides a start and end date, then groups the total sales for that period by the weeks laid out above.
Example:
SELECT
(some custom week date),
SUM(sales)
FROM salesTable
WHERE
startDate BETWEEN 'DATE 1' AND 'DATE 2'
I am not looking to change the EXTRACT() function, rather create a query that would pull from the following sample table and output the sample results.
If 'DATE 1' in query was '2014-07-01' AND 'DATE 2' was '2014-08-18':
Sample Table:
itemID | timeSold | price
------------------------------------
1 | 2014-08-13 09:13:00 | 12.45
2 | 2014-08-15 12:33:00 | 20.00
3 | 2014-08-05 18:33:00 | 10.00
4 | 2014-07-31 04:00:00 | 30.00
Desired result:
weekBegin | priceTotal
----------------------------------
2014-07-28 04:00:00 | 30.00
2014-08-04 04:00:00 | 10.00
2014-08-11 04:00:00 | 32.45
Produces your desired output:
SELECT date_trunc('week', time_sold - interval '4h')
+ interval '4h' AS week_begin
, sum(price) AS price_total
FROM tbl
WHERE time_sold >= '2014-07-01 0:0'::timestamp
AND time_sold < '2014-08-19 0:0'::timestamp -- start of next day
GROUP BY 1
ORDER BY 1;
db<>fiddle here (extended with a row that actually shows the difference)
Old sqlfiddle
Explanation
date_trunc() is the superior tool here. You are not interested in week numbers, but in actual timestamps.
The "trick" is to subtract 4 hours from selected timestamps before extracting the week - thereby shifting the time frame towards the earlier bound of the ISO week. To produce the desired display, add the same 4 hours back to the truncated timestamps.
But apply the WHERE condition on unmodified timestamps. Also, never use BETWEEN with timestamps, which have fractional digits. Use the WHERE conditions like presented above. See:
Unexpected results from SQL query with BETWEEN timestamps
Operating with data type timestamp, i.e. with (shifted) "weeks" according to the current time zone. You might want to work with timestamptz instead. See:
Ignoring time zones altogether in Rails and PostgreSQL

How can I see if a date is on a weekend?

I have a table:
ID | Name | TDate
1 | John | 1 May 2013, 8:67AM
2 | Jack | 2 May 2013, 6:43AM
3 | Adam | 3 May 2013, 9:53AM
4 | Max | 4 May 2013, 2:13AM
5 | Leny | 5 May 2013, 5:33AM
I need a query that will return all the items where TDate is a weekend. How would I write such a
query?
WHAT I HAVE SO FAR
select
table.*,
EXTRACT (DAY FROM table.tdate )
from table
I did a select using EXTRACT to just see if I can get the right values. However, EXTRACT with the parameter DAY returns the day of the month. If I instead use WEEKDAY, as per the documentation here, then I get error:
ERROR: timestamp units "weekday" not recognized
SQL state: 22023
limit 1250
EDIT
TDate has a data type of datetime (timestamp). I just wrote it like that for easy reading. But regardless of the type, I could easily cast between types if need be.
I know dates 4May and 5May are weekends (as they fall on a Saturday and a Sunday). Does firebird allow for a way to write a query that will return dates if they fall on weekends.
try this:
SELECT ID, Name, TDate
FROM your_table
WHERE EXTRACT(WEEKDAY FROM TDate) IN (6,0)
UPDATE
condition must be (0,6) not (0,1).

Database design for summarized data

I have a new table I'm going to add to a bunch of other summarized data, basically to take some of the load off by calculating weekly avgs.
My question is whether I would be better off with one model over the other. One model with days of the week as a column with an additional column for price or another model as a series of fields for the DOW each taking a price.
I'd like to know which would save me in speed and/or headaches? Or at least the trade off.
IE.
ID OBJECT_ID MON TUE WED THU FRI SAT SUN SOURCE
OR
ID OBJECT_ID DAYOFWEEK PRICE SOURCE
I would give the first preference to the following aggreate model:
ID | OBJECT_ID | DATE | PRICE | SOURCE
---+-----------+------------+--------+--------
1 | 100 | 2010/01/01 | 10.00 | 0
2 | 100 | 2010/01/02 | 15.00 | 0
3 | 100 | 2010/01/03 | 20.00 | 0
4 | 100 | 2010/01/04 | 12.00 | 0
You would then be able to aggreate the above data to generate averages for every week/month/year very easily and relatively quickly.
To get the list of weekly averages, you would be able to do the following:
SELECT WEEK(date), AVG(price) FROM table GROUP BY WEEK(date);
For some further examples, the following query would return the average price on Sundays:
SELECT AVG(price) FROM table WHERE DAYOFWEEK(date) = 1;
Or maybe get the average daily price for the 8th week of the year:
SELECT AVG(price) FROM table WHERE WEEK(date) = 8;
It would also be quite easy to get monthly or yearly averages:
SELECT MONTH(date), AVG(price) FROM table GROUP BY MONTH(date);
I would only opt for more de-normalized options like the two you proposed if the above aggregations would still be too expensive to compute.
I would vote for the second. With the first, you would need some contraints to ensure that any row has only one of MON, TUE, WED, THU, FRI, SAT, SUN. Of course, with the second, you might need some additional reference data to define the Days of the Week, to populate DAYOFWEEK.
UPDATE:
Ok it wasn't clear there would always be a price for every day. In that case my point about constraints isn't so valid. I'd still prefer the second model though, it seems better normalized. I don't know enough about this case now to say if this is a good time to cast off some good normalization practices for clarity and performance, but it might be...