How to calculate running total (month to date) in SQL Server 2008 - sql

I'm trying to calculate a month-to-date total using SQL Server 2008.
I'm trying to generate a month-to-date count at the level of activities and representatives. Here are the results I want to generate:
| REPRESENTATIVE_ID | MONTH | WEEK | TOTAL_WEEK_ACTIVITY_COUNT | MONTH_TO_DATE_ACTIVITIES_COUNT |
|-------------------|-------|------|---------------------------|--------------------------------|
| 40 | 7 | 7/08 | 1 | 1 |
| 40 | 8 | 8/09 | 1 | 1 |
| 40 | 8 | 8/10 | 1 | 2 |
| 41 | 7 | 7/08 | 2 | 2 |
| 41 | 8 | 8/08 | 4 | 4 |
| 41 | 8 | 8/09 | 3 | 7 |
| 41 | 8 | 8/10 | 1 | 8 |
From the following tables:
ACTIVITIES_FACT table
+-------------------+------+-----------+
| Representative_ID | Date | Activity |
+-------------------+------+-----------+
| 41 | 8/03 | Call |
| 41 | 8/04 | Call |
| 41 | 8/05 | Call |
+-------------------+------+-----------+
LU_TIME table
+-------+-----------------+--------+
| Month | Date | Week |
+-------+-----------------+--------+
| 8 | 8/01 | 8/08 |
| 8 | 8/02 | 8/08 |
| 8 | 8/03 | 8/08 |
| 8 | 8/04 | 8/08 |
| 8 | 8/05 | 8/08 |
+-------+-----------------+--------+
I'm not sure how to do this: I keep running into problems with multiple-counting or aggregations not being allowed in subqueries.

A running total is the summation of a sequence of numbers which is
updated each time a new number is added to the sequence, simply by
adding the value of the new number to the running total.
I THINK He wants a running total for Month by each Representative_Id, so a simple group by week isn't enough. He probably wants his Month_To_Date_Activities_Count to be updated at the end of every week.
This query gives a running total (month to end-of-week date) ordered by Representative_Id, Week
SELECT a.Representative_ID, l.month, l.Week, Count(*) AS Total_Week_Activity_Count
,(SELECT count(*)
FROM ACTIVITIES_FACT a2
INNER JOIN LU_TIME l2 ON a2.Date = l2.Date
AND a.Representative_ID = a2.Representative_ID
WHERE l2.week <= l.week
AND l2.month = l.month) Month_To_Date_Activities_Count
FROM ACTIVITIES_FACT a
INNER JOIN LU_TIME l ON a.Date = l.Date
GROUP BY a.Representative_ID, l.Week, l.month
ORDER BY a.Representative_ID, l.Week
| REPRESENTATIVE_ID | MONTH | WEEK | TOTAL_WEEK_ACTIVITY_COUNT | MONTH_TO_DATE_ACTIVITIES_COUNT |
|-------------------|-------|------|---------------------------|--------------------------------|
| 40 | 7 | 7/08 | 1 | 1 |
| 40 | 8 | 8/09 | 1 | 1 |
| 40 | 8 | 8/10 | 1 | 2 |
| 41 | 7 | 7/08 | 2 | 2 |
| 41 | 8 | 8/08 | 4 | 4 |
| 41 | 8 | 8/09 | 3 | 7 |
| 41 | 8 | 8/10 | 1 | 8 |
SQL Fiddle Sample

As I understand your question:
SELECT af.Representative_ID
, lt.Week
, COUNT(af.Activity) AS Qnt
FROM ACTIVITIES_FACT af
INNER JOIN LU_TIME lt ON lt.Date = af.date
GROUP BY af.Representative_ID, lt.Week
SqlFiddle

Representative_ID Week Month_To_Date_Activities_Count
41 2013-08-01 00:00:00.000 1
41 2013-08-08 00:00:00.000 3
USE tempdb;
GO
IF OBJECT_ID('#ACTIVITIES_FACT','U') IS NOT NULL DROP TABLE #ACTIVITIES_FACT;
CREATE TABLE #ACTIVITIES_FACT
(
Representative_ID INT NOT NULL
,Date DATETIME NULL
, Activity VARCHAR(500) NULL
)
IF OBJECT_ID('#LU_TIME','U') IS NOT NULL DROP TABLE #LU_TIME;
CREATE TABLE #LU_TIME
(
Month INT
,Date DATETIME
,Week DATETIME
)
INSERT INTO #ACTIVITIES_FACT(Representative_ID,Date,Activity)
VALUES
(41,'7/31/2013','Chat')
,(41,'8/03/2013','Call')
,(41,'8/04/2013','Call')
,(41,'8/05/2013','Call')
INSERT INTO #LU_TIME(Month,Date,Week)
VALUES
(8,'7/31/2013','8/01/2013')
,(8,'8/01/2013','8/08/2013')
,(8,'8/02/2013','8/08/2013')
,(8,'8/03/2013','8/08/2013')
,(8,'8/04/2013','8/08/2013')
,(8,'8/05/2013','8/08/2013')
--Begin Query
SELECT AF.Representative_ID
,LU.Week
,COUNT(*) AS Month_To_Date_Activities_Count
FROM #ACTIVITIES_FACT AS AF
INNER JOIN #LU_TIME AS LU
ON AF.Date = LU.Date
Group By AF.Representative_ID
,LU.Week

Related

Count Since Last Max Within Window

I have been working on this query for most of the night, and just cannot get it to work. This is an addendum to this question. The query should find the "Seqnum" of the last Maximum over the last 10 records. I am unable to limit the last Maximum to just the window.
Below is my best effort at getting there although I have tried many other queries to no avail:
SELECT [id], high, running_max, seqnum,
MAX(CASE WHEN ([high]) = running_max THEN seqnum END) OVER (ORDER BY [id]) AS [lastmax]
FROM (
SELECT [id], [high],
MAX([high]) OVER (ORDER BY [id] ROWS BETWEEN 9 PRECEDING AND CURRENT ROW) AS running_max,
ROW_NUMBER() OVER (ORDER BY [id]) as seqnum
FROM PY t
) x
When the above query is run, the below results.
id | high | running_max | seqnum | lastmax |
+----+--------+-------------+--------+---------+
| 1 | 28.12 | 28.12 | 1 | 1 |
| 2 | 27.45 | 28.12 | 2 | 1 |
| 3 | 27.68 | 28.12 | 3 | 1 |
| 4 | 27.4 | 28.12 | 4 | 1 |
| 5 | 28.09 | 28.12 | 5 | 1 |
| 6 | 28.07 | 28.12 | 6 | 1 |
| 7 | 28.2 | 28.2 | 7 | 7 |
| 8 | 28.7 | 28.7 | 8 | 8 |
| 9 | 28.05 | 28.7 | 9 | 8 |
| 10 | 28.195 | 28.7 | 10 | 8 |
| 11 | 27.77 | 28.7 | 11 | 8 |
| 12 | 28.27 | 28.7 | 12 | 8 |
| 13 | 28.185 | 28.7 | 13 | 8 |
| 14 | 28.51 | 28.7 | 14 | 8 |
| 15 | 28.5 | 28.7 | 15 | 8 |
| 16 | 28.23 | 28.7 | 16 | 8 |
| 17 | 27.59 | 28.7 | 17 | 8 |
| 18 | 27.6 | 28.51 | 18 | 8 |
| 19 | 27.31 | 28.51 | 19 | 8 |
| 20 | 27.11 | 28.51 | 20 | 8 |
| 21 | 26.87 | 28.51 | 21 | 8 |
| 22 | 27.12 | 28.51 | 22 | 8 |
| 23 | 27.22 | 28.51 | 23 | 8 |
| 24 | 27.3 | 28.5 | 24 | 8 |
| 25 | 27.66 | 28.23 | 25 | 8 |
| 26 | 27.405 | 27.66 | 26 | 8 |
| 27 | 27.54 | 27.66 | 27 | 8 |
| 28 | 27.65 | 27.66 | 28 | 8 |
+----+--------+-------------+--------+---------+
Unfortunately the lastmax column is taking the last max of all the previous records and not the max of the last 10 records only. The way it should result is below:
It is important to note that their can be duplicates in the "High" column, so this will need to be taken into account.
Any help would be greatly appreciated.
This isn't a bug. The issue is that high and lastmax have to come from the same row. This is a confusing aspect when using window functions.
Your logic in the outer query is looking for a row where the lastmax on that row matches the high on that row. That last occurred on row 8. The subsequent maxima are "local", in the sense that there was a higher value on that particular row.
For instance, on row 25, the value is 26.660. That is the maximum value that you want from row 26 onward. But on row 25 itself, then maximum is 28.230. That is clearly not equal to high on that row. So, it doesn't match in the outer query.
I don't think you can easily do what you want using window functions. There may be some tricky way.
A version using cross apply works. I've used id for the lastmax. I'm not sure if you really need seqnum:
select py.[id], py.high, t.high as running_max, t.id as lastmax
from py cross apply
(select top (1) t.*
from (SELECT top (10) t.*
from PY t
where t.id <= py.id
order by t.id desc
) t
order by t.high desc
) t;
Here is a db<>fiddle.

SQL to display value for different dates

I have a table named Reading_Hist containing columns such as Reading, Date, ID. This table contains history of the readings. example
+----+---------+-------------+
| ID | Reading | ReadingDate |
+----+---------+-------------+
| 1 | 12 | 9/12/2018 |
| 2 | 15 | 9/12/2018 |
| 1 | 16 | 9/5/2018 |
| 4 | 1 | 9/12/2018 |
| 3 | 65 | 9/12/2018 |
| 1 | 23 | 8/29/2018 |
| 3 | 25 | 9/5/2018 |
| 2 | 23 | 9/5/2018 |
| 4 | 3 | 9/5/2018 |
+----+---------+-------------+
I want to write a sql to display each ID and it's current Reading on first column, next reading taken a week before and third reading taken two weeks before and last trend of the reading.
Example Result below.
+----+---------+------+------+-------+
| ID | Current | Wk_1 | Wk_2 | Trend |
+----+---------+------+------+-------+
| 1 | 12 | 16 | 23 | Down |
| 2 | 15 | 23 | NULL | Down |
| 3 | 65 | 25 | NULL | UP |
| 4 | 1 | 3 | NULL | Down |
+----+---------+------+------+-------+
You can use aggregation to get the maximum day of readings per ID. Then left join the current readings, them of the last week and two weeks ago. Use CASE to calculate the trend.
It could look something like:
SELECT x.id,
rh2.reading current,
rh3.reading wk_1,
rh4.reading wk_2,
CASE
WHEN rh2.reading > rh3.reading THEN
'Up'
WHEN rh2.reading < rh3.reading THEN
'Down'
WHEN rh2.reading = rh3.reading THEN
'-'
END trend
FROM (SELECT rh1.id,
max(rh1.reading_date) reading_date
FROM reading_hist rh1
GROUP BY rh1.id) x
LEFT JOIN reading_hist rh2
ON rh2.id = x.id
AND rh2.reading_date = x.reading_date
LEFT JOIN reading_hist rh3
ON rh3.id = x.id
AND rh3.reading_date = dateadd(day, -7, x.reading_date)
LEFT JOIN reading_hist rh4
ON rh4.id = x.id
AND rh4.reading_date = dateadd(day, -14, x.reading_date);
Of course this requires, that there are readings exactly 7 or 14 days from the last day of readings.

Get the Id of the matched data from other table. No duplicates of ID from both tables

Here is my table A.
| Id | GroupId | StoreId | Amount |
| 1 | 20 | 7 | 15000 |
| 2 | 20 | 7 | 1230 |
| 3 | 20 | 7 | 14230 |
| 4 | 20 | 7 | 9540 |
| 5 | 20 | 7 | 24230 |
| 6 | 20 | 7 | 1230 |
| 7 | 20 | 7 | 1230 |
Here is my table B.
| Id | GroupId | StoreId | Credit |
| 12 | 20 | 7 | 1230 |
| 14 | 20 | 7 | 15000 |
| 15 | 20 | 7 | 14230 |
| 16 | 20 | 7 | 1230 |
| 17 | 20 | 7 | 7004 |
| 18 | 20 | 7 | 65523 |
I want to get this result without getting duplicate Id of both table.
I need to get the Id of table B and A where the Amount = Credit.
| A.ID | B.ID | Amount |
| 1 | 14 | 15000 |
| 2 | 12 | 1230 |
| 3 | 15 | 14230 |
| 4 | null | 9540 |
| 5 | null | 24230 |
| 6 | 16 | 1230 |
| 7 | null | 1230 |
My problem is when I have 2 or more same Amount in table A, I get duplicate ID of table B. which should be null. Please help me. Thank you.
I think you want a left join. But this is tricky because you have duplicate amounts, but you only want one to match. The solution is to use row_number():
select . . .
from (select a.*, row_number() over (partition by amount order by id) as seqnum
from a
) a left join
(select b.*, row_number() over (partition by credit order by id) as seqnum
from b
)b
on a.amount = b.credit and a.seqnum = b.seqnum;
Another approach, I think simplier and shorter :)
select ID [A.ID],
(select top 1 ID from TABLE_B where Credit = A.Amount) [B.ID],
Amount
from TABLE_A [A]

How do I select columns whenever they change?

I'm trying to create a slowly changing dimension (type 2 dimension) and am a bit lost on how to logically write it out. Say that we have a source table with a grain of Person | Country | Department | Login Time. I want to create this dimension table with Person | Country | Department | Eff Start time | Eff End Time.
Data could look like this:
Person | Country | Department | Login Time
------------------------------------------
Bob | CANADA | Marketing | 2009-01-01
Bob | CANADA | Marketing | 2009-02-01
Bob | USA | Marketing | 2009-03-01
Bob | USA | Sales | 2009-04-01
Bob | MEX | Product | 2009-05-01
Bob | MEX | Product | 2009-06-01
Bob | MEX | Product | 2009-07-01
Bob | CANADA | Marketing | 2009-08-01
What I want in the Type 2 dimension would look like this:
Person | Country | Department | Eff Start time | Eff End Time
------------------------------------------------------------------
Bob | CANADA | Marketing | 2009-01-01 | 2009-03-01
Bob | USA | Marketing | 2009-03-01 | 2009-04-01
Bob | USA | Sales | 2009-04-01 | 2009-05-01
Bob | MEX | Product | 2009-05-01 | 2009-08-01
Bob | CANADA | Marketing | 2009-08-01 | NULL
Assume that Bob's name, Country and Department hasn't been updated since 2009-08-01 so it's left as NULL
What function would work best here? This is on Netezza, which uses a flavor of Postgres.
Obviously GROUP BY would not work here because of same groupings later on (I added in Bob | CANADA | Marketing at the last row to show this.
EDIT
Including a hash column on Person, Country, and Department, would make sense, correct? Thinking of using logic of
SELECT PERSON, COUNTRY, DEPARTMENT
FROM table t1
where
person = person
AND t1.hash <> hash_function(person, country, department)
Answer
create table so (
person varchar(32)
,country varchar(32)
,department varchar(32)
,login_time date
) distribute on random;
insert into so values ('Bob','CANADA','Marketing','2009-01-01');
insert into so values ('Bob','CANADA','Marketing','2009-02-01');
insert into so values ('Bob','USA','Marketing','2009-03-01');
insert into so values ('Bob','USA','Sales','2009-04-01');
insert into so values ('Bob','MEX','Product','2009-05-01');
insert into so values ('Bob','MEX','Product','2009-06-01');
insert into so values ('Bob','MEX','Product','2009-07-01');
insert into so values ('Bob','CANADA','Marketing','2009-08-01');
/* ************************************************************************** */
with prm as ( --Create an ordinal primary key.
select
*
,row_number() over (
partition by person
order by login_time
) rwn
from
so
), chn as ( --Chain events to their previous and next event.
select
cur.rwn
,cur.person
,cur.country
,cur.department
,cur.login_time cur_login
,case
when
cur.country = prv.country
and cur.department = prv.department
then 1
else 0
end prv_equal
,case
when
(
cur.country = nxt.country
and cur.department = nxt.department
) or nxt.rwn is null --No next record should be equivalent to matching.
then 1
else 0
end nxt_equal
,case prv_equal
when 0 then cur_login
else null
end eff_login_start_sparse
,case
when eff_login_start_sparse is null
then max(eff_login_start_sparse) over (
partition by cur.person
order by rwn
rows unbounded preceding --The secret sauce.
)
else eff_login_start_sparse
end eff_login_start
,case nxt_equal
when 0 then cur_login
else null
end eff_login_end
from
prm cur
left outer join prm nxt on
cur.person = nxt.person
and cur.rwn + 1 = nxt.rwn
left outer join prm prv on
cur.person = prv.person
and cur.rwn - 1 = prv.rwn
), grp as ( --Group by login starts.
select
person
,country
,department
,eff_login_start
,max(eff_login_end) eff_login_end
from
chn
group by
person
,country
,department
,eff_login_start
), led as ( --Change the effective end to be the next start, if desired.
select
person
,country
,department
,eff_login_start
,case
when eff_login_end is null
then null
else
lead(eff_login_start) over (
partition by person
order by eff_login_start
)
end eff_login_end
from
grp
)
select * from led order by eff_login_start;
This code returns the following table.
PERSON | COUNTRY | DEPARTMENT | EFF_LOGIN_START | EFF_LOGIN_END
--------+---------+------------+-----------------+---------------
Bob | CANADA | Marketing | 2009-01-01 | 2009-03-01
Bob | USA | Marketing | 2009-03-01 | 2009-04-01
Bob | USA | Sales | 2009-04-01 | 2009-05-01
Bob | MEX | Product | 2009-05-01 | 2009-08-01
Bob | CANADA | Marketing | 2009-08-01 |
Explanation
I must have solved this four or five times in the past few years and keep neglecting to write it down formally. I'm glad to have the chance to do it, so this is a great question.
When attempting this, I like writing down the problem in matrix form. Here's the input, presuming that all values have the same key in the SCD.
Cv | Ce
----|----
A | 10
A | 11
B | 14
C | 16
D | 18
D | 25
D | 34
A | 40
Where Cv is the value that we'll need to compare against (again, presuming that the key value for the SCD is equal in this data; we'll be partitioning over the key value the entire time so it's irrelevant to the solution) and Ce is the event time.
First, we need an ordinal primary key. I've designated this Ck in the table. This will allow us to join the table to itself to get the previous and next events. I've called these columns Pk (previous key), Nk (next key), Pv, and Nv.
Cv | Ce | Ck | Pk | Pv | Nk | Nv |
----|----|----|----|----|----|----|
A | 10 | 1 | | | 2 | A |
A | 11 | 2 | 1 | A | 3 | B |
B | 14 | 3 | 2 | A | 4 | C |
C | 16 | 4 | 3 | B | 5 | D |
D | 18 | 5 | 4 | C | 6 | D |
D | 25 | 6 | 5 | D | 7 | D |
D | 34 | 7 | 6 | D | 8 | A |
A | 40 | 8 | 7 | D | | |
Now we need some columns to see if we're at the beginning or end of a contiguous event block. I'll call these Pc and Nc, for contiguous. Pc is defined as Pv = Cv => true. 1 represents true and 0 represents false. Nc is defined similarly, except that the null case defaults to true (we'll see why in a minute)
Cv | Ce | Ck | Pk | Pv | Nk | Nv | Pc | Nc |
----|----|----|----|----|----|----|----|----|
A | 10 | 1 | | | 2 | A | 0 | 1 |
A | 11 | 2 | 1 | A | 3 | B | 1 | 0 |
B | 14 | 3 | 2 | A | 4 | C | 0 | 0 |
C | 16 | 4 | 3 | B | 5 | D | 0 | 0 |
D | 18 | 5 | 4 | C | 6 | D | 0 | 1 |
D | 25 | 6 | 5 | D | 7 | D | 1 | 1 |
D | 34 | 7 | 6 | D | 8 | A | 1 | 0 |
A | 40 | 8 | 7 | D | | | 0 | 1 |
Now you can start to see how the 1,1 combination of Pc,Nc is a completely useless record. We know this intuitively, since Bob's Mex/Product combination on the 6th row is pretty much useless information when building an SCD.
So let's get rid of the useless information. I'll add two new columns here: an almost-complete effective start time called Sn and an actually-complete effective end time called Ee. Sn is is populated with Ce when Pc is 0 and Ee is populated with Ce when Nc is 0.
Cv | Ce | Ck | Pk | Pv | Nk | Nv | Pc | Nc | Sn | Ee |
----|----|----|----|----|----|----|----|----|----|----|
A | 10 | 1 | | | 2 | A | 0 | 1 | 10 | |
A | 11 | 2 | 1 | A | 3 | B | 1 | 0 | | 11 |
B | 14 | 3 | 2 | A | 4 | C | 0 | 0 | 14 | 14 |
C | 16 | 4 | 3 | B | 5 | D | 0 | 0 | 16 | 16 |
D | 18 | 5 | 4 | C | 6 | D | 0 | 1 | 18 | |
D | 25 | 6 | 5 | D | 7 | D | 1 | 1 | | |
D | 34 | 7 | 6 | D | 8 | A | 1 | 0 | | 34 |
A | 40 | 8 | 7 | D | | | 0 | 1 | 40 | |
This looks really close, but we still have the problem that we can't group by Cv (person/country/department). What we need is for Sn to populate all those nulls with the previous value of Sn. You could join this table to itself on rwn < rwn and get the maximum, but I'm going to be lazy and use Netezza's analytic functions and the rows unbounded preceding clause. It's a shortcut to the method I just described. So we're going to create another column called Es, efffective start, defined as follows.
case
when Sn is null
then max(Sn) over (
partition by k --key value of the SCD
order by Ck
rows unbounded preceding
)
else Sn
end Es
With that definition, we get this.
Cv | Ce | Ck | Pk | Pv | Nk | Nv | Pc | Nc | Sn | Ee | Es |
----|----|----|----|----|----|----|----|----|----|----|----|
A | 10 | 1 | | | 2 | A | 0 | 1 | 10 | | 10 |
A | 11 | 2 | 1 | A | 3 | B | 1 | 0 | | 11 | 10 |
B | 14 | 3 | 2 | A | 4 | C | 0 | 0 | 14 | 14 | 14 |
C | 16 | 4 | 3 | B | 5 | D | 0 | 0 | 16 | 16 | 16 |
D | 18 | 5 | 4 | C | 6 | D | 0 | 1 | 18 | | 18 |
D | 25 | 6 | 5 | D | 7 | D | 1 | 1 | | | 18 |
D | 34 | 7 | 6 | D | 8 | A | 1 | 0 | | 34 | 18 |
A | 40 | 8 | 7 | D | | | 0 | 1 | 40 | | 40 |
The rest is trivial. Group by Es and grab the max of Ee to obtain this table.
Cv | Es | Ee |
----|----|----|
A | 10 | 11 |
B | 14 | 14 |
C | 16 | 16 |
D | 18 | 34 |
A | 40 | |
If you want to populate the effective end time with the next start, join the table again to itself or use the lead() window function to grab it.

Rolling total with no sub-select and no vendor specific extensions

What I'm trying to achieve: rolling total for quantity and amount for a given day, grouped by hour.
It's easy in most cases, but if you have some additional columns (dir and product in my case) and you don't want to group/filter on them, that's a problem.
I know there are extensions in Oracle and MSSQL specifically for that, and there's SELECT OVER PARTITION in Postgres.
At the moment I'm working on an app prototype, and it's backed by MySQL, and I have no idea what it will be using in production, so I'm trying to avoid vendor lock-in.
The entrire table:
> SELECT id, dir, product, date, hour, quantity, amount FROM sales
ORDER BY date, hour;
+------+-----+---------+------------+------+----------+--------+
| id | dir | product | date | hour | quantity | amount |
+------+-----+---------+------------+------+----------+--------+
| 2230 | 65 | ABCDEDF | 2014-09-11 | 1 | 1 | 10 |
| 2231 | 64 | ABCDEDF | 2014-09-11 | 3 | 4 | 40 |
| 2232 | 64 | ABCDEDF | 2014-09-11 | 5 | 5 | 50 |
| 2235 | 64 | ZZ | 2014-09-11 | 7 | 6 | 60 |
| 2233 | 64 | ABCDEDF | 2014-09-11 | 7 | 6 | 60 |
| 2237 | 66 | ABCDEDF | 2014-09-11 | 7 | 6 | 60 |
| 2234 | 64 | ZZ | 2014-09-18 | 3 | 1 | 11 |
| 2236 | 66 | ABCDEDF | 2014-09-18 | 3 | 1 | 100 |
| 2227 | 64 | ABCDEDF | 2014-09-18 | 3 | 1 | 100 |
| 2228 | 64 | ABCDEDF | 2014-09-18 | 5 | 2 | 200 |
| 2229 | 64 | ABCDEDF | 2014-09-18 | 7 | 3 | 300 |
+------+-----+---------+------------+------+----------+--------+
For a given date:
> SELECT id, dir, product, date, hour, quantity, amount FROM sales
WHERE date = '2014-09-18'
ORDER BY hour;
+------+-----+---------+------------+------+----------+--------+
| id | dir | product | date | hour | quantity | amount |
+------+-----+---------+------------+------+----------+--------+
| 2227 | 64 | ABCDEDF | 2014-09-18 | 3 | 1 | 100 |
| 2236 | 66 | ABCDEDF | 2014-09-18 | 3 | 1 | 100 |
| 2234 | 64 | ZZ | 2014-09-18 | 3 | 1 | 11 |
| 2228 | 64 | ABCDEDF | 2014-09-18 | 5 | 2 | 200 |
| 2229 | 64 | ABCDEDF | 2014-09-18 | 7 | 3 | 300 |
+------+-----+---------+------------+------+----------+--------+
The results that I need, using sub-select:
> SELECT date, hour, SUM(quantity),
( SELECT SUM(quantity) FROM sales s2
WHERE s2.hour <= s1.hour AND s2.date = s1.date
) AS total
FROM sales s1
WHERE s1.date = '2014-09-18'
GROUP by date, hour;
+------------+------+---------------+-------+
| date | hour | sum(quantity) | total |
+------------+------+---------------+-------+
| 2014-09-18 | 3 | 3 | 3 |
| 2014-09-18 | 5 | 2 | 5 |
| 2014-09-18 | 7 | 3 | 8 |
+------------+------+---------------+-------+
My concerns for using sub-select:
once there are round million records in the table, the query may become too slow, not sure if it's subject to optimizations even though it has no HAVING statements.
if I had to filter on a product or dir, I will have to put those conditions to both main SELECT and sub-SELECT too (WHERE product = / WHERE dir =).
sub-select only counts a single sum, while I need two of them (sum(quantity) и sum(amount)) (ERROR 1241 (21000): Operand should contain 1 column(s)).
The closest result I were able to get using JOIN:
> SELECT DISTINCT(s1.hour) AS ih, s2.date, s2.hour, s2.quantity, s2.amount, s2.id
FROM sales s1
JOIN sales s2 ON s2.date = s1.date AND s2.hour <= s1.hour
WHERE s1.date = '2014-09-18'
ORDER by ih;
+----+------------+------+----------+--------+------+
| ih | date | hour | quantity | amount | id |
+----+------------+------+----------+--------+------+
| 3 | 2014-09-18 | 3 | 1 | 100 | 2236 |
| 3 | 2014-09-18 | 3 | 1 | 100 | 2227 |
| 3 | 2014-09-18 | 3 | 1 | 11 | 2234 |
| 5 | 2014-09-18 | 3 | 1 | 100 | 2236 |
| 5 | 2014-09-18 | 3 | 1 | 100 | 2227 |
| 5 | 2014-09-18 | 5 | 2 | 200 | 2228 |
| 5 | 2014-09-18 | 3 | 1 | 11 | 2234 |
| 7 | 2014-09-18 | 3 | 1 | 100 | 2236 |
| 7 | 2014-09-18 | 3 | 1 | 100 | 2227 |
| 7 | 2014-09-18 | 5 | 2 | 200 | 2228 |
| 7 | 2014-09-18 | 7 | 3 | 300 | 2229 |
| 7 | 2014-09-18 | 3 | 1 | 11 | 2234 |
+----+------------+------+----------+--------+------+
I could stop here and just use those results to group by ih (hour), calculate the sum for quantity and amount and be happy. But something eats me up telling that this is wrong.
If I remove DISTINCT most rows become to be duplicated. Replacing JOIN with its invariants doesn't help.
Once I remove s2.id from statement you get a complete mess with disappearing/collapsion meaningful rows (e.g. ids 2236/2227 got collapsed):
> SELECT DISTINCT(s1.hour) AS ih, s2.date, s2.hour, s2.quantity, s2.amount
FROM sales s1
JOIN sales s2 ON s2.date = s1.date AND s2.hour <= s1.hour
WHERE s1.date = '2014-09-18'
ORDER by ih;
+----+------------+------+----------+--------+
| ih | date | hour | quantity | amount |
+----+------------+------+----------+--------+
| 3 | 2014-09-18 | 3 | 1 | 100 |
| 3 | 2014-09-18 | 3 | 1 | 11 |
| 5 | 2014-09-18 | 3 | 1 | 100 |
| 5 | 2014-09-18 | 5 | 2 | 200 |
| 5 | 2014-09-18 | 3 | 1 | 11 |
| 7 | 2014-09-18 | 3 | 1 | 100 |
| 7 | 2014-09-18 | 5 | 2 | 200 |
| 7 | 2014-09-18 | 7 | 3 | 300 |
| 7 | 2014-09-18 | 3 | 1 | 11 |
+----+------------+------+----------+--------+
Summing doesn't help, and it adds up to the mess.
First row (hour = 3) should have SUM(s2.quantity) equal 3, but it has 9. What does SUM(s1.quantity) shows is a complete mystery to me.
> SELECT DISTINCT(s1.hour) AS hour, sum(s1.quantity), s2.date, SUM(s2.quantity)
FROM sales s1 JOIN sales s2 ON s2.date = s1.date AND s2.hour <= s1.hour
WHERE s1.date = '2014-09-18'
GROUP BY hour;
+------+------------------+------------+------------------+
| hour | sum(s1.quantity) | date | sum(s2.quantity) |
+------+------------------+------------+------------------+
| 3 | 9 | 2014-09-18 | 9 |
| 5 | 8 | 2014-09-18 | 5 |
| 7 | 15 | 2014-09-18 | 8 |
+------+------------------+------------+------------------+
Bonus points/boss level:
I also need a column that will show total_reference, the same rolling total for the same periods for a different date (e.g. 2014-09-11).
If you want a cumulative sum in MySQL, the most efficient way is to use variables:
SELECT date, hour,
(#q := q + #q) as cumeq, (#a := a + #a) as cumea
FROM (SELECT date, hour, SUM(quantity) as q, SUM(amount) as a
FROM sales s
WHERE s.date = '2014-09-18'
GROUP by date, hour
) dh cross join
(select #q := 0, #a := 0) vars
ORDER BY date, hour;
If you are planning on working with databases such as Oracle, SQL Server, and Postgres, then you should use a database more similar in functionality and that supports that ANSI standard window functions. The right way to do this is with window functions, but MySQL doesn't support those. Postgres, SQL Server, and Oracle all have free versions that yo can use for development purposes.
Also, with proper indexing, you shouldn't have a problem with the subquery approach, even on large tables.