How to fill missing dates by groups in a table in sql - sql

I want to know how to use loops to fill in missing dates with value zero based on the start/end dates by groups in sql so that i have consecutive time series in each group. I have two questions.
how to loop for each group?
How to use start/end dates for each group to dynamically fill in missing dates?
My input and expected output are listed as below.
Input: I have a table A like
date value grp_no
8/06/12 1 1
8/08/12 1 1
8/09/12 0 1
8/07/12 2 2
8/08/12 1 2
8/12/12 3 2
Also I have a table B which can be used to left join with A to fill in missing dates.
date
...
8/05/12
8/06/12
8/07/12
8/08/12
8/09/12
8/10/12
8/11/12
8/12/12
8/13/12
...
How can I use A and B to generate the following output in sql?
Output:
date value grp_no
8/06/12 1 1
8/07/12 0 1
8/08/12 1 1
8/09/12 0 1
8/07/12 2 2
8/08/12 1 2
8/09/12 0 2
8/10/12 0 2
8/11/12 0 2
8/12/12 3 2
Please send me your code and suggestion. Thank you so much in advance!!!

You can do it like this without loops
SELECT p.date, COALESCE(a.value, 0) value, p.grp_no
FROM
(
SELECT grp_no, date
FROM
(
SELECT grp_no, MIN(date) min_date, MAX(date) max_date
FROM tableA
GROUP BY grp_no
) q CROSS JOIN tableb b
WHERE b.date BETWEEN q.min_date AND q.max_date
) p LEFT JOIN TableA a
ON p.grp_no = a.grp_no
AND p.date = a.date
The innermost subquery grabs min and max dates per group. Then cross join with TableB produces all possible dates within the min-max range per group. And finally outer select uses outer join with TableA and fills value column with 0 for dates that are missing in TableA.
Output:
| DATE | VALUE | GRP_NO |
|------------|-------|--------|
| 2012-08-06 | 1 | 1 |
| 2012-08-07 | 0 | 1 |
| 2012-08-08 | 1 | 1 |
| 2012-08-09 | 0 | 1 |
| 2012-08-07 | 2 | 2 |
| 2012-08-08 | 1 | 2 |
| 2012-08-09 | 0 | 2 |
| 2012-08-10 | 0 | 2 |
| 2012-08-11 | 0 | 2 |
| 2012-08-12 | 3 | 2 |
Here is SQLFiddle demo

I just needed the query to return all the dates in the period I wanted. Without the joins. Thought I'd share for those wanting to put them in your query. Just change the 365 to whatever timeframe you are wanting.
DECLARE #s DATE = GETDATE()-365, #e DATE = GETDATE();
SELECT TOP (DATEDIFF(DAY, #s, #e)+1)
DATEADD(DAY, ROW_NUMBER() OVER (ORDER BY number)-1, #s)
FROM [master].dbo.spt_values
WHERE [type] = N'P' ORDER BY number

The following query does a union with tableA and tableB. It then uses group by to merge the rows from tableA and tableB so that all of the dates from tableB are in the result. If a date is not in tableA, then the row has 0 for value and grp_no. Otherwise, the row has the actual values for value and grp_no.
select
dat,
sum(val),
sum(grp)
from
(
select
date as dat,
value as val,
grp_no as grp
from
tableA
union
select
date,
0,
0
from
tableB
where
date >= date '2012-08-06' and
date <= date '2012-08-13'
)
group by
dat
order by
dat
I find this query to be easier for me to understand. It also runs faster. It takes 16 seconds whereas a similar right join query takes 32 seconds.
This solution only works with numerical data.
This solution assumes a fixed date range. With some extra work this query can be adapted to limit the date range to what is found in tableA.

Related

SQL select row containing all of the values in interval

I know the question is poorly worded, I'm sorry, I can't really put this problem into words. Here is a representation:
I have two tables: product and availability. A product can have multiple dates when it's available. Example:
Table 1 (products):
id | name | ....
----------------------------------
1 | My product 1 | ....
2 | My product 2 | ....
Table 2 (availability):
id | productId | date
-----------------------------------------
1 | 1 | 2021-01-15
2 | 1 | 2021-01-16
3 | 1 | 2021-01-17
4 | 2 | 2021-01-15
5 | 2 | 2021-01-16
Is there an sql statement that, given an interval, allows us to fetch a list of products having a row in the availabilty table for each element of the interval?
For example, given the interval [2021-01-15 -> 2021-01-17], the request should return product 1 because it's available during the entire period (it has a row for each element: the 15th, 16th and 17th). Product2 isn't returned because it's not available on 2021-01-17.
Is there a way to do this in SQL or do I have to use PL/SQL?
Any help is appreciated,
Thanks
You can use analytical function as follows:
select p.* from
(select p.*, count(distinct a.date) over (partition by a.productid) as cnt
from products p
join availability a on a.productid = p.id
where a.date >= date '201-01-15'
and a.date < date '201-01-17' + 1 )
where cnt = date '201-01-17' - date '201-01-15' + 1
Finally, came up with this, thanks #Popeye for the inspiration.
select occurence.pid from
(
select a.product_id as pid, count(distinct a.date::date) as cnt
from availability a
where a.date >= '2021-01-15'
and a.date < '2021-01-17'::date + 1
group by a.product_id
) as occurence
where cnt = '2021-01-17'::date - '2021-01-15'::date + 1;

Without using DISTINCT, how to group data without altering value?

I have a feeling this is a dumb question with a simple answer, but here goes.
How can I group the following data without using DISTINCT? #Table has 5 rows, which shows data for Hrs 5-9. I just don't like DISTINCT.
Since I need to display all hours of the day upto Hr9 (including 0-4), I'm joining it with table DimTime. DimTime has all hours, but with its 15-min intervals. So, DimTime looks like this:
Hour Minute
0 0
0 15
0 30
0 45
1 0
1 15
1 30
1 45
So here's my script:
declare #table table
(
Hour int,
Value int
)
insert into #table select 5, 25
insert into #table select 6, 34
insert into #table select 7, 54
insert into #table select 8, 65
insert into #table select 9, 11
select d.hour, t.hour, sum(value)
from #table t
left join dimtime d on d.hour = t.hour
group by d.hour, t.hour
If I use GROUP BY, then I need to have an aggregate function. So if I use SUM, it'll multiply all values by 4. If I remove the aggregate function, I'll get a syntax error.
Also, I cannot use a CTE since the contents in #table comes from a CTE (I just didn't include it here).
Here's the result that I need to display:
Hour Value
0 null
1 null
2 null
3 null
4 null
5 25
6 34
7 54
8 65
9 11
Simply add a condition WHERE minute = 0 to return only one row per hour.
If you really with to skip the sorting operation on dimtime with the use of distinct clause then check the below explanation.
Display all hours (0-9) from dimtime and sum the value given in #table for a particular hour:
SELECT
d.hour, SUM(t.value)
FROM
dimtime d
LEFT JOIN #table t
ON d.hour = t.hour
WHERE d.minute = 0 -- retrieves one row for every hour from dimtime
GROUP BY d.hour
ORDER BY d.hour -- not needed, but will give you resultset sorted by hour
Assuming that you have a row with value minute = 0 in your dimtable for every hour you could just limit the rows retrieved for join operation. That will work with any value from list 0, 15, 30, 45.
SUM() will work properly by summing all the values for a given hour in #table. If there are no rows with a particular hour, it will return 0 value.
You should have a better reason for not using a programming function than "I just don't like it"
You can have a CTE that uses another CTE
#dnoeth provided an excellent answer, but here's another option:
SELECT
d.hour,
t.value
FROM
#table t
INNER JOIN (SELECT DISTINCT hour FROM dimTime) d ON d.hour = t.hour
Try
SELECT *
FROM DimTime D
LEFT JOIN myTable T
ON D.Hour = T.Hour
WHERE D.Minute = 0
SQL Fiddle Demo
Output
| Hour | Minute | Hour | Value |
|------|--------|--------|--------|
| 0 | 0 | (null) | (null) |
| 1 | 0 | (null) | (null) |
| 2 | 0 | (null) | (null) |
| 3 | 0 | (null) | (null) |
| 4 | 0 | (null) | (null) |
| 5 | 0 | 5 | 25 |
| 6 | 0 | 6 | 34 |
| 7 | 0 | 7 | 54 |
| 8 | 0 | 8 | 65 |
| 9 | 0 | 9 | 11 |
If I use GROUP BY, then I need to have an aggregate function
Only if you include expressions in your SELECT that are not part of your group key. You could certainly do
select d.hour, t.hour, value
from #table t
inner join dimtime d
on d.hour = t.hour
group by d.hour, t.hour, value
or
select d.hour, t.hour, MIN(value)
from #table t
inner join dimtime d
on d.hour = t.hour
group by d.hour, t.hour
Note that the first query gives you the exact same results as DISTINCT (and may even be compiled to the same query plan) so I'm not sure what your aversion is to DISTINCT.

SQL Server - Select Distinct of two columns, where the distinct column selected has a maximum value based on two other columns

I have 2 tables - TC and T, with columns specified below. TC maps to T on column T_ID.
TC
----
T_ID,
TC_ID
T
-----
T_ID,
V_ID,
Datetime,
Count
My current result set is:
V_ID TC_ID Datetime Count
----|-----|------------|--------|
2 | 1 | 2013-09-26 | 450600 |
2 | 1 | 2013-12-09 | 14700 |
2 | 1 | 2014-01-22 | 15000 |
2 | 1 | 2014-01-22 | 15000 |
2 | 1 | 2014-01-22 | 7500 |
4 | 1 | 2014-01-22 | 1000 |
4 | 1 | 2013-12-05 | 0 |
4 | 2 | 2013-12-05 | 0 |
Using the following query:
select T.V_ID,
TC.TC_ID,
T.Datetime,
T.Count
from T
inner join TC
on TC.T_ID = T.T_ID
Result set I want:
V_ID TC_ID Datetime Count
----|-----|------------|--------|
2 | 1 | 2014-01-22 | 15000 |
4 | 1 | 2014-01-22 | 1000 |
4 | 2 | 2013-12-05 | 0 |
I want to write a query to select each distinct V_ID + TC_ID combination, but only with the maximum datetime, and for that datetime the maximum count. E.g. for the distinct combination of V_ID = 2 and TC_ID = 1, '2014-01-22' is the maximum datetime, and for that datetime, 15000 is the maximum count, so select this record for the new table. Any ideas? I don't know if this is too ambitious for a query and I should just handle the result set in code instead.
One method uses row_number():
select v_id, tc_id, datetime, count
from (select T.V_ID, TC.TC_ID, T.Datetime, T.Count,
row_number() over (partition by t.V_ID, tc.tc_id
order by datetime desc, count desc
) as seqnum
from t join
tc
on tc.t_id = t._id
) tt
where seqnum = 1;
The only issue is that some rows have the same maximum datetime value. SQL tables represent unordered sets, so there is no way to determine which is really the maximum -- unless the datetime really has a time component or another column specifies the ordering within a day.
It is possible to solve this using CTEs. First, extracting the data from your query. Second, get the maxdates. Third, get the highest count for each maxdate.:
;WITH Dataset AS
(
select T.V_ID,
TC.TC_ID,
T.[Datetime],
T.[Count]
from T
inner join TC
on TC.T_ID = T._ID
),
MaxDates AS
(
SELECT V_ID, TC_ID, MAX(t.[Datetime]) AS MaxDate
FROM Dataset t
GROUP BY t.V_ID, t.TC_ID
)
SELECT t.V_ID, t.TC_ID, t.[Datetime], MAX(t.[Count]) AS [Count]
FROM Dataset t
INNER JOIN MaxDates m ON t.V_ID = m.V_ID AND t.TC_ID = m.TC_ID AND m.MaxDate = t.[Datetime]
GROUP BY t.V_ID, t.TC_ID, t.[Datetime]
Just to keep it simple:
You need to group by T.V_ID,TC.TC_ID,
with selecting the max of date and then to get the maximum count, you must use a sub query as follows,
select T.V_ID,
TC.TC_ID,
max(T.Datetime) as Date_Time,
(select max(Count) from T as tb where v_ID = T.v_ID and DateTime = max(T.DateTime)) as Count
from T
inner join TC
on TC.T_ID = T._ID
group by T.V_ID,TC.TC_ID,

Get Monthly Totals from Running Totals

I have a table in a SQL Server 2008 database with two columns that hold running totals called Hours and Starts. Another column, Date, holds the date of a record. The dates are sporadic throughout any given month, but there's always a record for the last hour of the month.
For example:
ContainerID | Date | Hours | Starts
1 | 2010-12-31 23:59 | 20 | 6
1 | 2011-01-15 00:59 | 23 | 6
1 | 2011-01-31 23:59 | 30 | 8
2 | 2010-12-31 23:59 | 14 | 2
2 | 2011-01-18 12:59 | 14 | 2
2 | 2011-01-31 23:59 | 19 | 3
How can I query the table to get the total number of hours and starts for each month between two specified years? (In this case 2011 and 2013.) I know that I need to take the values from the last record of one month and subtract it by the values from the last record of the previous month. I'm having a hard time coming up with a good way to do this in SQL, however.
As requested, here are the expected results:
ContainerID | Date | MonthlyHours | MonthlyStarts
1 | 2011-01-31 23:59 | 10 | 2
2 | 2011-01-31 23:59 | 5 | 1
Try this:
SELECT c1.ContainerID,
c1.Date,
c1.Hours-c3.Hours AS "MonthlyHours",
c1.Starts - c3.Starts AS "MonthlyStarts"
FROM Containers c1
LEFT OUTER JOIN Containers c2 ON
c1.ContainerID = c2.ContainerID
AND datediff(MONTH, c1.Date, c2.Date)=0
AND c2.Date > c1.Date
LEFT OUTER JOIN Containers c3 ON
c1.ContainerID = c3.ContainerID
AND datediff(MONTH, c1.Date, c3.Date)=-1
LEFT OUTER JOIN Containers c4 ON
c3.ContainerID = c4.ContainerID
AND datediff(MONTH, c3.Date, c4.Date)=0
AND c4.Date > c3.Date
WHERE
c2.ContainerID is null
AND c4.ContainerID is null
AND c3.ContainerID is not null
ORDER BY c1.ContainerID, c1.Date
Using recursive CTE and some 'creative' JOIN condition, you can fetch next month's value for each ContainterID:
WITH CTE_PREP AS
(
--RN will be 1 for last row in each month for each container
--MonthRank will be sequential number for each subsequent month (to increment easier)
SELECT
*
,ROW_NUMBER() OVER (PARTITION BY ContainerID, YEAR(Date), MONTH(DATE) ORDER BY Date DESC) RN
,DENSE_RANK() OVER (ORDER BY YEAR(Date),MONTH(Date)) MonthRank
FROM Table1
)
, RCTE AS
(
--"Zero row", last row in decembar 2010 for each container
SELECT *, Hours AS MonthlyHours, Starts AS MonthlyStarts
FROM CTE_Prep
WHERE YEAR(date) = 2010 AND MONTH(date) = 12 AND RN = 1
UNION ALL
--for each next row just join on MonthRank + 1
SELECT t.*, t.Hours - r.Hours, t.Starts - r.Starts
FROM RCTE r
INNER JOIN CTE_Prep t ON r.ContainerID = t.ContainerID AND r.MonthRank + 1 = t.MonthRank AND t.Rn = 1
)
SELECT ContainerID, Date, MonthlyHours, MonthlyStarts
FROM RCTE
WHERE Date >= '2011-01-01' --to eliminate "zero row"
ORDER BY ContainerID
SQLFiddle DEMO (I have added some data for February and March in order to test on different lengths of months)
Old version fiddle

Statistical Mode with postgres

I have a table that has this schema:
create table mytable (creation_date timestamp,
value int,
category int);
I want the maximum ocurrence of a value every each hour per category, Only on week days. I had made some progress, I have a query like this now:
select category,foo.h as h,value, count(value) from mytable, (
select date_trunc('hour',
'2000-01-01 00:00:00'::timestamp+generate_series(0,23)*'1 hour'::interval)::time as h) AS foo
where date_part('hour',creation_date) = date_part('hour',foo.h) and
date_part('dow',creation_date) > 0 and date_part('dow',creation_date) < 6
group by category,h,value;
as result I got something like this:
category | h | value | count
---------+----------+---------+-------
1 | 00:00:00 | 2 | 1
1 | 01:00:00 | 2 | 1
1 | 02:00:00 | 2 | 6
1 | 03:00:00 | 2 | 31
1 | 03:00:00 | 3 | 11
1 | 04:00:00 | 2 | 21
1 | 04:00:00 | 3 | 9
1 | 13:00:00 | 1 | 14
1 | 14:00:00 | 1 | 10
1 | 14:00:00 | 2 | 7
1 | 15:00:00 | 1 | 52
for example at 04:00 I have to values 2 and 3, with counts of 21 and 9 respectively, I only need the value with highest count which would be the statiscal mode.
BTW I have more than 2M records
This can be simpler:
SELECT DISTINCT ON (category, extract(hour FROM creation_date)::int)
category
, extract(hour FROM creation_date)::int AS h
, count(*)::int AS max_ct
, value
FROM mytable
WHERE extract(isodow FROM creation_date) < 6 -- no sat or sun
GROUP BY 1,2,4
ORDER BY 1,2,3 DESC;
Basically these are the steps:
Exclude weekends (WHERE ...). Use ISODOW to simplify the expression.
Extract hour from timestamp as h.
Group by category, h and value.
Count the rows per combination of the three; cast to integer - we don't need bigint.
Order by category, h and the highest count (DESC).
Only pick the first row (highest count) per (category, h) with the according category.
I am able to do this in one query level, because DISTINCT is applied after the aggregate function.
The result will hold no rows for any (category, h) without no entries at all. If you need to fill in the blanks, LEFT JOIN to this:
SELECT c.category, h.h
FROM cat_tbl c
CROSS JOIN (SELECT generate_series(0, 23) AS h) h
Given the size of your table, I'd be tempted to use your query to build a temporary table, then run a query on that to finalise the results.
Assuming you called the temporary table "summary_table", the following query should do it.
select
category, h, value, count
from
summary_table s1
where
not exists
(select * from summary_table s2
where s1.category = s2.category and
s1.h = s2.h and
(s1.count < s2.count
OR (s1.count = s2.count and s1.value > s2.value));
If you don't want to create a table, you could use a WITH clause to attach your query to this one.
with summary_table as (
select category,foo.h as h,value, count(value) as count from mytable, (
select date_trunc('hour',
'2000-01-01 00:00:00'::timestamp+generate_series(0,23)*'1 hour'::interval)::time as h) AS foo
where date_part('hour',creation_date) = date_part('hour',foo.h) and
date_part('dow',creation_date) > 0 and date_part('dow',creation_date) < 6
group by category,h,value)
select
category, h, value, count
from
summary_table s1
where
not exists
(select * from summary_table s2
where s1.category = s1.category and
s1.h = s2.h and
(s1.count < s2.count
OR (s1.count = s2.count and s1.value > s2.value));