Irregular grouping of timestamp variable - sql

I have a table organized as follows:
id lateAt
1231235 2019/09/14
1242123 2019/09/13
3465345 NULL
5676548 2019/09/28
8986475 2019/09/23
Where lateAt is a timestamp of when a certain loan's payment became late. So, for each current date - I need to look at these numbers daily - there's a certain amount of entries which are late for 0-15, 15-30, 30-45, 45-60, 60-90 and 90+ days.
This is my desired output:
lateGroup Count
0-15 20
15-30 22
30-45 25
45-60 32
60-90 47
90+ 57
This is something I can easily calculate in R, but to get the results back to my BI dashboard I'd have to create a new table in my database, which I don't think is a good practice. What is the SQL-native approach to this problem?

I would define the "late groups" using a range, the join against the number of days:
with groups (grp) as (
values
(int4range(0,15, '[)')),
(int4range(15,30, '[)')),
(int4range(30,45, '[)')),
(int4range(45,60, '[)')),
(int4range(60,90, '[)')),
(int4range(90,null, '[)'))
)
select grp, count(t.user_id)
from groups g
left join the_table t on g.grp #> current_date - t.late_at
group by grp
order by grp;
int4range(0,15, '[)') creates a range from 0 (inclusive) and 15 (exclusive)
Online example: https://rextester.com/QJSN89445

The quick and dirty way to do this in SQL is:
SELECT '0-15' AS lateGroup,
COUNT(*) AS lateGroupCount
FROM my_table t
WHERE (CURRENT_DATE - t.lateAt) >= 0
AND (CURRENT_DATE - t.lateAt) < 15
UNION
SELECT '15-30' AS lateGroup,
COUNT(*) AS lateGroupCount
FROM my_table t
WHERE (CURRENT_DATE - t.lateAt) >= 15
AND (CURRENT_DATE - t.lateAt) < 30
UNION
SELECT '30-45' AS lateGroup,
COUNT(*) AS lateGroupCount
FROM my_table t
WHERE (CURRENT_DATE - t.lateAt) >= 30
AND (CURRENT_DATE - t.lateAt) < 45
-- Etc...
For production code, you would want to do something more like Ross' answer.

You didn't mention which DBMS you're using, but nearly all of them will have a construct known as a "value constructor" like this:
select bins.lateGroup, bins.minVal, bins.maxVal FROM
(VALUES
('0-15',0,15),
('15-30',15.0001,30), -- increase by a small fraction so bins don't overlap
('30-45',30.0001,45),
('45-60',45.0001,60),
('60-90',60.0001,90),
('90-99999',90.0001,99999)
) AS bins(lateGroup,minVal,maxVal)
If your DBMS doesn't have it, then you can probably use UNION ALL:
SELECT '0-15' as lateGroup, 0 as minVal, 15 as maxVal
union all SELECT '15-30',15,30
union all SELECT '30-45',30,45
Then your complete query, with the sample data you provided, would look like this:
--- example from SQL Server 2012 SP1
--- first let's set up some sample data
create table #temp (id int, lateAt datetime);
INSERT #temp (id, lateAt) values
(1231235,'2019-09-14'),
(1242123,'2019-09-13'),
(3465345,NULL),
(5676548,'2019-09-28'),
(8986475,'2019-09-23');
--- here's the actual query
select lateGroup, count(*) as Count
from #temp as T,
(VALUES
('0-15',0,15),
('15-30',15.0001,30), -- increase by a small fraction so bins don't overlap
('30-45',30.0001,45),
('45-60',45.0001,60),
('60-90',60.0001,90),
('90-99999',90.0001,99999)
) AS bins(lateGroup,minVal,maxVal)
) AS bins(lateGroup,minVal,maxVal)
where datediff(day,lateAt,getdate()) between minVal and maxVal
group by lateGroup
order by lateGroup
--- remove our sample data
drop table #temp;
Here's the output:
lateGroup Count
15-30 2
30-45 2
Note: rows with null lateAt are not counted.

I think you can do it all in one clear query :
with cte_lategroup as
(
select *
from (values(0,15,'0-15'),(15,30,'15-30'),(30,45,'30-45')) as t (mini, maxi, designation)
)
select
t2.designation
, count(*)
from test t
left outer join cte_lategroup t2
on current_date - t.lateat >= t2.mini
and current_date - lateat < t2.maxi
group by t2.designation;
With a preset like yours :
create table test
(
id int
, lateAt date
);
insert into test
values (1231235, to_date('2019/09/14', 'yyyy/mm/dd'))
,(1242123, to_date('2019/09/13', 'yyyy/mm/dd'))
,(3465345, null)
,(5676548, to_date('2019/09/28', 'yyyy/mm/dd'))
,(8986475, to_date('2019/09/23', 'yyyy/mm/dd'));

Related

cross join to get all dates and hours and avoid duplicate values

We have 2 tables:
sales
hourt (only 1 field (hourt) of numbers: 0 to 23)
The goal is to list all dates and all 24 hours for each day and group hours that have sales. For hours that do not have sales, zero will be shown.
This query cross joins the sales table with the hourt table and does list all dates and 24 hours. However, there are also many duplicate rows. How can we avoid the duplicates?
We're using Amazon Redshift (based on Postgres 8.0).
with h as (
SELECT
a.purchase_date,
CAST(DATE_PART("HOUR", AT_TIME_ZONE(AT_TIME_ZONE(CAST(a.purchase_date AS
DATETIME), "0:00"), "PST")) as INTEGER) AS Hour,
COUNT(a.quantity) AS QtyCount,
SUM(a.quantity) AS QtyTotal,
SUM((a.price) AS Price
FROM sales a
GROUP BY CAST(DATE_PART("HOUR",
AT_TIME_ZONE(AT_TIME_ZONE(CAST(a.purchase_date AS DATETIME), "0:00"),
"PST")) as INTEGER),
DATE_FORMAT(AT_TIME_ZONE(AT_TIME_ZONE(CAST(a.purchase_date AS DATETIME),
"0:00"), "PST"), "yyyy-MM-dd")
ORDER by a.purchase_date
),
hr as (
SELECT
CAST(hourt AS INTEGER) AS hourt
FROM hourt
),
joined as (
SELECT
purchase_date,
hourt,
QtyCount,
QtyTotal,
Price
FROM h
cross JOIN hr
)
SELECT *
FROM joined
Order by purchase_date,hourt
Sample Tables:
Before the cross join, query returned correct sales and grouped hours, as seen in the below table.
Desired results table:
Need to create a series of all the hour values and left join your data back to that. Comments inline explain the logic.
WITH data AS (-- Do the basic aggregation first
SELECT DATE_TRUNC('hour',a.purchase_date) purchase_hour --Truncate timestamp to the hour is simpler
,COUNT(a.quantity) AS QtyCount
,SUM(a.quantity) AS QtyTotal
,SUM((a.price) AS Price
FROM sales a
GROUP BY DATE_TRUNC('hour',a.purchase_date)
ORDER BY DATE_TRUNC('hour',a.purchase_date)
-- SELECT '2017-01-13 12:00:00'::TIMESTAMP purchase_hour, 1 qty_count, 1 qty_total, 119 price
-- UNION ALL SELECT '2017-01-13 15:00:00'::TIMESTAMP purchase_hour, 1 qty_count, 1 qty_total, 119 price
-- UNION ALL SELECT '2017-01-14 21:00:00'::TIMESTAMP purchase_hour, 1 qty_count, 1 qty_total, 119 price
)
,time_range AS (--Calculate the start and end **date** values
SELECT DATE_TRUNC('day',MIN(purchase_hour)) start_date
, DATE_TRUNC('day',MAX(purchase_hour))+1 end_date
FROM data
)
,hr AS (--Generate all hours between start and end
SELECT (SELECT start_date
FROM time_range
LIMIT 1) --Limit 1 so the optimizer knows it's not a correlated subquery
+ ((n-1) --Make the series start at zero so we don't miss the starting value
* INTERVAL '1 hour') AS "hour"
FROM (SELECT ROW_NUMBER() OVER () n
FROM stl_query --Can use any table here as long as it enough rows
LIMIT 100) series
WHERE "hour" < (SELECT end_date FROM time_range LIMIT 1)
)
--Use NVL to replace missing values with zeroes
SELECT hr.hour AS purchase_hour --Timestamp like `2017-01-13 12:00:00`
, NVL(data.qty_count, 0) AS qty_count
, NVL(data.qty_total, 0) AS qty_total
, NVL(data.price, 0) AS price
FROM hr
LEFT JOIN data
ON hr.hour = data.purchase_hour
ORDER BY hr.hour
;
I achieved the desired results by using Left Join (table A with table B) instead of Cross Join of these two tables:
Table A has all the dates and hours
Table B is the first part of the original query

Counting an already counted column in SQL (db2)

I'm pretty new to SQL and have this problem:
I have a filled table with a date column and other not interesting columns.
date | name | name2
2015-03-20 | peter | pan
2015-03-20 | john | wick
2015-03-18 | harry | potter
What im doing right now is counting everything for a date
select date, count(*)
from testtable
where date >= current date - 10 days
group by date
what i want to do now is counting the resulting lines and only returning them if there are less then 10 resulting lines.
What i tried so far is surrounding the whole query with a temp table and the counting everything which gives me the number of resulting lines (yeah)
with temp_count (date, counter) as
(
select date, count(*)
from testtable
where date >= current date - 10 days
group by date
)
select count(*)
from temp_count
What is still missing the check if the number is smaller then 10.
I was searching in this Forum and came across some "having" structs to use, but that forced me to use a "group by", which i can't.
I was thinking about something like this :
with temp_count (date, counter) as
(
select date, count(*)
from testtable
where date >= current date - 10 days
group by date
)
select *
from temp_count
having count(*) < 10
maybe im too tired to think of an easy solution, but i can't solve this so far
Edit: A picture for clarification since my english is horrible
http://imgur.com/1O6zwoh
I want to see the 2 columned results ONLY IF there are less then 10 rows overall
I think you just need to move your having clause to the inner query so that it is paired with the GROUP BY:
with temp_count (date, counter) as
(
select date, count(*)
from testtable
where date >= current date - 10 days
group by date
having count(*) < 10
)
select *
from temp_count
If what you want is to know whether the total # of records (after grouping), are returned, then you could do this:
with temp_count (date, counter) as
(
select date, counter=count(*)
from testtable
where date >= current date - 10 days
group by date
)
select date, counter
from (
select date, counter, rseq=row_number() over (order by date)
from temp_count
) x
group by date, counter
having max(rseq) >= 10
This will return 0 rows if there are less than 10 total, and will deliver ALL the results if there are 10 or more (you can just get the first 10 rows if needed with this also).
In your temp_count table, you can filter results with the WHERE clause:
with temp_count (date, counter) as
(
select date, count(distinct date)
from testtable
where date >= current date - 10 days
group by date
)
select *
from temp_count
where counter < 10
Something like:
with t(dt, rn, cnt) as (
select dt, row_number() over (order by dt) as rn
, count(1) as cnt
from testtable
where dt >= current date - 10 days
group by dt
)
select dt, cnt
from t where 10 >= (select max(rn) from t);
will do what you want (I think)

How to write SQL query for the following case.?

I have one Change Report Table which has two columns ChangedTime,FileName
Please consider this table has over 1000 records
Here I need to query all the changes based on following factors
i) Interval (i.e-1mins )
ii) No of files
It means when we have given Interval 1 min and No Of files 10.
If the the no of changed files more than 10 in any of the 1 minute interval, we need to get all the changed files exists in that 1 minute interval
Example:
i) Consider we have 15 changes in the interval 11:52 to 11:53
ii)And consider we have 20 changes in the interval 12:58 to 12:59
Now my expected results would be 35 records.
Thanks in advance.
You need to aggregate by the interval and then do the count. Assuming that an interval starting at time 0 is ok, the following should work:
declare #interval int = 1;
declare #limit int = 10;
select sum(cnt)
from (select count(*) as cnt
from t
group by DATEDIFF(minute, 0, ChangedTime)/#interval
) t
where cnt >= #limit;
If you have another time in mind for when intervals should start, then substitute that for 0.
EDIT:
For your particular query:
select sum(ChangedTime)
from (select count(*) as ChangedTime
from [MyDB].[dbo].[Log_Table.in_PC]
group by DATEDIFF(minute, 0, ChangedTime)/#interval
) t
where ChangedTime >= #limit;
You can't have a three part alias name on a subquery. t will do.
Something like this should work:
You count the number of records using the COUNT() function.
Then you limit the selection with the WHERE clause:
SELECT COUNT(FileName)
FROM "YourTable"
WHERE ChangedTime >= "StartInteval"
AND ChangedTime <= "EndInterval";
Another method that is useful in a where clause is BETWEEN : http://msdn.microsoft.com/en-us/library/ms187922.aspx.
You didn't state which SQL DB you are using so I assume its MSSQL.
select count(*) from (select a.FileName,
b.ChangedTime startTime,
a.ChangedTime endTime,
DATEDIFF ( minute , a.ChangedTime , b.ChangedTime ) timeInterval
from yourtable a, yourtable b
where a.FileName = b.FileName
and a.ChangedTime > b.ChangedTime
and DATEDIFF ( minute , a.ChangedTime , b.ChangedTime ) = 1) temp
group by temp.FileName

Join a count query on generate_series() and retrieve Null values as '0'

I want to count ID's per month using generate_series(). This query works in PostgreSQL 9.1:
SELECT (to_char(serie,'yyyy-mm')) AS year, sum(amount)::int AS eintraege FROM (
SELECT
COUNT(mytable.id) as amount,
generate_series::date as serie
FROM mytable
RIGHT JOIN generate_series(
(SELECT min(date_from) FROM mytable)::date,
(SELECT max(date_from) FROM mytable)::date,
interval '1 day') ON generate_series = date(date_from)
WHERE version = 1
GROUP BY generate_series
) AS foo
GROUP BY Year
ORDER BY Year ASC;
This is my output:
"2006-12" | 4
"2007-02" | 1
"2007-03" | 1
But what I want to get is this output ('0' value in January):
"2006-12" | 4
"2007-01" | 0
"2007-02" | 1
"2007-03" | 1
Months without id should be listed nevertheless.
Any ideas how to solve this?
Sample data:
drop table if exists mytable;
create table mytable(id bigint, version smallint, date_from timestamp);
insert into mytable(id, version, date_from) values
(4084036, 1, '2006-12-22 22:46:35'),
(4084938, 1, '2006-12-23 16:19:13'),
(4084938, 2, '2006-12-23 16:20:23'),
(4084939, 1, '2006-12-23 16:29:14'),
(4084954, 1, '2006-12-23 16:28:28'),
(4250653, 1, '2007-02-12 21:58:53'),
(4250657, 1, '2007-03-12 21:58:53')
;
Untangled, simplified and fixed, it might look like this:
SELECT to_char(s.tag,'yyyy-mm') AS monat
, count(t.id) AS eintraege
FROM (
SELECT generate_series(min(date_from)::date
, max(date_from)::date
, interval '1 day'
)::date AS tag
FROM mytable t
) s
LEFT JOIN mytable t ON t.date_from::date = s.tag AND t.version = 1
GROUP BY 1
ORDER BY 1;
db<>fiddle here
Among all the noise, misleading identifiers and unconventional format the actual problem was hidden here:
WHERE version = 1
You made correct use of RIGHT [OUTER] JOIN. But adding a WHERE clause that requires an existing row from mytable converts the RIGHT [OUTER] JOIN to an [INNER] JOIN effectively.
Move that filter into the JOIN condition to make it work.
I simplified some other things while being at it.
Better, yet
SELECT to_char(mon, 'yyyy-mm') AS monat
, COALESCE(t.ct, 0) AS eintraege
FROM (
SELECT date_trunc('month', date_from)::date AS mon
, count(*) AS ct
FROM mytable
WHERE version = 1
GROUP BY 1
) t
RIGHT JOIN (
SELECT generate_series(date_trunc('month', min(date_from))
, max(date_from)
, interval '1 mon')::date
FROM mytable
) m(mon) USING (mon)
ORDER BY mon;
db<>fiddle here
It's much cheaper to aggregate first and join later - joining one row per month instead of one row per day.
It's cheaper to base GROUP BY and ORDER BY on the date value instead of the rendered text.
count(*) is a bit faster than count(id), while equivalent in this query.
generate_series() is a bit faster and safer when based on timestamp instead of date. See:
Generating time series between two dates in PostgreSQL

How to write a database view that expands data into multiple rows?

I have a database table that contains collection data for product collected from a supplier and I need to produce an estimate of month-to-date production figures for that supplier using an Oracle SQL query. Each day can have multiple collections, and each collection can contain product produced across multiple days.
Here's an example of the raw collection data:
Date Volume ColectionNumber ProductionDays
2011-08-22 500 1 2
2011-08-22 200 2 2
2011-08-20 600 1 2
Creating a month-to-date estimate is tricky because the first day of the month may have a collection for two days worth of production. Only a portion of that collected volume is actually attributable to the current month.
How can I write a query to produce this estimate?
My gut feeling is that I should be able to create a database view that transforms the raw data into estimated daily production figures by summing collections on the same day and distributing collection volumes across the number of days they were produced on. This would allow me to write a simple query to find the month-to-date production figure.
Here's what the above collection data would look like after being transformed into estimated daily production figures:
Date VolumeEstimate
2011-08-22 350
2011-08-21 350
2011-08-20 300
2011-08-19 300
Am I on the right track? If so, how can this be implemented? I have absolutely no idea how to do this type of transformation in SQL. If not, what is a better approach?
Note: I cannot do this calculation in application code since that would require a significant code change which we can't afford.
try
CREATE TABLE TableA (ProdDate DATE, Volume NUMBER, CollectionNumber NUMBER, ProductionDays NUMBER);
INSERT INTO TableA VALUES (TO_DATE ('20110822', 'YYYYMMDD'), 500, 1, 2);
INSERT INTO TableA VALUES (TO_DATE ('20110822', 'YYYYMMDD'), 200, 2, 2);
INSERT INTO TableA VALUES (TO_DATE ('20110820', 'YYYYMMDD'), 600, 1, 2);
COMMIT;
CREATE VIEW DailyProdVolEst AS
SELECT DateList.TheDate, SUM (DateRangeSums.DailySum) VolumeEstimate FROM
(
SELECT ProdStart, ProdEnd, SUM (DailyProduction) DailySum
FROM
(
SELECT (ProdDate - ProductionDays + 1) ProdStart, ProdDate ProdEnd, CollectionNumber, VolumeSum/ProductionDays DailyProduction
FROM
(
Select ProdDate, CollectionNumber, ProductionDays, Sum (Volume) VolumeSum FROM TableA
GROUP BY ProdDate, CollectionNumber, ProductionDays
)
)
GROUP BY ProdStart, ProdEnd
) DateRangeSums,
(
SELECT A.MinD + MyList.L TheDate FROM
(SELECT MIN (ProdDate - ProductionDays + 1) MinD FROM TableA) A,
(SELECT LEVEL - 1 L FROM DUAL CONNECT BY LEVEL <= (SELECT Max (ProdDate) - MIN (ProdDate - ProductionDays + 1) + 1 FROM TableA)) MyList
) DateList
WHERE DateList.TheDate BETWEEN DateRangeSums.ProdStart AND DateRangeSums.ProdEnd
GROUP BY DateList.TheDate;
The view DailyProdVolEst gives you dynamically the result you described... though some "constraints" apply:
the combination of ProdDate and CollectionNumber should be unique.
the ProductionDays need to be > 0 for all rows
EDIT - as per comment requested:
How this query works:
It finds out what the smallest + biggest date in the table are, then builds rows with each row being a date in that range (DateList)... this is matched up against a list of rows containing the daily sum for unique combinations of ProdDate Start/End (DateRangeSums) and sums it up on the date level.
What do SUM (DateRangeSums.DailySum) and SUM (DailyProduction) do ?
Both sum things up - the SUM (DateRangeSums.DailySum) sums up in cases of partialy overlapping date ranges, and the SUM (DailyProduction) sums up within one date range if there are more than one CollectionNumber. Without SUM the GROUP BY wouldn't be needed.
I think a UNION query will do the trick for you. You aren't using the CollectionNumber field in your example, so I excluded it from the sample below.
Something similar to the below query should work (Disclaimer: My oracle db isn't accessible to me at the moment):
SELECT Date, SUM(Volume) VolumeEstimate
FROM
(SELECT Date, SUM(Volume / ProductionDays) Volume
FROM [Table]
GROUP BY Date
UNION
SELECT (Date - 1) Date, SUM(Volume / 2)
WHERE ProductionDays = 2
GROUP BY Date - 1)
GROUP BY Date
It sounds like what you want to do is sum up by day and then use a tally table to divide out the results.
Here's a runnable example with your data in T-SQL dialect:
DECLARE #tbl AS TABLE (
[Date] DATE
, Volume INT
, ColectionNumber INT
, ProductionDays INT);
INSERT INTO #tbl
VALUES ('2011-08-22', 500, 1, 2)
, ('2011-08-22', 200, 2, 2)
, ('2011-08-20', 600, 1, 2);
WITH Numbers AS (SELECT 1 AS N UNION ALL SELECT 2 AS N)
,AssignedVolumes AS (
SELECT t.*
, t.Volume / t.ProductionDays AS PerDay
, DATEADD(d, 1 - n.N, t.[Date]) AS AssignedDate
FROM #tbl AS t
INNER JOIN Numbers AS n
ON n.N <= t.ProductionDays
)
SELECT AssignedDate
, SUM(PerDay)
FROM AssignedVolumes
GROUP BY AssignedDate;​
I dummied up a simple numbers table with only 1 and 2 in it to perform the pivot. Typically you'll have a table with a million numbers in sequence.
For Oracle, the only thing you should need to change would be the DATEADD.