Query question regarding aggregates over a date range - sql

I have a data set where the structure could be like this
yes_no date
0 1/1/2011
1 1/1/2011
1 1/2/2011
0 1/4/2011
1 1/9/2011
Given a start data and and end date, I would like to create a query where it would aggregate over the date and provide a 0 for dates that do not exist in the table, for dates between start_data and end_date including both
This is in SQL.
I am stumped. I can get the aggregate queries very simply, but i don't know how to get zeros for dates that do not exist in the table.

If you're working with a DBMS that supports common table expressions, the following will generate a derived table of dates that you can then left join to your table. This was written for MSSQL, so you may need to derive your dates differently (i.e., an object other than master..spt_values)
with AllDates as (
select top 100000
convert(datetime, row_number() over (order by x.name)) as 'Date'
from
master..spt_values x
cross join master..spt_values y
)
select
ad.Date, isnull(yt.yn, 0)
from
AllDates ad
left join (
select date, sum(yes_no) yn
from YourTable yt
) yt
on ad.date = yt.date
where
ad.Date between YourStartDate and YourEndDate

Generating the dates has to be the way to go.
In ORACLE you could join on to a list of dates, why not..
(SELECT TRUNC(startdate + LEVEL)
FROM DUAL CONNECT BY LEVEL <(enddate-startdate))
If you can't generate your dates on-the-fly
a database agnostic solution would be to create a table containing all of the dates you will ever need and join on to that. (this should be your last resort)
here's the pseudeo code, you will need to substitute mydates for either the on-the fly sql or date table select
SELECT
CASE WHEN COUNT(b.date)=0
THEN
0
ELSE
1
END as yes_no
FROM (mydates) a
LEFT JOIN aggtable b ON a.date=b.date

Related

Add missing months with values from previous month

I need to use this SQL query for a software and get the time in a particular format hence the reason for the Time column however I need the query to insert the months that are missing with the value from the previous month. This is the query I currently have.
SELECT [accountnumber],SUM([postingamount]) AS Amount, [accountingdate],
convert(varchar(4),year(accountingdate))+'M'+ Format(DATEPART( MONTH, accountingdate) , '00')
AS [Time]
FROM [7 GL Detail MACL]
where [accountingdate]>='2019-01-01'
GROUP BY [accountingdate],[postingamount],[accountnumber]
Current Results
Expected Results
Since you didn't specify the RDBMS system you're using, I can't guarantee that this logic will work because every system uses slightly different SQL syntax.
However I used Rasgo datespine function to generate this SQL, as it is quite complex to wrap your head around, and tested it on Snowflake.
The main differences between Snowflake and other systems are: DATEADD and TABLE (GENERATOR())
In case you can't modify this to work in your system, here are the basic steps which you'll want to follow:
Select unique accountnumbers
Select unique dates (month beginnings?) This is where Snowflake uses GENERATOR but other systems might actually have a Calendar table you can select from
Cross Join (cartesian join) these to create every possible combination of accountnumber and date
Outer Join #3 to your data (might have to truncate your date to month-begin)
Filter out rows that dont apply. Like for instance you might have just inserted a row for 1/1/2019 for an account that didn't even begin until 12/12/2020.
WITH GLOBAL_SPINE AS (
SELECT
ROW_NUMBER() OVER (ORDER BY NULL) as INTERVAL_ID,
DATEADD('MONTH', (INTERVAL_ID - 1), '2019-01-01'::timestamp_ntz) as SPINE_START,
DATEADD('MONTH', INTERVAL_ID, '2022-06-01'::timestamp_ntz) as SPINE_END
FROM TABLE (GENERATOR(ROWCOUNT => 42))
),
GROUPS AS (
SELECT
accountnumber,
MIN(DESIRED_INTERVAL) AS LOCAL_START,
MAX(DESIRED_INTERVAL) AS LOCAL_END
FROM [7 GL Detail MACL]
GROUP BY
accountnumber
),
GROUP_SPINE AS (
SELECT
accountnumber,
SPINE_START AS GROUP_START,
SPINE_END AS GROUP_END
FROM GROUPS G
CROSS JOIN LATERAL (
SELECT
SPINE_START, SPINE_END
FROM GLOBAL_SPINE S
WHERE S.SPINE_START >= G.LOCAL_START
)
)
SELECT
G.accountnumber AS GROUP_BY_accountnumber,
GROUP_START,
GROUP_END,
T.*
FROM GROUP_SPINE G
LEFT JOIN {{ your_table }} T
ON DESIRED_INTERVAL >= G.GROUP_START
AND DESIRED_INTERVAL < G.GROUP_END
AND G.accountnumber = T.accountnumber;
You were also doing an aggregation step, but I figure once you get this complicated part down, you can figure out how to finally aggregate it the way you want it.

PostgreSQL GROUP BY that includes zeros

I have a SQL query (postgresql) that looks something like this:
SELECT
my_timestamp::timestamp::date as the_date,
count(*) as count
FROM my_table
WHERE ...
GROUP BY the_date
ORDER BY the_date
The result is a table of YYYY-MM-DD, count pairs.
Now I've been asked to fill in the empty dates with zero. So if I was previously providing
2022-03-15 3
2022-03-17 1
I'd now want to return
2022-03-15 3
2022-03-16 0
2022-03-17 1
Now I can easily do this client-side (relative to the database) and let my program compute and return the zero-augmented list to its clients based on the original list from postgres. But perhaps it would better if I could just tell postgresql to include zeros.
I suspect this isn't easy at all, because postgres has no obvious way of knowing what I'm up to. But in the interests of learning more about postgres and SQL, I thought I'd have try. The try isn't too promising thus far...
Any pointers before I conclude that I was right to leave this to my (postgres client) program?
Update
This is an interesting case where my simplification of the problem led to a correct answer that didn't work for me. For those who come after, I thought it worth documenting what followed, because it take some fun twists through constructing SQL queries.
#a_horse_with_no_name responded with a query that I've verified works if I simplify my own query to match. Unfortunately, my query had some extra baggage that I didn't think pertinent, and so had trimmed out when posting the original question.
Here's my real (original) query, with all names preserved (if shortened):
-- current query
SELECT
LEAST(time1, time2, time3, time4)::timestamp::date as the_date,
count(*) as count
FROM reading_group_reader rgr
INNER JOIN ( SELECT group_id, group_type ::group_type_name
FROM (VALUES (31198, 'excerpt')) as T(group_id, group_type)) TT
ON TT.group_id = rgr.group_id
AND TT.group_type = rgr.group_type
WHERE LEAST(time1, time2, time3, time4) > current_date - 30
GROUP BY the_date
ORDER BY the_date;
If I translate that directly into the proposed solution, however, the inner join between reading_group_reader and the temporary table TT causes the left join to become inner (I think) and the date sequence drops its zeros again. Fwiw, the table TT is a table because sometimes it actually is a subselect.
So I transformed my query into this:
SELECT
g.dt::date as the_date,
count(*) as count
FROM generate_series(date '2022-03-06', date '2022-04-06', interval '1 day') as g(dt)
LEFT JOIN (
SELECT
LEAST(rgr.time1, rgr.time2, rgr.time3, rgr.time4)::timestamp::date as the_date
FROM reading_group_reader rgr
INNER JOIN (
SELECT group_id, group_type ::group_type_name
FROM (VALUES (31198, 'excerpt')) as T(group_id, group_type)) TT
ON TT.group_id = rgr.group_id
AND TT.group_type = rgr.group_type
) rgrt
ON rgrt.the_date = g.dt::date
GROUP BY g.dt
ORDER BY the_date;
but this outputs 1's instead of 0's at the places that should be 0.
The reason for that, however, is because I've now selected every date, so, of course, there's one of each. I need to include an additional field (which will be NULL) and count that.
So this query finally does what I want:
SELECT
g.dt::date as the_date,
count(rgrt.device_id) as count
FROM generate_series(date '2022-03-06', date '2022-04-06', interval '1 day') as g(dt)
LEFT JOIN (
SELECT
LEAST(rgr.time1, rgr.time2, rgr.time3, rgr.time4)::timestamp::date as the_date,
rgr.device_id
FROM reading_group_reader rgr
INNER JOIN (
SELECT group_id, group_type ::group_type_name
FROM (VALUES (31198, 'excerpt')) as T(group_id, group_type)
) TT
ON TT.group_id = rgr.group_id
AND TT.group_type = rgr.group_type
) rgrt(the_date)
ON rgrt.the_date = g.dt::date
GROUP BY g.dt
ORDER BY g.dt;
And, of course, on re-reading the accepted answer, I eventually saw that he did count an unrelated field, which I'd simply missed on my first several readings.
You will need to join to a list of dates. This can e.g. be done using generate_series()
SELECT g.dt::date as the_date,
count(t.my_timestamp) as count
FROM generate_series(date '2022-03-01',
date '2022-03-31',
interval '1 day') as g(dt)
LEFT JOIN my_table as t
ON t.my_timestamp::date = g.dt::date
AND ... -- the original WHERE clause goes here!
GROUP BY the_date
ORDER BY the_date;
Note that the original WHERE conditions need to go into the join condition of the LEFT JOIN. You can't put them into a WHERE clause because that would turn the outer join back into an inner join (which means the missing dates wouldn't be returned).

Iterate through SQL table creating another SQL Table

Wonder if someone could cast an eye over the following problem:
I'm running a SQL SELECT statement which gives me the following results:
DATE NumberOfHours
2017-05-01 4
2017-06-01 38
2017-07-01 68
And what I'm trying (like to be able to) to do is off the back of this table create another table that contains 4 rows for 2017-05-01, 38 Rows for 2017-06-01 and 68 rows for 2017-07-01. So I end up with a table that's got 110 rows in it.
I'm at a bit of a loss as to how this could be achieved...could anyone assist?
////////////////////////////////////////////////////////////
Using the response listed by Gordon Linoff I managed to get this working working by using:
with cte as (
SELECT DATEADD(month, datediff(month,0,L.DateAdded),0) AS 'Date', CEILING(SUM(l.CPDHours))AS NumberOfHours
FROM WebsiteICA_SF.dbo.CPD_Log L
WHERE L.DateAdded >= DATEADD(month, -6, GETDATE())
AND (L.Provider = 'ICA' OR L.Provider like 'International Compli%')
GROUP BY DATEADD(month, datediff(month,0,L.DateAdded),0)
union all
select date, NumberOfHours - 1
from cte
where NumberOfHours > 1
)
select 1 AS 'ObId', date, 'ICA' AS Provider, '# ICA' AS DataType
from cte
order by DATEADD(month, datediff(month,0,cte.Date),0)
OPTION (maxrecursion 10000);
One simple method is a recursive CTE:
with cte as (
select date, NumberOfHours
from t
union all
select date, NumberOfHours - 1
from cte
where NumberOfHours > 1
)
select date
from cte;
By default, this is limited to a maximum of 100 hours. However, that is easily changed using the MAXRECURSION option.
Other methods generally rely on a second table to generate numbers. I also like this approach because it is a gentle introduction to recursive CTEs.
Here is a nice SQL Fiddle.
So you have a result set with 3 rows and one column in it which tells you how many rows it represents. You want to generate that many rows.
Not sure what you want to store in that, but here is a solution to the base problem:
Create a table (temp table or CTE is fine too) which contains only one column, storing numbers from 0 to whatever. This is called Tally Table or Numbers Table.
Join this table to your resultset:
WITH NumbersCTE AS (
-- This will give you a bunch of Numbers
-- Persist a table if you want to use it more frequently
SELECT ROW_NUMBER() OVER (ORDER BY name) AS Number FROM sys.columns
)
SELECT
MT.Date,
N.Number
FROM
dbo.MyTable MT
INNER JOIN NumbersCTE N
ON N.Number <= MT.NumberOfHours
As Pieter Geerkens pointed out in the comments, the above method is not the best to generate a numbers table, but for demostration puposes it is fine.
For more info about how to generat tally tables in SQL Server, you can check
http://www.sqlservercentral.com/blogs/dwainsql/2014/03/27/tally-tables-in-t-sql/

SQL Server get customer with 7 consecutive transactions

I am trying to write a query that would get the customers with 7 consecutive transactions given a list of CustomerKeys.
I am currently doing a self join on Customer fact table that has 700 Million records in SQL Server 2008.
This is is what I came up with but its taking a long time to run. I have an clustered index as (CustomerKey, TranDateKey)
SELECT
ct1.CustomerKey,ct1.TranDateKey
FROM
CustomerTransactionFact ct1
INNER JOIN
#CRTCustomerList dl ON ct1.CustomerKey = dl.CustomerKey --temp table with customer list
INNER JOIN
dbo.CustomerTransactionFact ct2 ON ct1.CustomerKey = ct2.CustomerKey -- Same Customer
AND ct2.TranDateKey >= ct1.TranDateKey
AND ct2.TranDateKey <= CONVERT(VARCHAR(8), (dateadd(d, 6, ct1.TranDateTime), 112) -- Consecutive Transactions in the last 7 days
WHERE
ct1.LogID >= 82800000
AND ct2.LogID >= 82800000
AND ct1.TranDateKey between dl.BeginTranDateKey and dl.EndTranDateKey
AND ct2.TranDateKey between dl.BeginTranDateKey and dl.EndTranDateKey
GROUP BY
ct1.CustomerKey,ct1.TranDateKey
HAVING
COUNT(*) = 7
Please help make it more efficient. Is there a better way to write this query in 2008?
You can do this using window functions, which should be much faster. Assuming that TranDateKey is a number and you can subtract a sequential number from it, then the difference constant for consecutive days.
You can put this in a query like this:
SELECT CustomerKey, MIN(TranDateKey), MAX(TranDateKey)
FROM (SELECT ct.CustomerKey, ct.TranDateKey,
(ct.TranDateKey -
DENSE_RANK() OVER (PARTITION BY ct.CustomerKey, ct.TranDateKey)
) as grp
FROM CustomerTransactionFact ct INNER JOIN
#CRTCustomerList dl
ON ct.CustomerKey = dl.CustomerKey
) t
GROUP BY CustomerKey, grp
HAVING COUNT(*) = 7;
If your date key is something else, there is probably a way to modify the query to handle that, but you might have to join to the dimension table.
This would be a perfect task for a COUNT(*) OVER (RANGE ...), but SQL Server 2008 supports only a limited syntax for Windowed Aggregate Functions.
SELECT CustomerKey, MIN(TranDateKey), COUNT(*)
FROM
(
SELECT CustomerKey, TranDateKey,
dateadd(d,-ROW_NUMBER()
OVER (PARTITION BY CustomerKey
ORDER BY TranDateKey),TranDateTime) AS dummyDate
FROM CustomerTransactionFact
) AS dt
GROUP BY CustomerKey, dummyDate
HAVING COUNT(*) >= 7
The dateadd calculates the difference between the current TranDateTime and a Row_Number over all date per customer. The resulting dummyDatehas no actual meaning, but is the same meaningless date for consecutive dates.

Unpivot date columns to a single column of a complex query in Oracle

Hi guys, I am stuck with a stubborn problem which I am unable to solve. Am trying to compile a report wherein all the dates coming from different tables would need to come into a single date field in the report. Ofcourse, the max or the most recent date from all these date columns needs to be added to the single date column for the report. I have multiple users of multiple branches/courses for whom the report would be generated.
There are multiple blogs and the latest date w.r.t to the blogtitle needs to be grouped, i.e. max(date_value) from the six date columns should give the greatest or latest date for that blogtitle.
Expected Result:
select u.batch_uid as ext_person_key, u.user_id, cm.batch_uid as ext_crs_key, cm.crs_id, ir.role_id as
insti_role, (CASE when b.JOURNAL_IND = 'N' then
'BLOG' else 'JOURNAL' end) as item_type, gm.title as item_name, gm.disp_title as ITEM_DISP_NAME, be.blog_pk1 as be_blogPk1, bc.blog_entry_pk1 as bc_blog_entry_pk1,bc.pk1,
b.ENTRY_mod_DATE as b_ENTRY_mod_DATE ,b.CMT_mod_DATE as BlogCmtModDate, be.CMT_mod_DATE as be_cmnt_mod_Date,
b.UPDATE_DATE as BlogUpDate, be.UPDATE_DATE as be_UPDATE_DATE,
bc.creation_date as bc_creation_date,
be.CREATOR_USER_ID as be_CREATOR_USER_ID , bc.creator_user_id as bc_creator_user_id,
b.TITLE as BlogTitle, be.TITLE as be_TITLE,
be.DESCRIPTION as be_DESCRIPTION, bc.DESCRIPTION as bc_DESCRIPTION
FROM users u
INNER JOIN insti_roles ir on u.insti_roles_pk1 = ir.pk1
INNER JOIN crs_users cu ON u.pk1 = cu.users_pk1
INNER JOIN crs_mast cm on cu.crsmast_pk1 = cm.pk1
INNER JOIN blogs b on b.crsmast_pk1 = cm.pk1
INNER JOIN blog_entry be on b.pk1=be.blog_pk1 AND be.creator_user_id = cu.pk1
LEFT JOIN blog_CMT bc on be.pk1=bc.blog_entry_pk1 and bc.CREATOR_USER_ID=cu.pk1
JOIN gradeledger_mast gm ON gm.crsmast_pk1 = cm.pk1 and b.grade_handler = gm.linkId
WHERE cu.ROLE='S' AND BE.STATUS='2' AND B.ALLOW_GRADING='Y' AND u.row_status='0'
AND u.available_ind ='Y' and cm.row_status='0' and and u.batch_uid='userA_157'
I am getting a resultset for the above query with multiple date columns which I want > > to input into a single columnn. The dates have to be the most recent, i.e. max of the dates in the date columns.
I have successfully done the Unpivot by using a view to store the above
resultset and put all the dates in one column. However, I do not
want to use a view or a table to store the resultset and then do
Unipivot simply because I cannot keep creating views for every user
one would query for.
The max(date_value) from the date columns need to be put in one single column. They are as follows:
* 1) b.entry_mod_date, 2) b.cmt_mod_date ,3) be.cmt_mod_date , 4) b.update_Date ,5) be.update_date, 6) bc.creation_date *
Apologies that I could not provide the desc of all the tables and the
fields being used.
Any help to get the above mentioned max of the dates from these
multiple date columns into a single column without using a view or a
table would be greatly appreciated.*
It is not clear what results you want, but the easiest solution is to use greatest().
with t as (
YOURQUERYHERE
)
select t.*,
greatest(entry_mod_date, cmt_mod_date, cmt_mod_date, update_Date,
update_date, bc.creation_date
) as greatestdate
from t;
select <columns>,
case
when greatest (b_ENTRY_mod_DATE) >= greatest (BlogCmtModDate) and greatest(b_ENTRY_mod_DATE) >= greatest(BlogUpDate)
then greatest( b_ENTRY_mod_DATE )
--<same implementation to compare each time BlogCmtModDate and BlogUpDate separately to get the greatest then 'date'>
,<columns>
FROM table
<rest of the query>
UNION ALL
Select <columns>,
case
when greatest (be_cmnt_mod_Date) >= greatest (be_UPDATE_DATE)
then greatest( be_cmnt_mod_Date )
when greatest (be_UPDATE_DATE) >= greatest (be_cmnt_mod_Date)
then greatest( be_UPDATE_DATE )
,<columns>
FROM table
<rest of the query>
UNION ALL
Select <columns>,
GREATEST(bc_creation_date)
,<columns>
FROM table
<rest of the query>