Using Distinct and MAX(date) in a large data - sql

I have a table that stores the list of users who have accessed a product(with the accessed date).
I have written the below query to get the list of users who have accessed the product B between '2021-02-01' and '2021-02-26'.
SELECT DISTINCT UserName,Country,ADate,Product FROM Report WHERE UserName != '-' and Product='B and (CAST(ADate AS DATE) BETWEEN #startdate AND #enddate '
then it gives the below result:
UserName Country ADate Product
-------- ------ -------- ---------
asson IN 2021-02-10 00:00:00.000 B
rajan US 2021-02-23 00:00:00.000 B
rajan US 2021-02-25 00:00:00.000 B
moody US 2021-02-14 00:00:00.000 B
rajon US 2021-02-01 00:00:00.000 B
lukman US 2021-02-10 00:00:00.000 B
since the user rajan has accessed the product in 2 days it shows 2 entries for rajan even though I have added distinct. So I have modified the query as below:
SELECT DISTINCT UserName,Country,max(ADate),Product FROM Report WHERE UserName != '-' and Product='B' and (CAST(ADate AS DATE) BETWEEN #startdate AND #enddate group by Username,product
This query gives me the required result. But the problem I am facing now is When I select the table with more than a month gap (say data between 2 months), I miss some data in the output. I believe it might be due to the MAX(ADate). Can anyone give a good suggestion to get rid of this issue?

This will give you the latest access date of each user by month
SELECT DISTINCT UserName,Country, month(ADate) as month, max(ADate),Product FROM Report WHERE UserName != '-' and Product='B' group by UserName, Country, month, Product

Related

Subsetting on dates for a SQL query

Using Snowflake, I am attempting to subset on customers that have no current subscriptions, and eliminating all IDs for those which have current/active contracts.
Each ID will typically have multiple records associated with a contract/renewal history for a particular ID/customer.
It is only known if a customer is active if there is no contract that goes beyond the current date, while there are likely multiple past contracts which have lapsed, but the account is still active if one of those contract end dates goes beyond the current date.
Consider the following table:
Date_Start
Date_End
Name
ID
2015-07-03
2019-07-03
Piggly
001
2019-07-04
2025-07-04
Piggly
001
2013-10-01
2017-12-31
Doggy
031
2018-01-01
2018-06-30
Doggy
031
2020-01-01
2021-03-14
Catty
022
2021-03-15
2024-06-01
Catty
022
1999-06-01
2021-06-01
Horsey
052
2021-06-02
2022-01-01
Horsey
052
2022-01-02
2022-07-04
Horsey
052
With a desired output non-active customers that do not have an end date beyond Jan 5th 2023 (or current/arbitrary date)
Name
ID
Doggy
031
Horsey
052
My first attempt was:
SELECT Name, ID
FROM table
WHERE Date_End < GETDATE()
but the obvious problem is that I'll also be selecting past contracts of customers who haven't expired/churned and who have a contract that goes beyond the current date.
How do I resolve this?
As there are many rows per name and ID, you should aggregate the data and then use a HAVING clause to select only those you are interested in.
SELECT name, id
FROM table
GROUP BY name, id
HAVING MAX(date_end) < GETDATE();
You can work it out with an EXCEPT operator, if your DBMS supports it:
SELECT DISTINCT Name, ID FROM tab
EXCEPT
SELECT DISTINCT Name, ID FROM tab WHERE Date_end > <your_date>
This would removes the active <Name, ID> pairs from the whole.

Interpolate missing values in a query by date

Given the following query
SELECT
DATEADD(DAY, DATEDIFF(DAY, 0, [Created]), 0) [Date],
[Type], COUNT(*) as [Total]
FROM
Submissions
WHERE
[Offer] = 'template1'
GROUP BY
DATEADD(DAY, DATEDIFF(DAY, 0, [Created]), 0),
[Type]
ORDER BY 1;
I get the following output:
Date Type Total
----------------------- -------------------- -----------
2021-04-30 00:00:00.000 Online 1
2021-05-01 00:00:00.000 Mail 1
2021-05-01 00:00:00.000 Online 2
2021-05-10 00:00:00.000 Mail 1
My goal is to ensure that for each date, both types are summarized. In the event that no rows for a given type exist, I'd like to show 0 instead of missing the row entirely. How can I reform the query so that, for example, 2 rows exist for 2021-04-30, one with type Online as shown, and one with type Mail with a total of 0?
I got it working using something like below, but this seems like a pretty brute force way of going about it.
SELECT [Date], [Type], [Total] FROM
(
SELECT
DATEADD(DAY, DATEDIFF(DAY, 0, [Created]), 0) [Date],
[Type]
FROM
Submissions
WHERE [Offer] = 'template1'
) t1
PIVOT (
COUNT([Type])
FOR [Type] in ([Mail],[Online])
) p
UNPIVOT
(
[Total] FOR [Type] in ([Mail],[Online])
) p2
This results in what I am looking for:
Date Type Total
----------------------- ------------------- -----------
2021-04-30 00:00:00.000 Mail 0
2021-04-30 00:00:00.000 Online 1
2021-05-01 00:00:00.000 Mail 1
2021-05-01 00:00:00.000 Online 2
2021-05-10 00:00:00.000 Mail 1
2021-05-10 00:00:00.000 Online 0
Even your brute force approach doesn't work if the submission table has no rows for a particular date.
The standard approach is to use dimension tables to create a template of all the rows you desire, then left join your fact table on to it.
SELECT
calendar.date,
type.label,
COUNT(fact.id)
FROM
calendar
CROSS JOIN
type
LEFT JOIN
submissions AS fact
ON fact.created >= calendar.date
AND fact.created < calendar.date + 1
AND fact.type = type.label
AND fact.offer = 'template1'
WHERE
calendar.date BETWEEN ? AND ?
AND type.label IN ('Mail', 'Online')
GROUP BY
calendar.date,
type.label
Please excuse typos, I'm on my phone

Can not understand the logic of this query

This query is trying to get the s1ppmp (the price of product) of each s1ilie (size), each s1iref (reference) and s1ydat (the lastest date) for the price, because one product could have more than one price on different dates, for example, during the black friday or the normal price for the other days.
The anmoisjour comes from calender table, but there is no connection between CALENDER table and main table msk100, so ... I don't understand the logic of this query...
SELECT s1isoc,
s1ilie,
s1iref,
s1ydat,
anmoisjour,
s1ppmp
FROM msk110
INNER JOIN (SELECT s1isoc AS isoc,
s1ilie AS ilie,
s1iref AS iref,
MAX(s1ydat) AS ydat,
anmoisjour
FROM calendrier,
msk110
WHERE s1ydat <= anmoisjour
AND anmoisjour BETWEEN 20100101 AND 20302131
GROUP BY s1isoc,
s1ilie,
s1iref,
anmoisjour) a ON s1isoc = isoc
AND s1ilie = ilie
AND s1iref = iref
AND s1ydat = ydat
WHERE s1isoc = 1
AND anmoisjour BETWEEN 20100101 AND 20302131
ORDER BY anmoisjour,
s1ydat;
s1isoc, s1ilie, s1iref, s1ydat,and s1ppmp comes from msk110
and
anmoisjour belongs to calender table, which is a date table.
I believe the confusion is the way that the calendar table is joined.
If anmoisjour is the day column of the calendar table and this table holds 1 row per day, the WHERE filter anmoisjour BETWEEN 20100101 AND 20302131 makes calendrier hold a row for each day for 20 years (2010 to 2030).
They way the product prices table msk100 is linked to the calendar calendrier table is not directly by date, but with a max date (msk100.s1ydat <= calendrier.anmoisjour). This means that for example a date of msk100.s1ydat that's 2015-01-01 will join against every row of the calendar table thats between 2015-01-01 and 2030-12-31.
The GROUP BY is by the calendar table's date (calendrier.anmoisjour) this means that if a particular product, size and price repeats on different dates, let's say the only occurrences are on dates 2015-01-01, 2017-01-01 and 2020-01-01, then the result of the group by would be the following (ordered by calendar date, displaying even NULL to demonstrate):
MAX(s1ydat) anmoisjour
null 2010-01-01
null ...
null 2014-12-31
2015-01-01 2015-01-01
2015-01-01 2015-01-02
2015-01-01 ...
2015-01-01 2016-01-01
2015-01-01 ...
2017-01-01 2017-01-01
2017-01-01 2017-01-02
2017-01-01 ...
2017-01-01 2019-12-31
2020-01-01 2020-01-01
2020-01-01 2025-01-01
2020-01-01 ...
What your query is showing is the contents of the product table with the last date that that particular product had that particular price, for each day over 20 years, also where s1isoc = 1 (which I don't know what that means).

Teradata SQL: Determine how many accounts had status change in given month

Ok, so I have a table that looks something like this:
Acct_id Eff_dt Expr_dt Prod_cd Open_dt
-------------------------------------------------------
111 2012-05-01 2013-06-01 A 2012-05-01
111 2013-06-02 2014-03-08 A 2012-05-01
111 2014-03-09 9999-12-31 B 2012-05-01
222 2015-07-15 2015-11-11 A 2015-07-15
222 2015-11-12 2016-08-08 B 2015-07-15
222 2016-08-09 9999-12-31 A 2015-07-15
333 2016-01-01 2016-04-15 B 2016-01-01
333 2016-04-16 2016-08-08 B 2016-01-01
333 2016-08-09 9999-12-31 A 2016-01-01
444 2017-02-03 2017-05-15 A 2017-02-03
444 2017-05-16 2017-12-02 A 2017-02-03
444 2017-12-03 9999-12-31 B 2017-02-03
555 2017-12-12 9999-12-31 B 2017-12-12
There are many more columns that I'm not including as they're otherwise not relevant.
What I'm trying to determine is how many accounts had a change in Prod_cd in a given month, but then only in one direction (so from A > B in this example). Sometimes however an account was first opened as B, and then later changed to A. Or it was opened as A, changed to B, and moved back to A. I only want to know the current set of accounts where in a given month the Prod_cd changed from A to B.
Eff_dt is the date when a change was made to an account (could be any change, such as address change, name change, or what I'm looking for, product code change).
Expr_dt is the expiration date of that row, essentially the last day before a new change was made. When the date of that row is 9999-12-31, that's the most current row.
Open_dt is the date the account was created.
I created a query at first that was something like this:
select
count(distinct acct_id)
from table
where prod_cd = 'B'
and expr_dt = '9999-12-31'
and eff_dt between '2017-12-01' and '2017-12-31'
and open_dt < '2017-12-01'
But it's giving me results that don't look right. I want to specifically track the # of conversions that happened, but the count of accounts I'm getting seems way too high.
There is probably a way to create a more reliable query using window functions, but given that the Prod_cd changes can happen in multiple directions, I'm not sure how to write that query. Any help would be appreciated!
If you are specifically looking for the switch A --> B, then the simplest method is to use lag(). But, Teradata requires a slightly different formulation:
select count(distinct acct_id)
from (select t.*,
max(prod_cd) over (partition by acct_id order by effdt rows between 1 preceding and 1 preceding) as prev_prod_cd
from t
) t
where prod_cd = 'B' and prev_prod_cd = 'A' and
expr_dt = '9999-12-31' and
eff_dt between '2017-12-01' and '2017-12-31' and
open_dt < '2017-12-01';
I am guessing that the date conditions go in the outer query -- meaning that they lag() does not use them.
Similar to Gordon's answer, but using a supported window function (instead of LAG) and using Teradata's QUALIFY clause to do the lag-gy lookup:
SELECT DISTINCT acct_id
FROM mytable
QUALIFY
MAX(prod_cd) OVER (PARTITION BY acct_id ORDER BY eff_dt ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) = 'A'
AND prod_cd = 'B'
AND expr_dt = '9999-12-31'
AND eff_dt between DATE '2013-01-01' and DATE '2017-12-31'
AND open_dt < DATE '2017-12-01'

How to use sub query in group by clause sql server 2005

my table data as follows
FinishDate SpecialistName jobstate
----------------------- --------------- ---------
2012-10-01 00:00:00.000 Josh FINISHED
2012-10-01 00:00:00.000 Josh FINISHED
2012-10-01 00:00:00.000 Sam FINISHED
2012-10-01 00:00:00.000 Robin FINISHED
2012-10-01 00:00:00.000 Robin FINISHED
2012-10-01 00:00:00.000 Joy FINISHED
2012-10-01 00:00:00.000 Joy INCOMMING
2012-10-02 00:00:00.000 Joy FINISHED
my query as follows
select Count(*) [Count] from employee
where convert(varchar,FinishDate,112)>='20121001'
and convert(varchar,FinishDate,112) <='20121001'
and JobState='FINISHED'
group by SpecialistName
if a particular specialist finish multiple jobs in same day then i want to show 1
if robin,josh & Sam finish 10 jobs in same day then 3 will be shown for that day
then output will be like
FinishDate Count
----------------------- ------
2012-10-01 00:00:00.000 3
2012-10-02 00:00:00.000 5
2012-10-03 00:00:00.000 15
so just guide me how to customize my sql to have desire result. thanks
try something along these lines. Syntax may not be perfect (did "freehand")
Select
TheDate
, Count(*) [Count]
From
(
select
convert(varchar,FinishDate,112) TheDate
, SpecialistName
from employee
where convert(varchar,FinishDate,112)>='20121001'
and convert(varchar,FinishDate,112) <='20121001'
and JobState='FINISHED'
group by
convert(varchar,FinishDate,112)
, SpecialistName
) t1
Group By
TheDate
It has to be two selects because the groupings that you want are different. If you did a single select grouping by FinishDate and SpecialistName what you would get would be a count of the distinct combinations of those two.
What you want is to get the distinct SpecialistNames that had at least one entry in a date. Distinct because you care that they had an entry, but not whether they had 1 or 3 or 17. This is done by the inner query.
Then you want to take these distinct SpecialistName with corresponding date and summarize them by FinishDate to get a count of specialists by date. This is done by the outer query.
Part of your comment mentions Distinct and you could in fact use Select Distinct instead of Group By in the inner query since we don’t need a count there. The outer query does require the Group By since you do need a count. My own bias is to use group by rather than distinct in case I need an aggregate function later, but that’s me. It would be perfectly OK to use Select Distinct if you prefer.