Can not understand the logic of this query - sql

This query is trying to get the s1ppmp (the price of product) of each s1ilie (size), each s1iref (reference) and s1ydat (the lastest date) for the price, because one product could have more than one price on different dates, for example, during the black friday or the normal price for the other days.
The anmoisjour comes from calender table, but there is no connection between CALENDER table and main table msk100, so ... I don't understand the logic of this query...
SELECT s1isoc,
s1ilie,
s1iref,
s1ydat,
anmoisjour,
s1ppmp
FROM msk110
INNER JOIN (SELECT s1isoc AS isoc,
s1ilie AS ilie,
s1iref AS iref,
MAX(s1ydat) AS ydat,
anmoisjour
FROM calendrier,
msk110
WHERE s1ydat <= anmoisjour
AND anmoisjour BETWEEN 20100101 AND 20302131
GROUP BY s1isoc,
s1ilie,
s1iref,
anmoisjour) a ON s1isoc = isoc
AND s1ilie = ilie
AND s1iref = iref
AND s1ydat = ydat
WHERE s1isoc = 1
AND anmoisjour BETWEEN 20100101 AND 20302131
ORDER BY anmoisjour,
s1ydat;
s1isoc, s1ilie, s1iref, s1ydat,and s1ppmp comes from msk110
and
anmoisjour belongs to calender table, which is a date table.

I believe the confusion is the way that the calendar table is joined.
If anmoisjour is the day column of the calendar table and this table holds 1 row per day, the WHERE filter anmoisjour BETWEEN 20100101 AND 20302131 makes calendrier hold a row for each day for 20 years (2010 to 2030).
They way the product prices table msk100 is linked to the calendar calendrier table is not directly by date, but with a max date (msk100.s1ydat <= calendrier.anmoisjour). This means that for example a date of msk100.s1ydat that's 2015-01-01 will join against every row of the calendar table thats between 2015-01-01 and 2030-12-31.
The GROUP BY is by the calendar table's date (calendrier.anmoisjour) this means that if a particular product, size and price repeats on different dates, let's say the only occurrences are on dates 2015-01-01, 2017-01-01 and 2020-01-01, then the result of the group by would be the following (ordered by calendar date, displaying even NULL to demonstrate):
MAX(s1ydat) anmoisjour
null 2010-01-01
null ...
null 2014-12-31
2015-01-01 2015-01-01
2015-01-01 2015-01-02
2015-01-01 ...
2015-01-01 2016-01-01
2015-01-01 ...
2017-01-01 2017-01-01
2017-01-01 2017-01-02
2017-01-01 ...
2017-01-01 2019-12-31
2020-01-01 2020-01-01
2020-01-01 2025-01-01
2020-01-01 ...
What your query is showing is the contents of the product table with the last date that that particular product had that particular price, for each day over 20 years, also where s1isoc = 1 (which I don't know what that means).

Related

Subsetting on dates for a SQL query

Using Snowflake, I am attempting to subset on customers that have no current subscriptions, and eliminating all IDs for those which have current/active contracts.
Each ID will typically have multiple records associated with a contract/renewal history for a particular ID/customer.
It is only known if a customer is active if there is no contract that goes beyond the current date, while there are likely multiple past contracts which have lapsed, but the account is still active if one of those contract end dates goes beyond the current date.
Consider the following table:
Date_Start
Date_End
Name
ID
2015-07-03
2019-07-03
Piggly
001
2019-07-04
2025-07-04
Piggly
001
2013-10-01
2017-12-31
Doggy
031
2018-01-01
2018-06-30
Doggy
031
2020-01-01
2021-03-14
Catty
022
2021-03-15
2024-06-01
Catty
022
1999-06-01
2021-06-01
Horsey
052
2021-06-02
2022-01-01
Horsey
052
2022-01-02
2022-07-04
Horsey
052
With a desired output non-active customers that do not have an end date beyond Jan 5th 2023 (or current/arbitrary date)
Name
ID
Doggy
031
Horsey
052
My first attempt was:
SELECT Name, ID
FROM table
WHERE Date_End < GETDATE()
but the obvious problem is that I'll also be selecting past contracts of customers who haven't expired/churned and who have a contract that goes beyond the current date.
How do I resolve this?
As there are many rows per name and ID, you should aggregate the data and then use a HAVING clause to select only those you are interested in.
SELECT name, id
FROM table
GROUP BY name, id
HAVING MAX(date_end) < GETDATE();
You can work it out with an EXCEPT operator, if your DBMS supports it:
SELECT DISTINCT Name, ID FROM tab
EXCEPT
SELECT DISTINCT Name, ID FROM tab WHERE Date_end > <your_date>
This would removes the active <Name, ID> pairs from the whole.

Counting subscriber numbers given events on SQL

I have a dataset on mysql in the following format, showing the history of events given some client IDs:
Base Data
Text of the dataset (subscriber_table):
user_id type created_at
A past_due 2021-03-27 10:15:56
A reactivate 2021-02-06 10:21:35
A past_due 2021-01-27 10:30:41
A new 2020-10-28 18:53:07
A cancel 2020-07-22 9:48:54
A reactivate 2020-07-22 9:48:53
A cancel 2020-07-15 2:53:05
A new 2020-06-20 20:24:18
B reactivate 2020-06-14 10:57:50
B past_due 2020-06-14 10:33:21
B new 2020-06-11 10:21:24
date_table:
full_date
2020-05-01
2020-06-01
2020-07-01
2020-08-01
2020-09-01
2020-10-01
2020-11-01
2020-12-01
2021-01-01
2021-02-01
2021-03-01
I have been struggling to come up with a query to count subscriber counts given a range of months, which are not necessary included in the event table either because the client is still subscribed or they cancelled and later resubscribed. The output I am looking for is this:
Output
date subscriber_count
2020-05-01 0
2020-06-01 2
2020-07-01 2
2020-08-01 1
2020-09-01 1
2020-10-01 2
2020-11-01 2
2020-12-01 2
2021-01-01 2
2021-02-01 2
2021-03-01 2
Reactivation and Past Due events do not change the subscription status of the client, however only the Cancel and New event do. If the client cancels in a month, they should still be counted as active for that month.
My initial approach was to get the latest entry given a month per subscriber ID and then join them to the premade date table, but when I have months missing I am unsure on how to fill them with the correct status. Maybe a lag function?
with last_record_per_month as (
select
date_trunc('month', created_at)::date order by created_at) as month_year ,
user_id ,
type,
created_at as created_at
from
subscriber_table
where
user_id in ('A', 'B')
order by
created_at desc
), final as (
select
month_year,
created_at,
type
from
last_record_per_month lrpm
right join (
select
date_trunc('month', full_date)::date as month_year
from
date_table
where
full_date between '2020-05-01' and '2021-03-31'
group by
1
order by
1
) dd
on lrpm.created_at = dd.month_year
and num = 1
order by
month_year
)
select
*
from
final
I do have a premade base table with every single date in many years to use as a joining table
Any help with this is GREATLY appreciated
Thanks!
The approach here is to have the subscriber rows with new connections as base and map them to the cancelled rows using a self join. Then have the date tables as base and aggregate them based on the number of users to get the result.
SELECT full_date, COUNT(DISTINCT user_id) FROM date_tbl
LEFT JOIN(
SELECT new.user_id,new.type,new.created_at created_at_new,
IFNULL(cancel.created_at,CURRENT_DATE) created_at_cancel
FROM subscriber new
LEFT JOIN subscriber cancel
ON new.user_id=cancel.user_id
AND new.type='new' AND cancel.type='cancel'
AND new.created_at<= cancel.created_at
WHERE new.type IN('new'))s
ON DATE_FORMAT(s.created_at_new, '%Y-%m')<=DATE_FORMAT(full_date, '%Y-%m')
AND DATE_FORMAT(s.created_at_cancel, '%Y-%m')>=DATE_FORMAT(full_date, '%Y-%m')
GROUP BY 1
Let me breakdown some sections
First up we need to have the subscriber table self joined based on user_id and then left table with rows as 'new' and the right one with 'cancel' new.type='new' AND cancel.type='cancel'
The new ones should always precede the canceled rows so adding this new.created_at<= cancel.created_at
Since we only care about the rows with new in the base table we filter out the rows in the WHERE clause new.type IN('new'). The result of the subquery would look something like this
We can then join this subquery with a Left join the date table such that the year and month of the created_at_new column is always less than equal to the full_date DATE_FORMAT(s.created_at_new, '%Y-%m')<=DATE_FORMAT(full_date, '%Y-%m') but greater than that of the canceled date.
Lastly we aggregate based on the full_date and consider the unique count of users
fiddle

Using Distinct and MAX(date) in a large data

I have a table that stores the list of users who have accessed a product(with the accessed date).
I have written the below query to get the list of users who have accessed the product B between '2021-02-01' and '2021-02-26'.
SELECT DISTINCT UserName,Country,ADate,Product FROM Report WHERE UserName != '-' and Product='B and (CAST(ADate AS DATE) BETWEEN #startdate AND #enddate '
then it gives the below result:
UserName Country ADate Product
-------- ------ -------- ---------
asson IN 2021-02-10 00:00:00.000 B
rajan US 2021-02-23 00:00:00.000 B
rajan US 2021-02-25 00:00:00.000 B
moody US 2021-02-14 00:00:00.000 B
rajon US 2021-02-01 00:00:00.000 B
lukman US 2021-02-10 00:00:00.000 B
since the user rajan has accessed the product in 2 days it shows 2 entries for rajan even though I have added distinct. So I have modified the query as below:
SELECT DISTINCT UserName,Country,max(ADate),Product FROM Report WHERE UserName != '-' and Product='B' and (CAST(ADate AS DATE) BETWEEN #startdate AND #enddate group by Username,product
This query gives me the required result. But the problem I am facing now is When I select the table with more than a month gap (say data between 2 months), I miss some data in the output. I believe it might be due to the MAX(ADate). Can anyone give a good suggestion to get rid of this issue?
This will give you the latest access date of each user by month
SELECT DISTINCT UserName,Country, month(ADate) as month, max(ADate),Product FROM Report WHERE UserName != '-' and Product='B' group by UserName, Country, month, Product

How to average values in one table based on the condition involving another table in SQL?

I have two tables. One defines time intervals (beginning and end). Time intervals are not equal in length. Another contains product ID, start and end date of the product.
TableOne:
Interval StartDateTime EndDateTime
202020201 2020-01-01 00:00:00 2020-02-10 00:00:00
202020202 2020-02-10 00:00:00 2020-02-20 00:00:00
TableTwo
ProductID ProductStartDateTime ProductEndDateTime
ASSDWE1 2018-01-04 00:12:00 2020-04-10 20:00:30
ADFGHER 2020-01-05 00:11:30 2020-01-19 00:00:00
ASDFVBN 2017-10-10 00:12:10 2020-02-23 00:23:23
I need to compute the average length of the products from TableTwo that existed during time intervals defined in TableOne. If the product existed throughout the time interval from TableOne, then the length of the product during this time interval is defined as it length since its start date till the end of the time interval.
I tried the following
select
a.*,
(select
AVG(datediff(day, b.ProductStartDateTime, IIF (b.ProductEndDateTime> a.EndDateTime, a.EndDateTime
,b.ProductEndDateTime))) --compute average length of the products
FROM #TableTwo b
WHERE ( not (b.ProductEndDateTime <= a.StartDateTime ) and not (b.ProductStartDateTime >= a.EndDateTime) )
-- select products that existed during interval from #TableOne
) as AverageProductLength
from #TableOne a
I get the mistake "Multiple columns are specified in an aggregated expression containing an outer reference. If an expression being aggregated contains an outer reference, then that outer reference must be the only column referenced in the expression."
The result I want:
Interval StartDateTime EndDateTime AverageProductLength
202020201 2020-01-01 00:00:00 2020-02-10 00:00:00 23
202020202 2020-02-10 00:00:00 2020-02-20 00:00:00 34.5
Is there a way I can do the averaging?

Teradata SQL: Determine how many accounts had status change in given month

Ok, so I have a table that looks something like this:
Acct_id Eff_dt Expr_dt Prod_cd Open_dt
-------------------------------------------------------
111 2012-05-01 2013-06-01 A 2012-05-01
111 2013-06-02 2014-03-08 A 2012-05-01
111 2014-03-09 9999-12-31 B 2012-05-01
222 2015-07-15 2015-11-11 A 2015-07-15
222 2015-11-12 2016-08-08 B 2015-07-15
222 2016-08-09 9999-12-31 A 2015-07-15
333 2016-01-01 2016-04-15 B 2016-01-01
333 2016-04-16 2016-08-08 B 2016-01-01
333 2016-08-09 9999-12-31 A 2016-01-01
444 2017-02-03 2017-05-15 A 2017-02-03
444 2017-05-16 2017-12-02 A 2017-02-03
444 2017-12-03 9999-12-31 B 2017-02-03
555 2017-12-12 9999-12-31 B 2017-12-12
There are many more columns that I'm not including as they're otherwise not relevant.
What I'm trying to determine is how many accounts had a change in Prod_cd in a given month, but then only in one direction (so from A > B in this example). Sometimes however an account was first opened as B, and then later changed to A. Or it was opened as A, changed to B, and moved back to A. I only want to know the current set of accounts where in a given month the Prod_cd changed from A to B.
Eff_dt is the date when a change was made to an account (could be any change, such as address change, name change, or what I'm looking for, product code change).
Expr_dt is the expiration date of that row, essentially the last day before a new change was made. When the date of that row is 9999-12-31, that's the most current row.
Open_dt is the date the account was created.
I created a query at first that was something like this:
select
count(distinct acct_id)
from table
where prod_cd = 'B'
and expr_dt = '9999-12-31'
and eff_dt between '2017-12-01' and '2017-12-31'
and open_dt < '2017-12-01'
But it's giving me results that don't look right. I want to specifically track the # of conversions that happened, but the count of accounts I'm getting seems way too high.
There is probably a way to create a more reliable query using window functions, but given that the Prod_cd changes can happen in multiple directions, I'm not sure how to write that query. Any help would be appreciated!
If you are specifically looking for the switch A --> B, then the simplest method is to use lag(). But, Teradata requires a slightly different formulation:
select count(distinct acct_id)
from (select t.*,
max(prod_cd) over (partition by acct_id order by effdt rows between 1 preceding and 1 preceding) as prev_prod_cd
from t
) t
where prod_cd = 'B' and prev_prod_cd = 'A' and
expr_dt = '9999-12-31' and
eff_dt between '2017-12-01' and '2017-12-31' and
open_dt < '2017-12-01';
I am guessing that the date conditions go in the outer query -- meaning that they lag() does not use them.
Similar to Gordon's answer, but using a supported window function (instead of LAG) and using Teradata's QUALIFY clause to do the lag-gy lookup:
SELECT DISTINCT acct_id
FROM mytable
QUALIFY
MAX(prod_cd) OVER (PARTITION BY acct_id ORDER BY eff_dt ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) = 'A'
AND prod_cd = 'B'
AND expr_dt = '9999-12-31'
AND eff_dt between DATE '2013-01-01' and DATE '2017-12-31'
AND open_dt < DATE '2017-12-01'