Creating an indicator based on the same ID in different years in PROC SQL/SQL - sql

I'd like to create an indicator based on the ID and product type. My data:
Year ID Purchase_Category
2020 1 Kitchen
2020 2 Home
2020 2 Kitchen
2020 3 Home
2021 1 Home
2021 2 Kitchen
2021 3 Kitchen
If someone with the same ID purchased Kitchen in 2020 and then Home in 2021 or vice versa, then they are deemed holistic. ID 2 in this case is not holistic because Home and Kitchen were purchased in the same year.
The output should look like this:
ID Indicator
1 Holistic
2 Not Holistic
3 Holistic

Something like this might work:
SELECT ID, CASE COUNT(*) WHEN 1 THEN 'Not Holistic' ELSE 'Holistic' END AS INDICATOR
FROM (SELECT ID, YEAR, COUNT(*) FROM DATA GROUP BY ID, YEAR)
GROUP BY ID
First, determine the distinct years per ID, then from that set, if an ID appears only once then everything was purchased in same year, otherwise there were products purchased in different years.

You just need a distinct count on the Year column per ID. No need for two steps.
select ID,
case when count(distinct "Year") > 1
then 'Holistic' else 'Not Holistic' end as Indicator
from T
group by ID
It would be just as easy to say:
case when max("Year") > min("Year") then ...
I don't know which one seems more natural. If you have a lot of data the second approach is potentially faster.

Related

SQL - Monthly cumulative count of new customer based on a created date field

Thanks in advance.
I have Customer records that look like this:
Customer_Number
Create_Date
34343
01/22/2001
54554
03/03/2020
85296
01/01/2001
...
I have about a thousand of these records (customer number is unique) and the bossman wants to see how the number of customers has grown over time.
The output I need:
Customer_Count
Monthly_Bucket
7
01/01/2021
9
02/01/2021
13
03/01/2021
20
04/01/2021
The customer count is cumulative and the Monthly Bucket will just feed the graphing package to make a nice bar chart answering the question "how many customers to we have in total in a particular month and how is it growing over time".
Try the following SELECT SQL with a sub-query:
SELECT Customer_Count=
(
SELECT COUNT(s.[Create_Date])
FROM [Customer_Sales] s
WHERE MONTH(s.[Create_Date]) <= MONTH(t.[Create_Date])
), Monthly_Bucket=MONTH([Create_Date])
FROM Customer_Sales t
WHERE YEAR(t.[Create_Date]) = ????
GROUP BY MONTH(t.[Create_Date])
Where [Customer_Sales] is the sales table and ??? = your year

Finding a lagged cumulative total from a table containing both cumulative and delta entries

I have a SQL table with a schema where a value is either a cumulative value for a particular category, or a delta on top of the previous value. While I appreciate this is not a particularly great design, it comes from an external source and thus I can't change it in any way.
The table looks something like the following:
Date Category AmountSoldType AmountSold
-----------------------------------------------------
Jan 1 Apples Cumulative 100
Jan 1 Bananas Cumulative 50
Jan 2 Apples Delta 20
Jan 2 Bananas Delta 10
Jan 3 Apples Delta 25
Jan 3 Bananas Cumulative 75
For this example, I want to produce the total cumulative number of fruits sold by item at the beginning of each day:
Date Category AmountSold
--------------------------------
Jan 1 Apples 0
Jan 1 Bananas 0
Jan 2 Apples 100
Jan 2 Bananas 50
Jan 3 Apples 170
Jan 3 Bananas 60
Jan 4 Apples 195
Jan 4 Bananas 75
Intuitively, I want to take the most recent cumulative total, and add any deltas that have appeared since that entry.
I imagine something akin to
SELECT Date, Category
LEAD((subquery??), 1) OVER (PARTITION BY Category ORDER BY Date) AS Amt
FROM Fruits
GROUP BY Date, Category
ORDER BY Date ASC
is what I want, but I'm having trouble putting the right subquery together. Any suggestions?
You seem to want to add the deltas to the most recent cumulative -- all before the current date.
If so, I think this logic does what you want:
select f.*,
(max(case when date = date_cumulative then amountsold else 0 end) over (partition by category
) +
sum(case when date > date_cumulative then amountsold else 0 end) over (partition by category order by date rows between unbounded preceding and 1 preceding
)
) amt
from (select f.*,
max(case when AmountSoldType = 'cumulative' then date else 0 end) over
(partition by category order by date rows between unbounded preceding and current_row
) as date_cumulative
from fruits f
) f
I'm a bit confused by this data set (notwithstanding the mistake in adding up the apples). I assume the raw data states end-of-day figures, so for example 20 apples were sold on Jan 2 (because there is a delta of 20 reported for that day).
In your example results, it does not appear valid to say that zero apples were sold on Jan 1. It isn't actually possible to say how many were sold on that day, because it is not clear whether the 100 cumulative apples were accrued during Jan 1 (and thus should be excluded from the start-of-day figure you seek) or whether they were accrued on previous days (and should be included), or some mix of the two. That day's data should thus be null.
It is also not clear whether all data sets must begin with a cumulative, or whether data sets can begin with a delta (which might require working backwards from a subsequent cumulative), and whether you potentially have access to multiple data sets from your external source which form a continuous consistent sequence, or whether "cumulatives" relate purely to a single data set received. I'm going to assume at least that all data sets begin with a cumulative.
All that said, this problem is a simple case of firstly converting all rows into either all deltas, or all cumulatives. Assuming we go for all cumulatives, then recursing through each row in order, it is a case of either selecting the AmountSold as-is (if the row is a cumulative), or adding the AmountSold to the result of the previous step (if it is a delta).
Once pre-processed like this, then for a start-of-day cumulative, it is all just a question of looking at the previous day's cumulative (which was an end-of-day cumulative, if my initial assumption was correct that all raw data relates to end-of-day figures).
Using the LAG function in this final step to get the previous day's cumulative, will also neatly produce a null for the first row.

Count Distinct values in one column based on other columns

I have a table that looks like the following:
app_id supplier_reached creation_date platform
10001 1 9/11/2018 iOS
10001 2 9/18/2018 iOS
10002 1 5/16/2018 android
10003 1 5/6/2018 android
10004 1 10/1/2018 android
10004 1 2/3/2018 android
10004 2 2/2/2018 web
10005 4 1/5/2018 web
10005 2 5/1/2018 android
10006 3 10/1/2018 iOS
10005 4 1/1/2018 iOS
The objective is to find the unique number of app_id submitted per month.
If I just do a count(distinct app_id) I will get the following results:
Group by month count(app number)
Jan 1
Feb 1
may 3
september 1
october 2
However, an application is considered unique based on a combination of other fields as well. For example, for the month of January, the app_id is the same however a combination of app_id, supplier_reached and platform show different values and hence the app_id should be counted twice.
Following the same pattern, the desired result should be:
Group by month Desired answer
Jan 2
Feb 2
may 3
september 2
october 2
Lastly, there can be many other columns in the table which may or may not contribute to the uniqueness of an application.
Is there a way to do this type of count in SQL?
I am using Redshift.
As pointed out above, in Redshift count(distinct ...) does not work with multiple fields.
You can first group by the columns that you want to be unique and then count the records like this:
select month,count(1) as app_number
from (
select month,app_id,supplier_reached,platform
from your_table
group by 1,2,3,4
)
group by 1
I don't think Postgres or Redshift supports COUNT(DISTINCT) with multiple arguments. One workaround is to use concatenation:
count(distinct app_id || ':' || supplier_reached || ':' || platform)
Your objective's mean is wrong.
You don't want
to find the unique number of app_id submitted per month
you want
to find the unique number of app_id + supplier_reached + platform submitted per month.
And so, you need to use a) combination of columns like count(distinct col1||col2||col3) or b)
select t1.month, count(t1.*)
(select distinct
app_id,
supplier_reached,
platform,
month
from sometable) t1
group by month
Actually, you can count distinct ROW values conveniently in Postgres:
SELECT month, count(DISTINCT (app_id, supplier_reached, platform)) AS dist_apps
FROM tbl
GROUP BY 1;
The ROW keyword would be just noise here:
count(DISTINCT ROW(app_id, supplier_reached, platform))
I would discourage concatenating columns for the purpose. This is comparatively expensive, error prone (think of distinct data types and locale-dependent text representation) and introduces corner-case errors if the used separator can be contained in column values.
Alas, not supported by Redshift:
...
Value expressions
Subscripted expressions
Array constructors
Row constructors
...

SQL/HQL - Partition By

Trying to understand Partition By and getting super confused, I have the following data:
Name ID Product Date Amount
Jason 1 Car Jan 2017 $10
Jason 1 Car Feb 2017 $5
Jason 2 Car Jan 2017 $50
Jason 2 Car Feb 2017 $60
Jason 3 House Jan 2017 $20
Jason 3 House Feb 2017 $30
Would doing:
Select Name, ID, Product, Date, Amount,
**LAG(Amount,1) Over Partition By Name Order by Date**
FROM table
give me Jason's correct previous month amount for the appropriate Product and ID number? So, for example at Feb 2017: Jason, ID 1 and Product Car's should give me the amount $5.
OR
Would I need to modify the Partition by to include the Product and ID, such as:
Select Name, ID, Product, Date, Amount,
**LAG(Amount,1) Over Partition By *Name, ID, Product* Order by Date** FROM table'
Thanks!
I myself also came here in search of some understanding of the "partition by" clause. But to answer your question, the new column created would give you the previous row's value. So you don't have to add the other columns (i.e. Product and ID) in to your Partition by clause.
Essentially, you would have your existing 5 columns, plus one (in which contains the row contains the value of the previous row in "amount").

Using Having statement and In Statement together

Here is my hypothetical table, (lets call it Table Watched)
MovieID, UserID, Date
---------------------
1 1 June
1 2 June
2 3 July
2 2 August
3 1 August
3 2 August
4 1 August
I want to get all movies only watched by users 1 or 2. So, in this case, the result should be:
MovieID
--------
1
3
4
So, I was thinking to write a query like,
SELECT MovieID FROM Watched GROUP BY MovieID HAVING ALL UserID in (1,2)
It does not work. I am not sure is there another working query with the same way I am thinking now. I am thinking like the following,
Group all records with the MovieID
Eliminate the groups which has another user than 1 or 2
What should be the right way of doing this?
PS: I am using Oracle Database 12c.
Eliminate the groups which has another user than 1 or 2
You are on the right track, but you don't need GROUP BY - you can use DISTINCT instead. To get the results you're looking for you can use NOT IN instead of HAVING:
SELECT DISTINCT
MovieID
FROM Watched
WHERE MovieID NOT IN
( SELECT MovieID FROM Watches
WHERE UserID NOT IN (1,2))