This is a practice question from stratascratch and I'm literally stuck at the final HAVING statement.
Problem statement:
Find the total number of downloads for paying and non-paying users by date. Include only records where non-paying customers have more downloads than paying customers. The output should be sorted by earliest date first and contain 3 columns date, non-paying downloads, paying downloads.
There are three tables:
ms_user_dimension (user_id, acc_id)
ms_acc_dimension (acc_id, paying_customer)
ms_download_facts (date, user_id, downloads)
This is my code so far
SELECT date,
SUM(CASE WHEN paying_customer = 'no' THEN cnt END) AS no,
SUM(CASE WHEN paying_customer = 'yes' THEN cnt END) AS yes
FROM (
SELECT date, paying_customer, SUM(downloads) AS cnt
FROM ms_download_facts d
LEFT JOIN ms_user_dimension u ON d.user_id = u.user_id
LEFT JOIN ms_acc_dimension a ON u.acc_id = a.acc_id
GROUP BY 1, 2
ORDER BY 1, 2
) prePivot
GROUP BY date
HAVING no > yes;
If I remove the HAVING no > yes at the end, the code will run and I can see I have three columns: date, yes, and no. However, if I add the HAVING statement, I get the error "column "no" does not exist...LINE 13: HAVING no > yes"
Can't figure out for the sake of my life what's going on here. Please let me know if anyone figures out something. TIA!
You don't need a subquery for this:
SELECT d.date,
SUM(CASE WHEN a.paying_customer = 'no' THEN d.downloads END) AS no,
SUM(CASE WHEN a.paying_customer = 'yes' THEN d.downloads END) AS yes
FROM ms_download_facts d LEFT JOIN
ms_user_dimension u
ON d.user_id = u.user_id LEFT JOIN
ms_acc_dimension a
ON u.acc_id = a.acc_id
GROUP BY d.date
HAVING SUM(CASE WHEN a.paying_customer = 'no' THEN d.downloads END) > SUM(CASE WHEN a.paying_customer = 'yes' THEN d.downloads END);
You can simplify the HAVING clause to:
HAVING SUM(CASE WHEN a.paying_customer = 'no' THEN 1 ELSE -1 END) > 0
This version assumes that paying_customer only takes on the values 'yes' and 'no'.
You may be able to simplify the query further, depending on the database you are using.
It doesn't like aliases in the having statement. Replace no with:
SUM(CASE WHEN paying_customer = 'no' THEN cnt END)
and do the similar thing for yes.
SELECT date,
SUM(CASE WHEN paying_customer = 'no' THEN cnt END) AS no,
SUM(CASE WHEN paying_customer = 'yes' THEN cnt END) AS yes
FROM (
SELECT date, paying_customer, SUM(downloads) AS cnt
FROM ms_download_facts d
LEFT JOIN ms_user_dimension u ON d.user_id = u.user_id
LEFT JOIN ms_acc_dimension a ON u.acc_id = a.acc_id
GROUP BY 1, 2
ORDER BY 1, 2
) prePivot
GROUP BY date
HAVING SUM(CASE WHEN paying_customer = 'no' THEN cnt END) > SUM(CASE WHEN paying_customer = 'yes' THEN cnt END);
Related
I am trying to return year over year results based on date criteria. There is additional information I would like to include in the query i.e. first date of activity and first date of activity with spot name like '%6%'. The current query I have is multiplying the correct amounts by 6 and I can't figure out how to solve. When I remove the first "where" clause I get the correct amounts. Any help would be appreciated.
Select
V.IGB_License,
DBA,
V.Sci_Games_Name,
convert(date,v2.Activity_date) as First6thMachineDate,
convert(date,V3.Activity_date) as GoLiveDate,
sum(case when (v.Activity_date between '1/23/2019' and DATEADD(YEAR,-2,getdate()-1)) then v.Funds_in else 0 end) as FundsIn2019,
sum(case when (v.Activity_date between '1/23/2020' and DATEADD(YEAR,-1,getdate()-1)) then v.Funds_in else 0 end) as FundsIn2020,
sum(case when (v.Activity_date between '1/23/2021' and getdate()) then v.Funds_in else 0 end) as FundsIn2021,
sum(case when (v.Activity_date between '1/23/2019' and DATEADD(YEAR,-2,getdate()-1)) then v.Net_funds else 0 end) as NetFunds2019,
sum(case when (v.Activity_date between '1/23/2020' and DATEADD(YEAR,-1,getdate()-1)) then v.Net_funds else 0 end) as NetFunds2020,
sum(case when (v.Activity_date between '1/23/2021' and getdate()) then v.Net_funds else 0 end) as NetFunds2021
From VGT_activity V
Left Join Locations on v.IGB_License = Locations.IGB_License
left join VGT_activity V2 on v.IGB_License = v2.IGB_License
Left join VGT_activity V3 on v.IGB_License = v3.IGB_License
Where v2.Activity_date = (
Select Min(V1.Activity_date)
From VGT_activity V1
Where v1.IGB_License = V2.IGB_License
and Spot_name like '%6%'
)
and v3.Activity_date = (
Select Min(V1.Activity_date)
From VGT_activity V1
Where v1.IGB_License = V3.IGB_License
)
group by V.IGB_License, dba, V.Sci_Games_Name, v2.Activity_date, v3.Activity_date
order by 4
I am working on Hive and am facing an issue with rolling counts. The sample data I am working on is as shown below:
and the output I am expecting is as shown below:
I tried using the following query but it is not returning the rolling count:
select event_dt,status, count(distinct account) from
(select *, row_number() over (partition by account order by event_dt
desc)
as rnum from table.A
where event_dt between '2018-05-02' and '2018-05-04') x where rnum =1
group by event_dt, status;
Please help me with this if some one has solved a similar issue.
You seem to just want conditional aggregation:
select event_dt,
sum(case when status = 'Registered' then 1 else 0 end) as registered,
sum(case when status = 'active_acct' then 1 else 0 end) as active_acct,
sum(case when status = 'suspended' then 1 else 0 end) as suspended,
sum(case when status = 'reactive' then 1 else 0 end) as reactive
from table.A
group by event_dt
order by event_dt;
EDIT:
This is a tricky problem. The solution I've come up with does a cross-product of dates and users and then calculates the most recent status as of each date.
So:
select a.event_dt,
sum(case when aa.status = 'Registered' then 1 else 0 end) as registered,
sum(case when aa.status = 'active_acct' then 1 else 0 end) as active_acct,
sum(case when aa.status = 'suspended' then 1 else 0 end) as suspended,
sum(case when aa.status = 'reactive' then 1 else 0 end) as reactive
from (select d.event_dt, ac.account, a.status,
max(case when a.status is not null then a.timestamp end) over (partition by ac.account order by d.event_dt) as last_status_timestamp
from (select distinct event_dt from table.A) d cross join
(select distinct account from table.A) ac left join
(select a.*,
row_number() over (partition by account, event_dt order by timestamp desc) as seqnum
from table.A a
) a
on a.event_dt = d.event_dt and
a.account = ac.account and
a.seqnum = 1 -- get the last one on the date
) a left join
table.A aa
on aa.timestamp = a.last_status_timestamp and
aa.account = a.account
group by d.event_dt
order by d.event_dt;
What this is doing is creating a derived table with rows for all accounts and dates. This has the status on certain days, but not all days.
The cumulative max for last_status_timestamp calculates the most recent timestamp that has a valid status. This is then joined back to the table to get the status on that date. Voila! This is the status used for the conditional aggregation.
The cumulative max and join is a work-around because Hive does not (yet?) support the ignore nulls option in lag().
I have a SQL database with people, from which I want to see how much experience each person has in a specific department.
In the current query I have the following code:
SELECT [PERSON_ID]
,sum(case when [DEPARTMENT] = 'Marketing' then 1 else 0 end) as Exp_Marketing
,sum(case when [FUNCTION_DESC] = 'Finance' then 1 else 0 end) as Exp_Finance
FROM [xxxx].[xxxx].[xxxx]
GROUP BY [PERSON_ID]
Each person has one row for the months of service, so a person with 12 months of experience in Finance has a value of 12 in the Exp_Finance column.
The issue however is that the result now shows the outcome for all people. Also the one who already left the organization. How can I make sure the result only shows the historical information for the people currently part in the organization. In other words, the ones actually having a row with "2018M06" as value for the Date column.
You can use a having clause:
SELECT [PERSON_ID],
sum(case when [DEPARTMENT] = 'Marketing' then 1 else 0 end) as Exp_Marketing,
sum(case when [FUNCTION_DESC] = 'Finance' then 1 else 0 end) as Exp_Finance
FROM [xxxx].[xxxx].[xxxx]
GROUP BY [PERSON_ID]
HAVING MAX([DATE]) = '2018M06';
Your month format seems amenable to using MAX().
You should add an EXISTS within a WHERE clause so you only include people meeting your criteria.
SELECT [PERSON_ID]
,sum(case when [DEPARTMENT] = 'Marketing' then 1 else 0 end) as Exp_Marketing
,sum(case when [FUNCTION_DESC] = 'Finance' then 1 else 0 end) as Exp_Finance
FROM [xxxx].[xxxx].[xxxx] A
WHERE EXISTS (SELECT * FROM [xxxx].[xxxx].[xxxx] B
WHERE A.PERSON_ID = B.PERSON_ID AND B.[DATE] = '2018M06')
GROUP BY [PERSON_ID]
I'm really struggling with this. I just can't seem to figure it out. I've got the concept in my head but don't exactly know how to put my plain language understanding of how to solve the problem into the correct Syntax.
Here is the question.
Give me a list of all donors and their addresses categorized by whether they donated art, money, or both.
Here is the set up for the tables.
CareTakers: CareTakerID, CareTakerName
Donations: DonationID, DonorID, DonatedMoney, ArtName, ArtType, ArtAppraisedPrice, ArtLocationBuilding, ArtLocationRoom, CareTakerID
Donors: DonorID, DonorName, DonorAddress
Here is what I have for my code so far.
SELECT
DISTINCT(DonorName), DonorAddress
FROM
Donors JOIN Donations ON Donors.DonorID = Donations.DonorID
GROUP BY
DonatedMoney
HAVING
DonatedMoney = 'Y' OR DonatedMoney = 'N' OR DonatedMoney = 'Y' AND ArtName IS NOT NULL
Any help would be highly appreciated!
Why would you use a having clause? The question specifies no filtering. The following summarizes the donations to get what you need and then joins the results back to the donors table:
select d.*, don.DonationType
from donors d join
(select don.donorid,
(case when sum(case when donatedmoney = 'Y' then 1 else 0 end) > 0 and
sum(case when artname is not null then 1 else 0 end) > 0
then 'Both'
when sum(case when donatedmoney = 'Y' then 1 else 0 end) > 0
then 'Money'
when sum(case when artname is not null then 1 else 0 end)
then 'Art'
else 'Neither'
end) as DonationType
from donations don
group by don.donorid
) don
on d.donorid = don.donorid
I am just learning SQL and have run into a problem creating a custom report. I am working with school attendance data. I want to create a report that gives membership days and number of days for each absence type.
I have successfully created a report for these separately.
Membership Days (calculated by counting days school was in session between the student's entry date and the current date. Membership days does not exist as a field on its own)
SELECT sum(case when cd.DATE_VALUE >= s.ENTRYDATE and cd.DATE_VALUE <= current_timestamp THEN cd.INSESSION ELSE 0 END), s.LASTFIRST
FROM CALENDAR_DAY cd,STUDENTS s
WHERE cd.SCHOOLID = 405
GROUP BY s.LASTFIRST
Count per absence type
SELECT s.STUDENT_NUMBER, s.LASTFIRST,SUM(CASE WHEN a.ATTENDANCE_CODEID = 2 THEN 1 ELSE 0 END),SUM(CASE WHEN a.ATTENDANCE_CODEID = 4 THEN 1 ELSE 0 END),SUM(CASE WHEN a.ATTENDANCE_CODEID = 3 THEN 1 ELSE 0 END),SUM(CASE WHEN a.ATTENDANCE_CODEID = 51 THEN 1 ELSE 0 END)
FROM ATTENDANCE a
INNER join STUDENTS s
ON a.STUDENTID = s.ID
WHERE a.att_date between '%param1%' and '%param2%'
GROUP BY s.STUDENT_NUMBER, s.LASTFIRST
The problem is that if I try to put these in the same report, the membership days are multiplied by the number of times the student appears in the attendance table due to joining student and attendance. My thought on a solution was to then divide this line
sum(case when cd.DATE_VALUE >= s.ENTRYDATE and cd.DATE_VALUE <= current_timestamp THEN cd.INSESSION ELSE 0 END)
by the number of times the student showed up in the attendance table to counteract the student information existing on every line. I can't figure out how to do that. I don't know much about these types of problems, so hopefully I've just gone off on the wrong track and there is an easy solution. Thanks.
Your problem is a common problem -- trying to summarize along two dimensions at the same time without using a subquery. You want to do this query with two aggregation subqueries. Something like this:
SELECT *
FROM (SELECT sum(case when cd.DATE_VALUE >= s.ENTRYDATE and cd.DATE_VALUE <= current_timestamp
THEN cd.INSESSION
ELSE 0
END), s. STUDENT_NUMBER
FROM CALENDAR_DAY cd CROSS JOIN
STUDENTS s
WHERE cd.SCHOOLID = 405
GROUP BY s.STUDENT_NUMBER
) sc JOIN
(SELECT s.STUDENT_NUMBER, s.LASTFIRST,
SUM(CASE WHEN a.ATTENDANCE_CODEID = 2 THEN 1 ELSE 0 END),
SUM(CASE WHEN a.ATTENDANCE_CODEID = 4 THEN 1 ELSE 0 END),
SUM(CASE WHEN a.ATTENDANCE_CODEID = 3 THEN 1 ELSE 0 END),
SUM(CASE WHEN a.ATTENDANCE_CODEID = 51 THEN 1 ELSE 0 END)
FROM ATTENDANCE a INNER join
STUDENTS s
ON a.STUDENTID = s.ID
WHERE a.att_date between '%param1%' and '%param2%'
GROUP BY s.STUDENT_NUMBER, s.LASTFIRST
) sa
on sc.STUDENT_NUMBER = sa.STUDENT_NUMBER;