SQL: Need to SUM on results that meet a HAVING statement - sql

I have a table where we record per user values like money_spent, money_spent_on_candy and the date.
So the columns in this table (let's call it MoneyTable) would be:
UserId
Money_Spent
Money_Spent_On_Candy
Date
My goal is to SUM the total amount of money_spent -- but only for those users where they have spent more than 10% of their total money spent for the date range on candy.
What would that query be?
I know how to select the Users that have this -- and then I can output the data and sum that by hand but I would like to do this in one single query.
Here would be the query to pull the sum of Spend per user for only the users that have spent > 10% of their money on candy.
SELECT
UserId,
SUM(Money_Spent),
SUM(Money_Spent_On_Candy) / SUM(Money_Spent) AS PercentCandySpend
FROM MoneyTable
WHERE DATE >= '2010-01-01'
HAVING PercentCandySpend > 0.1;

You couldn't do this with a single query. You'd need a query that could reach back in time and retroactively filter the source table to handle only users with 10% candy spending. Luckily, that's kind of what sub-queries do:
SELECT SUM(spent) FROM (
SELECT SUM(Money_Spent) AS spent
FROM MoneyTable
WHERE (DATE >= '2010-01-01')
GROUP BY UserID
HAVING (SUM(Money_Spent_On_Candy)/SUM(Money_Spent)) > 0.1
);
The inner query does the heavy lifting of figuring out what the "10%" users spent, and then the outer query uses the sub-query as a virtual table to sum up the per-user Money_Spent sums.
Of course, this only works if you need ONLY the global total Money_Spent. If you end up needing the per-user sums as well, then you'd be better off just running the inner query and doing the global total in your application.

You can use common table expressions. Like this:
WITH temp AS (SELECT
UserId,
SUM(Money_Spent) AS MoneySpent,
SUM(Money_Spent_On_Candy)/SUM(Money_Spent) AS PercentCandySpend
FROM MoneyTable
WHERE DATE >= '2010-01-01'
HAVING PercentCandySpend > 0.1)
SELECT
UserId
SUM(MoneySpent)
FROM UserId

Or you can use a derived table:
SELECT SUM(Total_Money_Spent)
FROM ( SELECT UserId, Total_Money_Spent = SUM(Money_Spent), SUM(Money_Spent_On_Candy)/SUM(Money_Spent) AS PercentCandySpend
FROM MoneyTable
WHERE DATE >= '2010-01-01'
HAVING PercentCandySpend > 0.1 ) x;

Related

PL-SQL query to calculate customers per period from start and stop dates

I have a PL-SQL table with a structure as shown in the example below:
I have customers (customer_number) with insurance cover start and stop dates (cover_start_date and cover_stop_date). I also have dates of accidents for those customers (accident_date). These customers may have more than one row in the table if they have had more than one accident. They may also have no accidents. And they may also have a blank entry for the cover stop date if their cover is ongoing. Sorry I did not design the data format, but I am stuck with it.
I am looking to calculate the number of accidents (num_accidents) and number of customers (num_customers) in a given time period (period_start), and from that the number of accidents-per-customer (which will be easy once I've got those two pieces of information).
Any ideas on how to design a PL-SQL function to do this in a simple way? Ideally with the time periods not being fixed to monthly (for example, weekly or fortnightly too)? Ideally I will end up with a table like this shown below:
Many thanks for any pointers...
You seem to need a list of dates. You can generate one in the query and then use correlated subqueries to calculate the columns you want:
select d.*,
(select count(distinct customer_id)
from t
where t.cover_start_date <= d.dte and
(t.cover_end_date > d.date + interval '1' month or t.cover_end_date is null)
) as num_customers,
(select count(*)
from t
where t.accident_date >= d.dte and
t.accident_date < d.date + interval '1' month
) as accidents,
(select count(distinct customer_id)
from t
where t.accident_date >= d.dte and
t.accident_date < d.date + interval '1' month
) as num_customers_with_accident
from (select date '2020-01-01' as dte from dual union all
select date '2020-02-01' as dte from dual union all
. . .
) d;
If you want to do arithmetic on the columns, you can use this as a subquery or CTE.

i am trying to use the avg() function in a subquery after using a count in the inner query but i cannot seem to get it work in SQL

my table name is CustomerDetails and it has the following columns:
customer_id, login_id, session_id, login_date
i am trying to write a query that calculates the average number of customers login in per day.
i tried this:
select avg(session_id)
from CustomerDetails
where exists (select count(session_id) from CustomerDetails as 'no_of_entries')
.
but then i realized it was going straight to the column and just calculating the average of that column but that's not what i want to do. can someone help me?
thanks
The first thing you need to do is get logins per day:
SELECT login_date, COUNT(*) AS loginsPerDay
FROM CustomerDetails
GROUP BY login_date
Then you can use that to get average logins per day:
SELECT AVG(loginsPerDay)
FROM (
SELECT login_date, COUNT(*) AS loginsPerDay
FROM CustomerDetails
GROUP BY login_date
)
If your login_date is a DATE type you're all set. If it has a time component then you'll need to truncate it to date only:
SELECT AVG(loginsPerDay)
FROM (
SELECT CAST(login_date AS DATE), COUNT(*)
FROM CustomerDetails
GROUP BY CAST(login_date AS DATE)
)
i am trying to write a query that calculates the average number of customers login in per day.
Count the number of customers. Divide by the number of days. I think that is:
select count(*) * 1.0 / count(distinct cast(login_date as date))
from customerdetails;
I understand that you want do count the number of visitors per day, not the number of visits. So if a customer logged twice on the same day, you want to count him only once.
If so, you can use distinct and two levels of aggregation, like so:
select avg(cnt_visitors) avg_cnt_vistors_per_day
from (
select count(distinct customer_id) cnt_visitors
from customer_details
group by cast(login_date as date)
) t
The inner query computes the count of distinct customers for each day, he outer query gives you the overall average.

Account for missing values in group by month

I'm trying to retrieve the average number of records added to the database each month. However for months that no records were added, the row is missing and therefore not being calculated into the average.
Here is the query:
SELECT AVG(a.count) AS AVG
FROM ( SELECT COUNT(*) AS count, MONTH(InsertedTimestamp) AS Month
FROM Certificates
WHERE InsertedTimestamp >= '9/19/2014'
AND InsertedTimestamp <= '7/1/2015'
GROUP BY MONTH(InsertedTimestamp)
) AS a
When I run just the inner query, only results from months 9,10,11 are showing, because there are no records for months 12,1,2,3,4,5,6,7. How can I add these missing rows to the table in order to get the correct monthly average?
Thanks!
This is easy enough to fix, just by using sum / cnt:
SELECT COUNT(*) / (TIMESTAMPDIFF(month, '2014-09-19', '2015-07-01' ) + 1)
FROM Certificates
WHERE InsertedTimestamp >= '2014-09-19' AND
InsertedTimestamp <= '2015-07-01' ;
You don't even need the subquery.

Using SQL to compute average daily unique usage

I have a MySQL table "stats", which is a list of entries for each login into a website. Each entry has a "userId" string, a "loginTime" timestamp and other fields. There can be more than one entry for each user - one for each login that he makes. I want to write a query that will calculate the average of unique daily logins over, say, 30 days.
Any ideas?
/*
This should give you one row for each date and unique visits on that date
*/
SELECT DATE(loginTime) LoginDate, COUNT(userID) UserCount
FROM stats
WHERE DATE(loginTime) BETWEEN [start date] AND [end date]
GROUP BY DATE(logintime), userID
Note: It will be more helpful if you can provide some sample data with the result you are looking for.
i'm probably wrong but if you did: select count(distinct userid) from stats where logintime between start of :day and end of day for day in each of those 30 days fetched those 30 counts (which could be pre-calculated cached (as you probably don't have users logging in at past times)) and them just average them in the programing language that your executing the query from
i read http://unganisha.org/home/pages/Generating_Sequences_With_SQL/index.html while looking and thought if you had a table of say the numbers 0 to 30 lets name it offsets for this example:
select avg(userstoday)
from (select count(userid) as userstoday, day
from stats join offsets on (stats.logintime=(current_day)-offsets.day)
group by day)
and as i noted, the userstoday value could be pre-calculated and stored in a table
Thanks everyone, eventually I used:
SELECT SUM( uniqueUsers ) / 30 AS DAU
FROM (
SELECT DATE( loginTime ) AS DATE, COUNT( DISTINCT userID ) AS uniqueUsers
FROM user_requests
WHERE DATE( loginTime ) > DATE_SUB( CURDATE( ) , INTERVAL 30
DAY )
GROUP BY DATE( loginTime )
) AS daily_users
I use a SUM and divide by 30 instead of average, because on some days I may not have any logins and I want to account for that. But on any daily heavy-traffic website simply using AVG will give the same results

SQL for counting events by date

I feel like I've seen this question asked before, but neither the SO search nor google is helping me... maybe I just don't know how to phrase the question. I need to count the number of events (in this case, logins) per day over a given time span so that I can make a graph of website usage. The query I have so far is this:
select
count(userid) as numlogins,
count(distinct userid) as numusers,
convert(varchar, entryts, 101) as date
from
usagelog
group by
convert(varchar, entryts, 101)
This does most of what I need (I get a row per date as the output containing the total number of logins and the number of unique users on that date). The problem is that if no one logs in on a given date, there will not be a row in the dataset for that date. I want it to add in rows indicating zero logins for those dates. There are two approaches I can think of for solving this, and neither strikes me as very elegant.
Add a column to the result set that lists the number of days between the start of the period and the date of the current row. When I'm building my chart output, I'll keep track of this value and if the next row is not equal to the current row plus one, insert zeros into the chart for each of the missing days.
Create a "date" table that has all the dates in the period of interest and outer join against it. Sadly, the system I'm working on already has a table for this purpose that contains a row for every date far into the future... I don't like that, and I'd prefer to avoid using it, especially since that table is intended for another module of the system and would thus introduce a dependency on what I'm developing currently.
Any better solutions or hints at better search terms for google? Thanks.
Frankly, I'd do this programmatically when building the final output. You're essentially trying to read something from the database which is not there (data for days that have no data). SQL isn't really meant for that sort of thing.
If you really want to do that, though, a "date" table seems your best option. To make it a bit nicer, you could generate it on the fly, using i.e. your DB's date functions and a derived table.
I had to do exactly the same thing recently. This is how I did it in T-SQL (
YMMV on speed, but I've found it performant enough over a coupla million rows of event data):
DECLARE #DaysTable TABLE ( [Year] INT, [Day] INT )
DECLARE #StartDate DATETIME
SET #StartDate = whatever
WHILE (#StartDate <= GETDATE())
BEGIN
INSERT INTO #DaysTable ( [Year], [Day] )
SELECT DATEPART(YEAR, #StartDate), DATEPART(DAYOFYEAR, #StartDate)
SELECT #StartDate = DATEADD(DAY, 1, #StartDate)
END
-- This gives me a table of all days since whenever
-- you could select #StartDate as the minimum date of your usage log)
SELECT days.Year, days.Day, events.NumEvents
FROM #DaysTable AS days
LEFT JOIN (
SELECT
COUNT(*) AS NumEvents
DATEPART(YEAR, LogDate) AS [Year],
DATEPART(DAYOFYEAR, LogDate) AS [Day]
FROM LogData
GROUP BY
DATEPART(YEAR, LogDate),
DATEPART(DAYOFYEAR, LogDate)
) AS events ON days.Year = events.Year AND days.Day = events.Day
Create a memory table (a table variable) where you insert your date ranges, then outer join the logins table against it. Group by your start date, then you can perform your aggregations and calculations.
The strategy I normally use is to UNION with the opposite of the query, generally a query that retrieves data for rows that don't exist.
If I wanted to get the average mark for a course, but some courses weren't taken by any students, I'd need to UNION with those not taken by anyone to display a row for every class:
SELECT AVG(mark), course FROM `marks`
UNION
SELECT NULL, course FROM courses WHERE course NOT IN
(SELECT course FROM marks)
Your query will be more complex but the same principle should apply. You may indeed need a table of dates for your second query
Option 1
You can create a temp table and insert dates with the range and do a left outer join with the usagelog
Option 2
You can programmetically insert the missing dates while evaluating the result set to produce the final output
WITH q(n) AS
(
SELECT 0
UNION ALL
SELECT n + 1
FROM q
WHERE n < 99
),
qq(n) AS
(
SELECT 0
UNION ALL
SELECT n + 1
FROM q
WHERE n < 99
),
dates AS
(
SELECT q.n * 100 + qq.n AS ndate
FROM q, qq
)
SELECT COUNT(userid) as numlogins,
COUNT(DISTINCT userid) as numusers,
CAST('2000-01-01' + ndate AS DATETIME) as date
FROM dates
LEFT JOIN
usagelog
ON entryts >= CAST('2000-01-01' AS DATETIME) + ndate
AND entryts < CAST('2000-01-01' AS DATETIME) + ndate + 1
GROUP BY
ndate
This will select up to 10,000 dates constructed on the fly, that should be enough for 30 years.
SQL Server has a limitation of 100 recursions per CTE, that's why the inner queries can return up to 100 rows each.
If you need more than 10,000, just add a third CTE qqq(n) and cross-join with it in dates.