Simpler way to do a SUM with a fanout on a join - sql

Note: SQL backend does not matter, any mainstream relational DB is fine (postgres, mysql, oracle, sqlserver)
There is an interesting article on Looker that tells about the technique they use to provide correct totals when a JOIN results in a fanout, along the lines of:
# In other words, using a hash to remove any potential duplicates (assuming a Primary Key).
SUM(DISTINCT big_unique_number + total) - SUM(DISTINCT big_unique_number)
A good way to simulate the fanout it just doing something like this:
WITH Orders AS (
SELECT 10293 AS id, 2.5 AS rate UNION ALL
SELECT 210293 AS id, 3.5
),
Other AS (
SELECT 1 UNION ALL SELECT 2
)
SELECT SUM(rate) FROM Orders CROSS JOIN Other
-- Returns 12.0 instead of 6.0
Their example does something like this, which I think is just a long-form way of grabbing md5(PK) with all the fancy footwork to get around the 8-byte limitation (so they do a LEFT(...) then a RIGHT(...):
(COALESCE(CAST( ( SUM(DISTINCT (CAST(FLOOR(COALESCE(users.age ,0)
*(1000000*1.0)) AS DECIMAL(38,0))) +
CAST(STRTOL(LEFT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))
* 1.0e8 + CAST(STRTOL(RIGHT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0)) )
- SUM(DISTINCT CAST(STRTOL(LEFT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))
* 1.0e8 + CAST(STRTOL(RIGHT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))) )
AS DOUBLE PRECISION)
/ CAST((1000000*1.0) AS DOUBLE PRECISION), 0)
Is there another general-purpose way to do this? Perhaps using a correlated subquery or something else? Or is the above way the best known way to do this?
Two related answers:
https://stackoverflow.com/a/14140884/651174
https://stackoverflow.com/a/3333574/651174
Without worrying about a general-purpose hashing function (for example, that may take strings), the following works:
WITH Orders AS (
SELECT 10293 AS id, 2.5 AS rate UNION ALL
SELECT 210293 AS id, 3.5
),
Other AS (
SELECT 1 UNION ALL SELECT 2
)
SELECT SUM(DISTINCT id + rate) - SUM(DISTINCT id) FROM Orders CROSS JOIN Other
-- 6.0
But this still begs the question: is there another / better way to do this in a very general-purpose manner?

A typical example for the joins mutilating the aggregation is this:
select
posts.id,
count(likes.id) as likes_total,
count(dislikes.id) as dislikes_total
from posts
left join likes on likes.post_id = posts.post_id
left join dislikes on dislikes.post_id = posts.post_id
group by posts.id;
where both counts result in the same number, because each gets multiplied by the other. With 2 likes and 3 dislikes, both counts are 6.
The simple solution is: Aggregate before joining. If you want to know the likes and dislikes counts per post, join the likes and dislikes counts to the posts.
select posts.id, l.likes_total, d.dislikes_total
from posts
left join
(
select post_id, count(*) as likes_total
from likes
group by post_id
) l on l.post_id = posts.post_id
left join
(
select post_id, count(*) as dislikes_total
from dislikes
group by post_id
) d on d.post_id = posts.post_id
group by posts.id;
Use COALESCE, if you want to see zeros instead of nulls.
Don't try to muddle through with tricks. Just aggregate, then join. You can of course replace the joins with lateral joins (which are correlated subqueries), if the DBMS supports them. Or for single aggregates as in the example even move the correlated subqueries to the select clause. That's mainly personal preference, but depending on the DBMS's optimizer one solution may be faster than the other. (Ideally the optimizer would come up with the same execution plan for all those queries of course.)

Use a larger datatype to shift the values out of the way. This is similar to the first example without the potential for collisions. It probably also has minor performance benefits in not having to execute two different distinct sums.
sum(distinct id * 1000000000 + value) % 1000000000
The principle is to package up the values into a single unit. For the most flexibility you'd want to convert to something like a wide decimal type in order to accommodate the full range. With strings it's easY to generate a new surrogate id via dense_rank() That would also let you collapse the key width according to the number of expect key values.
Ultimately though, I think the ultimate answer is no. There's not a one size fits all approach, especially across the spectrum of the various aggregate functions going beyond variations in mixed data types.

I think the best option is always to SUM data before to join via a subquery or a cte
WITH
Orders AS (
SELECT 10293 AS id, 2.5 AS rate
UNION ALL
SELECT 210293 AS id, 3.5
),
Other AS (
SELECT 1 other
UNION ALL
SELECT 2
)
select *
from (
SELECT SUM(rate) rate
FROM Orders
) OrdersSummed
CROSS JOIN Other
or
WITH
Orders AS (
SELECT 10293 AS id, 2.5 AS rate
UNION ALL
SELECT 210293 AS id, 3.5
),
Other AS (
SELECT 1 other
UNION ALL
SELECT 2
),
OrdersSummed AS (
SELECT SUM(rate) rate
FROM Orders
) s
select *
from OrdersSummed
CROSS JOIN Other

--Approaching the solution such that fanout phenomenon a natural consequence of cross join.
;WITH Orders AS (
SELECT 10293 AS id, 2.5 AS rate UNION ALL
SELECT 210293 AS id, 3.5
), Other AS (
SELECT 1 as oth_id UNION ALL SELECT 2 as oth_id
)
, FanDepth AS (
SELECT count(*) as depth from Other
)
SELECT SUM(rate) / depth
FROM
Orders CROSS JOIN Other CROSS JOIN FanDepth
Group by depth

Related

Sql max trophy count

I Create DataBase in SQL about Basketball. Teacher give me the task, I need print out basketball players from my database with the max trophy count. So, I wrote this little bit of code:
select surname ,count(player_id) as trophy_count
from dbo.Players p
left join Trophies t on player_id=p.id
group by p.surname
and SQL gave me this:
but I want, that SQL will print only this:
I read info about select in selects, but I don't know how it works, I tried but it doesn't work.
Use TOP:
SELECT TOP 1 surname, COUNT(player_id) AS trophy_count -- or TOP 1 WITH TIES
FROM dbo.Players p
LEFT JOIN Trophies t
ON t.player_id = p.id
GROUP BY p.surname
ORDER BY COUNT(player_id) DESC;
If you want to get all ties for the highest count, then use SELECT TOP 1 WITH TIES.
;WITH CTE AS
(
select surname ,count(player_id) as trophy_count
from dbo.Players p
group by p.surname;
)
select *
from CTE
where trophy_count = (select max(trophy_count) from CTE)
While select top with ties works (and is probably more efficient) I would say this code is probably more useful in the real world as it could be used to find the max, min or specific trophy count if needed with a very simple modification of the code.
This is basically getting your group by first, then allowing you to specify what results you want back. In this instance you can use
max(trophy_count) - get the maximum
min(trophy_count) - get the minimum
# i.e. - where trophy_count = 3 - to get a specific trophy count
avg(trophy_count) - get the average trophy_count
There are many others. Google "SQL Aggregate functions"
You will eventually go down the rabbit hole of needing to subsection this (examples are by week or by league). Then you are going to want to use windows functions with a cte or subquery)
For your example:
;with cte_base as
(
-- Set your detail here (this step is only needed if you are looking at aggregates)
select surname,Count(*) Ct
left join Trophies t on player_id=p.id
group by p.surname
, cte_ranked as
-- Dense_rank is chosen because of ties
-- Add to the partition to break out your detail like by league, surname
(
select *
, dr = DENSE_RANK() over (partition by surname order by Ct desc)
from cte_base
)
select *
from cte_ranked
where dr = 1 -- Bring back only the #1 of each partition
This is by far overkill but helping you lay the foundation to handle much more complicated queries. Tim Biegeleisen's answer is more than adequate to answer you question.

Redshift: Cross join make data disappear

I have a weird issue in Redshift with a crossjoin.
I am generating days and want to join them with some ids.
The sample query is this:
with ids as (
Select number as id
from models.number_10000
limit 10
),
day as (
SELECT
TO_CHAR(DATEADD(day,num.number,CAST(DATEADD(day,-463,GETDATE()) AS DATE)),'YYYY-MM-DD') as date_string
FROM
(Select * from models.number_10000 limit 463)
as num
)
SELECT
id,date_string
from ids,day
Everything is working fine so far.
However, if I group by then I have no results.
with ids as (
Select number as id
from models.number_10000
limit 10
),
day as (
SELECT
TO_CHAR(DATEADD(day,num.number,CAST(DATEADD(day,-463,GETDATE()) AS DATE)),'YYYY-MM-DD') as date_string
FROM
(Select * from models.number_10000 limit 463)
as num
)
SELECT
id,date_String
from ids,day
group by 1,2
How is this happening? I have never faced something similar. I guess it's something with the cross join and the group by but it seems very strange.
Any thoughts?
I'd start with the following:
State your JOIN explicitly (ANSI92 Style).
State the names of items you want to be grouped explicitly.
Moreover - remove your GROUP BY statement (as you do not have any aggregate functions) and have a DISTINCT clause in your select statement.

SQL Query: Looking to calculate difference between two columns both in different tables?

I'm looking to calculate the difference between the sum of two different columns in two different tables. Here's what I have:
SELECT sum(amount)
FROM variable_in
where user_id='111111'
minus
SELECT sum(amount)
FROM variable_out
where user_id='111111'
When I do this, I just get an output of the first query results. How do I have it execute both queries (for the in and out tables) as well as have it minus the variable_out total for the amount column? Since they are both going to be positive integers.
Thanks in advance! Most of the other tips I've seen have been overly complex compared to my issue.
it's very simple...
select
(select sum(amount) from variable_in where user_id='111111')
-
(select sum(amount) from variable_out where user_id='111111')
as amount;
How about moving the queries to the from clause and using -:
SELECT in_amount - out_amount
FROM (SELECT sum(amount) as in_amount
FROM variable_in
WHERE user_id = '111111'
) i CROSS JOIN
(SELECT sum(amount) as out_amount
FROM variable_out
WHERE user_id = '111111'
) o;
Your query is confusing the set operation "minus" with the numerical operator -. Admittedly, they do have the same name. But minus works with sets, not numbers.
I should point out that you can put the nested queries in the FROM clause and use the results like numbers ("scalar subqueries"):
SELECT ((SELECT sum(amount) as in_amount
FROM variable_in
WHERE user_id = '111111'
) -
(SELECT sum(amount) as out_amount
FROM variable_out
WHERE user_id = '111111'
) o
) as diff
FROM dual;

clubbing multiple "With" clauses in sql

I am using oracle database 10g and trying to compute the Upper control limit and lower control limit for the data set.Though it seems useless for phone number values but I am just trying to use it as a learning experience.The output should have a row wise form for entries of:-
salutation,zip,lcl and ucl value
which would allow better understanding of data.
with q as(
select student_id,salutation,zip,first_name,last_name from tempTable)
with r as(
select avg(phone) as average,stddev(phone) as sd from tempTable)
select salutation,zip,average-3*sd as"lcl",average+3*sd as"UCL"
from
q ,r
error given is select statement missing.Please tell me what is wrong I am a sql newbie and can't do it myself
while using stacked CTE expect for the first CTE you don't need With keyword instead use comma before the CTE name. Try this syntax.
WITH q
AS (SELECT student_id,
salutation,
zip,
first_name,
last_name
FROM temptable),
r
AS (SELECT Avg(phone) AS average,
STDDEV(phone) AS sd
FROM temptable)
SELECT salutation,
zip,
average - 3 * sd AS"lcl",
average + 3 * sd AS"UCL"
FROM q Cross Join r;
I don't think you need a WITH clause at all to run such a query. It might be better to use the AVG() and STDDEV() functions as window functions (analytic functions in Oracle lingo):
SELECT temp1.*, average - 3 * sd AS lcl, average + 3 * sd AS ucl
FROM (
SELECT student_id, salutation, zip, first_name, last_name
, AVG(phone) OVER ( ) AS average, STDDEV(phone) OVER ( ) AS sd
FROM tempTable
) temp1
You don't even need the subquery but it helps save some keystrokes. See this SQL Fiddle demo with dummy data from DUAL.
P.S. You do need the alias (in this case, temp1) for the subquery if you want to use * to get all the columns selected in the subquery - it won't work otherwise. Alternately you could name the columns explicitly, which is a good practice anyway.

Compare SQL groups against eachother

How can one filter a grouped resultset for only those groups that meet some criterion compared against the other groups? For example, only those groups that have the maximum number of constituent records?
I had thought that a subquery as follows should do the trick:
SELECT * FROM (
SELECT *, COUNT(*) AS Records
FROM T
GROUP BY X
) t HAVING Records = MAX(Records);
However the addition of the final HAVING clause results in an empty recordset... what's going on?
In MySQL (Which I assume you are using since you have posted SELECT *, COUNT(*) FROM T GROUP BY X Which would fail in all RDBMS that I know of). You can use:
SELECT T.*
FROM T
INNER JOIN
( SELECT X, COUNT(*) AS Records
FROM T
GROUP BY X
ORDER BY Records DESC
LIMIT 1
) T2
ON T2.X = T.X
This has been tested in MySQL and removes the implicit grouping/aggregation.
If you can use windowed functions and one of TOP/LIMIT with Ties or Common Table expressions it becomes even shorter:
Windowed function + CTE: (MS SQL-Server & PostgreSQL Tested)
WITH CTE AS
( SELECT *, COUNT(*) OVER(PARTITION BY X) AS Records
FROM T
)
SELECT *
FROM CTE
WHERE Records = (SELECT MAX(Records) FROM CTE)
Windowed Function with TOP (MS SQL-Server Tested)
SELECT TOP 1 WITH TIES *
FROM ( SELECT *, COUNT(*) OVER(PARTITION BY X) [Records]
FROM T
)
ORDER BY Records DESC
Lastly, I have never used oracle so apolgies for not adding a solution that works on oracle...
EDIT
My Solution for MySQL did not take into account ties, and my suggestion for a solution to this kind of steps on the toes of what you have said you want to avoid (duplicate subqueries) so I am not sure I can help after all, however just in case it is preferable here is a version that will work as required on your fiddle:
SELECT T.*
FROM T
INNER JOIN
( SELECT X
FROM T
GROUP BY X
HAVING COUNT(*) =
( SELECT COUNT(*) AS Records
FROM T
GROUP BY X
ORDER BY Records DESC
LIMIT 1
)
) T2
ON T2.X = T.X
For the exact question you give, one way to look at it is that you want the group of records where there is no other group that has more records. So if you say
SELECT taxid, COUNT(*) as howMany
GROUP by taxid
You get all counties and their counts
Then you can treat that expressions as a table by making it a subquery, and give it an alias. Below I assign two "copies" of the query the names X and Y and ask for taxids that don't have any more in one table. If there are two with the same number I'd get two or more. Different databases have proprietary syntax, notably TOP and LIMIT, that make this kind of query simpler, easier to understand.
SELECT taxid FROM
(select taxid, count(*) as HowMany from flats
GROUP by taxid) as X
WHERE NOT EXISTS
(
SELECT * from
(
SELECT taxid, count(*) as HowMany FROM
flats
GROUP by taxid
) AS Y
WHERE Y.howmany > X.howmany
)
Try this:
SELECT * FROM (
SELECT *, MAX(Records) as max_records FROM (
SELECT *, COUNT(*) AS Records
FROM T
GROUP BY X
) t
) WHERE Records = max_records
I'm sorry that I can't test the validity of this query right now.