Rewriting the HAVING clause and COUNT function - sql

I'm trying to conceptually understand how to rewrite the HAVING clause and COUNT function.
I was asked "Find the names of all reviewers who have contributed three or more ratings. (As an extra challenge, try writing the query without HAVING or without COUNT.)" in relation this this simple database: http://sqlfiddle.com/#!5/35779/2/0
The query with HAVING and COUNT is easy. Without, I'm having difficulty.
Help would be very much appreciated. Thank you.

One option would be to use SUM(1) in place of COUNT in a subquery, and using WHERE instead of HAVING:
SELECT b.name
FROM (SELECT rID,SUM(1) Sum1
FROM rating
GROUP BY rID
)a
JOIN reviewer b
ON a.rID = b.rID
WHERE Sum1 >= 3
Demo: SQL Fiddle
Update: Some explanation of SUM(1):
Adding a constant to a SELECT statement will result in that value being repeated for every row returned, for example:
SELECT rID
,1 as Col1
FROM rating
Returns:
| rID | Col1 |
|-----|------|
| 201 | 1 |
| 201 | 1 |
| 202 | 1 |
| 203 | 1 |
| 203 | 1 |
......
SUM(1) is applying a constant 1 to every row and aggregating it.

Related

Calculate overall percentage of Access Query

I have an MS Access Query which returns the following sample data:
+-----+------+------+
| Ref | ANS1 | ANS2 |
+-----+------+------+
| 123 | A | A |
| 234 | B | B |
| 345 | C | C |
| 456 | D | E |
| 567 | F | G |
| 678 | H | I |
+-----+------+------+
Is it possible to have Access return the overall percentage where ANS1 = ANS2?
So my new query would return:
50
I know how to get a count of the records returned by the original query, but not how to calculate the percentage.
Since you're looking for a percentage of some condition being met across the entire dataset, the task can be reduced to having a function return either 1 (when the condition is validated), or 0 (when the condition is not validated), and then calculating an average across all records.
This could be achieved in a number of ways, one example might be to use a basic iif statement:
select avg(iif(t.ans1=t.ans2,1,0)) from YourTable t
Or, using the knowledge that a boolean value in MS Access is represented using -1 (True) or 0 (False), the expression can be reduced to:
select -avg(t.ans1=t.ans2) from YourTable t
In each of the above, change YourTable to the name of your table.
If you know how to get a count, then apply that same knowledge twice:
SELECT Count([ANS1]) As MatchCount FROM [Data]
WHERE [ANS1] = [ANS2]
divided by the total count
SELECT Count([ANS1]) As AllCount FROM [Data]
To combine both of these in a basic SQL query, one needs a "dummy" query since Access doesn't allow selection of only raw data:
SELECT TOP 1
((SELECT Count([ANS1]) As MatchCount FROM [Data] WHERE [ANS1] = [ANS2])
/
(SELECT Count([ANS1]) As AllCount FROM [Data]))
AS MatchPercent
FROM [Data]
This of course assumes that there is at least one row... so it doesn't divide by zero.

Novice seeking help, Max Aggregate not returning expected results

I'm still very new to MS-SQL. I have a simple table and query that that is getting the best of me. I know it will something fundamental I'm overlooking.
I've changed the field names but the idea is the same.
So the idea is that every time someone signs up they get a RegID, Name, and Team. The names are unique, so for below yes John changed teams. And that's my trouble.
Football Table
+------------+----------+---------+
| Max_RegID | Name | Team |
+------------+----------+---------+
| 100 | John | Red |
| 101 | Bill | Blue |
| 102 | Tom | Green |
| 103 | John | Green |
+------------+----------+---------+
With the query at the bottom using the Max_RegID, I was expecting to get back only one record.
+------------+----------+---------+
| Max_RegID | Name | Team |
+------------+----------+---------+
| 103 | John | Green |
+------------+----------+---------+
Instead I get back below, Which seems to include Max_RegID but also for each team. What am I doing wrong?
+------------+----------+---------+
| Max_RegID | Name | Team |
+------------+----------+---------+
| 100 | John | Red |
| 103 | John | Green |
+------------+----------+---------+
My Query
SELECT
Max(Football.RegID) AS Max_RegID,
Football.Name,
Football.Team
FROM
Football
GROUP BY
Football.RegID,
Football.Name,
Football.Team
EDIT* Removed the WHERE statement
The reason you're getting the results that you are is because of the way you have your GROUP BY clause structured.
When you're using any aggregate function, MAX(X), SUM(X), COUNT(X), or what have you, you're telling the SQL engine that you want the aggregate value of column X for each unique combination of the columns listed in the GROUP BY clause.
In your query as written, you're grouping by all three of the columns in the table, telling the SQL engine that each tuple is unique. Therefore the query is returning ALL of the values, and you aren't actually getting the MAX of anything at all.
What you actually want in your results is the maximum RegID for each distinct value in the Name column and also the Team that goes along with that (RegID,Name) combination.
To accomplish that you need to find the MAX(ID) for each Name in an initial data set, and then use that list of RegIDs to add the values for Name and Team in a secondary data set.
Caveat (per comments from #HABO): This is premised on the assumption that RegID is a unique number (an IDENTITY column, value from a SEQUENCE, or something of that sort). If there are duplicate values, this will fail.
The most straight forward way to accomplish that is with a sub-query. The sub-query below gets your unique RegIDs, then joins to the original table to add the other values.
SELECT
f.RegID
,f.Name
,f.Team
FROM
Football AS f
JOIN
(--The sub-query, sq, gets the list of IDs
SELECT
MAX(f2.RegID) AS Max_RegID
FROM
Football AS f2
GROUP BY
f2.Name
) AS sq
ON
sq.Max_RegID = f.RegID;
EDIT: Sorry. I just re-read the question. To get just the single record for the MAX(RegID), just take the GROUP BY out of the sub-query, and you'll just get the current maximum value, which you can use to find the values in the rest of the columns.
SELECT
f.RegID
,f.Name
,f.Team
FROM
Football AS f
JOIN
(--The sub-query, sq, now gets the MAX ID
SELECT
MAX(f2.RegID) AS Max_RegID
FROM
Football AS f2
) AS sq
ON
sq.Max_RegID = f.RegID;
Use row_number()
select * from
(SELECT
Football.RegID AS Max_RegID,
Football.Name,
Football.Team, row_number() over(partition by name order by Football.RegID desc) as rn
FROM
Football
WHERE
Football.Name = 'John')a
where rn=1
simply you can edit your query below way
SELECT *
FROM
Football f
WHERE
f.Name = 'John' and
Max_RegID = (SELECT Max(Football.Max_RegID) where Football.Name = 'John'
)
or
if sql server simply use this
select top 1 * from Football f
where f.Name = 'John'
order by Max_RegID desc
or
if mysql then
select * from Football f
where f.Name = 'John'
order by Max_RegID desc
Limit 1
You need self join :
select f1.*
from Football f inner join
Football f1
on f1.name = f.name
where f.Max_RegID = 103;
After re-visit question, the sample data suggests me subquery :
select f.*
from Football f
where name = (select top (1) f1.name
from Football f1
order by f1.Max_RegID desc
);

Find spectators that have seen the same shows (match multiple rows for each)

For an assignment I have to write several SQL queries for a database stored in a PostgreSQL server running PostgreSQL 9.3.0. However, I find myself blocked with last query. The database models a reservation system for an opera house. The query is about associating the a spectator the other spectators that assist to the same events every time.
The model looks like this:
Reservations table
id_res | create_date | tickets_presented | id_show | id_spectator | price | category
-------+---------------------+---------------------+---------+--------------+-------+----------
1 | 2015-08-05 17:45:03 | | 1 | 1 | 195 | 1
2 | 2014-03-15 14:51:08 | 2014-11-30 14:17:00 | 11 | 1 | 150 | 2
Spectators table
id_spectator | last_name | first_name | email | create_time | age
---------------+------------+------------+----------------------------------------+---------------------+-----
1 | gonzalez | colin | colin.gonzalez#gmail.com | 2014-03-15 14:21:30 | 22
2 | bequet | camille | bequet.camille#gmail.com | 2014-12-10 15:22:31 | 22
Shows table
id_show | name | kind | presentation_date | start_time | end_time | id_season | capacity_cat1 | capacity_cat2 | capacity_cat3 | price_cat1 | price_cat2 | price_cat3
---------+------------------------+--------+-------------------+------------+----------+-----------+---------------+---------------+---------------+------------+------------+------------
1 | madama butterfly | opera | 2015-09-05 | 19:30:00 | 21:30:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
2 | don giovanni | opera | 2015-09-12 | 19:30:00 | 21:45:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
So far I've started by writing a query to get the id of the spectator and the date of the show he's attending to, the query looks like this.
SELECT Reservations.id_spectator, Shows.presentation_date
FROM Reservations
LEFT JOIN Shows ON Reservations.id_show = Shows.id_show;
Could someone help me understand better the problem and hint me towards finding a solution. Thanks in advance.
So the result I'm expecting should be something like this
id_spectator | other_id_spectators
-------------+--------------------
1| 2,3
Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.
Note based on comments: Wanted to make clear that this answer may be of limited use as it was answered in the context of SQL-Server (tag was present at the time)
There is probably a better way to do it, but you could do it with the 'stuff 'function. The only drawback here is that, since your ids are ints, placing a comma between values will involve a work around (would need to be a string). Below is the method I can think of using a work around.
SELECT [id_spectator], [id_show]
, STUFF((SELECT ',' + CAST(A.[id_spectator] as NVARCHAR(10))
FROM reservations A
Where A.[id_show]=B.[id_show] AND a.[id_spectator] != b.[id_spectator] FOR XML PATH('')),1,1,'') As [other_id_spectators]
From reservations B
Group By [id_spectator], [id_show]
This will show you all other spectators that attended the same shows.
Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.
In other words, you want a list of ...
all spectators that have seen all the shows that a given spectator has seen (and possibly more than the given one)
This is a special case of relational division. We have assembled an arsenal of basic techniques here:
How to filter SQL results in a has-many-through relation
It is special because the list of shows each spectator has to have attended is dynamically determined by the given prime spectator.
Assuming that (d_spectator, id_show) is unique in reservations, which has not been clarified.
A UNIQUE constraint on those two columns (in that order) also provides the most important index.
For best performance in query 2 and 3 below also create an index with leading id_show.
1. Brute force
The primitive approach would be to form a sorted array of shows the given user has seen and compare the same array of others:
SELECT 1 AS id_spectator, array_agg(sub.id_spectator) AS id_other_spectators
FROM (
SELECT id_spectator
FROM reservations r
WHERE id_spectator <> 1
GROUP BY 1
HAVING array_agg(id_show ORDER BY id_show)
#> (SELECT array_agg(id_show ORDER BY id_show)
FROM reservations
WHERE id_spectator = 1)
) sub;
But this is potentially very expensive for big tables. The whole table hast to be processes, and in a rather expensive way, too.
2. Smarter
Use a CTE to determine relevant shows, then only consider those
WITH shows AS ( -- all shows of id 1; 1 row per show
SELECT id_spectator, id_show
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
)
SELECT sub.id_spectator, array_agg(sub.other) AS id_other_spectators
FROM (
SELECT s.id_spectator, r.id_spectator AS other
FROM shows s
JOIN reservations r USING (id_show)
WHERE r.id_spectator <> s.id_spectator
GROUP BY 1,2
HAVING count(*) = (SELECT count(*) FROM shows)
) sub
GROUP BY 1;
#> is the "contains2 operator for arrays - so we get all spectators that have at least seen the same shows.
Faster than 1. because only relevant shows are considered.
3. Real smart
To also exclude spectators that are not going to qualify early from the query, use a recursive CTE:
WITH RECURSIVE shows AS ( -- produces exactly 1 row
SELECT id_spectator, array_agg(id_show) AS shows, count(*) AS ct
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
GROUP BY 1
)
, cte AS (
SELECT r.id_spectator, 1 AS idx
FROM shows s
JOIN reservations r ON r.id_show = s.shows[1]
WHERE r.id_spectator <> s.id_spectator
UNION ALL
SELECT r.id_spectator, idx + 1
FROM cte c
JOIN reservations r USING (id_spectator)
JOIN shows s ON s.shows[c.idx + 1] = r.id_show
)
SELECT s.id_spectator, array_agg(c.id_spectator) AS id_other_spectators
FROM shows s
JOIN cte c ON c.idx = s.ct -- has an entry for every show
GROUP BY 1;
Note that the first CTE is non-recursive. Only the second part is recursive (iterative really).
This should be fastest for small selections from big tables. Row that don't qualify are excluded early. the two indices I mentioned are essential.
SQL Fiddle demonstrating all three.
It sounds like you have one half of the total question--determining which id_shows a particular id_spectator attended.
What you want to ask yourself is how you can determine which id_spectators attended an id_show, given an id_show. Once you have that, combine the two answers to get the full result.
So the final answer I got, looks like this :
SELECT id_spectator, id_show,(
SELECT string_agg(to_char(A.id_spectator, '999'), ',')
FROM Reservations A
WHERE A.id_show=B.id_show
) AS other_id_spectators
FROM Reservations B
GROUP By id_spectator, id_show
ORDER BY id_spectator ASC;
Which prints something like this:
id_spectator | id_show | other_id_spectators
-------------+---------+---------------------
1 | 1 | 1, 2, 9
1 | 14 | 1, 2
Which suits my needs, however if you have any improvements to offer, please share :) Thanks again everybody!

prevent from double/triple SUMing when JOINing

i am joining two tables: accn_demographics and accn_payments. The relationship between the two tables is one to many between accn_demographics.accn_id and accn_payments.accn_id
My question is when I am summing the PAID_AMT and COPAY_AMT, I am getting double/triple/quadrouple the number that I should be getting.
Is there an obvious problem with my join condition?
select sum(p.paid_amt) as SumPaidAmount
, sum(p.copay_amt) as SumCoPay
, p.pmt_date
, d.load_Date
, p.ACCN_ID
from accn_payments p
join
(
select distinct load_date, accn_id
from accn_demographics
) d
on p.ACCN_ID=d.ACCN_ID
where p.POSTED='Y'
and p.pmt_date between '20120701' and '20120731'
group by p.pmt_date, d.load_Date,p.ACCN_ID
order by 3 desc
thanks so much for your guidance.
You need to do the summation in a subquery:
select sum(p.SumPaidAmount) as SumPaidAmount, sum(p.SumCoPay) as SumCoPay,
p.pmt_date, d.load_Date, p.ACCN_ID
from (select accn_id, p.pmt_date, sum(paid_amt) as SumPaidAmt,
sum(copay_amt) as SumCoPay
from accn_payments p
where p.POSTED='Y' and
p.pmt_date between '20120701' and '20120731'
group by accn_id, pmt_date
) p join
(select distinct load_date, accn_id from accn_demographics) d
on p.ACCN_ID=d.ACCN_ID
group by p.pmt_date, d.load_Date,p.ACCN_ID
order by 3 desc
Question: do you really intend for pmt_date to be in the final results? It looks like you want to remove it from both the outer SELECT and the subquery.
The only thing I can see if that (select distinct load_date, accn_id from accn_demographics) might return several matches. Look at your data and run a separate query
select distinct load_date, accn_id from accn_demographics WHERE accn_id=SomeID
where SomeID is one of the result accounts that is returning double/triple values. That should pinpoint your problem.
Yes, but it's not so obvious for beginners. What happens is that for every accn_payments record, you're matching on ONLY the accn_id, which means if there are multiple records in accn_demographics for that particular accn_id, then you will get duplicate accn_payment records due to the join. Is there another limiting field on accn_demographics to join back to the payments?
Ultimately, think of it this way:
accn_payments (p):
accn_id | paid_amt | copay_amt | ...
----------------------------------------------------
1 | 100.00 | 20.00 | ...
accn_demographics (d):
accn_id | load_date | ...
------------------------------------
1 | 2012/01/01 | ...
1 | 2012/03/05 | ...
1 | 2012/06/23 | ...
After joining, your results will look like this:
p.accn_id | p.paid_amt | p.copay_amt | p... | d.accn_id | d.load_date | d...
----------------------------------------------------------------------------
1 | 100.00 | 20.00 | .... | 1 | 2012/01/01 | ....
1 | 100.00 | 20.00 | .... | 1 | 2012/03/05 | ....
1 | 100.00 | 20.00 | .... | 1 | 2012/06/21 | ....
As you can see, the same row from accn_payments gets replicated for every matching accn_demographics record, since you specified only the accn_id column to be the join criteria. It can't limit the results any further, so it the DB engine says "Hey, look, this p record matches for all these d records, this must be what he was asking for!" Obviously not what was intended, as when you sum on the p.paid_amt and p.copay_amt, it performs a sum for ALL ROWS (even though they are duplicated).
Ultimately, see if you can limit the join criteria for accn_demographics even further (by some date, perhaps), that way you limit the number of duplicate payment records during the join.

MIN() Function in SQL

Need help with Min Function in SQL
I have a table as shown below.
+------------+-------+-------+
| Date_ | Name | Score |
+------------+-------+-------+
| 2012/07/05 | Jack | 1 |
| 2012/07/05 | Jones | 1 |
| 2012/07/06 | Jill | 2 |
| 2012/07/06 | James | 3 |
| 2012/07/07 | Hugo | 1 |
| 2012/07/07 | Jack | 1 |
| 2012/07/07 | Jim | 2 |
+------------+-------+-------+
I would like to get the output like below
+------------+------+-------+
| Date_ | Name | Score |
+------------+------+-------+
| 2012/07/05 | Jack | 1 |
| 2012/07/06 | Jill | 2 |
| 2012/07/07 | Hugo | 1 |
+------------+------+-------+
When I use the MIN() function with just the date and Score column I get the lowest score for each date, which is what I want. I don't care which row is returned if there is a tie in the score for the same date. Trouble starts when I also want name column in the output. I tried a few variation of SQL (i.e min with correlated sub query) but I have no luck getting the output as shown above. Can anyone help please:)
Query is as follows
SELECT DISTINCT
A.USername, A.Date_, A.Score
FROM TestTable AS A
INNER JOIN (SELECT Date_,MIN(Score) AS MinScore
FROM TestTable
GROUP BY Date_) AS B
ON (A.Score = B.MinScore) AND (A.Date_ = B.Date_);
Use this solution:
SELECT a.date_, MIN(name) AS name, a.score
FROM tbl a
INNER JOIN
(
SELECT date_, MIN(score) AS minscore
FROM tbl
GROUP BY date_
) b ON a.date_ = b.date_ AND a.score = b.minscore
GROUP BY a.date_, a.score
SQL-Fiddle Demo
This will get the minimum score per date in the INNER JOIN subselect, which we use to join to the main table. Once we join the subselect, we will only have dates with names having the minimum score (with ties being displayed).
Since we only want one name per date, we then group by date and score, selecting whichever name: MIN(name).
If we want to display the name column, we must use an aggregate function on name to facilitate the GROUP BY on date and score columns, or else it will not work (We could also use MAX() on that column as well).
Please learn about the GROUP BY functionality of RDBMS.
SELECT Date_,Name,MIN(Score)
FROM T
GROUP BY Name
This makes the assumption that EACH NAME and EACH date appears only once, and this will only work for MySQL.
To make it work on other RDBMSs, you need to apply another group function on the Date column, like MAX. MIN. etc
SELECT T.Name, T.Date_, MIN(T.Score) as Score FROM T
GROUP BY T.Date_
Edit: This answer is not corrected as pointed out by JNK in comments
SELECT Date_,MAX(Name),MIN(Score)
FROM T
GROUP BY Date_
Here I am using MAX(NAME), it will pick one name if two names were found with the same goal numbers.
This will find Min score for each day (no duplicates), scored by any player. The name that starts with Z will be picked first than the name that starts with A.
Edit: Fixed by removing group by name