Query optimisation for count comparisons - sql

I have the following query I am trying to optimize.
EXPLAIN
select clb.f_name, clb.l_name, noofbooks
from (
select f_name, l_name, count(*) as noofbooks from
customer natural join loaned_book
group by f_name, l_name
) as clb
where 3 > (
select count(*) from (
select f_name, l_name, count(*) as noofbooks
from customer natural join loaned_book
group by f_name, l_name
) as clb1
where clb.noofbooks<clb1.noofbooks
)
order by noofbooks desc;
Essentially this query is trying to find the "top three" counts (including ties i.e. not limited to 3) of the no. of books loaned by a customer.
The problem is related to the amount of counts that must be made in the query.
Is it possible to use the count values from the first query to reduce selected rows in the second query without recounting all of the rows?
This is a homework task so I am not expecting a direct answer. Any pointers would be appreciated.

Have a look at the dense_rank window function. You want all the rows that have a dense rank of 3 or less.

Related

SQL What do I need to group by?

I am trying to get a better understanding of group by and count in SQL and tried to find the student who has been studying for the second longest time.
I need to also group by s.semester for it to work, just group by s.name alone (which is what I had done initially) does not work - why is this? I know this is right, but am trying to understand why for future practice questions.
select s.name
from students s
group by s.name, s.semester
having 1 = (select count (gold.name)
from students gold
where gold.name <> s.name
and gold.semester > s.semester
)
Thanks in advance!
To solve this task, it is not enough to use grouping. You need to use ranking functions.
You will receive the rank of each subsequent student or students, based on the number of semesters.
You can get students by rank using "where".
select
*
from(
select
row_number() over (partition by name_students, semestr order by semestr_count
asc) r_n,
name_students,
semestr
from (
select name_students, semestr, count(semestr_count) from studies
group by name_students, semestr
) agregate_studies
where r_n = 1
Using r_n for get some students top _ n : 2..3

SQL: Select inside select?

I have a table of car accident in a major city, and the structure is like:
accident_table has the following columns:
id, caseno, date_of_occurrence, street, iucr, primary_type,
description, district, community_area, year, updated_on
I want to write a query that finds the street which has the most accidents for each district(I think the street count for each street is the number of accident that happened on that street).
Here is what I have:
SELECT DISTINCT on (street)
street,
district
FROM
(
SELECT
count(street) as street_cnt,
street,
district
FROM accident_table
)
WHERE street_count = (SELECT max(street_cnt))
It did not give me syntax error, but timed out, so I guess it took too long to run.
What's wrong and how to fix it?
Thanks,
Philip
First aggregate to get the count of accidents for each street. Then use the rank() window function to rank the streets within a district by the count of accidents in them. Then only select the ones that were ranked at the top.
SELECT x.district,
x.street,
x.accidents
FROM (SELECT a.district,
a.street,
count(*) accidents,
rank() OVER (PARTITION BY a.district
ORDER BY count(*) DESC) r
FROM accident_table a
GROUP BY a.district,
a.street) x
WHERE x.r = 1;
Your code looks like Postgres. In that database, you can express this without a subquery:
SELECT DISTINCT ON (a.district)
a.district, a.street, COUNT(*) as accidents
FROM accident_table a
GROUP BY a.district, a.street
ORDER BY a.district, COUNT(*) DESC;
That said, your problem is performance, which is probably not affected by subqueries. An index on accident_table(district, street) might help performance.

How to work with problems correlated subqueries that reference other tables, without using Join

I am trying to work on public dataset bigquery-public-data.austin_crime.crime of the BigQuery. My goal is to get the output as three column that shows the
discription(of the crime), count of them, and top district for that particular description(crime).
I am able to get the first two columns with this query.
select
a.description,
count(*) as district_count
from `bigquery-public-data.austin_crime.crime` a
group by description order by district_count desc
and was hoping I can get that done with one query and then I tried this in order to get the third column showing me the Top district for that particular description (crime) by adding the code below
select
a.description,
count(*) as district_count,
(
select district from
( select
district, rank() over(order by COUNT(*) desc) as rank
FROM `bigquery-public-data.austin_crime.crime`
where description = a.description
group by district
) where rank = 1
) as top_District
from `bigquery-public-data.austin_crime.crime` a
group by description
order by district_count desc
The error i am getting is this. "Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN."
I think i can do that by joins. Can someone has better solution possibly to do that using without join.
Below is for BigQuery Standard SQL
#standardSQL
SELECT description,
ANY_VALUE(district_count) AS district_count,
STRING_AGG(district ORDER BY cnt DESC LIMIT 1) AS top_district
FROM (
SELECT description, district,
COUNT(1) OVER(PARTITION BY description) AS district_count,
COUNT(1) OVER(PARTITION BY description, district) AS cnt
FROM `bigquery-public-data.austin_crime.crime`
)
GROUP BY description
-- ORDER BY district_count DESC

Trying to figure out how to join these queries

I have a table named grades. A column named Students, Practical, Written. I am trying to figure out the top 5 students by total score on the test. Here are the queries that I have not sure how to join them correctly. I am using oracle 11g.
This get's me the total sums from each student:
SELECT Student, Practical, Written, (Practical+Written) AS SumColumn
FROM Grades;
This gets the top 5 students:
SELECT Student
FROM ( SELECT Student,
, DENSE_RANK() OVER (ORDER BY Score DESC) as Score_dr
FROM Grades )
WHERE Student_dr <= 5
order by Student_dr;
The approach I prefer is data-centric, rather than row-position centric:
SELECT g.Student, g.Practical, g.Written, (g.Practical+g.Written) AS SumColumn
FROM Grades g
LEFT JOIN Grades g2 on g2.Practical+g2.Written > g.Practical+g.Written
GROUP BY g.Student, g.Practical, g.Written, (g.Practical+g.Written) AS SumColumn
HAVING COUNT(*) < 5
ORDER BY g.Practical+g.Written DESC
This works by joining with all students that have greater scores, then using a HAVING clause to filter out those that have less than 5 with a greater score - giving you the top 5.
The left join is needed to return the top scorer(s), which have no other students with greater scores to join to.
Ties are all returned, leading to more than 5 rows in the case of a tie for 5th.
By not using row position logic, which varies from darabase to database, this query is also completely portable.
Note that the ORDER BY is optional.
With Oracle's PLSQL you can do:
SELECT score.Student, Practical, Written, (Practical+Written) as SumColumn
FROM ( SELECT Student, DENSE_RANK() OVER (ORDER BY Score DESC) as Score_dr
FROM VOTES ) as score, students
WHERE score.score_dr <= 5
and score.Student = students.Student
order by score.Score_dr;
You can easily include the projection of the first query in the sub-query of the second.
SELECT Student
, Practical
, Written
, tot_score
FROM (
SELECT Student
, Practical
, Written
, (Practical+Written) AS tot_score
, DENSE_RANK() OVER (ORDER BY (Practical+Written) DESC) as Score_dr
FROM Grades
)
WHERE Student_dr <= 5
order by Student_dr;
One virtue of analytic functions is that we can just use them in any query. This distinguishes them from aggregate functions, where we need to include all non-aggregate columns in the GROUP BY clause (at least with Oracle).

How to optimize an SQLite3 query

I'm learning SQLite3 by means of a book ("Using SQLite") and the Northwind database. I have written the following code to order the customers by the number of customers in their city, then alphabetically by their name.
SELECT ContactName, Phone, City as originalCity
FROM Customers
ORDER BY (
SELECT count(*)
FROM Customers
WHERE city=originalCity)
DESC, ContactName ASC
It takes about 50-100ms to run. Is there a standard procedure to follow to optimize this query, or more generally, queries of its type?
In the most general case, query optimization starts with reading the query optimizer's execution plan. In SQLite, you just use
EXPLAIN QUERY PLAN statement
In your case,
EXPLAIN QUERY PLAN
SELECT ContactName, Phone, City as originalCity
FROM Customers
ORDER BY (
SELECT count(*)
FROM Customers
WHERE city=originalCity)
DESC, ContactName ASC
You might also need to read the output of
EXPLAIN statement
which goes into more low-level detail.
In general (not only SQLite), it's better to do the count for all values (cities) at once, and a join to construct the query:
SELECT ContactName, Phone, Customers.City as originalCity
FROM Customers
JOIN (SELECT city, count(*) cnt
FROM Customers
GROUP BY city) Customers_City_Count
ON Customers.city = Customers_City_Count.city
ORDER BY Customers_City_Count.cnt DESC, ContactName ASC
(to prevent, like in your case, the count from being computed many times for the same value (city))