I have the following data:
ExamEntry Student_ID Grade
11 1 80
12 2 70
13 3 20
14 3 68
15 4 75
I want to find all the students that passed an exam. In this case, if there are few exams
that one student attended to, I need to find the last result.
So, in this case I'd get that all students passed.
Can I find it with one fast query? I do it this way:
Find the list of entries by
select max(ExamEntry) from data group by Student_ID
Find the results:
select ExamEntry from data where ExamEntry in ( ).
But this is VERY slow - I get around 1000 entries, and this 2 step process takes 10 seconds.
Is there a better way?
Thanks.
If your query is very slow at with 1000 records in your table, there is something wrong.
For a modern Database system a table containing, 1000 entries is considered very very small.
Most likely, you did not provid a (primary) key for your table?
Assuming that a student would pass if at least on of the grades is above the minimum needed, the appropriate query would be:
SELECT
Student_ID
, MAX(Grade) AS maxGrade
FROM table_name
GROUP BY Student_ID
HAVING maxGrade > MINIMUM_GRADE_NEEDED
If you really need the latest grade to be above the minimum:
SELECT
Student_ID
, Grade
FROM table_name
WHERE ExamEntry IN (
SELECT
MAX(ExamEntry)
FROM table_name
GROUP BY Student_ID
)
HAVING Grade > MINIMUM_GRADE_NEEDED
SELECT student_id, MAX(ExamEntry)
FROM data
WHERE Grade > :threshold
GROUP BY student_id
Like this?
I'll make some assumptions that you have a student table and test table and the table you are showing us is the test_result table... (if you don't have a similar structure, you should revisit your schema)
select s.id, s.name, t.name, max(r.score)
from student s
left outer join test_result r on r.student_id = s.id
left outer join test t on r.test_id = t.id
group by s.id, s.name, t.name
All the fields with id in it should be indexed.
If you really only have a single test (type) in your domain... then the query would be
select s.id, s.name, max(r.score)
from student s
left outer join test_result r on r.student_id = s.id
group by s.id, s.name
I've used the hints given here, and here the query I found that runs almost 3 orders faster than my first one (.03 sec instead of 10 sec):
SELECT ExamEntry, Student_ID, Grade from data,
( SELECT max(ExamEntry) as ExId GROUP BY Student_ID) as newdata
WHERE `data`.`ExamEntry`=`newdata`.`ExId` AND Grade > 60;
Thanks All!
As mentioned, indexing is a powerful tool for speeding up queries. The order of the index, however, is fundamentally important.
An index in order of (ExamEntry) then (Student_ID) then (Grade) would be next to useless for finding exams where the student passed.
An index in the opposite order would fit perfectly, if all you wanted was to find what exams had been passed. This would enable the query engine to quickly identify rows for exams that have been passed, and just process those.
In MS SQL Server this can be done with...
CREATE INDEX [IX_results] ON [dbo].[results]
(
[Grade],
[Student_ID],
[ExamEntry]
)
ON [PRIMARY]
(I recommend reading more about indexs to see what other options there are, such as ClusterdIndexes, etc, etc)
With that index, the following query would be able to ignore the 'failed' exams very quickly, and just display the students who ever passed the exam...
(This assumes that if you ever get over 60, you're counted as a pass, even if you subsequently take the exam again and get 27.)
SELECT
Student_ID
FROM
[results]
WHERE
Grade >= 60
GROUP BY
Student_ID
Should you definitely need the most recent value, then you need to change the order of the index back to something like...
CREATE INDEX [IX_results] ON [dbo].[results]
(
[Student_ID],
[ExamEntry],
[Grade]
)
ON [PRIMARY]
This is because the first thing we are interested in is the most recent ExamEntry for any given student. Which can be achieved using the following query...
SELECT
*
FROM
[results]
WHERE
[results].ExamEntry = (
SELECT
MAX([student_results].ExamEntry)
FROM
[results] AS [student_results]
WHERE
[student_results].Student_ID = [results].student_id
)
AND [results].Grade > 60
Having a sub query like this can appear slow, especially since it appears to be executed for every row in [results].
This, however, is not the case...
- Both main and sub query reference the same table
- The query engine scans through the Index for every unique Student_ID
- The sub query is executed, for that Student_ID
- The query engine is already in that part of the index
- So a new Index Lookup is not needed
EDIT:
A comment was made that at 1000 records indexs are not relevant. It should be noted that the question states that there are 1000 records Returned, not that the table contains 1000 records. For a basic query to take as long as stated, I'd wager there are many more than 1000 records in the table. Maybe this can be clarified?
EDIT:
I have just investigated 3 queries, with 999 records in each (3 exam results for each of 333 students)
Method 1: WHERE a.ExamEntry = (SELECT MAX(b.ExamEntry) FROM results [a] WHERE a.Student_ID = b.student_id)
Method 2: WHERE a.ExamEntry IN (SELECT MAX(ExamEntry) FROM resuls GROUP BY Student_ID)
Method 3: USING an INNER JOIN instead of the IN clause
The following times were found:
Method QueryCost(No Index) QueryCost(WithIndex)
1 23% 9%
2 38% 46%
3 38% 46%
So, Query 1 is faster regardless of indexes, but indexes also definitely make method 1 substantially faster.
The reason for this is that indexes allow lookups, where otherwise you need a scan. The difference between a linear law and a square law.
Thanks for the answers!!
I think that Dems is probably closest to what I need, but I will elaborate a bit on the issue.
Only the latest grade counts. If the student had passed first time, attended again and failed, he failed in total. He/She could've attended 3 or 4 exams, but still only the last one counts.
I use MySQL server. The problem I experience in both Linux and Windows installations.
My data set is around 2K entries now and grows with the speed of ~ 1K per new exam.
The query for specific exam also returns ~ 1K entries, when ~ 1K would be the number of students attended (received by SELECT DISTINCT STUDENT_ID from results;), then almost all have passed and some have failed.
I perform the following query in my code:
SELECT ExamEntry, Student_ID from exams WHERE ExamEntry in ( SELECT MAX(ExamEntry) from exams GROUP BY Student_ID). As subquery returns about ~1K entries, it appears that main query scans them in loop, making all the query run for a very long time and with 50% server load (100% on Windows).
I feel that there is a better way :-), just can't find it yet.
select examentry,student_id,grade
from data
where examentry in
(select max(examentry)
from data
where grade > 60
group by student_id)
don't use
where grade > 60
but
where grade between 60 and 100
that should go faster
Related
I need to match up an employee with a task in a small Microsoft Access DB I built. Essentially, I have a list of 45 potential tasks, and I have 25 employees. What I need is:
Each employee to have at LEAST one task
No employee to have more than TWO
Be able to randomize the results every time I run the query (so the same people don't get consistently the same tasks)
My table structure is:
Employees - w/ fields: ID, Name
Tasks - w/ fields: ID, Location, Task Group, Task
I know this is a dumb question, but I truly am struggling. I have searched through SO and Google for help but have been unsuccessful.
I don't have a way to link together employees to tasks since each employee is capable of every task, so I was going to:
1. SELECT * from Employees
2. SELECT * from Tasks
3. Union
4. COUNT(Name) <= 2
But I don't know how to randomize those results so that folks are randomly matched up, with each person at least once and nobody more than twice.
Any help or guidance is appreciated. Thank you.
Consider a cross join with an aggregate query that randomizes the choice set. Currently, at 45 X 25 this yields a cartesian product of 1,125 records which is manageable.
Select query (save as a query object, assumes Tasks has autonumber field)
SELECT cj.[Emp_Name], Max(cj.ID) As M_ID, Max(cj.Task) As M_Task
FROM
(SELECT e.[Emp_Name], t.ID, t.Task
FROM Employees e,
Tasks t) cj
GROUP BY cj.[Emp_Name], Rnd(cj.ID)
ORDER BY cj.[Emp_Name], Rnd(cj.ID)
However, the challenge here is this above query randomizes the order of all 45 tasks per each of the 25 employees whereas you need the top two tasks per employee. Unfortunately, MS Access does not have a row id like other DBMS to use to select top 2 per employee. And we cannot use a correlated subquery on Task ID per Employee since this will always return the highest two task IDs by their value and not random top two IDs.
Therefore to do so in Access, you will need a temp table regularly cleaned out prior to each allocation of employee tasks and use autonumber for selection via correlated subquery.
Create table (run once, autonumber field required)
CREATE TABLE CrossJoinRandomPicks (
ID AUTOINCREMENT PRIMARY KEY,
Emp_Name TEXT(255),
M_ID LONG,
M_Task TEXT(255)
)
Delete query (run regularly)
DELETE FROM CrossJoinRandomPicks;
Append query (run regularly)
INSERT INTO CrossJoinRandomPicks ([Emp_Name], [M_ID], [M_Task])
SELECT [Emp_Name], [M_ID], [M_Task]
FROM mySavedCrossJoinQuery;
Final query (selects top two random tasks for each employee)
SELECT c.name, c.M_Letter
FROM CrossJoinRandomPicks c
WHERE
(SELECT Count(*) FROM CrossJoinRandomPicks sub
WHERE sub.name = c.name
AND sub.ID <= c.ID) <= 2;
I wrote several SQL queries and executed them against my table. Each individual query worked. I kept adding functionality until I got a really ugly working query. The problem is that I have to manually change a value every time I want to use it. Can you assist in making this query automatic rather than “manual”?
I am working with DB2.
Table below shows customers (cid) from 1 to 3. 'club' is a book seller, and 'qnty' is the number of books the customer bought from each 'club'. The full table has 45 customers.
Image below shows all the table elements for the first 3 users (cid=1 OR cid=2 OR cid=3). The final purpose of all my queries (once combined) is it to find the single 'club' with the largest 'qnty' for each 'cid'. So for 'cid =1' the 'club' is Readers Digest with 'qnty' of 3. For 'cid=2' the 'club' is YRB Gold with 'qnty' of 5. On and on until cid 45 is reached.
To give you a background on what I did here are my queries:
(Query 1-starting point for cid=1)
SELECT * FROM yrb_purchase WHERE cid=1
(Query 2 - find the 'club' with the highest 'qnty' for cid=1)
SELECT *
FROM
(SELECT club,
sum(qnty) AS t_qnty
FROM yrb_purchase
WHERE cid=1
GROUP BY club)results
ORDER BY t_qnty DESC
(Query 3 – combine the record from the above query with it’s cid)
SELECT cid,
temp.club,
temp.t_qnty
FROM yrb_purchase AS p,
(SELECT *
FROM
(SELECT club,
sum(qnty) AS t_qnty
FROM yrb_purchase
WHERE cid=1
GROUP BY club)results
ORDER BY t_qnty DESC FETCH FIRST 1 ROWS ONLY) AS TEMP
WHERE p.cid=1
AND p.club=temp.club
(Query 4) make sure there is only one record for cid=1
SELECT cid,
temp.club,
temp.t_qnty
FROM yrb_purchase AS p,
(SELECT *
FROM
(SELECT club,
sum(qnty) AS t_qnty
FROM yrb_purchase
WHERE cid=1
GROUP BY club)results
ORDER BY t_qnty DESC FETCH FIRST 1 ROWS ONLY) AS TEMP
WHERE p.cid=1
AND p.club=temp.club FETCH FIRST ROWS ONLY
To get the 'club' with the highest 'qnty' for customer 2, I would simply change the text cid=1 to cid=2 in the last query above. My query seems to always produce the correct results. My question is, how do I modify my query to get the results for all 'cid's from 1 to 45 in a single table? How do I get a table with all the cid values along with the club which sold that cid the most books, and how many books were sold within one tablei? Please keep in mind I am hoping you can modify my query as opposed to you providing a better query.
If you decide that my query is way too ugly (I agree with you) and choose to provide another query, please be aware that I just started learning SQL and may not be able to understand your query. You should be aware that I already asked this question: For common elements, how to find the value based on two columns? SQL but I was not able to make the answer work (due to my SQL limitations - not because the answer wasn't good); and in the absence of a working answer I could not reverse engineer it to understand how it works.
Thanks in advance
****************************EDIT #1*******************************************
The results of the answer is:
You could use OLAP/Window Functions to achieve this:
SELECT
cid,
club,
qnty
FROM
(
SELECT
cid,
club,
qnty,
ROW_NUMBER() OVER (PARTITION BY cid order by qnty desc) as cid_club_rank
FROM
(
SELECT
cid,
club,
sum(qnty) as qnty
FROM yrb_purchase
GROUP BY cid, club
) as sub1
) as sub2
WHERE cid_club_rank = 1
The inner most statement (sub1) just grabs a total quantity for each cid/club combination. The second inner most statement (sub2) creates a row_number for each cid/club combination ordering by the quantity (top down). Then the outer most query chooses only records where that row_number() is 1.
Ive very strange problem, when I execute query like below:
with ap as (
SELECT id from adress limit 1000)
)
SELECT distinct house.id, house.date
FROM house
WHERE house.adressid in (select id from ap)
LIMIT 9999
I ge the resulkts within 100 ms
But when I change the limit to 10 then Im getting a result after 20 s
with ap as (
SELECT id from adress limit 1000)
)
SELECT distinct house.id, house.date
FROM house
WHERE house.adressid in (select id from ap)
LIMIT 10
Of course there is index on adressid
CREATE INDEX house_idx
ON house
USING btree
(adressid COLLATE pg_catalog."default");
In house there are like 9 mln rows.
Does anyone have an idea hoiw can I try to improve the performance. Ive reduces the problem to this very simple one but in reality the structure is much more complex thats why I didnt provide you with table create and query plans...
That is actually not surprising:
In the first case ap has up to 1000 rows and the result set should have up to 9999, so the optimizer put adress first. With index on house query performance is relatively high.
In the second case ap still has up to 1000 rows, but the result set should have only up to 10, so the optimizer put house first... and ends up with 10 table scans on adress of up to 1000 rows each. It probably won't be able to even use an index since there is no Order By clause anywhere.
This limit 1000 on address looks really suspicious and potentially can lead to inconsistent results: without Order By there is no guarantee which records from adress would be taken into account during each run.
I would use INNER JOIN to resolve the issue:
SELECT DISTINCT house.id, house.date
FROM house
INNER JOIN adress ON adress.id = house.adressid
ORDER BY house.date --< To add some consistency
LIMIT 10
Following on from my prior question about joins I'm now having trouble with joins and comparing using count function.
I have a table called subjects
subno subname quota
30006 Math 300
31445 Science 400
31567 Business 250
I also have a another table called enrollment
subno sno
30009 980008
4134 988880
31567 900890
etc. (Converted to SQLFiddle here: http://sqlfiddle.com/#!12/dcd01 -- Craig)
How do i List subject number and name which quota is less than the average quota of subjects. This means i need to count the number of students in one table and compare with the other table correct?
After finally determining the question (deduced from comments) to be:
List all subjects with vacancies
The query you need is:
select
subno,
subname,
quota,
quota - count(sno) as vacancies
from subjects s
left join enrollments e on e.subno = s.subno
group by 1, 2, 3
having quota - count(sno) > 0
I also added in a column vacancies, which displays the number of vacancies remaining.
Note: You have misspelled "enrolments" (correct spelling has only one L) - I recommend you rename your table to the correct spelling to avoid future confusion.
select a.subno,b.subname
from
(select subno, count(sno) as cnt from enrollment
group by 1
having count(sno)<(select avg(quota) from subjects)
) as a
inner join
(select * from subjects) as b
on a.subno=b.subno
I'm learning SQL (using SQLite 3 and its sqlite3 command-line tool) and I've noticed that I can do some things in several ways, and sometimes it is not clear which one is better. Here are three queries which do the same thing, one executed through intersect, another through inner join and distinct, the last one similar to the second one but it incorporates filtering through where. (The first one was written by the author of the book I'm reading, and the others I wrote myself.)
The question is, which of these queries is better and why? And, more generally, how can I know when one query is better than another? Are there some guidelines I missed or perhaps I should learn SQLite internals despite the declarative nature of SQL?
(In the following example, there are tables that describe food names that are mentioned in some TV series. Foods_episodes is many-to-many linking table while others describe food names and episode names together with season number. Note that all-time ten top foods (based on the count of their appearances in all series) are being looked for, not just top foods in seasons 3..5)
-- task
-- find the all-time top ten foods that appear in seasons 3 through 5
-- schema
-- CREATE TABLE episodes (
-- id integer primary key,
-- season int,
-- name text );
-- CREATE TABLE foods(
-- id integer primary key,
-- name text );
-- CREATE TABLE foods_episodes(
-- food_id integer,
-- episode_id integer );
select f.* from foods f
inner join
(select food_id, count(food_id) as count
from foods_episodes
group by food_id
order by count(food_id) desc limit 10) top_foods
on f.id=top_foods.food_id
intersect
select f.* from foods f
inner join foods_episodes fe on f.id = fe.food_id
inner join episodes e on fe.episode_id = e.id
where
e.season between 3 and 5
order by
f.name;
select
distinct f.*
from
foods_episodes as fe
inner join episodes as e on e.id = fe.episode_id
inner join foods as f on fe.food_id = f.id
inner join (select food_id from foods_episodes
group by food_id order by count(*) desc limit 10) as lol
on lol.food_id = fe.food_id
where
e.season between 3 and 5
order by
f.name;
select
distinct f.*
from
foods_episodes as fe
inner join episodes as e on e.id = fe.episode_id
inner join foods as f on fe.food_id = f.id
where
fe.food_id in (select food_id from foods_episodes
group by food_id order by count(*) desc limit 10)
and e.season between 3 and 5
order by
f.name;
-- output (same for these thee):
-- id name
-- ---------- ----------
-- 4 Bear Claws
-- 146 Decaf Capp
-- 153 Hennigen's
-- 55 Kasha
-- 94 Ketchup
-- 164 Naya Water
-- 317 Pizza
-- CPU Time: user 0.000000 sys 0.000000
Similar to MySQL, it looks like SQLlite has an EXPLAIN command. Prepend your select with the EXPLAIN keyword and it will return information about the query, including the number of rows scanned, and the indexes used.
http://www.sqlite.org/lang_explain.html
By running EXPLAIN on various selects you can determine which queries (and sub-queries) are more efficient than others.
And here is a general overview of SQLlite's query planner and optimization: http://sqlite.org/optoverview.html
SQLlite3 also supports a callback function to trace queries. You have to implement it though: http://www.sqlite.org/c3ref/profile.html
Generally, there is more than 1 way to solve a problem. If you are getting correct answers, the only other question is whether the process/script/statement needs to be improved, or if it works well now.
In SQL generally, there may be a "best' way, but it's usually not the goal to find a canonical best way to do something - you want a way that efficiently blances your needs from the program, and your time. You can spend months optimizing a process, but if the process is used only weekly, and it only takes 5 minutes now, reducing it to 4 minutes isn't much help.
it's weird to transition from a context where there are correct answers (like school,) to a context where the goal is to get something done well, and works well enough trumps perfect, because there are time constraints. It's something that took me a while to appreciate, but I'm not sure there is a better answer. Hope the perspective helps a bit!