In my SQLite database, in each table, there is a sync_id column. I regularly want to retrieve the maximum sync_id for each table. Here's what I tried first:
SELECT
MAX(answer.sync_id),
MAX(community.sync_id),
MAX(question.sync_id),
MAX(topic.sync_id)
FROM
answer,
community,
question,
topic;
This query took forever, I actually never got to the end of it.
Here's what I tried next:
SELECT "answer" AS name, MAX(answer.sync_id) AS max_sync_id FROM answer
UNION SELECT "community" AS name, MAX(community.sync_id) AS max_sync_id FROM community
UNION SELECT "question" AS name, MAX(question.sync_id) AS max_sync_id FROM question
UNION SELECT "topic" AS name, MAX(topic.sync_id) AS max_sync_id FROM topic;
This one is blazingly fast and gives me the results I expected.
I have 2 questions about this:
Why are the 2 queries so different? I'm guessing there's some SQL semantics that I'm not getting, some kind of implicit JOIN...
The 1st query returns the maximums as one row, with columns named after the tables. The 2nd query returns 1 maximum per row, and I had to create a name column to keep the context. Is there a way I could get the result set of the 1st query, with the speed of the 2nd query?
1/ Why are the queries so different
Because the first one makes a big table as the cartesian product of the 4 tables before running the select against it, while the second one fires 1 request per table before aggregating the results in 4 lines. The execution plan of both requests can show that in details.
2/ Is there a way to get the result set of the 1st query with the speed of the 2nd query?
No. This is because of the nature of your data: seems like your 4 tables are not related anyhow, so you can't have a single (fast) request to hit them all. The best would probably be to make 4 requests, and group your results in your application.
Related
Title pretty much says it all, using Oracle SQL, I'd like to get, as quickly as possible, three records that share an ID from a very large table. The rows are not duplicates, they share one ID (rID) but differ in another (mID).
One approach I know I could do (that would be very slow) would be to load the first say 1000 records into a C# program, then execute a COUNT query to count the number of records with each ID until I hit one with 3 records and return that ID. I know this is a terrible approach but should give an idea of what I want to get out of this.
I've tried using GROUP BY, and this would work but would be unacceptably slow, I don't care about the state of the rest of the table, I just need a single ID that has three records. Ideally I'd do something like a GROUP BY that would stop after finding the first ID with three or more records and just return that one. There are over a million records in the table so efficiency is important.
What you describe translates to:
select the_id
from the_table
group by the_id
having count(*) >= 3
fetch first row only;
This should be as fast as it gets. You can help Oracle by providing an index on the id. That's about it.
I'm not a database guru and feel like I'm missing some core SQL knowledge to grok a solution to this problem. Here's the situation as briefly as I can explain it.
Context:
I have a SQLite database table that contains timestamped user event records. The records can be uniquely identified by the combination of timestamp and user ID (i.e., when the event took place and who the event is about). I understand this situation is called a "composite primary key." The table looks something like this (with a bunch of other columns removed, of course):
sqlite> select Last_Updated,User_ID from records limit 4;
Last_Updated User_ID
------------- --------
1434003858430 1
1433882146115 3
1433882837088 3
1433964103500 2
Question: How do I SELECT a result set containing only the most recent record for each user?
Given the above example, what I'd like to get back is a table that looks like this:
Last_Updated User_ID
------------- --------
1434003858430 1
1433882837088 3
1433964103500 2
(Note that the result set only includes user 3's most recent record.)
In reality, I have approximately 2.5 million rows in this table.
Bonus: I've been reading answers about JOINs, de-dupe procedures, and a bunch more, and I've been googling for tutorials/articles in the hopes that I would find what I'm missing. I have extensive programming background so I could de-dupe this dataset in procedural code like I've done a hundred times before, but I'm tired of writing scripts to do what I believe should be possible in SQL. That's what it's for, right?
So, what do you think is missing from my understand of SQL, conceptually, that I need in order to understand why the solution you've provided to my question actually works? (A reference to a good article that actually explains the theory behind the practice would suffice.) I want to know WHY the solution actually works, not just that it does.
Many thanks for your time!
You could try this:
select user_id, max(last_updated) as latest
from records
group by user_id
This should give you the latest record per user. I assume you have an index on user_id and last_updated combined.
In the above query, generally speaking - we are asking the database to group user_id records. If there are more than 1 records for user_id 1, they will all be grouped together. From that recordset, maximum last_updated will be picked for output. Then the next group is sought and the same operation is applied there.
If you have a composite index, sqlite will likely just use the index because the index contains both fields addressed in the query. Indexes are smaller than the table itself, so scanning or seeking is faster.
Well, in true "d'oh!" fashion, right after I ask this question, I find the answer.
For my case, the answer is:
SELECT MAX(Last_Updated),User_ID FROM records GROUP BY User_ID
I was making this more complicated than it needed to be by thinking I needed to use JOINs and stuff. Applying an aggregate function like MAX() is all that's needed to select only those rows whose content matches the function result. That means this statement…
SELECT MAX(Last_Updated),User_ID FROM records
…would therefor return a result set containing only 1 row, the most recent event.
By adding the GROUP BY clause, however, the result set contains a row for each "group" of results, i.e., for each user. My programmer-brain did not understand that GROUP BY is how we say "for each" in SQL. I think I get it now.
Note to self: keep it simple, stupid. :)
I have a table in MS Access 2010 I'm trying to analyze of people who belong to various groups having completed various jobs. What I would like to do is calculate the standard deviation of the count of the number of jobs each person has completed per group. Meaning, the output I would like is that for each group, I'd have a number that constitutes the standard deviation of how many jobs each person did.
The data is structured like this:
OldGroup, OldPerson, JobID
I know that I need to do a COUNT of the job IDs by Group and Person. I tried creating a subquery to work with, but that didn't work:
SELECT data.OldGroup, STDEV(
SELECT COUNT(data.JobID)
FROM data
WHERE data.Classification = 1
GROUP BY data.OldGroup, data.OldPerson
)
FROM data
GROUP BY data.OldGroup;
This returned an error "At most one record can be returned by this subquery," which I know is wrong, since when I tried to run the subquery as a standalone query it successfully returned more than one record.
Question:
How can I get the STDEV of a COUNT?
Subquestion: If this question can be answered by correcting incorrect syntax in my examples, please do so.
A minor change in strategy that wouldn't work for all cases but did end up working for this one seemed to take care of the problem. Instead of sticking the subquery in the SELECT statement, I put it in FROM, mimicking creating a separate table.
As such, my code looks like:
SELECT OldGroup, STDEV(NumberJobs) AS JobsStDev
FROM (
SELECT OldGroup, OldPerson, COUNT(JobID) AS NumberJobs
FROM data
WHERE data.Classification = 1
GROUP BY OldGroup, OldPerson
) AS TempTable
GROUP BY OldGroup;
That seemed to get the job done.
Try doing a max table query for "SELECT COUNT(data.JobID)...."
Then for the 2nd query, use the new base table.
Sometimes it is just easier to do something in 2 or more queries.
I have a query formed by an UNION ALL from two tables. The results have to be ordered and paginated (like the typical list of a web application).
The original query (simplified) is:
SELECT name, id
FROM _test1 -- conditions WHERE
UNION ALL
SELECT name, id
FROM _test2 -- conditions WHERE
ORDER BY name DESC LIMIT 10,20
The problem is that the 2 tables have more than 1 million rows each, and the query is very slow.
How can I get an optimized paginated list from a UNION ALL?
Postdata:
I've used the search of Stack Overflow and I've found some similar questions of this, but the answer was incorrect or the question isn't exactly the same. Two examples:
Optimize a UNION mysql query
Combining UNION and LIMIT operations in MySQL query
I'm surprised that in Stack Overflow nobody could answered this question. Maybe it is impossible to do this query more efficiently? What could be a solution of this problem?
I would think that you could use something similar to the solution in your second link to at least help performance, but I doubt that you'll be able to get great performance on later pages. For example:
( SELECT name, id
FROM _test1 -- conditions WHERE
ORDER BY name DESC LIMIT 0, 30
)
UNION ALL
( SELECT name, id
FROM _test2 -- conditions WHERE
ORDER BY name DESC LIMIT 0, 30
)
ORDER BY name DESC
LIMIT 10, 20
You're basically limiting each subquery to the subset of possible rows that might be on the given page. In this way you only need to retrieve and merge 20 rows from each table before determining which 10 to return. Otherwise the server will potentially grab all of the rows from each table, order and merge them, then start trying to find the correct rows.
I don't use MySQL a lot though, so I can't guarantee that the engine will behave how I think it should :)
In any event, once you get to later pages you're still going to be merging larger and larger datasets. HOWEVER, I am of the strong opinion that a UI should NEVER allow a user to retrieve a set of records that let them go to (for example) page 5000. That's simply too much data for a human mind to find useful all at once and should require further filtering. Maybe let them see the first 100 pages (or some other number), but otherwise they have to constrain the results better. Just my opinion though.
Say I want to check if a record in a MySQL table exists. I'd run a query, check the number of rows returned. If 0 rows do this, otherwise do that.
SELECT * FROM table WHERE id=5
SELECT id FROM table WHERE id=5
Is there any difference at all between these two queries? Is effort spent in returning every column, or is effort spent in filtering out the columns we don't care about?
SELECT COUNT(*) FROM table WHERE id=5
Is a whole new question. Would the server grab all the values and then count the values (harder than usual), or would it not bother grabbing anything and just increment a variable each time it finds a match (easier than usual)?
I think I'm making a lot of false assumptions about how MySQL works, but that's the meat of the question! Where am I wrong? Educate me, Stack Overflow!
Optimizers are pretty smart (generally). They typically only grab what they need so I'd go with:
SELECT COUNT(1) FROM mytable WHERE id = 5
The most explicit way would be
SELECT WHEN EXISTS (SELECT 1 FROM table WHERE id = 5) THEN 1 ELSE 0 END
If there is an index on (or starting with) id, it will only search, with maximum efficiency, for the first entry in the index it can find with that value. It won't read the record.
If you SELECT COUNT(*) (or COUNT anything else) it will, under the same circumstances, count the index entries, but not read the records.
If you SELECT *, it will read all the records.
Limit your results to at most one row by appending LIMIT 1, if all you want to do is check the presence of a record.
SELECT id FROM table WHERE id=5 LIMIT 1
This will definitely ensure that no more than one row is returned or processed. In my experience, LIMIT 1 (or TOP 1 depending in the DB) to check for existence of a row makes a big difference in terms of performance for large tables.
EDIT: I think I misread your question, but I'll leave my answer here anyway if it's of any help.
I would think this
SELECT null FROM table WHERE id = 5 LIMIT 1;
would be faster than this
SELECT 1 FROM table WHERE id = 5 LIMIT 1;
but the timer says the winner is "SELECT 1".
For the first two queries, most people will generally say, always specify exactly what you need and leave the rest. Effort isn't all specific as bandwidth could be spent in returning data that you aren't even going to do anything with.
As for the previous answer will do for your result set, unless you're dealing with a language that supports affected rows. This can sometimes work when getting data to collect information on how many rows were returned in the last query. You'll need to look at your interface documentation as to how to get that information.
The difference between your 3 queries depends on how you've built your index. Only returning the primary key is likely to be faster as MySQL will have your index in memory, and not have to hit disk. Adding the LIMIT 1 is also a good trick that will speed up the optimizer significantly in early 5.0.x branches and earlier.
try EXPLAIN SELECT id FROM table WHERE id=5 and check the Extras column for the presence of USING INDEX. If its there, then you're query is coming straight from the index, and is going to be much faster.