Postgres Skipping Duplicate Field in Select - sql

I want to select a limited number of items but only keeping ones with a distinct value for a specific field. I have tried using SELECT DISTINCT ON(field) as well as GROUP BY but they are both extremely slow because the table is very large. I assume this is because using DISTINCT will actually sort the table into distinct values before selecting at all.
SELECT DISTINCT ON(parent) id FROM posts WHERE sub = ? LIMIT 25
For my purposes this is unnecessary because I am using a LIMIT and can guarantee the limit will be met without scanning much of the table at all. Similar to selecting a value with a condition, which (without an index) will scan each row and check if it meets the condition before continuing, how can I use not having duplicate fields as a condition?
Another way to think about it is how do I do this:
SELECT DISTINCT ON (parent) post.id FROM
(SELECT id FROM posts WHERE sub = ? ORDER BY id LIMIT 25) AS post
While guaranteeing that there are 25 results. Here the result is very fast but it will usually have less results than required because multiple rows can have the same parent.

The way you're thinking may seem to make sense, but if you think a little deeper, you'll find that it cannot work that way. You want 25 unique result. To give you that, first it needs to go through the records and find the unique ones then return the first 25.
What you actually want is for it to go through the records one by one and check, do I already have similar value? If yes, discard it and continue, if no, add it to the results. Now check, do I already have 25 results? If no, continue, if yes, stop and return the results.
This is not a trivial task to do in a query. Your best bet is to do it in a stored procedure with a cursor. That will be much easier as you are in full control of the flow, just follow the steps as per the description above.
For my purposes this is unnecessary because I am using a LIMIT and can guarantee the limit will be met without scanning much of the table at all.
If you really know that your first 25 results will be found in the first xx records (say first 100), and that's all you care to achieve, then you can use a somewhat dumb query:
SELECT DISTINCT ON (parent) post.id
FROM (SELECT id FROM posts WHERE sub = ? ORDER BY id LIMIT 100) AS post
LIMIT 25
Change the 100 to whatever suits your needs.

When you use distinct on, you should use order by:
SELECT DISTINCT ON (parent) id
FROM posts
WHERE sub = ?
ORDER BY parent
LIMIT 25;
To optimize this query, you want an index on posts(sub, parent, id).

Related

Does a query goes through all data when you only select the last N?

I have a query that select the last 5($new) items from my database.
SELECT OvenRunData.dataId AS id, OvenRunData.data AS data
FROM ovenRuns INNER JOIN OvenRunData ON OvenRuns.id = OvenRunData.ovenRunId
WHERE OvenRunData.ovenRunId = (SELECT MAX(id) FROM OvenRuns)
ORDER BY id DESC LIMIT '$new'
I want to execute this query every 5 seconds with an AJAX request so I can update my table.
I know this query select the last 5 records but I want to know if the query runs through all records and then selects the last 5 or does it select only the last 5 without checking all the data?
I'm really worried that I'll have lag.
You need two indexes to make it fast enough:
create index ix_OvenRuns_id on OvenRuns(id)
create index ix_OvenRunData_ovenRunId on OvenRunData(ovenRunId)
you can even put OvenRunData.dataId OvenRunData.data into the second one, or create clustered index, however, these indexes definitely avoid full data scan.
That depends on the indexes.
In your case, you should have one on OverRuns(id).
More here: http://use-the-index-luke.com/sql/partial-results/top-n-queries
The LIMIT is applied after the ORDER BY, and the ORDER BY is applied to the entire result-set. So the answer to your question is, yes it must go through all of the records in your result-set determined by your WHERE clause before applying the LIMIT.

How to get a optimized paginated list from a query that has a UNION ALL?

I have a query formed by an UNION ALL from two tables. The results have to be ordered and paginated (like the typical list of a web application).
The original query (simplified) is:
SELECT name, id
FROM _test1 -- conditions WHERE
UNION ALL
SELECT name, id
FROM _test2 -- conditions WHERE
ORDER BY name DESC LIMIT 10,20
The problem is that the 2 tables have more than 1 million rows each, and the query is very slow.
How can I get an optimized paginated list from a UNION ALL?
Postdata:
I've used the search of Stack Overflow and I've found some similar questions of this, but the answer was incorrect or the question isn't exactly the same. Two examples:
Optimize a UNION mysql query
Combining UNION and LIMIT operations in MySQL query
I'm surprised that in Stack Overflow nobody could answered this question. Maybe it is impossible to do this query more efficiently? What could be a solution of this problem?
I would think that you could use something similar to the solution in your second link to at least help performance, but I doubt that you'll be able to get great performance on later pages. For example:
( SELECT name, id
FROM _test1 -- conditions WHERE
ORDER BY name DESC LIMIT 0, 30
)
UNION ALL
( SELECT name, id
FROM _test2 -- conditions WHERE
ORDER BY name DESC LIMIT 0, 30
)
ORDER BY name DESC
LIMIT 10, 20
You're basically limiting each subquery to the subset of possible rows that might be on the given page. In this way you only need to retrieve and merge 20 rows from each table before determining which 10 to return. Otherwise the server will potentially grab all of the rows from each table, order and merge them, then start trying to find the correct rows.
I don't use MySQL a lot though, so I can't guarantee that the engine will behave how I think it should :)
In any event, once you get to later pages you're still going to be merging larger and larger datasets. HOWEVER, I am of the strong opinion that a UI should NEVER allow a user to retrieve a set of records that let them go to (for example) page 5000. That's simply too much data for a human mind to find useful all at once and should require further filtering. Maybe let them see the first 100 pages (or some other number), but otherwise they have to constrain the results better. Just my opinion though.

When is LIMIT applied? Will it select all results before limiting?

I'm concerned about the performance of a query such as SELECT * FROM user LIMIT 5 on a very large user table. Will it select all records then limit to 5?
More specifically will the following query select all assetids before limiting...
SELECT * FROM assets WHERE asset_id IN(1,2,3,4,5,6,7,8,9,10) LIMIT 5
I realize it doesn't make sense to include all ids in the IN() clause if I'm limiting but I'd like to know how mysql behaves in this situation.
Thanks.
This depends on your query. See this page for more explanations of how LIMIT is applied:
http://dev.mysql.com/doc/refman/5.0/en/limit-optimization.html
For that specific query, the following would apply:
"As soon as MySQL has sent the required number of rows to the client, it aborts the query unless you are using SQL_CALC_FOUND_ROWS."
Hope that helps.
Your query will have to scan all rows by asset_id column, so you better have an index on it. In my experience, you would always want to set an order by clause also, since the result set will be internally (i.e. order unknown), and you would not know why the returned 5 results were the ones you actually wanted.

Semi-Distinct MySQL Query

I have a MySQL table called items that contains thousands of records. Each record has a user_id field and a created (datetime) field.
Trying to put together a query to SELECT 25 rows, passing a string of user ids as a condition and sorted by created DESC.
In some cases, there might be just a few user ids, while in other instances, there may be hundreds.
If the result set is greater than 25, I want to pare it down by eliminating duplicate user_id records. For instance, if there were two records for user_id = 3, only the most recent (according to created datetime) would be included.
In my attempts at a solution, I am having trouble because while, for example, it's easy to get a result set of 100 (allowing duplicate user_id records), or a result set of 16 (using GROUP BY for unique user_id records), it's hard to get 25.
One logical approach, which may not be the correct MySQL approach, is to get the most recent record for each for each user_id, and then, if the result set is less than 25, begin adding a second record for each user_id until the 25 record limit is met (maybe a third, fourth, etc. record for each user_id would be needed).
Can this be accomplished with a MySQL query, or will I need to take a large result set and trim it down to 25 with code?
I don't think what you're trying to accomplish is possible as a SQL query. Your desire is to return 25 rows, no matter what the normal data groupings are whereas SQL is usually picky about returning based on data groupings.
If you want a purely MySQL-based solution, you may be able to accomplish this with a stored procedure. (Supported in MySQL 5.0.x and later.) However, it might just make more sense to run the query to return all 100+ rows and then trim it programmatically within the application.
This will get you the most recent for each user --
SELECT user_id, create
FROM items AS i1
LEFT JOIN items AS i2
ON i1.user_id = i2.user_id AND i1.create > i2.create
WHERE i2.id IS NULL
his will get you the most recent two records for each user --
SELECT user_id, create
FROM items AS i1
LEFT JOIN items AS i2
ON i1.user_id = i2.user_id AND i1.create > i2.create
LEFT JOIN items IS i3
ON i2.user_id = i3.user_id AND i2.create > i3.create
WHERE i3.id IS NULL
Try working from there.
You could nicely put this into a stored procedure.
My opinion is to use application logic, as this is very much application layer logic you are trying to implement at the DB level, i.e. filtering down the results to make the search more useful to the end user.
You could implement a stored procedure (personally I would never do such a thing) or just get the application to decide which 25 results.
One approach would be to get the most recent item from each user, followed by the most recent items from all users, and limit that. You could construct pathological examples where this probably isn't what you want, but it should be pretty good in general.
Unfortunately, there is no easy way :( I had to do something similar when I built a report for my company that would pull up customer disables that were logged in a database. Only problem was that the disconnect is ran and logged every 30 minutes. Therefore, the rows would not be distinct since the timestamp was different in every disconnect. I solved this problem with sub queries. I don't have the exact code anymore, but I beleive this is how I implemented it:
SELECT CORP, HOUSE, CUST,
(
SELECT TOP 1 hsd
FROM #TempTable t2
WHERE t1.corp = t2.corp
AND t1.house = t2.house
AND t1.cust = t2.cust
) DisableDate
FROM #TempTable t1
GROUP BY corp, house, cust -- selecting distinct
So, my answer is to elimante the non-distinct column from the query by using sub queries. There might be an easier way to do it though. I'm curious to see what others post.
Sorry, i keep editing this, I keep trying to find ways to make it easier to show what I did.

What's the most efficient way to check the presence of a row in a table?

Say I want to check if a record in a MySQL table exists. I'd run a query, check the number of rows returned. If 0 rows do this, otherwise do that.
SELECT * FROM table WHERE id=5
SELECT id FROM table WHERE id=5
Is there any difference at all between these two queries? Is effort spent in returning every column, or is effort spent in filtering out the columns we don't care about?
SELECT COUNT(*) FROM table WHERE id=5
Is a whole new question. Would the server grab all the values and then count the values (harder than usual), or would it not bother grabbing anything and just increment a variable each time it finds a match (easier than usual)?
I think I'm making a lot of false assumptions about how MySQL works, but that's the meat of the question! Where am I wrong? Educate me, Stack Overflow!
Optimizers are pretty smart (generally). They typically only grab what they need so I'd go with:
SELECT COUNT(1) FROM mytable WHERE id = 5
The most explicit way would be
SELECT WHEN EXISTS (SELECT 1 FROM table WHERE id = 5) THEN 1 ELSE 0 END
If there is an index on (or starting with) id, it will only search, with maximum efficiency, for the first entry in the index it can find with that value. It won't read the record.
If you SELECT COUNT(*) (or COUNT anything else) it will, under the same circumstances, count the index entries, but not read the records.
If you SELECT *, it will read all the records.
Limit your results to at most one row by appending LIMIT 1, if all you want to do is check the presence of a record.
SELECT id FROM table WHERE id=5 LIMIT 1
This will definitely ensure that no more than one row is returned or processed. In my experience, LIMIT 1 (or TOP 1 depending in the DB) to check for existence of a row makes a big difference in terms of performance for large tables.
EDIT: I think I misread your question, but I'll leave my answer here anyway if it's of any help.
I would think this
SELECT null FROM table WHERE id = 5 LIMIT 1;
would be faster than this
SELECT 1 FROM table WHERE id = 5 LIMIT 1;
but the timer says the winner is "SELECT 1".
For the first two queries, most people will generally say, always specify exactly what you need and leave the rest. Effort isn't all specific as bandwidth could be spent in returning data that you aren't even going to do anything with.
As for the previous answer will do for your result set, unless you're dealing with a language that supports affected rows. This can sometimes work when getting data to collect information on how many rows were returned in the last query. You'll need to look at your interface documentation as to how to get that information.
The difference between your 3 queries depends on how you've built your index. Only returning the primary key is likely to be faster as MySQL will have your index in memory, and not have to hit disk. Adding the LIMIT 1 is also a good trick that will speed up the optimizer significantly in early 5.0.x branches and earlier.
try EXPLAIN SELECT id FROM table WHERE id=5 and check the Extras column for the presence of USING INDEX. If its there, then you're query is coming straight from the index, and is going to be much faster.