Is a GROUP BY on UNIQUE key calculates all the groups before applying LIMIT clause? - sql

If I GROUP BY on a unique key, and apply a LIMIT clause to the query, will all the groups be calculated before the limit is applied?
If I have hundred records in the table (each has a unique key), Will I have 100 records in the temporary table created (for the GROUP BY) before a LIMIT is applied?
A case study why I need this:
Take Stack Overflow for example.
Each query you run to show a list of questions, also shows the user who asked this question, and the number of badges he has.
So, while a user<->question is one to one, user<->badges is one has many.
The only way to do it in one query (and not one on questions and another one on users and then combine results), is to group the query by the primary key (question_id) and join+group_concat to the user_badges table.
The same goes for the questions TAGS.
Code example:
Table Questions:
question_id (int)(pk)| question_body(varchar)
Table tag-question:
question-id (int) | tag_id (int)
SELECT:
SELECT quesuestions.question_id,
questions.question_body,
GROUP-CONCAT(tag_id,' ') AS 'tags-ids'
FROM
questions
JOIN
tag_question
ON
questions.question_id=tag-question.question-id
GROUP BY
questions.question-id
LIMIT 15

Yes, the order the query executes is:
FROM
WHERE
GROUP
HAVING
SORT
SELECT
LIMIT
LIMIT is the last thing calculated, so your grouping will be just fine.
Now, looking at your rephrased question, then you're not having just one row per group, but many: in the case of stackoverflow, you'll have just one user per row, but many badges - i.e.
(uid, badge_id, etc.)
(1, 2, ...)
(1, 3, ...)
(1, 12, ...)
all those would be grouped together.
To avoid full table scan all you need are indexes. Besides that, if you need to SUM, for example, you cannot avoid a full scan.
EDIT:
You'll need something like this (look at the WHERE clause):
SELECT
quesuestions.question_id,
questions.question_body,
GROUP_CONCAT(tag_id,' ') AS 'tags_ids'
FROM
questions q1
JOIN tag_question tq
ON q1.question_id = tq.question-id
WHERE
q1.question_id IN (
SELECT
tq2.question_id
FROM
tag_question tq2
ON q2.question_id = tq2.question_id
JOIN tag t
tq2.tag_id = t.tag_id
WHERE
t.name = 'the-misterious-tag'
)
GROUP BY
q1.question_id
LIMIT 15

LIMIT does get applied after GROUP BY.
Will the temporary table be created or not, depends on how your indexes are built.
If you have an index on the grouping field and don't order by the aggregate results, then an INDEX SCAN FOR GROUP BY is applied, and each aggregate is counted on the fly.
That means that if you don't select an aggregate due to the LIMIT, it won't ever be calculated.
But if you order by an aggregate, then, of course, all of them need to be calculated before they can be sorted.
That's why they are calculated first and then the filesort is applied.
Update:
As for your query, see what EXPLAIN EXTENDED says for it.
Most probably, question_id is a PRIMARY KEY for your table, and most probably, it will be used in a scan.
That means no filesort will be applies and the join itself will not ever happen after the 15'th row.
To make sure, rewrite your query as following:
SELECT question_id,
question_body,
(
SELECT GROUP_CONCAT(tag_id, ' ')
FROM tag_question t
WHERE t.question_id = q.question_id
)
FROM questions q
ORDER BY
question_id
LIMIT 15
First, it is more readable,
Second, it is more efficient, and
Third, it will return even untagged questions (which your current query doesn't).

If the field you're grouping on is indexed, it shouldn't do a full table scan.

Related

Best approach to ocurrences of ids on a table and all elements in another table

Well, the query I need is simple, and maybe is in another question, but there is a performance thing in what I need, so:
I have a table of users with 10.000 rows, the table contains id, email and more data.
In another table called orders I have way more rows, maybe 150.000 rows.
In this orders I have the id of the user that made the order, and also a status of the order. The status could be a number from 0 to 9 (or null).
My final requirement is to have every user with the id, email, some other column , and the number of orders with status 3 or 7. it does not care of its 3 or 7, I just need the amount
But I need to do this query in a low-impact way (or a performant way).
What is the best approach?
I need to run this in a redash with postgres 10.
This sounds like a join and group by:
select u.*, count(*)
from users u join
orders o
on o.user_id = u.user_id
where o.status in (3, 7)
group by u.user_id;
Postgres is usually pretty good about optimizing these queries -- and the above assumes that users(user_id) is the primary key -- so this should work pretty well.

How can I order by a specific order?

It would be something like:
SELECT * FROM users ORDER BY id ORDER("abc","ghk","pqr"...);
In my order clause there might be 1000 records and all are dynamic.
A quick google search gave me below result:
SELECT * FROM users ORDER BY case id
when "abc" then 1
when "ghk" then 2
when "pqr" then 3 end;
As I said all my order clause values are dynamic. So is there any suggestion for me?
Your example isn't entirely clear, as it appears that a simple ORDER BY would suffice to order your id's alphabetically. However, it appears you are trying to create a dynamic ordering scheme that may not be alphabetical. In that case, my recommendation would be to use a lookup table for the values that you will be ordering by. This serves two purposes: first, it allows you to easily reorder the items without altering each entry in the users table, and second, it avoids (or at lest reduces) problems with typos and other issues that can occur with "magic strings."
This would look something like:
Lookup Table:
CREATE TABLE LookupValues (
Id CHAR(3) PRIMARY KEY,
Order INT
);
Query:
SELECT
u.*
FROM
users u
INNER JOIN
LookupTable l
ON
u.Id = l.Id
ORDER BY
l.Order

Query to ORDER BY the number of rows returned from another SELECT

I'm trying to wrap my head around SQL and I need some help figuring out how to do the following query in PostgreSQL 9.3.
I have a users table, and a friends table that lists user IDs and the user IDs of friends in multiple rows.
I would like to query the user table, and ORDER BY the number of mutual friends in common to a user ID.
So, the friends table would look like:
user_id | friend_user_id
1 | 4
1 | 5
2 | 10
3 | 7
And so on, so user 1 lists 4 and 5 as friends, and user 2 lists 10 as a friend, so I want to sort by the highest count of user 1 in friend_user_id for the result of user_id in the select.
The Postgres way to do this:
SELECT *
FROM users u
LEFT JOIN (
SELECT user_id, count(*) AS friends
FROM friends
) f USING (user_id)
ORDER BY f.friends DESC NULLS LAST, user_id -- as tiebreaker
The keyword AS is just noise for table aliases. But don't omit it from column aliases. The manual on "Omitting the AS Key Word":
In FROM items, both the standard and PostgreSQL allow AS to be omitted
before an alias that is an unreserved keyword. But this is impractical
for output column names, because of syntactic ambiguities.
Bold emphasis mine.
ISNULL() is a custom extension of MySQL or SQL Server. Postgres uses the SQL-standard function COALESCE(). But you don't need either here. Use the NULLS LAST clause instead, which is faster and cleaner. See:
PostgreSQL sort by datetime asc, null first?
Multiple users will have the same number of friends. These peers would be sorted arbitrarily. Repeated execution might yield different sort order, which is typically not desirable. Add more expressions to ORDER BY as tiebreaker. Ultimately, the primary key resolves any remaining ambiguity.
If the two tables share the same column name user_id (like they should) you can use the syntax shortcut USING in the join clause. Another standard SQL feature. Welcome side effect: user_id is only listed once in the output for SELECT *, as opposed to when joining with ON. Many clients wouldn't even accept duplicate column names in the output.
Something like this?
SELECT * FORM [users] u
LEFT JOIN (SELECT user_id, COUNT(*) friends FROM fields) f
ON u.user_id = f.user_id
ORDER BY ISNULL(f.friends,0) DESC

SQL Server - Speed up count on large table

I have a table with close to 30 million records. Just several columns. One of the column 'Born' have not more than 30 different values and there is an index defined on it. I need to be able to filter on that column and efficiently page through results.
For now I have (example if the year I'm searching for is '1970' - it is a parameter in my stored procedure):
WITH PersonSubset as
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
FROM Person WITH (INDEX(IX_Person_Born))
WHERE Born = '1970'
)
SELECT *, (SELECT count(*) FROM PersonSubset) AS TotalPeople
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30
Every query of that sort (only Born parameter used) returns just over 1 million results.
I've noticed the biggest overhead is on the count used to return the total results. If I remove (SELECT count(*) FROM PersonSubset) AS TotalPeople from the select clause the whole thing speeds up a lot.
Is there a way to speed up the count in that query. What I care about is to have the paged results returned and the total count.
Updated following discussion in comments
The cause of the problem here is very low cardinality of the IX_Person_Born index.
SQL indexes are very good at quickly narrowing down values, but they have problems when you have lots of records with the same value.
You can think of it as like the index of a phone book - if you want to find "Smith, John" you first find that there are lots of names that begin with S, and then pages and pages of people called Smith, and then lots of Johns. You end up scanning the book.
This is compounded because the index in the phone book is clustered - the records are sorted by surname. If instead you want to find everyone called "John" you'll be doing a lot of looking up.
Here there are 30 million records but only 30 different values, which means that the best possible index is still returning around 1 million records - at that sort of scale it might as well be a table-scan. Each of those 1 million results is not the actual record - it's a lookup from the index to the table (the page number in the phone book analogy), which makes it even slower.
A high cardinality index (say for full date of birth), rather than year would be much quicker.
This is a general problem for all OLTP relational databases: low cardinality + huge datasets = slow queries because index-trees don't help much.
In short: there's no significantly quicker way to get the count using T-SQL and indexes.
You have a couple of options:
1. Data Aggregation
Either OLAP/Cube rollups or do it yourself:
select Born, count(*)
from Person
group by Born
The pro is that cube lookups or checking your cache is very fast. The problem is that the data will get out of date and you need some way to account for that.
2. Parallel Queries
Split into two queries:
SELECT count(*)
FROM Person
WHERE Born = '1970'
SELECT TOP 30 *
FROM Person
WHERE Born = '1970'
Then run these either in parallel server side, or add it to the user interface.
3. No-SQL
This problem is one of the big advantages no-SQL solutions have over traditional relational databases. In a no-SQL system the Person table is federated (or sharded) across lots of cheap servers. When a user searches every server is checked at the same time.
At this point a technology change is probably out, but it may be worth investigating so I've included it.
I have had similar problems in the past with databases of this kind of size, and (depending on context) I've used both options 1 and 2. If the total here is for paging then I'd probably go with option 2 and AJAX call to get the count.
DECLARE #TotalPeople int
--does this query run fast enough? If not, there is no hope for a combo query.
SET #TotalPeople = (SELECT count(*) FROM Person WHERE Born = '1970')
WITH PersonSubset as
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
FROM Person WITH (INDEX(IX_Person_Born))
WHERE Born = '1970'
)
SELECT *, #TotalPeople as TotalPeople
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30
You usually can't take a slow query, combine it with a fast query, and wind up with a fast query.
One of the column 'Born' have not more than 30 different values and there is an index defined on it.
Either SQL Server isn't using the index or statistics, or the index and statistics aren't helpful enough.
Here is a desperate measure that will force Sql's hand (at the potential cost of making writes very expensive - measure that, and blocking schema changes to the Person table while the view exists).
CREATE VIEW dbo.BornCounts WITH SCHEMABINDING
AS
SELECT Born, COUNT_BIG(*) as NumRows
FROM dbo.Person
GROUP BY Born
GO
CREATE UNIQUE CLUSTERED INDEX BornCountsIndex ON BornCounts(Born)
By putting a clustered index on a view, you make it a system maintained copy. The size of this copy is much smaller than 30 Million rows, and it has the exact information you're looking for. I did not have to change the query to get it to use the view, but you're free to use the view's name in the query if you like.
WITH PersonSubset as
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
FROM Person WITH (INDEX(IX_Person_Born))
WHERE Born = '1970'
)
SELECT *, **max(Row) AS TotalPeople**
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30
why not like that ?
edit , dont know why bold doesnt work :<
Here is a novel approach using system dmv's if you can get by with a "good enough" count, you don't mind creating an index for every distinct value for [Born], and you don't mind feeling a little bit dirty inside.
Create a filtered index for each year:
--pick a column to index, it doesn't matter which.
CREATE INDEX IX_Person_filt_1970 on Person ( id ) WHERE Born = '1970'
CREATE INDEX IX_Person_filt_1971 on Person ( id ) WHERE Born = '1971'
CREATE INDEX IX_Person_filt_1972 on Person ( id ) WHERE Born = '1972'
Then use the [rows] column from sys.partitions to to get a rowcount.
WITH PersonSubset as
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
FROM Person WITH (INDEX(IX_Person_Born))
WHERE Born = '1970'
)
SELECT *,
(
SELECT sum(rows)
FROM sys.partitions p
inner join sys.indexes i on p.object_id = i.object_id and p.index_id =i.index_id
inner join sys.tables t on t.object_id = i.object_id
WHERE t.name ='Person'
and i.name = 'IX_Person_filt_' + '1970' --or at #p1
) AS TotalPeople
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30
Sys.partitions isn't guaranteed to be accurate in 100% of cases (usually it is exact or really close) This approach won't work if you need to filter on anything but [Born]

SQL aggregation question

I have three tables:
unmatched_purchases table:
unmatched_purchases_id --primary key
purchases_id --foreign key to events table
location_id --which store
purchase_date
item_id --item purchased
purchases table:
purchases_id --primary key
location_id --which store
customer_id
credit_card_transactions:
transaction_id --primary key
trans_timestamp --timestamp of when the transaction occurred
item_id --item purchased
customer_id
location_id
All three tables are very large. The purchases table has 590130404 records. (Yes, half a billion) Unmatched_purchases has 192827577 records. Credit_card_transactions has 79965740 records.
I need to find out how many purchases in the unmatched_purchases table match up with entries in the credit_card_transactions table. I need to do this for one location at a time (IE run the query for location_id = 123. Then run it for location_id = 456) "Match up" is defined as:
1) same customer_id
2) same item_id
3) the trans_timestamp is within a certain window of the purchase_date
(EG if the purchase_date is Jan 3, 2005
and the trans_timestamp is 11:14PM Jan 2, 2005, that's close enough)
I need the following aggregated:
1) How many unmatched purchases are there for that location
2) How many of those unmatched purchases could have been matched with credit_card_transactions for a location.
So, what is a query (or queries) to get this information that won't take forever to run?
Note: all three tables are indexed on location_id
EDIT: as it turns out, the credit_card_purchases table has been partitioned based on location_id. So that will help speed this up for me. I'm asking our DBA if the others could be partitioned as well, but the decision is out of my hands.
CLARIFICATION: I only will need to run this on a few of our many locations, not all of them separately. I need to run it on 3 locations. We have 155 location_ids in our system, but some of them are not used in this part of our system.
try this (I have no idea how fast it will be - that depends on your indices)
Select Count(*) TotalPurchases,
Sum(Case When c.transaction_id Is Not Null
Then 1 Else 0 End) MatchablePurchases
From unmatched_purchases u
Join purchases p
On p.purchases_id = u.unmatched_purchases_id
Left Join credit_card_transactions c
On customer_id = p.customer_id
And item_id = u.item_id
And trans_timestamp - purchase_date < #DelayThreshold
Where Location_id = #Location
At least, you'll need more indexes. I propose at least the folloging:
An index on unmatched_purchases.purchases_id, one on purchases.location_id and
another index on credit_card_transactions.(location_id, customer_id, item_id, trans_timestamp).
Without those indexes, there is little hope IMO.
I suggest you to query ALL locations at once. It will cost you 3 full scans (each table once) + sorting. I bet this will be faster then querying locations one by one.
But if you want not to guess, you at least need to examine EXPLAIN PLAN and 10046 trace of your query...
The query ought to be straightforward, but the tricky part is to get it to perform. I'd question why you need to run it once for each location when it would probably be more eficient to run it for every location in a single query.
The join would be a big challenge, but the aggregation ought to be straightforward. I would guess that your best hope performance-wise for the join would be a hash join on the customer and item columns, with a subsequent filter operation on the date range. You might have to fiddle with putting the customer and item join in an inline view and then try to stop the date predicate from being pushed into the inline view.
The hash join would be much more efficient with tables that are being equi-joined both having the same hash partitioning key on all join columns, if that can be arranged.
Whether to use the location index or not ...
Whether the index is worth using or not depends on the clustering factor for the location index, which you can read from the user_indexes table. Can you post the clustering factor along with the number of blocks that the table contains? That will give a measure of the way that values for each location are distributed throughout the table. You could also extract the execution plan for a query such as:
select some_other_column
from my_table
where location_id in (value 1, value 2, value 3)
... and see if oracle thinks the index is useful.