strange performance of postgres query - sql

Ive very strange problem, when I execute query like below:
with ap as (
SELECT id from adress limit 1000)
)
SELECT distinct house.id, house.date
FROM house
WHERE house.adressid in (select id from ap)
LIMIT 9999
I ge the resulkts within 100 ms
But when I change the limit to 10 then Im getting a result after 20 s
with ap as (
SELECT id from adress limit 1000)
)
SELECT distinct house.id, house.date
FROM house
WHERE house.adressid in (select id from ap)
LIMIT 10
Of course there is index on adressid
CREATE INDEX house_idx
ON house
USING btree
(adressid COLLATE pg_catalog."default");
In house there are like 9 mln rows.
Does anyone have an idea hoiw can I try to improve the performance. Ive reduces the problem to this very simple one but in reality the structure is much more complex thats why I didnt provide you with table create and query plans...

That is actually not surprising:
In the first case ap has up to 1000 rows and the result set should have up to 9999, so the optimizer put adress first. With index on house query performance is relatively high.
In the second case ap still has up to 1000 rows, but the result set should have only up to 10, so the optimizer put house first... and ends up with 10 table scans on adress of up to 1000 rows each. It probably won't be able to even use an index since there is no Order By clause anywhere.
This limit 1000 on address looks really suspicious and potentially can lead to inconsistent results: without Order By there is no guarantee which records from adress would be taken into account during each run.
I would use INNER JOIN to resolve the issue:
SELECT DISTINCT house.id, house.date
FROM house
INNER JOIN adress ON adress.id = house.adressid
ORDER BY house.date --< To add some consistency
LIMIT 10

Related

does the order on the Where statement hits the performance? [duplicate]

This question already has answers here:
Does the order of conditions in a WHERE clause affect MySQL performance?
(8 answers)
Closed 9 years ago.
Does the order you specify when constructing your query affects the performance? Or SQL does smart filtering.
For example assume I have a table Employee with 2 million records:
Employee( emp_id, name, dept_id, country_id )
not let's say I want to get the id and name for those employees in country id 500 and dept id 17.
Not lets say there about 300k in that dept and about 1million in that country, however the result of those that met both criterions isw 50k.
Would it make a perfomance difference if I do:
SELECT *
FROM employees
where dept_id = 17 and country_id= 500
than if I do:
SELECT *
FROM employees
where country_id= 500 and dept_id = 17
Assuming the latter will cut down the table to 1 million then do the rest from there and
the first query will cut down to 300k and do the 2nd filter from there.
However as mentioned before I'm unsure this is how the SQL motor handles the query.
It does affect, especially when you have residual predicates on your execution plan, but most of the time the query optimizer will do the job of reordering the predicates for you.
That assumes, of course, that your indexes and statistics are well-maintained and updated, so it's something to account for.
Further reading: http://sqlserverpedia.com/wiki/Index_Selectivity_and_Column_Order
Most modern RDMBS won't have a problem with the order of values in the WHERE-part of the statement, their query optimizers will in most cases sort it out the way you've described it to maximize the performance.
I know of a few older RDBMS who actually will be affected quite significant if you choose the "wrong" order, but those should be out of fashion for the last decade.
In the above table,
if non-clustered index is there for -
(country_id,dept_id,employee_id) in the table,
then the query -
SELECT *
FROM employees
where country_id= 500 and dept_id = 17
will have better performance.
and
if non-clustered index is there for -
(dept_id,country_id,employee_id) in the table,
then the query -
SELECT *
FROM employees
where dept_id = 17 and country_id= 500
will have better performance.
If no non-clustered index is there,
then the query -
SELECT *
FROM employees
where dept_id = 17 and country_id= 500
will have better performance for 2nd filtering subset is lower in number.
Also to mention,
If both the non-clustered index are there,
then the query -
SELECT *
FROM employees
where dept_id = 17 and country_id= 500
will have better performance for 2nd filtering subset is lower in number.

SQL JOIN returning multiple rows when I only want one row

I am having a slow brain day...
The tables I am joining:
Policy_Office:
PolicyNumber OfficeCode
1 A
2 B
3 C
4 D
5 A
Office_Info:
OfficeCode AgentCode OfficeName
A 123 Acme
A 456 Acme
A 789 Acme
B 111 Ace
B 222 Ace
B 333 Ace
... ... ....
I want to perform a search to return all policies that are affiliated with an office name. For example, if I search for "Acme", I should get two policies: 1 & 5.
My current query looks like this:
SELECT
*
FROM
Policy_Office P
INNER JOIN Office_Info O ON P.OfficeCode = O.OfficeCode
WHERE
O.OfficeName = 'Acme'
But this query returns multiple rows, which I know is because there are multiple matches from the second table.
How do I write the query to only return two rows?
SELECT DISTINCT a.PolicyNumber
FROM Policy_Office a
INNER JOIN Office_Info b
ON a.OfficeCode = b.OfficeCode
WHERE b.officeName = 'Acme'
SQLFiddle Demo
To further gain more knowledge about joins, kindly visit the link below:
Visual Representation of SQL Joins
Simple join returns the Cartesian multiplication of the two sets and you have 2 A in the first table and 3 A in the second table and you probably get 6 results. If you want only the policy number then you should do a distinct on it.
(using MS-Sqlserver)
I know this thread is 10 years old, but I don't like distinct (in my head it means that the engine gathers all possible data, computes every selected row in each record into a hash and adds it to a tree ordered by that hash; I may be wrong, but it seems inefficient).
Instead, I use CTE and the function row_number(). The solution may very well be a much slower approach, but it's pretty, easy to maintain and I like it:
Given is a person and a telephone table tied together with a foreign key (in the telephone table). This construct means that a person can have more numbers, but I only want the first, so that each person only appears one time in the result set (I ought to be able concatenate multiple telephone numbers into one string (pivot, I think), but that's another issue).
; -- don't forget this one!
with telephonenumbers
as
(
select [id]
, [person_id]
, [number]
, row_number() over (partition by [person_id] order by [activestart] desc) as rowno
from [dbo].[telephone]
where ([activeuntil] is null or [activeuntil] > getdate()
)
select p.[id]
,p.[name]
,t.[number]
from [dbo].[person] p
left join telephonenumbers t on t.person_id = p.id
and t.rowno = 1
This does the trick (in fact the last line does), and the syntax is readable and easy to expand. The example is simple but when creating large scripts that joins tables left and right (literally), it is difficult to avoid that the result contains unwanted duplets - and difficult to identify which tables creates them. CTE works great for me.

SQL Server - Speed up count on large table

I have a table with close to 30 million records. Just several columns. One of the column 'Born' have not more than 30 different values and there is an index defined on it. I need to be able to filter on that column and efficiently page through results.
For now I have (example if the year I'm searching for is '1970' - it is a parameter in my stored procedure):
WITH PersonSubset as
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
FROM Person WITH (INDEX(IX_Person_Born))
WHERE Born = '1970'
)
SELECT *, (SELECT count(*) FROM PersonSubset) AS TotalPeople
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30
Every query of that sort (only Born parameter used) returns just over 1 million results.
I've noticed the biggest overhead is on the count used to return the total results. If I remove (SELECT count(*) FROM PersonSubset) AS TotalPeople from the select clause the whole thing speeds up a lot.
Is there a way to speed up the count in that query. What I care about is to have the paged results returned and the total count.
Updated following discussion in comments
The cause of the problem here is very low cardinality of the IX_Person_Born index.
SQL indexes are very good at quickly narrowing down values, but they have problems when you have lots of records with the same value.
You can think of it as like the index of a phone book - if you want to find "Smith, John" you first find that there are lots of names that begin with S, and then pages and pages of people called Smith, and then lots of Johns. You end up scanning the book.
This is compounded because the index in the phone book is clustered - the records are sorted by surname. If instead you want to find everyone called "John" you'll be doing a lot of looking up.
Here there are 30 million records but only 30 different values, which means that the best possible index is still returning around 1 million records - at that sort of scale it might as well be a table-scan. Each of those 1 million results is not the actual record - it's a lookup from the index to the table (the page number in the phone book analogy), which makes it even slower.
A high cardinality index (say for full date of birth), rather than year would be much quicker.
This is a general problem for all OLTP relational databases: low cardinality + huge datasets = slow queries because index-trees don't help much.
In short: there's no significantly quicker way to get the count using T-SQL and indexes.
You have a couple of options:
1. Data Aggregation
Either OLAP/Cube rollups or do it yourself:
select Born, count(*)
from Person
group by Born
The pro is that cube lookups or checking your cache is very fast. The problem is that the data will get out of date and you need some way to account for that.
2. Parallel Queries
Split into two queries:
SELECT count(*)
FROM Person
WHERE Born = '1970'
SELECT TOP 30 *
FROM Person
WHERE Born = '1970'
Then run these either in parallel server side, or add it to the user interface.
3. No-SQL
This problem is one of the big advantages no-SQL solutions have over traditional relational databases. In a no-SQL system the Person table is federated (or sharded) across lots of cheap servers. When a user searches every server is checked at the same time.
At this point a technology change is probably out, but it may be worth investigating so I've included it.
I have had similar problems in the past with databases of this kind of size, and (depending on context) I've used both options 1 and 2. If the total here is for paging then I'd probably go with option 2 and AJAX call to get the count.
DECLARE #TotalPeople int
--does this query run fast enough? If not, there is no hope for a combo query.
SET #TotalPeople = (SELECT count(*) FROM Person WHERE Born = '1970')
WITH PersonSubset as
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
FROM Person WITH (INDEX(IX_Person_Born))
WHERE Born = '1970'
)
SELECT *, #TotalPeople as TotalPeople
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30
You usually can't take a slow query, combine it with a fast query, and wind up with a fast query.
One of the column 'Born' have not more than 30 different values and there is an index defined on it.
Either SQL Server isn't using the index or statistics, or the index and statistics aren't helpful enough.
Here is a desperate measure that will force Sql's hand (at the potential cost of making writes very expensive - measure that, and blocking schema changes to the Person table while the view exists).
CREATE VIEW dbo.BornCounts WITH SCHEMABINDING
AS
SELECT Born, COUNT_BIG(*) as NumRows
FROM dbo.Person
GROUP BY Born
GO
CREATE UNIQUE CLUSTERED INDEX BornCountsIndex ON BornCounts(Born)
By putting a clustered index on a view, you make it a system maintained copy. The size of this copy is much smaller than 30 Million rows, and it has the exact information you're looking for. I did not have to change the query to get it to use the view, but you're free to use the view's name in the query if you like.
WITH PersonSubset as
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
FROM Person WITH (INDEX(IX_Person_Born))
WHERE Born = '1970'
)
SELECT *, **max(Row) AS TotalPeople**
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30
why not like that ?
edit , dont know why bold doesnt work :<
Here is a novel approach using system dmv's if you can get by with a "good enough" count, you don't mind creating an index for every distinct value for [Born], and you don't mind feeling a little bit dirty inside.
Create a filtered index for each year:
--pick a column to index, it doesn't matter which.
CREATE INDEX IX_Person_filt_1970 on Person ( id ) WHERE Born = '1970'
CREATE INDEX IX_Person_filt_1971 on Person ( id ) WHERE Born = '1971'
CREATE INDEX IX_Person_filt_1972 on Person ( id ) WHERE Born = '1972'
Then use the [rows] column from sys.partitions to to get a rowcount.
WITH PersonSubset as
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
FROM Person WITH (INDEX(IX_Person_Born))
WHERE Born = '1970'
)
SELECT *,
(
SELECT sum(rows)
FROM sys.partitions p
inner join sys.indexes i on p.object_id = i.object_id and p.index_id =i.index_id
inner join sys.tables t on t.object_id = i.object_id
WHERE t.name ='Person'
and i.name = 'IX_Person_filt_' + '1970' --or at #p1
) AS TotalPeople
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30
Sys.partitions isn't guaranteed to be accurate in 100% of cases (usually it is exact or really close) This approach won't work if you need to filter on anything but [Born]

How do I select 8 random songs from top 50, with unique user_id?

I am trying to get top 50 downloads, and then shuffling (randomizing) 8 results. Plus, the 8 results have to be unique user_id's. I came up with this so far:
Song.select('DISTINCT songs.user_id, songs.*').where(:is_downloadable => true).order('songs.downloads_count DESC').limit(50).sort_by{rand}.slice(0,8)
My only gripe with this is, the last part .sort_by{rand}.slice(0,8) is being done via Ruby. Any way I can do all this via Active Record?
I wonder how the column user_id ended up in the table songs? That means you have one row for every combination of song and user? In a normalized schema, that would be an n:m relationship implemented with three tables:
song(song_id, ...)
usr(usr_id, ...) -- "user" is a reserved word
download (song_id, user_id, ...) -- implementing the n:m relationship
The query in your question yields incorrect results. The same user_id can pop up multiple times. DISTINCT does not do what you seem to expect it to. You need DISTINCT ON or some other method like aggregate or window functions.
You also need to use subqueries or CTEs, because this cannot be done in one step. When you use DISTINCT you cannot at the same time ORDER BY random(), because the sort order cannot disagree with the order dictated by DISTINCT. This query is certainly not trivial.
Simple case, top 50 songs
If you are happy to just pick the top 50 songs (not knowing how many duplicate user_ids are among them), this "simple" case will do:
WITH x AS (
SELECT *
FROM songs
WHERE is_downloadable
ORDER BY downloads_count DESC
LIMIT 50
)
, y AS (
SELECT DISTINCT ON (user_id) *
FROM x
ORDER BY user_id, downloads_count DESC -- pick most popular song per user
-- ORDER BY user_id, random() -- pick random song per user
)
SELECT *
FROM y
ORDER BY random()
LIMIT 8;
Get the 50 songs with the highest download_count. Users can show up multiple times.
Pick 1 song per user. Randomly or the most popular one, that's not defined in your question.
Pick 8 songs with now distinct user_id randomly.
You only need an index on songs.downloads_count for this to be fast:
CREATE INDEX songs_downloads_count_idx ON songs (downloads_count DESC);
Top 50 songs with unique user_id
WITH x AS (
SELECT DISTINCT ON (user_id) *
FROM songs
WHERE is_downloadable
ORDER BY user_id, downloads_count DESC
)
, y AS (
SELECT *
FROM x
ORDER BY downloads_count DESC
LIMIT 50
)
SELECT *
FROM y
ORDER BY random()
LIMIT 8;
Get the song with the highest download_count per user. Every user can only show up once, so it has to be the one song with the highest download_count.
Pick the 50 with highest downloads_count from that.
Pick 8 songs from that randomly.
With a big table, performance will suck, because you have to find the best row for every user before you can proceed. A multi-column index will help, but it will still not be very fast:
CREATE INDEX songs_u_dc_idx ON songs (user_id, downloads_count DESC);
The same, faster
If duplicate user_ids among the top songs are predictably rare, you can use a trick. Pick just enough of the top downloads, so that the top 50 with unique user_id are certainly among them. After this step, proceed like above. This will be much faster with big tables, because the top n rows can be read from the top of an index quickly:
WITH x AS (
SELECT *
FROM songs
WHERE is_downloadable
ORDER BY downloads_count DESC
LIMIT 100 -- adjust to your secure estimate
)
, y AS (
SELECT DISTINCT ON (user_id) *
FROM x
ORDER BY user_id, downloads_count DESC
)
, z AS (
SELECT *
FROM y
ORDER BY downloads_count DESC
LIMIT 50
)
SELECT *
FROM z
ORDER BY random()
LIMIT 8;
The index from the simple case above will suffice to make it almost as fast as the simple case.
This would fall short if less than 50 distinct users are among the top 100 "songs".
All queries should work with PostgreSQL 8.4 or later.
If it has to be faster, yet, create a materialized view that holds the pre-selected top 50, and rewrite that table in regular intervals or triggered by events. If you make heavy use of this and the table is big, I would go for that. Otherwise it's not worth the overhead.
Generalized, improved solution
I later formalized and improved this approach further to be applicable to a whole class of similar problems under this related question at dba.SE.
You could use PostgreSQL's RANDOM() function in the order by, making it
___.order('songs.downloads_count DESC, RANDOM()').limit(8)
though this doesn't work though because PostgreSQL requires the columns used in the ORDER BY be found in the SELECT. You'll get an error like
ActiveRecord::StatementInvalid: PG::Error: ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list
The only way to do what your'e asking all in SQL (using PostgreSQL) is with a subquery, which may or may not be a better solution for you. If it is, your best bet is to write out the full query/subquery using find_by_sql.
I'm happy to help come up with the SQL, though now that you know about RANDOM(), it should be pretty trivial.

Fetch data with single and fast SQL query

I have the following data:
ExamEntry Student_ID Grade
11 1 80
12 2 70
13 3 20
14 3 68
15 4 75
I want to find all the students that passed an exam. In this case, if there are few exams
that one student attended to, I need to find the last result.
So, in this case I'd get that all students passed.
Can I find it with one fast query? I do it this way:
Find the list of entries by
select max(ExamEntry) from data group by Student_ID
Find the results:
select ExamEntry from data where ExamEntry in ( ).
But this is VERY slow - I get around 1000 entries, and this 2 step process takes 10 seconds.
Is there a better way?
Thanks.
If your query is very slow at with 1000 records in your table, there is something wrong.
For a modern Database system a table containing, 1000 entries is considered very very small.
Most likely, you did not provid a (primary) key for your table?
Assuming that a student would pass if at least on of the grades is above the minimum needed, the appropriate query would be:
SELECT
Student_ID
, MAX(Grade) AS maxGrade
FROM table_name
GROUP BY Student_ID
HAVING maxGrade > MINIMUM_GRADE_NEEDED
If you really need the latest grade to be above the minimum:
SELECT
Student_ID
, Grade
FROM table_name
WHERE ExamEntry IN (
SELECT
MAX(ExamEntry)
FROM table_name
GROUP BY Student_ID
)
HAVING Grade > MINIMUM_GRADE_NEEDED
SELECT student_id, MAX(ExamEntry)
FROM data
WHERE Grade > :threshold
GROUP BY student_id
Like this?
I'll make some assumptions that you have a student table and test table and the table you are showing us is the test_result table... (if you don't have a similar structure, you should revisit your schema)
select s.id, s.name, t.name, max(r.score)
from student s
left outer join test_result r on r.student_id = s.id
left outer join test t on r.test_id = t.id
group by s.id, s.name, t.name
All the fields with id in it should be indexed.
If you really only have a single test (type) in your domain... then the query would be
select s.id, s.name, max(r.score)
from student s
left outer join test_result r on r.student_id = s.id
group by s.id, s.name
I've used the hints given here, and here the query I found that runs almost 3 orders faster than my first one (.03 sec instead of 10 sec):
SELECT ExamEntry, Student_ID, Grade from data,
( SELECT max(ExamEntry) as ExId GROUP BY Student_ID) as newdata
WHERE `data`.`ExamEntry`=`newdata`.`ExId` AND Grade > 60;
Thanks All!
As mentioned, indexing is a powerful tool for speeding up queries. The order of the index, however, is fundamentally important.
An index in order of (ExamEntry) then (Student_ID) then (Grade) would be next to useless for finding exams where the student passed.
An index in the opposite order would fit perfectly, if all you wanted was to find what exams had been passed. This would enable the query engine to quickly identify rows for exams that have been passed, and just process those.
In MS SQL Server this can be done with...
CREATE INDEX [IX_results] ON [dbo].[results]
(
[Grade],
[Student_ID],
[ExamEntry]
)
ON [PRIMARY]
(I recommend reading more about indexs to see what other options there are, such as ClusterdIndexes, etc, etc)
With that index, the following query would be able to ignore the 'failed' exams very quickly, and just display the students who ever passed the exam...
(This assumes that if you ever get over 60, you're counted as a pass, even if you subsequently take the exam again and get 27.)
SELECT
Student_ID
FROM
[results]
WHERE
Grade >= 60
GROUP BY
Student_ID
Should you definitely need the most recent value, then you need to change the order of the index back to something like...
CREATE INDEX [IX_results] ON [dbo].[results]
(
[Student_ID],
[ExamEntry],
[Grade]
)
ON [PRIMARY]
This is because the first thing we are interested in is the most recent ExamEntry for any given student. Which can be achieved using the following query...
SELECT
*
FROM
[results]
WHERE
[results].ExamEntry = (
SELECT
MAX([student_results].ExamEntry)
FROM
[results] AS [student_results]
WHERE
[student_results].Student_ID = [results].student_id
)
AND [results].Grade > 60
Having a sub query like this can appear slow, especially since it appears to be executed for every row in [results].
This, however, is not the case...
- Both main and sub query reference the same table
- The query engine scans through the Index for every unique Student_ID
- The sub query is executed, for that Student_ID
- The query engine is already in that part of the index
- So a new Index Lookup is not needed
EDIT:
A comment was made that at 1000 records indexs are not relevant. It should be noted that the question states that there are 1000 records Returned, not that the table contains 1000 records. For a basic query to take as long as stated, I'd wager there are many more than 1000 records in the table. Maybe this can be clarified?
EDIT:
I have just investigated 3 queries, with 999 records in each (3 exam results for each of 333 students)
Method 1: WHERE a.ExamEntry = (SELECT MAX(b.ExamEntry) FROM results [a] WHERE a.Student_ID = b.student_id)
Method 2: WHERE a.ExamEntry IN (SELECT MAX(ExamEntry) FROM resuls GROUP BY Student_ID)
Method 3: USING an INNER JOIN instead of the IN clause
The following times were found:
Method QueryCost(No Index) QueryCost(WithIndex)
1 23% 9%
2 38% 46%
3 38% 46%
So, Query 1 is faster regardless of indexes, but indexes also definitely make method 1 substantially faster.
The reason for this is that indexes allow lookups, where otherwise you need a scan. The difference between a linear law and a square law.
Thanks for the answers!!
I think that Dems is probably closest to what I need, but I will elaborate a bit on the issue.
Only the latest grade counts. If the student had passed first time, attended again and failed, he failed in total. He/She could've attended 3 or 4 exams, but still only the last one counts.
I use MySQL server. The problem I experience in both Linux and Windows installations.
My data set is around 2K entries now and grows with the speed of ~ 1K per new exam.
The query for specific exam also returns ~ 1K entries, when ~ 1K would be the number of students attended (received by SELECT DISTINCT STUDENT_ID from results;), then almost all have passed and some have failed.
I perform the following query in my code:
SELECT ExamEntry, Student_ID from exams WHERE ExamEntry in ( SELECT MAX(ExamEntry) from exams GROUP BY Student_ID). As subquery returns about ~1K entries, it appears that main query scans them in loop, making all the query run for a very long time and with 50% server load (100% on Windows).
I feel that there is a better way :-), just can't find it yet.
select examentry,student_id,grade
from data
where examentry in
(select max(examentry)
from data
where grade > 60
group by student_id)
don't use
where grade > 60
but
where grade between 60 and 100
that should go faster