Select random rows according to a given criteria PostgreSQL - sql

I have table user with ten million rows. It has fields: id int4 primary key, rating int4, country varchar(32), last_active timestamp. It has gaps in identifiers.
The task is to select five random users for a given country which were active in a period of last two days and has rating in a given range.
Is there a tricky way to select them faster than the query below?
SELECT id
FROM user
WHERE last_active > '2020-04-07'
AND rating between 200 AND 280
AND country = 'US'
ORDER BY random()
LIMIT 5
It thought about this query:
SELECT id
FROM user
WHERE last_active > '2020-04-07'
AND rating between 200 AND 280
AND country = 'US'
AND id > (SELECT random()*max(id) FROM user)
ORDER BY id ASC
LIMIT 5
but the problem is that there lots of inactive users with small identifier values, the majority of new users are in the end of the id range. So, this query would select some users too often.

Based on the EXPLAIN plan, your table is large. About 2 rows per page. Either it is very bloated, or the rows themselves are very wide.
The key to getting good performance is probably to get it to use an index-only scan, by creating an index which contains all 4 columns referenced in your query. The column tested for equality should come first. After that, you have to choose between your two range-or-inequality queried columns ("last_active" or "rating"), based on whichever you think will be more selective. Then you add the other range-or-inequality and the id column to the end, so that an index-only scan can be used. So maybe create index on app_user (country, last_active, rating, id). That will probably be good enough.
You could also try a GiST index on those same columns. This has the theoretical advantage that the two range-or-inequality restrictions can be used together in defining what index pages to look at. But in practise GiST indexes have very high overhead, and this overhead would likely exceed the theoretical benefit.
If the above aren't good enough, you could try partitioning. But how exactly you do that should be based on a holistic view of your application, not just one query.

Related

how to create an index for a non-deterministic function

I have a table with Date-Of-Birth column. I have defined a function, say FIND_AGE, which takes it as input and returns age (it uses system date in calculations).
I want to optimize a query which returns all records having a certain age, say 30. I understand that we can't use non-deterministic functions (like FIND_AGE) while creating indexes.
Is there still a way I can create an index to optimize the query to fetch all records having age 30?
I would advice if you have a performance issue to share the whole query. Generally, storing a date and having a index on it is enough to find particular records based on it.
For example, getting users born before 30 years on current date:
SELECT *
FROM my_table
WHERE dob = DATEADD(year, -30, GETDATE())
If you have billions of records, which is unusual for users data, I can accept this is the cause of your performance issue.
If not, it will be better to check how the data from this table is read. You currently can have a index on this column, which is ignore from the engine, because the index is not covering. For example, you are reading also the first and the last name of the users. So, you index can be:
CREATE INDEX INX_my_table_DOB_I_FisrtsName_LastName ON
(
DOB
)
INCLUDE (FirstName, LastName);
or you are filtering by country code, also, so the index will be:
CREATE INDEX INX_my_table_DOB_I_FisrtsName_LastName ON
(
DOB
,CountryCode
)
INCLUDE (FirstName, LastName);
If your users table has many columns or large columns holding text, xml, blob, etc. scanning the table and not using the index can be the root of your issues.
If your table has an "updatedDate" column you could take the opportunity to maintain an indexed column "ageAtUpdatedDate" at the same time you update the "updatedDate" column at low cost: then the people having age X now are obviously among the ones having "ageAtUpdatedDate <= X" and you could reduce the data set size of people to test for "age = X", but he reduction of the data set size will depend on X in regard of the histogram of ages of your population, so the improvement will be "random"...

How does query with multiple where condition work in PostgreSQL?

I have a table account_config where I keep key-value configs for accounts with columns:
id - pk
account_id - fk
key
value
Table may have configs for thousands of accounts, but for each account it may have 10-20 configs max. I am using query:
select id, key, value from account_config t where t.account_id = ? and t.key = ?;
I already have index for account_id field, do I need another index for key field here? Will second filter (key = ?) apply to already filtered result set (account_id = ?) or it scans whole table?
Indexes are used when only a small percentage of the table's rows get accessed and the index helps finding those rows quickly.
You say there are thousands of accounts in your table, each with 10 to 20 rows.
Let's say there are 3000 accounts and 45,000 rows in your table, then accessing data via an index on the account ID means with the index we access about 0,03 % of the rows to find the one row in question. That makes it extremely likely that the index will be used.
Of course, if there were an index on (account_id, key), that index would be preferred, as we would only have to read one row from the table which the index points to.
So, yes, your index should suffice for the query shown, but if you want to get this faster, then provide the two-column index.

performance of select queries oracle

from millions of records in a table if i want to select few records depending on several where conditions, are there any facts to be considered for the sequence of where conditions?
for example, in a table of students in a state where conditions might include the following:
Institute Id,
DeptId (unique for each institute, i.e. deptId in a institute cannot be present on another institute.)
Now, if i have to select a list of students in a particular dept, deptId is enough for the where condition, because deptId of the students will be present only in one particular institute.
But will it improve performance if i include instituteId also before the deptid in where conditions so that records can be filtered based on institute first, and then based on deptId?
Will the order of where conditions have impact on query performance? Thanks.
Order of where conditions won't have impact on query performance. Because your RDBMS will reorganise depending on best indexed columns.
Anyway, if you have indexes on both columns you should use only DeptId. Otherwise RDBMS will perform 2 filters instead of 1, in theory it could be slower to use more conditions (depending on indexes).
But, you can try both ways to check excecution time, many things can affect performance (specially with huge bulk of data) so just test it.
Try to make 1 index for 2 columns in same time, so it could be interesting to use 2 conditions. (depending on RDBMS)
Not per se but depending on the index: if you have an index with Institute ID, DeptID the using only DeptID in the where condition will not use the index and performe a table scan (there is a lot more in this but that's the basic). Try always to create a where condition covered by the PK or another index on the table with every column in that index, if you have an index on a, b and c and a where on a and c that will not use the index.
The order of the column in the where condition will be re-organized by the DB to fit the index definition
It all depends on how your indexes are set up. If you have no indexes on your table, then it doesn't make a bit of difference because every query is going to have to scan the entire table anyway.
http://use-the-index-luke.com is an excellent introduction to indexes and how they work and how they should be set up.

SELECT COUNT(*) with an ORDER BY

Will the following two queries be executed in the same way?
SELECT COUNT(*) from person ORDER BY last_name;
and
SELECT COUNT(*) from person;
Either way they should display the same results, so I was curious if the ORDER BY just gets ignored.
The reason I am asking is because I am displaying a paginated table where I will get 20 records at a time from the database and then firing a second query that counts the total number of records. I want to know if I should use the same criteria that the first query used, or if I should be removing all sorting from the criteria?
According to the execution plan, the two queries are different. For example, the query:
select count(*) from USER
Will give me:
INDEX (FAST FULL SCAN) 3.0 3 453812 3457 1 TPMDBO USER_PK FAST FULL SCAN INDEX (UNIQUE) ANALYZED
As you can see, we hit USER_PK which is the primary key of that table.
If I sort by a non-indexed column:
select count(*) from USER ORDER BY FIRSTNAME --No Index on FIRSTNAME
I'll get:
TABLE ACCESS (FULL) 19.0 19 1124488 3457 24199 1 TPMDBO USER FULL TABLE ANALYZED 1
Meaning we did a full table scan (MUCH higher node cost)
If I sort by the primary key (which is already index,) Oracle is smart enough to use the index to do that sort:
INDEX (FAST FULL SCAN) 3.0 3 453812 3457 13828 1 TPMDBO USER_PK FAST FULL SCAN INDEX (UNIQUE) ANALYZED
Which looks very similar to the first execution plan.
So, the answer to your question is absolutely not - they are not the same. However, ordering by an index that Oracle is already seeking anyway will probably result in the same query plan.
Of course not. Unless last name is the primary key and you are already ordered by that.
The Oracle query optimizer actually does perform a sort (I verified this looking at the explain plan) for the first version, but since both queries only return one row, the performance difference will be very small.
EDIT:
Mike's answer is correct. The performance difference can possibly be significant.

Indexes, EXPLAIN PLAN, and record access in Oracle SQL

I have been learning about indexes in Oracle SQL, and I wanted to conduct a small experiment with a test table to see how indexes really worked. As I discovered from an earlier post made here, the best way to do this is with EXPLAIN PLAN. However, I am running into something which confuses me.
My sample table contains attributes (EmpID, Fname, Lname, Occupation, .... etc). I populated it with 500,000 records using a java program I wrote (random names, occupations, etc). Now, here are some sample queries with and without indexes:
NO INDEX:
SELECT Fname FROM EMPLOYEE WHERE Occupation = 'DOCTOR';
EXPLAIN PLAN says:
OPERATION OPTIMIZER COST
TABLE ACCESS(FULL) TEST.EMPLOYEE ANALYZED 1169
Now I create index:
CREATE INDEX occupation_idx
ON EMPLOYEE (Occupation);
WITH INDEX "occupation_idx":
SELECT Fname FROM EMPLOYEE WHERE Occupation = 'DOCTOR';
EXPLAIN PLAN says:
OPERATION OPTIMIZER COST
TABLE ACCESS(FULL) TEST.EMPLOYEE ANALYZED 1169
So... the cost is STILL the same, 1169? Now I try this:
WITH INDEX "occupation_idx":
SELECT Occupation FROM EMPLOYEE WHERE Occupation = 'DOCTOR';
EXPLAIN PLAN says:
OPERATION OPTIMIZER COST
INDEX(RANGE SCAN) TEST.OCCUPATION_IDX ANALYZED 67
So, it appears that the index only is utilized when that column is the only one I'm pulling values from. But I thought that the point of an index was to unlock the entire record using the indexed column as the key? The search above is a pretty pointless one... it searches for values which you already know. The only worthwhile query I can think of which ONLY involves an indexed column's value (and not the rest of the record) would be an aggregate such as COUNT or something.
What am I missing?
Even with your index, Oracle decided to do a full scan for the second query.
Why did it do this? Oracle would have created two plans and come up with a cost for each:-
1) Full scan
2) Index access
Oracle selected the plan with the lower cost. Obviously it came up with the full scan as the lower cost.
If you want to see the cost of the index plan, you can do an explain plan with a hint like this to force the index usage:
SELECT /*+ INDEX(EMPLOYEE occupation_idx) */ Fname
FROM EMPLOYEE WHERE Occupation = 'DOCTOR';
If you do an explain plan on the above, you will see that the cost is greater than the full scan cost. This is why Oracle did not choose to use the index.
A simple way to consider the cost of the index plan is:-
The blevel of the index (how many blocks must be read from top to bottom)
The number of table blocks that must be subsequently read for records matching in the index. This relies on Oracle's estimate of the number of employees that have an occupation of 'DOCTOR'. In your simple example, this would be:
number of rows / number of distinct values
More complicated considerations include the clustering factory and index cost adjustments which both reflect the likelyhood that a block that is read is already in memory and hence does not need to read from disk.
Perhaps you could update your question with the results from your query with the index hint and also the results of this query:-
SELECT COUNT(*), COUNT(DISTINCT( Occupation ))
FROM EMPLOYEE;
This will allow people to comment on the cost of the index plan.
I think I see what's happening here.
When you have the index in place, and you do:
SELECT Occupation FROM EMPLOYEE WHERE Occupation = 'DOCTOR';
The execution plan will use the index. This is a no-brainer, cause all the data that's required to satisfy the query is right there in the index, and Oracle never even has to reference the table at all.
However, when you do:
SELECT Fname FROM EMPLOYEE WHERE Occupation = 'DOCTOR';
then, if Oracle uses the index, it will do an INDEX RANGE SCAN followed by a TABLE ACCESS BY ROWID to look up the Fname that corresponds to that Occupation. Now, depending on how many rows have DOCTOR for Occupation, Oracle will have to make one or more trips to the table, to look up the Fname. If, for example, you have a table, and all the employees have Occupation set to 'DOCTOR', the index isn't of much use, and Oracle will simply do a FULL TABLE SCAN of the table. If there are 10,000 employees, and only one is a DOCTOR, then again, it's a no-brainer, and Oracle will use the index.
But there are some subtleties, when you're somewhere between those two extremes. People like to talk about 'selectivity', i.e., how many rows are identifed by the index, vs. the size of the table, when discussing whether the index will be used. But, that's not really true. What Oracle really cares about is block selectivity. That is, how many blocks does it have to visit, to satisfy the query? So, first, how "wide" is the RANGE SCAN? The more limited the range of values specified by the predicate values, the better. Second, when your query needs to do table lookups, how many different blocks will it have to visit to find all the data it needs. That is, how "random" is the data in the table relative to the index order? This is called the CLUSTERING_FACTOR. If you analyze the index to collect statistics, and then look at USER_INDEXES, you'll see that the CLUSTERING_FACTOR is now populated.
So, what's CLUSTERING_FACTOR? CLUSTERING_FACTOR is the "orderedness" of the table, with respect to the index's key column(s). The value of CLUSTERING_FACTOR will always be between the number of blocks in a table and the number of rows in a table. A low CLUSTERING_FACTOR, that is, one that is very near to the number of blocks in the table, indicates a table that's very ordered, relative to the index. A high CLUSTERING_FACTOR, that is, one that is very near to the number of rows in the table, is very unordered, relative to the index.
It's an important concept to understand that the CLUSTERING_FACTOR describes the order of data in the table relative to the index. So, rebuilding an index, for example, will not change the CLUSTERING_FACTOR. It's also important to understand that the same table could have two indexes, and one could have an excellent CLUSTERING_FACTOR, and the other could have an extremely poor CLUSTERING_FACTOR. The table itself can only be ordered in one way.
So, why have I spent so much time describing CLUSTERING_FACTOR? Because when you have an execution plan that does an INDEX RANGE SCAN followed by TABLE ACCESS BY ROWID, you can be sure that the CLUSTERING_FACTOR has been considered by Oracle's optimizer, to come up with the execution plan. For example, suppose you have a 10,000 row table, and suppose 100 of the rows have Occupation = 'DOCTOR'. You write the query above, asking for the Fname of the employees whose occupation is DOCTOR. Well, Oracle can very easily and efficiently determine the rowids of the rows where occupation is DOCTOR. But, how many table blocks will Oracle need to visit, to do the Fname lookup? It could be only 1 or 2 table blocks, if the data is clustered (ordered) by Occupation in the table. But, it could be as many as 100, if the data is very unordered in the table! So, again, 10,000 row table, and, let's assume, (for the purposes of illustration and simple math) that the table has 100 rows/block, and so, 100 blocks. Depending on table order (i.e. CLUSTERING_FACTOR), the number of table block visits could be as few as 1, or as many as 100.
So, I hope this helps you understand why the optimizer may be reluctant to use an index in some cases.
An index is the copy of the table which only stores the following data:
Indexed field(s)
A pointer to the original row (rowid).
Say you have a table like this:
rowid id name occupation
[1] 1 John clerk
[2] 2 Jim manager
[3] 3 Jane boss
Then an index on occupation would look like this:
occupation rowid
boss [3]
manager [2]
clerk [1]
, with the records sorted on occupation in a B-Tree.
As you can see, if you only select the indexed fields, you only need the index (the second table).
If you select anything other than occupation:
SELECT *
FROM mytable
WHERE occupation = 'clerk'
then the engine should make two things: first find the relevant records in the index, second, find the records in the original table by rowid. It's like if you joined the two tables on rowid.
Since the rowids in the index are not in order, the reads to the original table are not sequential and can be slow. It may be faster to read the original table in sequential order and just filter the records with occupation = 'clerk'.
The engine does not "unlock" the records: it just finds the rowid in the index, and if there are not enough data in the index itself, it looks up data in the original table by the rowid found.
As a WAG. Analyze the table, and the index, then see if the plan changes.
When you are selecting just the occupation, the entire query can be satisfied from the index. The index literally has a copy of the occupation. The moment you add an additional column to the select, Oracle has to go to the data record, to get it. The optimizer chooses to read all of the data rows instead of all of the index rows, and the data rows. It's cheaper.