Say I have a table with a thousand users and 50 million user_actions. A few users have more than a million actions but most have thousands.
CREATE TABLE users (id, name)
CREATE TABLE user_actions (id, user_id, created_at)
CREATE INDEX index_user_actions_on_user_id ON user_actions(user_id)
Querying user_actions by user_id is fast, using the index.
SELECT *
FROM user_actions
WHERE user_id = ?
LIMIT 1
But I'd like to know the last action by a user.
SELECT *
FROM user_actions
WHERE user_id = ?
ORDER BY created_at DESC
LIMIT 1
This query throws out the index and does a table scan, backwards until it finds an action. Not a problem for users that have been active recently, too slow for users that haven't.
Is there a way to tune this index so postgres keeps track of the last action by each user? (For bonus points the last N actions!)
Or, suggested alternate strategies? I suppose a materialized view of a window function will do the trick.
Create an index on (user_id, created_at)
This will allow PostgreSQL to do a index scan to locate the first record.
This is one of the cases where multi-column indexes make a big difference.
Note we put user_id first because that allows us to efficiently select the sub-portion of the index we are interested in, and then from there it is just a quick traversal to get the most recent created_at date, provided not a lot of dead rows in the area.
Related
I have table user with ten million rows. It has fields: id int4 primary key, rating int4, country varchar(32), last_active timestamp. It has gaps in identifiers.
The task is to select five random users for a given country which were active in a period of last two days and has rating in a given range.
Is there a tricky way to select them faster than the query below?
SELECT id
FROM user
WHERE last_active > '2020-04-07'
AND rating between 200 AND 280
AND country = 'US'
ORDER BY random()
LIMIT 5
It thought about this query:
SELECT id
FROM user
WHERE last_active > '2020-04-07'
AND rating between 200 AND 280
AND country = 'US'
AND id > (SELECT random()*max(id) FROM user)
ORDER BY id ASC
LIMIT 5
but the problem is that there lots of inactive users with small identifier values, the majority of new users are in the end of the id range. So, this query would select some users too often.
Based on the EXPLAIN plan, your table is large. About 2 rows per page. Either it is very bloated, or the rows themselves are very wide.
The key to getting good performance is probably to get it to use an index-only scan, by creating an index which contains all 4 columns referenced in your query. The column tested for equality should come first. After that, you have to choose between your two range-or-inequality queried columns ("last_active" or "rating"), based on whichever you think will be more selective. Then you add the other range-or-inequality and the id column to the end, so that an index-only scan can be used. So maybe create index on app_user (country, last_active, rating, id). That will probably be good enough.
You could also try a GiST index on those same columns. This has the theoretical advantage that the two range-or-inequality restrictions can be used together in defining what index pages to look at. But in practise GiST indexes have very high overhead, and this overhead would likely exceed the theoretical benefit.
If the above aren't good enough, you could try partitioning. But how exactly you do that should be based on a holistic view of your application, not just one query.
I'm using will_paginate to get the top 10-20 rows from a table, but I've found that the simple query it produces is scanning the entire table.
sqlite> explain query plan
SELECT "deals".* FROM "deals" ORDER BY created_at DESC LIMIT 10 OFFSET 0;
0|0|0|SCAN TABLE deals (~1000000 rows)
0|0|0|USE TEMP B-TREE FOR ORDER BY
If I was using a WITH clause and indexes, I'm sure it would be different, but this is just displaying the newest posts on the top page of the site. I did find a post or two on here that suggested adding indexes anyway, but I don't see how it can help with the table scan.
sqlite> explain query plan
SELECT deals.id FROM deals ORDER BY id DESC LIMIT 10 OFFSET 0;
0|0|0|SCAN TABLE deals USING INTEGER PRIMARY KEY (~1000000 rows)
It seems like a common use case, so how is it typically done efficiently?
The ORDER BY created_at DESC requires the database to search for the largest values in the entire table.
To speed up this search, you would need an index on the created_at column.
I have table in Oracle database which is called my_table for example. It is type of log table. It has an incremental column which is named "id" and "registration_number" which is unique for registered users. Now I want to get latest changes for registered users so I wrote queries below to accomplish this task:
First version:
SELECT t.*
FROM my_table t
WHERE t.id =
(SELECT MAX(id) FROM my_table t_m WHERE t_m.registration_number = t.registration_number
);
Second version:
SELECT t.*
FROM my_table t
INNER JOIN
( SELECT MAX(id) m_id FROM my_table GROUP BY registration_number
) t_m
ON t.id = t_m.m_id;
My first question is which of above queries is recommended and why? And second one is if sometimes there is about 70.000 insert to this table but mostly the number of inserted rows is changing between 0 and 2000 is it reasonable to add index to this table?
An analytical query might be the fastest way to get the latest change for each registered user:
SELECT registration_number, id
FROM (
SELECT
registration_number,
id,
ROW_NUMBER() OVER (PARTITION BY registration_number ORDER BY id DESC) AS IDRankByUser
FROM my_table
)
WHERE IDRankByUser = 1
As for indexes, I'm assuming you already have an index by registration_number. An additional index on id will help the query, but maybe not by much and maybe not enough to justify the index. I say that because if you're inserting 70K rows at one time the additional index will slow down the INSERT. You'll have to experiment (and check the execution plans) to figure out if the index is worth it.
In order to check for faster query, you should check the execution plan and cost and it will give you a fair idea. But i agree with solution of Ed Gibbs as analytics make query run much faster.
If you feel this table is going to grow very big then i would suggest partitioning the table and using local indexes. They will definitely help you to form faster queries.
In cases where you want to insert lots of rows then indexes slow down insertion as with each insertion index also has to be updated[I will not recommend index on ID]. There are 2 solutions i have think of for this:
You can drop index before insertion and then recreate it after insertion.
Use reverse key indexes. Check this link : http://oracletoday.blogspot.in/2006/09/there-is-option-to-create-index.html. Reverse key index can impact your query a bit so there will be trade off.
If you look for faster solution and there is a really need to maintain list of last activity for each user, then most robust solution is to maintain separate table with unique registration_number values and rowid of last record created in log table.
E.g. (only for demo, not checked for syntax validity, sequences and triggers omitted):
create table my_log(id number not null, registration_number number, action_id varchar2(100))
/
create table last_user_action(refgistration_number number not null, last_action rowid)
/
alter table last_user_action
add constraint pk_last_user_action primary key (registration_number) using index
/
create or replace procedure write_log(p_reg_num number, p_action_id varchar2)
is
v_row_id rowid;
begin
insert into my_log(registration_number, action_id)
values(p_reg_num, p_action_id)
returning rowid into v_row_id;
update last_user_action
set last_action = v_row_id
where registration_number = p_reg_num;
end;
/
With such schema you can simple query last actions for every user with good performance:
select
from
last_user_action lua,
my_log l
where
l.rowid (+) = lua.last_action
Rowid is physical storage identity directly addressing storage block and you can't use it after moving to another server, restoring from backups etc. But if you need such functionality it's simple to add id column from my_log table to last_user_action too, and use one or another depending on requirements.
I've a little problem;
I have 2 tables:
events and multimedia.
events have the
id,
device_id
created_at field
the primary key is the id and there's a index formed by device_id and created_at field.
multimedia table have the follower field:
id
device_id
created_at
data (this field is a blob field and contains a 20k string)
the primary key is id and there's a index formed by device_id and created_by field.
The problem is when i want to delete the record with created_at before a data.
the query:
DELETE FROM events WHERE device_id = #{dev[0]}
AND created_at <= '#{mm_critical_time.to_s}'
is ok. In 5 or 6 second delete the record.
The query
DELETE FROM multimedia WHERE device_id = #{dev[0]}
AND created_at <= '#{mm_critical_time.to_s}'
give me some problem, the execution start and never finish it.
what's the problem?
You probably need to create an index for the columns you are searching.
CREATE INDEX device_created_index
ON multimedia (device_id, created_at);
If you want to learn more about optimizing your queries, refer to the answer I gave here about using EXPLAIN SELECT: is there better way to do these mysql queries?
the order of the conditions is important, you havent't told us your database server but at least in Oracle it is, so try to reverse them like
DELETE FROM multimedia WHERE
created_at <= '#{mm_critical_time.to_s}'
AND device_id = #{dev[0]}
or us an inner query on the fastest part
DELETE FROM multimedia WHERE
created_at <= '#{mm_critical_time.to_s}'
AND device_id in (select device_id from multimedia where device_id = #{dev[0]})
Also, i always break slow queries up and test the parts on speed so that you know where the bottleneck is.
Some programs show you how long a query took and in Ruby you could use benchmark, you can supplement the delete with a select while testing.
so test:
select * FROM multimedia WHERE created_at <= '#{mm_critical_time.to_s}'
and
select * from multimedia WHERE device_id = #{dev[0]}
Success..
It is quite naive to give solutions to performance problems in relational databases without knowing the whole story, since there are many variables involved.
For the data you provided though, I would suggest you to drop the primary keys and indexes and run:
CREATE UNIQUE CLUSTERED INDEX uc ON events (device_id, created_at);
CREATE UNIQUE CLUSTERED INDEX uc ON multimedia (device_id, created_at);
If you really need to enforce the uniqueness of the id field, create one unique nonclustered index for this column on each table (but it will cause the delete command to consume more time):
CREATE UNIQUE INDEX ix_id ON events (id);
CREATE UNIQUE INDEX ix_id ON multimedia (id);
Suppose I have a 2 column table (id, flag) and id is sequential.
I expect this table to contain a lot of records.
I want to periodically select the first row not flagged and update it. Some of the records on the way may have already been flagged, so I want to skip them.
Does it make more sense if I store the last id I flagged and use it in my select statement, like
select * from mytable where id > my_last_id order by id asc limit 1
or simply get the first unflagged row, like:
select * from mytable where flagged = 'F' order by id asc limit 1
Thank you!
If you create an index on flagged, retrieving an unflagged row should be pretty much an instant operation. If you always update them sequentially, then the first method is fine though.
Option two is the only one that makes sense unless you know that you're always going to process records in sequence!
Assuming MySQL, this one:
SELECT *
FROM mytable
WHERE flagged = 'F'
ORDER BY
flagged ASC, id ASC
LIMIT 1
will be slightly less efficient in InnoDB and of same efficiency in MyISAM, if you have an index on (flagged, id).
InnoDB tables are clustered on the PRIMARY KEY, so fetching the first record in id does not require looking up the table.
In MyISAM, tables are heap-organized, so the index used to police the PRIMARY KEY is stored separately from the table.
Note the flagged in the ORDER BY clause may seem to be redundant, but it is required for MySQL to pick the correct index.
Also, the composite index should be on (flagged, id) even in InnoDB (which implicitly includes the PRIMARY KEY into each index).
You could use
Select Min(Id) as 'Id'
From dbo.myTable
Where Flagged='F'
Assuming the Flagged = 'F' means that it is not flagged.