Delete sql query is very very slow - sql

I've a little problem;
I have 2 tables:
events and multimedia.
events have the
id,
device_id
created_at field
the primary key is the id and there's a index formed by device_id and created_at field.
multimedia table have the follower field:
id
device_id
created_at
data (this field is a blob field and contains a 20k string)
the primary key is id and there's a index formed by device_id and created_by field.
The problem is when i want to delete the record with created_at before a data.
the query:
DELETE FROM events WHERE device_id = #{dev[0]}
AND created_at <= '#{mm_critical_time.to_s}'
is ok. In 5 or 6 second delete the record.
The query
DELETE FROM multimedia WHERE device_id = #{dev[0]}
AND created_at <= '#{mm_critical_time.to_s}'
give me some problem, the execution start and never finish it.
what's the problem?

You probably need to create an index for the columns you are searching.
CREATE INDEX device_created_index
ON multimedia (device_id, created_at);
If you want to learn more about optimizing your queries, refer to the answer I gave here about using EXPLAIN SELECT: is there better way to do these mysql queries?

the order of the conditions is important, you havent't told us your database server but at least in Oracle it is, so try to reverse them like
DELETE FROM multimedia WHERE
created_at <= '#{mm_critical_time.to_s}'
AND device_id = #{dev[0]}
or us an inner query on the fastest part
DELETE FROM multimedia WHERE
created_at <= '#{mm_critical_time.to_s}'
AND device_id in (select device_id from multimedia where device_id = #{dev[0]})
Also, i always break slow queries up and test the parts on speed so that you know where the bottleneck is.
Some programs show you how long a query took and in Ruby you could use benchmark, you can supplement the delete with a select while testing.
so test:
select * FROM multimedia WHERE created_at <= '#{mm_critical_time.to_s}'
and
select * from multimedia WHERE device_id = #{dev[0]}
Success..

It is quite naive to give solutions to performance problems in relational databases without knowing the whole story, since there are many variables involved.
For the data you provided though, I would suggest you to drop the primary keys and indexes and run:
CREATE UNIQUE CLUSTERED INDEX uc ON events (device_id, created_at);
CREATE UNIQUE CLUSTERED INDEX uc ON multimedia (device_id, created_at);
If you really need to enforce the uniqueness of the id field, create one unique nonclustered index for this column on each table (but it will cause the delete command to consume more time):
CREATE UNIQUE INDEX ix_id ON events (id);
CREATE UNIQUE INDEX ix_id ON multimedia (id);

Related

best way to store time series data with an identifier in sqlite3

Let's say there are a number of different sensors, all of which save data in a database as they measure it and each sensors can have more entries. I'm looking for the best way to save this data so that select queries could be done fastest possible later. Something like
"CREATE TABLE IF NOT EXISTS DataTable (sensor_id TEXT, measured_value REAL, time_of_measuring REAL)"
could basically work, but I imagine this wouldn't be very fast for selecting. I know about primary keys, but they prevent duplicates, so I can't just put sensor_id as a primary key. I'm basically looking for sqlite equivalent of saving data like this, but in a single table and as one measurement being one row:
data = {"sensor1":[x1,x2,x3], "sensor2":[z1,z2,z3]...}
I imagine something like ˇˇ would work for inserting more than a single value for each sensor, but would that help at all with selecting?
"CREATE TABLE IF NOT EXISTS DataTable (sensor_id TEXT NOT NULL, measured_value REAL, time_of_measuring REAL NOT NULL, PRIMARY KEY(sensor_id, time_of_measuring ))"
For this time-series data, the relevant primary (or unique) key is probably (time_of_measuring, sensor_id). This is close to what you suggested at the end of your question, but the columns are in reverse order.
Technically, this prevent a sensor from loging two measures at the same point in time, which seems like a relevant business rule for your data.
When it comes to the speed of queries: it highly depends on the query themselves. Say that you have query like:
select sensor_id, measured_val, time_of_measuring
from data_table
where
sensor_id = ?
and time_of_measuring >= ?
and time_of_measuring < ?
order by sensor_id, time_of_measuring
This query would take advantage of the primary key index, since the columns are the same as those of the where and order by clauses. You could add the measured_val to the index to make the query even more efficient:
create index data_table_idx1
on data_table(sensor_id, time_of_measuring, measured_val);
As another example, consider this where clause:
where time_of_measuring >= ? and time_of_measuring < ?
No predicate on sensor_id, but time_of_measuring is the first column in the index, so the primary key index can be used.
As typical counter-examples, the following where clauses would not benefit the index:
where sensor_id = ? -- need an index where `sensor_id` is first
where sensor_id = ? and measured_val >= ? -- needs an index on "(sensor_id, measured_val)"

Postgres index most recent by foreign key

Say I have a table with a thousand users and 50 million user_actions. A few users have more than a million actions but most have thousands.
CREATE TABLE users (id, name)
CREATE TABLE user_actions (id, user_id, created_at)
CREATE INDEX index_user_actions_on_user_id ON user_actions(user_id)
Querying user_actions by user_id is fast, using the index.
SELECT *
FROM user_actions
WHERE user_id = ?
LIMIT 1
But I'd like to know the last action by a user.
SELECT *
FROM user_actions
WHERE user_id = ?
ORDER BY created_at DESC
LIMIT 1
This query throws out the index and does a table scan, backwards until it finds an action. Not a problem for users that have been active recently, too slow for users that haven't.
Is there a way to tune this index so postgres keeps track of the last action by each user? (For bonus points the last N actions!)
Or, suggested alternate strategies? I suppose a materialized view of a window function will do the trick.
Create an index on (user_id, created_at)
This will allow PostgreSQL to do a index scan to locate the first record.
This is one of the cases where multi-column indexes make a big difference.
Note we put user_id first because that allows us to efficiently select the sub-portion of the index we are interested in, and then from there it is just a quick traversal to get the most recent created_at date, provided not a lot of dead rows in the area.

Cassandra modeling with key and indexes

I've a table of "Users" each user has many "Projects" and each project has many "Clients" so it's many-to-many so I keep track of clients events in a different table.
The problem is that I can't figured out how to choose the key and the index so the queries will be with best performance.
The table with Key:
CREATE TABLE project_clients_events(
id timeuuid,
user_id int,
project_id int,
client_id text,
event text,
PRIMARY KEY ((user_id, project_id), id, client_id)
);
Now there will be more then 100K of events per (user_id, project_id) so I need to be able to paginate throw the result:
http://www.datastax.com/documentation/cql/3.0/cql/cql_using/paging_c.html
How can I group the results and paginate?
Thanks!
Let me answer your question in two parts. First the pagination and then the partition key
Cassandra CQL driver supports automatic paging now, so you need not worry about designing a complex where clause.
Statement stmt = new SimpleStatement("SELECT * FROM images");
stmt.setFetchSize(100);
ResultSet rs = session.execute(stmt);
// Iterate over the ResultSet here
This link will be helpful :
http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0
Deciding the partition depends on the queries you may have. For example if most of your queries use the user_id and project_id (i.e. most of your queries fetch results only based on user_id and client_id) then it’s better to have then as a part of the partition key, as all those results will be placed in the same Cassandra column (on the same node) and fetched together.
Hence I would advise you to first decide the queries and select your partition keys accordingly. As your performance will depend on what the queries are vs. how the columns are stored in Cassandra
This could help you http://www.slideshare.net/DataStax/understanding-how-cql3-maps-to-cassandras-internal-data-structure (slides 45-70)

Selecting the most optimal query

I have table in Oracle database which is called my_table for example. It is type of log table. It has an incremental column which is named "id" and "registration_number" which is unique for registered users. Now I want to get latest changes for registered users so I wrote queries below to accomplish this task:
First version:
SELECT t.*
FROM my_table t
WHERE t.id =
(SELECT MAX(id) FROM my_table t_m WHERE t_m.registration_number = t.registration_number
);
Second version:
SELECT t.*
FROM my_table t
INNER JOIN
( SELECT MAX(id) m_id FROM my_table GROUP BY registration_number
) t_m
ON t.id = t_m.m_id;
My first question is which of above queries is recommended and why? And second one is if sometimes there is about 70.000 insert to this table but mostly the number of inserted rows is changing between 0 and 2000 is it reasonable to add index to this table?
An analytical query might be the fastest way to get the latest change for each registered user:
SELECT registration_number, id
FROM (
SELECT
registration_number,
id,
ROW_NUMBER() OVER (PARTITION BY registration_number ORDER BY id DESC) AS IDRankByUser
FROM my_table
)
WHERE IDRankByUser = 1
As for indexes, I'm assuming you already have an index by registration_number. An additional index on id will help the query, but maybe not by much and maybe not enough to justify the index. I say that because if you're inserting 70K rows at one time the additional index will slow down the INSERT. You'll have to experiment (and check the execution plans) to figure out if the index is worth it.
In order to check for faster query, you should check the execution plan and cost and it will give you a fair idea. But i agree with solution of Ed Gibbs as analytics make query run much faster.
If you feel this table is going to grow very big then i would suggest partitioning the table and using local indexes. They will definitely help you to form faster queries.
In cases where you want to insert lots of rows then indexes slow down insertion as with each insertion index also has to be updated[I will not recommend index on ID]. There are 2 solutions i have think of for this:
You can drop index before insertion and then recreate it after insertion.
Use reverse key indexes. Check this link : http://oracletoday.blogspot.in/2006/09/there-is-option-to-create-index.html. Reverse key index can impact your query a bit so there will be trade off.
If you look for faster solution and there is a really need to maintain list of last activity for each user, then most robust solution is to maintain separate table with unique registration_number values and rowid of last record created in log table.
E.g. (only for demo, not checked for syntax validity, sequences and triggers omitted):
create table my_log(id number not null, registration_number number, action_id varchar2(100))
/
create table last_user_action(refgistration_number number not null, last_action rowid)
/
alter table last_user_action
add constraint pk_last_user_action primary key (registration_number) using index
/
create or replace procedure write_log(p_reg_num number, p_action_id varchar2)
is
v_row_id rowid;
begin
insert into my_log(registration_number, action_id)
values(p_reg_num, p_action_id)
returning rowid into v_row_id;
update last_user_action
set last_action = v_row_id
where registration_number = p_reg_num;
end;
/
With such schema you can simple query last actions for every user with good performance:
select
from
last_user_action lua,
my_log l
where
l.rowid (+) = lua.last_action
Rowid is physical storage identity directly addressing storage block and you can't use it after moving to another server, restoring from backups etc. But if you need such functionality it's simple to add id column from my_log table to last_user_action too, and use one or another depending on requirements.

Smart choice for primary key and clustered index on a table in SQL 2005 to boost performance of selecting single record or multiple records

EDIT: I have added "Slug" column to address performance issues on specific record selection.
I have following columns in my table.
Id Int - Primary key (identity, clustered by default)
Slug varchar(100)
...
EntryDate DateTime
Most of the time, I'm ordering the select statement by EntryDate like below.
Select T.Id, T.Slug, ..., T.EntryDate
From (
Select Id, Slug, ..., EntryDate,
Row_Number() Over (Order By EntryDate Desc, Id Desc) AS RowNum
From TableName
Where ...
) As T
Where T.RowNum Between ... And ...
I'm ordering it by EntryDate and Id in case there are duplicate EntryDates.
When I'm selecting A record, I do the following.
Select Id, Slug, ..., EntryDate
From TableName
Where Slug = #slug And Year(EntryDate) = #entryYear
And Month(EntryDate) = #entryMonth
I have a unique key of Slug & EntryDate.
What would be a smart choice of keys and indexes in my situation? I'm facing performance issues probably because I'm ordering by a column that is not clustered indexed.
Should I have Id set as non-clustered primary key and EntryDate as clustered index?
I appreciate all your help. Thanks.
EDIT:
I haven't tried adding non-clustered index on the EntryDate. Data inserted from back-end, so performance for insert isn't a big deal for me. Also, EntryDate is not always the date when it is inserted. It can be a past date. Back-end user picks the date.
Based on the current table layout you want some indexes like this.
CREATE INDEX IX_YourTable_1 ON dbo.YourTable
(EntryDate, Id)
INCLUDE (SLug)
WITH (FILLFACTOR=90)
CREATE INDEX IX_YourTable_2 ON dbo.YourTable
(EntryDate, Slug)
INCLUDE (Id)
WITH (FILLFACTOR=80)
Add any other columns you are returning to the INCLUDE line.
Change your second query to something like this.
Select Id, Slug, ..., EntryDate
From TableName
Where Slug = #slug
AND EntryDate BETWEEN CAST(CAST(#EntryYear AS VARCHAR(4) + CAST(#EntryMonth AS VARCHAR(2)) + '01' AS DATE) AND DATEADD(mm, 1, CAST(CAST(#EntryYear AS VARCHAR(4) + CAST(#EntryMonth AS VARCHAR(2)) + '01' AS DATE))
The way your second query is currently written the index will never be used. If you can change the Slug column to a related table it will increase your performance and decrease your storage requirements.
Have you tried simply adding a non-clustered index on the entrydate to see what kind of performance gain you get?
Also, how often is new data added? and will new data that is added always be >= the last EntryDate?
You want to keep ID as a clustered index, as you will most likely join to the table off your id, and not entry date.
A simple non clustered index with just the date field would be fine to speed things up.
Clustering is a bit like "index paging", the index is "chunked" instead of simply being a long list. This is helpful when you've got a lot of data. The DB can search within cluster ranges, then find the individual record. It makes the index smaller, therefore faster to search, but less specific. Once if finds the correct spot in the cluster it then needs to search within the cluster.
It's faster with a lot of data, but slower with smaller data sets.
If you're not searching a lot using the primary key, then cluster the date and leave the primary key non-clustered. It really depends on how complex your queries are with joining other tables.
A clustered index will only make any difference at all, if you are returning a bunch of records, and some the fields you return are not part of the index. Otherwise there's no benefit.
You need first to find out what the query plan tells you about why your current queries are slow. Without that, it's mostly idle speculation (which is usually counterproductive when optimizing queries.)
I wouldn't try anything (suggested by me or anyone else) without having a solid queryplan to compare with, to at least know if you're doing good or harm.