which one is a faster/better sql practice? - sql

Suppose I have a 2 column table (id, flag) and id is sequential.
I expect this table to contain a lot of records.
I want to periodically select the first row not flagged and update it. Some of the records on the way may have already been flagged, so I want to skip them.
Does it make more sense if I store the last id I flagged and use it in my select statement, like
select * from mytable where id > my_last_id order by id asc limit 1
or simply get the first unflagged row, like:
select * from mytable where flagged = 'F' order by id asc limit 1
Thank you!

If you create an index on flagged, retrieving an unflagged row should be pretty much an instant operation. If you always update them sequentially, then the first method is fine though.

Option two is the only one that makes sense unless you know that you're always going to process records in sequence!

Assuming MySQL, this one:
SELECT *
FROM mytable
WHERE flagged = 'F'
ORDER BY
flagged ASC, id ASC
LIMIT 1
will be slightly less efficient in InnoDB and of same efficiency in MyISAM, if you have an index on (flagged, id).
InnoDB tables are clustered on the PRIMARY KEY, so fetching the first record in id does not require looking up the table.
In MyISAM, tables are heap-organized, so the index used to police the PRIMARY KEY is stored separately from the table.
Note the flagged in the ORDER BY clause may seem to be redundant, but it is required for MySQL to pick the correct index.
Also, the composite index should be on (flagged, id) even in InnoDB (which implicitly includes the PRIMARY KEY into each index).

You could use
Select Min(Id) as 'Id'
From dbo.myTable
Where Flagged='F'
Assuming the Flagged = 'F' means that it is not flagged.

Related

Auto Increment selection SQLite

I have a column named id in my SQLite database which is auto-increment, Primary Key, Unique.
Is the result of the following query guaranteed to be the smallest value of id in the database and does this correspond to the "oldest" (as in a FIFO) row to be inserted?
SELECT id FROM table LIMIT 1
The SQLite documentation is quite explicit:
If a SELECT statement that returns more than one row does not have an
ORDER BY clause, the order in which the rows are returned is
undefined. Or, if a SELECT statement does have an ORDER BY clause,
then the list of expressions attached to the ORDER BY determine the
order in which rows are returned to the user.
The LIMIT is applied after an ORDER BY would be, so I don't think it affects the application of this statement.
Hence, if you want the first row, use ORDER BY:
SELECT id
FROM table
ORDER BY id
LIMIT 1;
Note that if id is a primary key, this will add basically no overhead.
I should emphasize that in practice you are probably going to get the smallest id without the ORDER BY. However, it is a really, really bad idea to depend on behavior that directly contradicts the documentation.
Is the result of the following query guaranteed to be the smallest value of id in the database
Yes. However if the table is empty or the id column is NULL, it could also return NULL
and does this correspond to the "oldest" (as in a FIFO) row to be inserted?
No, there's no guarantee of that.

Selecting the most optimal query

I have table in Oracle database which is called my_table for example. It is type of log table. It has an incremental column which is named "id" and "registration_number" which is unique for registered users. Now I want to get latest changes for registered users so I wrote queries below to accomplish this task:
First version:
SELECT t.*
FROM my_table t
WHERE t.id =
(SELECT MAX(id) FROM my_table t_m WHERE t_m.registration_number = t.registration_number
);
Second version:
SELECT t.*
FROM my_table t
INNER JOIN
( SELECT MAX(id) m_id FROM my_table GROUP BY registration_number
) t_m
ON t.id = t_m.m_id;
My first question is which of above queries is recommended and why? And second one is if sometimes there is about 70.000 insert to this table but mostly the number of inserted rows is changing between 0 and 2000 is it reasonable to add index to this table?
An analytical query might be the fastest way to get the latest change for each registered user:
SELECT registration_number, id
FROM (
SELECT
registration_number,
id,
ROW_NUMBER() OVER (PARTITION BY registration_number ORDER BY id DESC) AS IDRankByUser
FROM my_table
)
WHERE IDRankByUser = 1
As for indexes, I'm assuming you already have an index by registration_number. An additional index on id will help the query, but maybe not by much and maybe not enough to justify the index. I say that because if you're inserting 70K rows at one time the additional index will slow down the INSERT. You'll have to experiment (and check the execution plans) to figure out if the index is worth it.
In order to check for faster query, you should check the execution plan and cost and it will give you a fair idea. But i agree with solution of Ed Gibbs as analytics make query run much faster.
If you feel this table is going to grow very big then i would suggest partitioning the table and using local indexes. They will definitely help you to form faster queries.
In cases where you want to insert lots of rows then indexes slow down insertion as with each insertion index also has to be updated[I will not recommend index on ID]. There are 2 solutions i have think of for this:
You can drop index before insertion and then recreate it after insertion.
Use reverse key indexes. Check this link : http://oracletoday.blogspot.in/2006/09/there-is-option-to-create-index.html. Reverse key index can impact your query a bit so there will be trade off.
If you look for faster solution and there is a really need to maintain list of last activity for each user, then most robust solution is to maintain separate table with unique registration_number values and rowid of last record created in log table.
E.g. (only for demo, not checked for syntax validity, sequences and triggers omitted):
create table my_log(id number not null, registration_number number, action_id varchar2(100))
/
create table last_user_action(refgistration_number number not null, last_action rowid)
/
alter table last_user_action
add constraint pk_last_user_action primary key (registration_number) using index
/
create or replace procedure write_log(p_reg_num number, p_action_id varchar2)
is
v_row_id rowid;
begin
insert into my_log(registration_number, action_id)
values(p_reg_num, p_action_id)
returning rowid into v_row_id;
update last_user_action
set last_action = v_row_id
where registration_number = p_reg_num;
end;
/
With such schema you can simple query last actions for every user with good performance:
select
from
last_user_action lua,
my_log l
where
l.rowid (+) = lua.last_action
Rowid is physical storage identity directly addressing storage block and you can't use it after moving to another server, restoring from backups etc. But if you need such functionality it's simple to add id column from my_log table to last_user_action too, and use one or another depending on requirements.

SQL: Remove rows whose associations are broken (orphaned data)

I have a table called "downloads" with two foreign key columns -- "user_id" and "item_id". I need to select all rows from that table and remove the rows where the User or the Item in question no longer exists. (Look up the User and if it's not found, delete the row in "downloads", then look up the Item and if it's not found, delete the row in "downloads").
It's 3.4 million rows, so all my scripted solutions have been taking 6+ hours. I'm hoping there's a faster, SQL-only way to do this?
use two anti joins and or them together:
delete from your_table
where user_id not in (select id from users_table)
or item_id not in (select id from items_table)
once that's done, consider adding two foreign keys, each with an on delete cascade clause. it'll do this for you automatically.
delete from your_table where user_id not in (select id from users_table) or item_id not in (select id from items_table)
think there is no faster solution when there are so many rows
that are on your server 157 rows per second
check user id
if mysql num rows = 0 than delete the downloads and also check the item_id
there was also a similar question about the performance of myswl num rows
MySQL: Fastest way to count number of rows
edit: think the best is to creatse some triggers so the database server does the job for you
currently i would use a cronjob for the first time
For future reference. For these kind of long operations. It is possible to optimise the server independently of the SQL. For example detach the sql service, defrag the system disk, if you can ensure the sql log files are on separate disk drive to the drive where database is.
This will at least reduce the pain of these kind of long operations.
I've found in SQL 2008 R2, if your "in" clause contains a null value (perhaps from a table who has a reference to this key that is nullable), no records will be returned! To correct, just add a clause to your selects in the union part:
delete from SomeTable where Key not in (
select SomeTableKey from TableB where SomeTableKey is not null
union
select SomeTableKey from TableC where SomeTableKey is not null
)

SQL get last rows in table WITHOUT primary ID

I have a table with 800,000 entries without a primary key. I am not allowed to add a primary key and I cant sort by TOP 1 ....ORDER BY DESC because it takes hours to complete this task. So I tried this work around:
DECLARE #ROWCOUNT int, #OFFSET int
SELECT #ROWCOUNT = (SELECT COUNT(field) FROM TABLE)
SET #OFFSET = #ROWCOUNT-1
select TOP 1 FROM TABLE WHERE=?????NO PRIMARY KEY??? BETWEEN #Offset AND #ROWCOUNT
Of course this doesn't work.
Anyway to do use this code/or better code to retrieve the last row in table?
If your table has no primary key or your primary key is not orderly... you can try the code below... if you want see more last record, you can change the number in code
Select top (select COUNT(*) from table) * From table
EXCEPT
Select top ((select COUNT(*) from table)-(1)) * From table
I assume that when you are saying 'last rows', you mean 'last created rows'.
Even if you had primary key, it would still be not the best option to use it do determine rows creation order.
There is no guarantee that that the row with the bigger primary key value was created after the row with a smaller primary key value.
Even if primary key is on identity column, you can still always override identity values on insert by using
set identity_insert on.
It is a better idea to have timestamp column, for example CreatedDateTime with a default constraint.
You would have index on this field.Then your query would be simple, efficient and correct:
select top 1 *
from MyTable
order by CreatedDateTime desc
If you don't have timestamp column, you can't determine 'last rows'.
If you need to select 1 column from a table of 800,000 rows where that column is the min or max possible value, and that column is not indexed, then the unassailable fact is that SQL will have to read every row in the table in order to identify that min or max value.
(An aside, on the face of it reading all the rows of an 800,000 row table shouldn't take all that long. How wide is the column? How often is the query run? Are there concurrency, locking, blocking, or deadlocking issues? These may be pain points that could be addressed. End of aside.)
There are any number of work-arounds (indexes, views, indexed views, peridocially indexed copies of the talbe, run once store result use for T period of time before refreshing, etc.), but virtually all of them require making permanent modifications to the database. It sounds like you are not permitted to do this, and I don't think there's much you can do here without some such permanent change--and call it improvement, when you discuss it with your project manager--to the database.
You need to add an Index, can you?
Even if you don't have a primary key an Index will speed up considerably the query.
You say you don't have a primary key, but for your question I assume you have some type of timestamp or something similar on the table, if you create an Index using this column you will be able to execute a query like :
SELECT *
FROM table_name
WHERE timestamp_column_name=(
SELECT max(timestamp_column_name)
FROM table_name
)
If you're not allowed to edit this table, have you considered creating a view, or replicating the data in the table and moving it into one that has a primary key?
Sounds hacky, but then, your 800k row table doesn't have a primary key, so hacky seems to be the order of the day. :)
I believe you could write it simply as
SELECT * FROM table ORDER BY rowid DESC LIMIT 1;
Hope it helps.

Optimizing "ORDER BY" when the result set is very large and it can't be ordered by an index

How can I make an ORDER BY clause with a small LIMIT (ie 20 rows at a time) return quickly, when I can't use an index to satisfy the ordering of rows?
Let's say I would like to retrieve a certain number of titles from a table 'node' (simplified below). I'm using MySQL by the way.
node_ID INT(11) NOT NULL auto_increment,
node_title VARCHAR(127) NOT NULL,
node_lastupdated INT(11) NOT NULL,
node_created INT(11) NOT NULL
But I need to limit the rows returned to only those a particular user has access to. Many users have access large numbers of nodes. I have this information pre-calculated in a big lookup table (an attempt to make things easier) where the primary key covers both columns and the presence of a row means that usergroup has access to that node:
viewpermission_nodeID INT(11) NOT NULL,
viewpermission_usergroupID INT(11) NOT NULL
My query therefore contains something like
FROM
node
INNER JOIN viewpermission ON
viewpermission_nodeID=node_ID
AND viewpermission_usergroupID IN (<...usergroups of current user...>)
... and I also use a GROUP BY or a DISTINCT so that a node is only returned once even if two of the user's 'usergroups' both have access to that node.
My problem is that there seems to be no way for an ORDER BY clause which sorts results by created or last updated date to use an index, because the rows being returned depend on values in the other viewpermission table.
Therefore MySQL would need to find all rows which match the criteria, then sort them all itself. If there are one million rows for a particular user, and we want to view, say, the latest 100 or rows 100-200 when ordered by last update, the DB would need to figure out which one million rows the user can see, sort this whole result set itself, before it can return those 100 rows, right?
Is there any creative way to get around this? I've been thinking along the lines of:
Somehow add dates into the viewpermission lookup table so that I can build an index containing the dates as well as the permissions. It's a possibility I guess.
Edit: Simplified question
Perhaps I can simplify the question by rewriting it like this:
Is there any way to rewrite this query or create an index for the following such that an index can be used to do the ordering (not just to select the rows)?
SELECT nodeid
FROM lookup
WHERE
usergroup IN (2, 3)
GROUP BY
nodeid
An index on (usergroup) allows the WHERE part to be satisfied by an index, but the GROUP BY forces a temporary table and filesort on those rows. An index on (nodeid) does nothing for me, because the WHERE clause needs an index with usergroup as its first column. An index on (usergroup, nodeid) forces a temporary table and filesort because the GROUP BY is not the first column of the index that can vary.
Any solutions?
Can I answer my own question?
I believe I have found that the only way to do what I describe is for my lookup table to have rows for every possible combination of usergroups a person may want to be a member of.
To pick a simplified example, instead of doing this:
SELECT id FROM ids WHERE groups IN(1,2) ORDER BY id
If you need to use the index both to select rows and to order them, you have to abstract that IN(1,2) so that it is constant rather than a range, ie:
SELECT id FROM ids WHERE grouplist='1,2' ORDER BY id
Of course instead of using the string '1,2' you could have a foreign key there, etc. The point being that you'd have to have a row not just for each group but for each combination of multiple groups.
So, there is my answer.
Anyway, for my application, I feel that maintaining a lookup for all possible combinations of usergroups for each node is not worth it. For my purposes, I predict that most nodes are visible to most users, so I feel that it is acceptable to simply to make the GROUP BY use the index, as the filtering doesn't need it so badly.
In other words, the approach I'll take for my original query may be something like:
SELECT
<fields>
FROM
node
INNER JOIN viewpermission ON
viewpermission_nodeID=node_ID
AND viewpermission_usergroupID IN (<...usergroups of current user...>)
FORCE INDEX(node_created_and_node_ID)
GROUP BY
node_created, node_ID
GROUP BY can use an index if it starts at the left most column of the index and it is in the first non-const non-system table to be processed. The join then deals with the entire list (which is already ordered), and only those not visible to the current user (which will be a small proportion) are removed by the INNER JOIN.
Copy the value you are going to order by into to viewpermission table and add it to your index.
You could use a trigger to maintain that value from the other table.
select * from
(
select *
FROM node
INNER JOIN viewpermission
ON viewpermission_nodeID=node_ID
AND viewpermission_usergroupID IN (<...usergroups of current user...>)
) a
order by a.node_lastupdated desc
The inner query gives you the filtered subset, which I understand is substantially smaller than the whole set. Only the smaller has to be sorted.
MySQL has problems when you use GROUP BY and ORDER BY in the same query. That causes a filesort, and that's probably the biggest penalty for performance.
You can eliminate the need for a DISTINCT (or GROUP BY) by using a non-correlated subquery instead of a JOIN.
SELECT * FROM node
WHERE node_id IN (
SELECT viewpermission_nodeID
FROM viewpermission
WHERE viewpermissiong_usergroupID IN ( <...usergroups...> )
)
ORDER BY node_lastupdated DESC
LIMIT 100;
There's no need to sort or do a DISTINCT on the subquery, since IN (1, 1, 2, 3) is the same as IN (1, 3, 2).
Note that MySQL can use only one index per table in a given query, so it'll try to make the best choice between an index on node_id and an index on node_lastupdated. It can't use both, and even if you made a compound index it wouldn't help in this case.
Remember to analyze different solutions with EXPLAIN.