SQL get last rows in table WITHOUT primary ID - sql

I have a table with 800,000 entries without a primary key. I am not allowed to add a primary key and I cant sort by TOP 1 ....ORDER BY DESC because it takes hours to complete this task. So I tried this work around:
DECLARE #ROWCOUNT int, #OFFSET int
SELECT #ROWCOUNT = (SELECT COUNT(field) FROM TABLE)
SET #OFFSET = #ROWCOUNT-1
select TOP 1 FROM TABLE WHERE=?????NO PRIMARY KEY??? BETWEEN #Offset AND #ROWCOUNT
Of course this doesn't work.
Anyway to do use this code/or better code to retrieve the last row in table?

If your table has no primary key or your primary key is not orderly... you can try the code below... if you want see more last record, you can change the number in code
Select top (select COUNT(*) from table) * From table
EXCEPT
Select top ((select COUNT(*) from table)-(1)) * From table

I assume that when you are saying 'last rows', you mean 'last created rows'.
Even if you had primary key, it would still be not the best option to use it do determine rows creation order.
There is no guarantee that that the row with the bigger primary key value was created after the row with a smaller primary key value.
Even if primary key is on identity column, you can still always override identity values on insert by using
set identity_insert on.
It is a better idea to have timestamp column, for example CreatedDateTime with a default constraint.
You would have index on this field.Then your query would be simple, efficient and correct:
select top 1 *
from MyTable
order by CreatedDateTime desc
If you don't have timestamp column, you can't determine 'last rows'.

If you need to select 1 column from a table of 800,000 rows where that column is the min or max possible value, and that column is not indexed, then the unassailable fact is that SQL will have to read every row in the table in order to identify that min or max value.
(An aside, on the face of it reading all the rows of an 800,000 row table shouldn't take all that long. How wide is the column? How often is the query run? Are there concurrency, locking, blocking, or deadlocking issues? These may be pain points that could be addressed. End of aside.)
There are any number of work-arounds (indexes, views, indexed views, peridocially indexed copies of the talbe, run once store result use for T period of time before refreshing, etc.), but virtually all of them require making permanent modifications to the database. It sounds like you are not permitted to do this, and I don't think there's much you can do here without some such permanent change--and call it improvement, when you discuss it with your project manager--to the database.

You need to add an Index, can you?
Even if you don't have a primary key an Index will speed up considerably the query.
You say you don't have a primary key, but for your question I assume you have some type of timestamp or something similar on the table, if you create an Index using this column you will be able to execute a query like :
SELECT *
FROM table_name
WHERE timestamp_column_name=(
SELECT max(timestamp_column_name)
FROM table_name
)

If you're not allowed to edit this table, have you considered creating a view, or replicating the data in the table and moving it into one that has a primary key?
Sounds hacky, but then, your 800k row table doesn't have a primary key, so hacky seems to be the order of the day. :)

I believe you could write it simply as
SELECT * FROM table ORDER BY rowid DESC LIMIT 1;
Hope it helps.

Related

SQL - Get specific row without a full table scan

I'm using Postgresql (cockroachdb) and I want to select a specific row. For example, there are thousands of records and I want to select row number 999.
In this case we would use LIMIT and OFFSET, SELECT * FROM table LIMIT 1 OFFSET 998;
However, using LIMIT and OFFSET can cause performance issue according to this post. So I'm wondering if there a way to get specific row without a full table scan.
I feel like it is possible because the database seems to sort data by primary key, that when I do SELECT * FROM table; it always show a sorted result. Since it is sorted by primary key, database can use index to access a specific row, right?
If you select rows based on the primary key (e.g. SELECT * FROM table WHERE <primary key> = <value>), no scans will be needed underneath the hood. The same is also true if you define a secondary index on the table and apply a WHERE clause that filters based on the column(s) in the secondary index.

Implement a ring buffer

We have a table logging data. It is logging at say 15K rows per second.
Question: How would we limit the table size to the 1bn newest rows?
i.e. once 1bn rows is reached, it becomes a ring buffer, deleting the oldest row when adding the newest.
Triggers might load the system too much. Here's a trigger example on SO.
We are already using a bunch of tweaks to keep the speed up (such as stored procedures, Table Parameters etc).
Edit (8 years on) :
My recent question/answer here addresses a similar issue using a time series database.
Unless there is something magic about 1 billion, I think you should consider other approaches.
The first that comes to mind is partitioning the data. Say, put one hour's worth of data into each partition. This will result in about 15,000*60*60 = 54 million records in a partition. About every 20 hours, you can remove a partition.
One big advantage of partitioning is that the insert performance should work well and you don't have to delete individual records. There can be additional overheads depending on the query load, indexes, and other factors. But, with no additional indexes and a query load that is primarily inserts, it should solve your problem better than trying to delete 15,000 records each second along with the inserts.
I don't have a complete answer but hopefully some ideas to help you get started.
I would add some sort of numeric column to the table. This value would increment by 1 until it reached the number of rows you wanted to keep. At that point the procedure would switch to update statements, overwriting the previous row instead of inserting new ones. You obviously won't be able to use this column to determine the order of the rows, so if you don't already I would also add a timestamp column so you can order them chronologically later.
In order to coordinate the counter value across transactions you could use a sequence, then perform a modulo division to get the counter value.
In order to handle any gaps in the table (e.g. someone deleted some of the rows) you may want to use a merge statement. This should perform an insert if the row is missing or an update if it exists.
Hope this helps.
Here's my suggestion:
Pre-populate the table with 1,000,000,000 rows, including a row number as the primary key.
Instead of inserting new rows, have the logger keep a counter variable that increments each time, and update the appropriate row according to the row number.
This is actually what you would do with a ring buffer in other contexts. You wouldn't keep allocating memory and deleting; you'd just overwrite the same array over and over.
Update: the update doesn't actually change the data in place, as I thought it did. So this may not be efficient.
Just an idea that is to complicated to write in a comment.
Create a few log tables, 3 as an example, Log1, Log2, Log3
CREATE TABLE Log1 (
Id int NOT NULL
CHECK (Id BETWEEN 0 AND 9)
,Message varchar(10) NOT NULL
,CONSTRAINT [PK_Log1] PRIMARY KEY CLUSTERED ([Id] ASC) ON [PRIMARY]
)
CREATE TABLE Log2 (
Id int NOT NULL
CHECK (Id BETWEEN 10 AND 19)
,Message varchar(10) NOT NULL
,CONSTRAINT [PK_Log2] PRIMARY KEY CLUSTERED ([Id] ASC) ON [PRIMARY]
)
CREATE TABLE Log3 (
Id int NOT NULL
CHECK (Id BETWEEN 20 AND 29)
,Message varchar(10) NOT NULL
,CONSTRAINT [PK_Log3] PRIMARY KEY CLUSTERED ([Id] ASC) ON [PRIMARY]
)
Then create a partitioned view
CREATE VIEW LogView AS (
SELECT * FROM Log1
UNION ALL
SELECT * FROM Log2
UNION ALL
SELECT * FROM Log3
)
If you are on SQL2012 you can use a sequence
CREATE SEQUENCE LogSequence AS int
START WITH 0
INCREMENT BY 1
MINVALUE 0
MAXVALUE 29
CYCLE
;
And then start to insert values
INSERT INTO LogView (Id, Message)
SELECT NEXT VALUE FOR LogSequence
,'SomeMessage'
Now you just have to truncate the logtables on some kind of schedule
If you don't have sql2012 you need to create the sequence some other way
I'm looking for something similar myself (using a table as a circular buffer) but it seems like a simpler approach (for me) will be just to periodically delete old entries (e.g. the lowest IDs or lowest create/lastmodified datetimes or entries over a certain age). It's not a circular buffer but perhaps it is a close enough approximation for some. ;)

Selecting the most optimal query

I have table in Oracle database which is called my_table for example. It is type of log table. It has an incremental column which is named "id" and "registration_number" which is unique for registered users. Now I want to get latest changes for registered users so I wrote queries below to accomplish this task:
First version:
SELECT t.*
FROM my_table t
WHERE t.id =
(SELECT MAX(id) FROM my_table t_m WHERE t_m.registration_number = t.registration_number
);
Second version:
SELECT t.*
FROM my_table t
INNER JOIN
( SELECT MAX(id) m_id FROM my_table GROUP BY registration_number
) t_m
ON t.id = t_m.m_id;
My first question is which of above queries is recommended and why? And second one is if sometimes there is about 70.000 insert to this table but mostly the number of inserted rows is changing between 0 and 2000 is it reasonable to add index to this table?
An analytical query might be the fastest way to get the latest change for each registered user:
SELECT registration_number, id
FROM (
SELECT
registration_number,
id,
ROW_NUMBER() OVER (PARTITION BY registration_number ORDER BY id DESC) AS IDRankByUser
FROM my_table
)
WHERE IDRankByUser = 1
As for indexes, I'm assuming you already have an index by registration_number. An additional index on id will help the query, but maybe not by much and maybe not enough to justify the index. I say that because if you're inserting 70K rows at one time the additional index will slow down the INSERT. You'll have to experiment (and check the execution plans) to figure out if the index is worth it.
In order to check for faster query, you should check the execution plan and cost and it will give you a fair idea. But i agree with solution of Ed Gibbs as analytics make query run much faster.
If you feel this table is going to grow very big then i would suggest partitioning the table and using local indexes. They will definitely help you to form faster queries.
In cases where you want to insert lots of rows then indexes slow down insertion as with each insertion index also has to be updated[I will not recommend index on ID]. There are 2 solutions i have think of for this:
You can drop index before insertion and then recreate it after insertion.
Use reverse key indexes. Check this link : http://oracletoday.blogspot.in/2006/09/there-is-option-to-create-index.html. Reverse key index can impact your query a bit so there will be trade off.
If you look for faster solution and there is a really need to maintain list of last activity for each user, then most robust solution is to maintain separate table with unique registration_number values and rowid of last record created in log table.
E.g. (only for demo, not checked for syntax validity, sequences and triggers omitted):
create table my_log(id number not null, registration_number number, action_id varchar2(100))
/
create table last_user_action(refgistration_number number not null, last_action rowid)
/
alter table last_user_action
add constraint pk_last_user_action primary key (registration_number) using index
/
create or replace procedure write_log(p_reg_num number, p_action_id varchar2)
is
v_row_id rowid;
begin
insert into my_log(registration_number, action_id)
values(p_reg_num, p_action_id)
returning rowid into v_row_id;
update last_user_action
set last_action = v_row_id
where registration_number = p_reg_num;
end;
/
With such schema you can simple query last actions for every user with good performance:
select
from
last_user_action lua,
my_log l
where
l.rowid (+) = lua.last_action
Rowid is physical storage identity directly addressing storage block and you can't use it after moving to another server, restoring from backups etc. But if you need such functionality it's simple to add id column from my_log table to last_user_action too, and use one or another depending on requirements.

Get max value for identity column without a table scan

I have a table with an Identity column Id.
When I execute:
select max(Id) from Table
SQL Server does a table scan and stream aggregate.
My question is, why can it not simply look up the last value assigned to Id? It's an identity, so the information must be tracked, right?
Can I look this up manually?
You can use IDENT_CURRENT to look up the last identity value to be inserted, e.g.
IDENT_CURRENT('MyTable')
However, be cautious when using this function. A failed transaction can still increment this value, and, as Quassnoi states, this row might have been deleted.
It's likely that it does a table scan because it can't guarantee that the last identity value is the MAX value. For example the identity might not be a simple incrementing integer. You could be using a decrementing integer as your identity.
What if you have deleted the latest record?
The value of IDENTITY would not correspond to the actual data anymore.
If you want fast lookups for MAX(id), you should create an index on it (or probably declare it a PRIMARY KEY)
Is the table clustered on that column?
Can you use Top 1:
SELECT TOP 1 [ID]
FROM [Table]
order by ID desc
You can run this following statement and remove the last UNION ALL. Run this statement to get the current Identity values.
SELECT
' SELECT '+char(39)+[name]+char(39)+' AS table_name, IDENT_CURRENT('+char(39)+[name]+char(39)+') AS currvalue UNION ALL'
AS currentIdentity
FROM sys.all_objects WHERE type = 'U'
Is the Id the primary key or indexed? Seems like it should do a seek in those cases.
I'm pretty sure you could set up an index on that field in descending order and it would use that to find the largest key. It should be fast.

which one is a faster/better sql practice?

Suppose I have a 2 column table (id, flag) and id is sequential.
I expect this table to contain a lot of records.
I want to periodically select the first row not flagged and update it. Some of the records on the way may have already been flagged, so I want to skip them.
Does it make more sense if I store the last id I flagged and use it in my select statement, like
select * from mytable where id > my_last_id order by id asc limit 1
or simply get the first unflagged row, like:
select * from mytable where flagged = 'F' order by id asc limit 1
Thank you!
If you create an index on flagged, retrieving an unflagged row should be pretty much an instant operation. If you always update them sequentially, then the first method is fine though.
Option two is the only one that makes sense unless you know that you're always going to process records in sequence!
Assuming MySQL, this one:
SELECT *
FROM mytable
WHERE flagged = 'F'
ORDER BY
flagged ASC, id ASC
LIMIT 1
will be slightly less efficient in InnoDB and of same efficiency in MyISAM, if you have an index on (flagged, id).
InnoDB tables are clustered on the PRIMARY KEY, so fetching the first record in id does not require looking up the table.
In MyISAM, tables are heap-organized, so the index used to police the PRIMARY KEY is stored separately from the table.
Note the flagged in the ORDER BY clause may seem to be redundant, but it is required for MySQL to pick the correct index.
Also, the composite index should be on (flagged, id) even in InnoDB (which implicitly includes the PRIMARY KEY into each index).
You could use
Select Min(Id) as 'Id'
From dbo.myTable
Where Flagged='F'
Assuming the Flagged = 'F' means that it is not flagged.