I have a table of millions of rows that is constantly changing(new rows are inserted, updated and some are deleted). I'd like to query 100 new rows(I haven't queried before) every minute but these rows can't be ones I've queried before. The table has a about 2 dozen columns and a primary key.
Happy to answer any questions or provide clarification.
A simple solution is to have a separate table with just one row to store the last ID you fetched.
Let's say that's your "table of millions of rows":
-- That's your table with million of rows
CREATE TABLE test_table (
id serial unique,
col1 text,
col2 timestamp
);
-- Data sample
INSERT INTO test_table (col1, col2)
SELECT 'test', generate_series
FROM generate_series(now() - interval '1 year', now(), '1 day');
You can create the following table to store an ID:
-- Table to keep last id
CREATE TABLE last_query (
last_quey_id int references test_table (id)
);
-- Initial row
INSERT INTO last_query (last_quey_id) VALUES (1);
Then with the following query, you will always fetch 100 rows never fetched from the original table and maintain a pointer in last_query:
WITH last_id as (
SELECT last_quey_id FROM last_query
), new_rows as (
SELECT *
FROM test_table
WHERE id > (SELECT last_quey_id FROM last_id)
ORDER BY id
LIMIT 100
), update_last_id as (
UPDATE last_query SET last_quey_id = (SELECT MAX(id) FROM new_rows)
)
SELECT * FROM new_rows;
Rows will be fetched by order of new IDs (oldest rows first).
You basically need a unique, sequential value that is assigned to each record in this table. That allows you to search for the next X records where the value of this field is greater than the last one you got from the previous page.
Easiest way would be to have an identity column as your PK, and simply start from the beginning and include a "where id > #last_id" filter on your query. This is a fairly straightforward way to page through data, regardless of underlying updates. However, if you already have millions of rows and you are constantly creating and updating, an ordinary integer identity is eventually going to run out of numbers (a bigint identity column is unlikely to run out of numbers in your great-grandchildren's lifetimes, but not all DBs support anything but a 32-bit identity).
You can do the same thing with a "CreatedDate" datetime column, but as these dates aren't 100% guaranteed to be unique, depending on how this date is set you might have more than one row with the same creation timestamp, and if those records cross a "page boundary", you'll miss any occurring beyond the end of your current page.
Some SQL system's GUID generators are guaranteed to be not only unique but sequential. You'll have to look into whether PostgreSQL's GUIDs work this way; if they're true V4 GUIDs, they'll be totally random except for the version identifier and you're SOL. If you do have access to sequential GUIDs, you can filter just like with an integer identity column, only with many more possible key values.
Related
This question already has answers here:
How to get rid of gaps in rowid numbering after deleting rows?
(4 answers)
Closed 5 months ago.
I've created 2 rows in an table in SQL (sqlite3 on cmd) and then deleted 1 of them.
CREATE TABLE sample1( name TEXT, id INTEGER PRIMARY KEY AUTOINCREMENT);
INSERT INTO sample1 VALUES ('ROY',1);
INSERT INTO sample1(name) VALUES ('RAJ');
DELETE FROM sample1 WHERE id = 2;
Later when I inserted another row, its id was given 3 by the system instead of 2.
INSERT INTO sample1 VALUES ('AMIE',NULL);
SELECT * FROM sample1;
picture of table
How do I correct it so the next values are given right id's automatically? Or how do I clear the sql database cache to solve it?
The simplest fix to resolve the problem you describe, is to omit AUTOINCREMENT.
The result of your test would then be as you wish.
However, the rowid (which the id column is an alias of, if INTEGER PRIMARY KEY is specified, with or without AUTOINCREMENT), will still be generated and probably be 1 higher than the highest existing id (alias of rowid).
There is a subtle difference between using and not using AUTOINCREMENT.
without AUTOINCREMENT then the generated value of the rowid and therefore it's alias will be the highest existing rowid for the table plus 1 (not absolutely guaranteed though).
with AUTOINCREMENT the generated value will be 1 plus the higher of:-
the highest existing rowid, or
the highest used rowid
the highest, in some circumstances, may have only existed briefly
In your example as 2 had been used then 2 + 1 = 3 even though 2 had been deleted.
Using AUTOINCREMENT is inefficient as to know what the last used value was requires a system table, sqlite_sequence and it being accessed to store the latest id and also to retrieve the id.
The SQLite AUTOINCREMENT documentation, says this:-
The AUTOINCREMENT keyword imposes extra CPU, memory, disk space, and disk I/O overhead and should be avoided if not strictly needed. It is usually not needed.
There are other differences, such as with AUTOINCREMENT if the id 9223372036854775807 has been reached, then another insert will result in an SQLITE_FULL error. Whilst without AUTOINCREMENT then an unused id (there would be one as current day storage devices could not hold that number of rows).
The intention of id's (rowid's) is to uniquely identify a row and to be able to access such a row efficiently if accessing it by the id. The intention is not for it to be used as a sequence/order. Using it as a sequence/order number will probably invariably result in unanticipated sequences or inefficient overheads trying to maintain such a sequence/order.
You should always consider that rows are unordered unless specifically ordered by a clause that orders the output, such as an ORDER BY clause.
However, if you take your example a little further, omitting AUTOINCREMENT, will still probably result in the order/sequence issues as if, for example, the row with an id of 1 were deleted instead of 2 then you would end up with id's of 2 and 3.
Perhaps consider the following which shows a) how the limited issue you have posed, is solved without AUTOINCREMENT, and b) that it is not the solution if it is not the highest id that is deleted:-
DROP TABLE IF EXISTS sample1;
CREATE TABLE IF NOT EXISTS sample1( name TEXT, id INTEGER PRIMARY KEY);
INSERT INTO sample1 VALUES ('ROY',1);
INSERT INTO sample1(name) VALUES ('RAJ');
DELETE FROM sample1 WHERE id = 2;
INSERT INTO sample1 VALUES ('AMIE',NULL);
/* Result 1 */
SELECT * FROM sample1;
/* BUT if a lower than the highest id is deleted */
DELETE FROM sample1 WHERE id=1;
INSERT INTO sample1 VALUES ('EMMA',NULL);
/* Result 2 */
SELECT * FROM sample1;
Result 1 (your exact issue resolved)
Result 2 (if not the highest id deleted)
I'm having an issue with Sequences when inserting data into a Postgres table through SQL Alchemy.
All of the data is inserted fine, the id BIGSERIAL PRIMARY KEY column has all unique values which is great.
However when I query the first 10/20 rows etc. of the table, the id values are not ascending in numeric order. There are gaps in the sequence, fine, that's to be expected, I mean rows will go through values randomly not ascending like:
id
15
22
16
833
30
etc...
I've gone through plenty of SO and Postgres forum posts around this and have only found people talking about having huge serial gaps in their sequences, not about incorrect ascending order when being created
Screenshots of examples:
The table itself has being created through standard DDL statement like so:
CREATE TABLE IF NOT EXISTS schema.table_name (
id BIGSERIAL NOT NULL,
col1 text NOT NULL,
col2 JSONB[] NOT NULL,
etc....
PRIMARY KEY (id)
);
However when I query the first 10/20 rows etc. of the table
Your query has no order by clause, so you are not selecting the first rows of the table, just an undefined set of rows.
Use order by - you will find out that sequence number are indeed assigned in ascending order (potentially with gaps):
select id from ht_data order by id limit 30
In order to actually check the ordering of the sequence, you would actually need another column that stores the timestamp when each row was created. You could then do:
select id from ht_data order by ts limit 30
In general, there is no defined "order" within a SQL table. If you want to view your data in a certain order, you need an ORDER BY clause:
SELECT *
FROM table_name
ORDER BY id;
As for gaps in the sequence, the contract of an auto increment column generally only guarantees that each newly generated id value with be unique and, most of the time (but not necessarily always), will be increasing.
How could you possibly know if the values are "out of order"? SQL tables represent unordered sets. The only indication of ordering in your table is the serial value.
The query that you are running has no ORDER BY. The results are not guaranteed to be in any particular ordering. Period. That is a very simply fact about SQL. That you want the results of a SELECT to be ordered by the primary key or by insertion order is nice, but not how databases work.
The only way you could determine if something were out of order would be if you had a column that separate specified the insert order -- you could have a creation timestamp for instance.
All you have discovered is that SQL lives up to its promise of not guaranteeing ordering unless the query specifically asks for it.
I have a PostgreSQL 9.5 table that is set to cycle when the primary key ID hits the maximum value. For argument's sake, lets the maximum ID value can be 999,999. I'll add commas to make the numbers easier to read.
We run a job that deletes data from the table that is older than 45 days. Let's assume that the table now only contains records with IDs of 999,998 and 999,999.
The primary key ID cycles back to 1 and 20 more records have been written. I need to keep it generic so I won't make any assumptions about how many were written. In my real world needs, I don't care how many were written.
How can I select the records without getting duplicates with an ID of 999,998 and 999,999?
For example:
SELECT * FROM my_table WHERE ID >0;
Would return (in no particular order):
999,998
999,999
1
2
...
20
My real world case is that I need to publish every record that was written to the table to a message broker. I maintain a separate table that tracks the row ID and timestamp of the last record that was published. The pseudo-query/pseudo-algorithm to determine what new records to write is something like this. The IF statement handles when the primary key ID cycles back to 1 as I need to read the new record written after the ID cycled:
SELECT * from my_table WHERE id > last_written_id
PUBLISH each record
if ID of last record published == MAX_TABLE_ID (e.g 999,999):
??? What to do here? I need to get the newest records where ID >= 1 but less than the oldest record I have
I realise that the "code" is rough, but it's really just an idea at the moment so there's no code.
Thanks
Hmm, you can use the current value of the sequence to do what you want:
select t.*
from my_table t
where t.id > #last_written_id or
(currval(pg_get_serial_sequence('my_table', 'id')) < #last_written_id and
t.id <= currval(pg_get_serial_sequence('my_table', 'id'))
);
This is not a 100% solution. After all, 2,000,000 records could have been added so the numbers will all be repeated or the records deleted. Also, if you have inserts happening while the query is running -- particularly in a multithreaded environment.
Here is a completely different approach: You could completely fill the table, giving it a column for deletion time. So instead of deleting rows, you merely set this datetime. And instead of inserting a row you merely update the one that was deleted the longest time ago:
update my_table
set col1 = 123, col2 = 456, col3 = 'abc', deletion_datetime = null
where deletion_datetime =
(
select deletion_datetime
from my_table
where deletion_datetime is not null
order by deletion_datetime
limit 1
);
I have got two databases with the same structure, but different datas.
And... in both databases all datas have auto-incerement IDs in columns 'Tiles' and 'TilesData' and these IDs are related keys. I have to move rows from first one to another, but there is exception with id.
Is description of my problem clearly?
I tried this (but I'm a little afraid of reliability of this solution, there will be few millions of rows):
INSERT INTO DataBase_2.Tiles (X,Y,Zoom,Type,CacheTime)
SELECT X, Y, Zoom, Type, CacheTime FROM DataBase_1.Tiles;
INSERT INTO DataBase_2.TilesData(Tile)
SELECT Tile FROM DataBase_1.TilesData;
Could you help me or give me some tips? Is simply SQL enough?
When the autoincrementing columns are declared as INTEGER PRIMARY KEY (without AUTOINCREMENT), then new IDs will get the next value after the largest value already in the table.
Check if both tables have the same maximum ID value:
SELECT MAX(id) FROM DataBase_2.Tiles;
SELECT MAX(id) FROM DataBase_2.TilesData;
If they are equal, then corresponding rows will indeed get the same new ID. However, you should use ORDER BY to ensure that the rows are read/inserted in the same order:
INSERT INTO DataBase_2.Tiles (...)
SELECT ... FROM DataBase_1.Tiles ORDER BY id;
INSERT INTO DataBase_2.TilesData(Tile)
SELECT Tile FROM DataBase_1.TilesData ORDER BY id;
If the id columns are declared with AUTOINCREMENT, then you have to check (and if needed, adjust) the next ID values in the sqlite_sequence table.
I am developing an application that is required to store previous versions of database table rows to maintain a history of changes. I am recording the history in the same table but need the most current data to be accessible by a unique identifier that doesn't change with new versions. I have a few ideas on how this could be done and was just looking for some ideas on the best way of doing this or whether there is any reason not to use one of my ideas:
Create a new row for each row version, with a field to indicate which row was the current row. The drawback of this is that the new version has a different primary key and any references to the old version will not return the current version.
When data is updated, the old row version is duplicated to a new row, and the new version replaces the old row. The current row can be accessed by the same primary key.
Add a second table with only a primary key, add a column to the other table which is foreign key to new table's primary key. Use same method as described in option 1 for storing multiple versions and create a view which finds the current version by using the new table's primary key.
PeopleSoft uses (used?) "effective dated records". It took a little while to get the hang of it, but it served its purpose. The business key is always extended by an EFFDT column (effective date). So if you had a table EMPLOYEE[EMPLOYEE_ID, SALARY] it would become EMPLOYEE[EMPLOYEE_ID, EFFDT, SALARY].
To retrieve the employee's salary:
SELECT e.salary
FROM employee e
WHERE employee_id = :x
AND effdt = (SELECT MAX(effdt)
FROM employee
WHERE employee_id = :x
AND effdt <= SYSDATE)
An interesting application was future dating records: you could give every employee a 10% increase effective Jan 1 next year, and pre-poulate the table a few months beforehand. When SYSDATE crosses Jan 1, the new salary would come into effect. Also, it was good for running historical reports. Instead of using SYSDATE, you plug in a date from the past in order to see the salaries (or exchange rates or whatever) as they would have been reported if run at that time in the past.
In this case, records are never updated or deleted, you just keep adding records with new effective dates. Makes for more verbose queries, but it works and starts becoming (dare I say) normal. There are lots of pages on this, for example: http://peoplesoft.wikidot.com/effective-dates-sequence-status
#3 is probably best, but if you wanted to keep the data in one table, I suppose you could add a datetime column that has a now() value populated for each new row and then you could at least sort by date desc limit 1.
Overall though - multiple versions needs more info on what you want to do effectively as much as programatically...ie need more info on what you want to do.
R
Have you considered using AutoAudit?
AutoAudit is a SQL Server (2005, 2008) Code-Gen utility that creates
Audit Trail Triggers with:
Created, CreatedBy, Modified, ModifiedBy, and RowVersion (incrementing INT) columns to table
Insert event logged to Audit table
Updates old and new values logged to Audit table
Delete logs all final values to the Audit tbale
view to reconstruct deleted rows
UDF to reconstruct Row History
Schema Audit Trigger to track schema changes
Re-code-gens triggers when Alter Table changes the table
For me, history tables are always separate. So, definitely I would go with that, but why create some complex versioning thing where you need to look at the current production record. In reporting, this results in nasty unions that are really unnecessary.
Table has a primary key and who cares what else.
TableHist has these columns: incrementing int/bigint primary key, history written date/time, history written by, record type (I, U, D for insert, update, delete), the PK from Table as an FK on TableHist, the remaining columns all other columns with the same name are in the TableHist table.
If you create this history table structure and populate it via triggers on Table, you will have all versions of every row in the tables you care about and can easily determine the original record, every change, and the deletion records as well. AND if you are reporting, you only need to use your historical tables to get all of the information you'd like.
create table table1 (
Id int identity(1,1) primary key,
[Key] varchar(max),
Data varchar(max)
)
go
create view view1 as
with q as (
select [Key], Data, row_number() over (partition by [Key] order by Id desc) as 'r'
from table1
)
select [Key], Data from q where r=1
go
create trigger trigger1 on view1 instead of update, insert as begin
insert into table1
select [Key], Data
from (select distinct [Key], Data from inserted) a
end
go
insert into view1 values
('key1', 'foo')
,('key1', 'bar')
select * from view1
update view1
set Data='updated'
where [Key]='key1'
select * from view1
select * from table1
drop trigger trigger1
drop table table1
drop view view1
Results:
Key Data
key1 foo
Key Data
key1 updated
Id Key Data
1 key1 bar
2 key1 foo
3 key1 updated
I'm not sure if the disctinct is needed.