Deleting millions of record in bunch in postgresql

Deleting millions of record in bunch in postgresql - sql

I have to delete rows from table that has 120 millions records.
The data that has highest(entry_date) and second highest(entry_date) should not be deleted.
Table has many constraints.
One PRIMARY key
Two FOREIGN keys
and two indexes other than index on primary key.
I have already successfully tried method to delete as creating temp table and moving required data into temp table.
Then dropping the present table and then again moving back filtered data from temp to main table.And it worked fine.
But I need a way to delete records in bunch .
CREATE TABLE values
(
value_id bigint NOT NULL,
content_definition_id bigint NOT NULL,
value_s text,
value_n double precision,
order integer,
scope_id integer NOT NULL,
answer boolean NOT NULL,
date timestamp without time zone NOT NULL,
entry_date timestamp without time zone NOT NULL,
CONSTRAINT "value_PK" PRIMARY KEY (value_id),
CONSTRAINT content_definition_id_fk FOREIGN KEY (content_definition_id)
REFERENCES content_definition (content_definition_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT scope_fk FOREIGN KEY (scope_id)
REFERENCES scopes (scope_id) MATCH SIMPLE
ON UPDATE RESTRICT ON DELETE RESTRICT
)
-- Index: fki_content_definition_id_fk
-- Index: fki_value_value_scope_id
How to delete record in bunch like first only 1 million data should be deleted and on.

This assumes you have no conflicting locks. Note that index page locks may slow things down as well.
Recent PostgreSQL allows you to use a CTE in a delete statement. I.e. you can:
WITH ids_to_delete (
SELECT value_id FROM values
where ...
limit ...
)
delete from values where value_id in (select value_id from ids_to_delete)

Can you try merge with conditions of your temp tables and use delete part of it. That should give you a good performance

Related

Make a hint to SQLite that a particular column is always sorted

I have the following table in an SQLite database
CREATE TABLE `log` (
`time` REAL NOT NULL DEFAULT CURRENT_TIMESTAMP,
`data` BLOB NOT NULL
) WITHOUT ROWID;
CREATE INDEX `time_index` ON `log`(`time`);
The index is created because the most frequent query is going to be
SELECT * FROM `log` WHERE `time` BETWEEN ? AND ?
Since the time is going to be always the current time when the new record is added, the index is not really required here. So I would like to "tell" the SQLite engine something like "The lines are going to be added with the 'time' column always having increasing value (similar to AUTO_INCREMENT), and if something goes wrong I will take all responsibility".
Is it possible at all?

You don't want a separate index. You want to declare the column to be the primary key:
CREATE TABLE `log` (
`time` REAL NOT NULL DEFAULT CURRENT_TIMESTAMP PRIMARY KEY,
`data` BLOB NOT NULL
) WITHOUT ROWID;
This creates a single b-tree index for the log based on the primary key. In other databases, this structure would be called a "clustered index". You have probably already read the documentation but I'm referencing it anyway.

You would have an issue, or not depending upon how you consider that you cannot use :-
CREATE TABLE `log` (
`time` REAL NOT NULL DEFAULT CURRENT_TIMESTAMP,
`data` BLOB NOT NULL
) WITHOUT ROWID;
because :-
Every WITHOUT ROWID table must have a PRIMARY KEY. An error is raised
if a CREATE TABLE statement with the WITHOUT ROWID clause lacks a
PRIMARY KEY.
Clustered Indexes and the WITHOUT ROWID Optimization
So you might as well make the time column the PRIMARY KEY.
but the problem is that the precision of REAL is not enough to handle
microsecond resolution, and thus two adjacent records may have the
same time value which would violate the PRIMARY KEY constraint.
Then you could use a composite PRIMARY KEY where the precision required is satisfied by multiple columns (a second column would likely more than suffice) perhaps along the lines of :-
CREATE TABLE log (
time_datepart INTEGER,
time_microsecondpart,
data BLOB NOt NULL,
PRIMARY KEY (time_datepart,time_microsecondpart)
) WITHOUT ROWID;
The time_microsecondpart column needn't necessarily be microseconds it could be a counter derived from another table similar to how the sqlite_sequence table is utilised when AUTOINCREMENT is utilised (less the need for the column that holds the name of the table that a row is attached to).

How to ignore duplicate Primary Key in SQL?

I have an excel sheet with several values which I imported into SQL (book1$) and I want to transfer the values into ProcessList. Several rows have the same primary keys which is the ProcessID because the rows contain original and modified values, both of which I want to keep. How do I make SQL ignore the duplicate primary keys?
I tried the IGNORE_DUP_KEY = ON but for rows with duplicated primary key, only 1 the latest row shows up.
CREATE TABLE dbo.ProcessList
(
Edited varchar(1),
ProcessId int NOT NULL PRIMARY KEY WITH (IGNORE_DUP_KEY = ON),
Name varchar(30) NOT NULL,
Amount smallmoney NOT NULL,
CreationDate datetime NOT NULL,
ModificationDate datetime
)
INSERT INTO ProcessList SELECT Edited, ProcessId, Name, Amount, CreationDate, ModificationDate FROM Book1$
SELECT * FROM ProcessList
Also, if I have a row and I update the values of that row, is there any way to keep the original values of the row and insert a clone of that row below, with the updated values and creation/modification date updated automatically?

How do I make SQL ignore the duplicate primary keys?
Under no circumstances can a transaction be committed that results in a table containing two distinct rows with the same primary key. That is fundamental to the nature of a primary key. SQL Server's IGNORE_DUP_KEY option does not change that -- it merely affects how SQL Server handles the problem. (With the option turned on it silently refuses to insert rows having the same primary key as any existing row; otherwise, such an insertion attempt causes an error.)
You can address the situation either by dropping the primary key constraint or by adding one or more columns to the primary key to yield a composite key whose collective value is not duplicated. I don't see any good candidate columns for an expanded PK among those you described, though. If you drop the PK then it might make sense to add a synthetic, autogenerated PK column.
Also, if I have a row and I update the values of that row, is there any way to keep the original values of the row and insert a clone of that row below, with the updated values and creation/modification date updated automatically?
If you want to ensure that this happens automatically, however a row happens to be updated, then look into triggers. If you want a way to automate it, but you're willing to make the user ask for the behavior, then consider a stored procedure.

try this
INSERT IGNORE INTO ProcessList SELECT Edited, ProcessId, Name, Amount, CreationDate, ModificationDate FROM Book1$
SELECT * FROM ProcessList

You drop the constraint. Something like this:
alter table dbo.ProcessList drop constraint PK_ProcessId;
You need to know the constraint name.
In other words, you can't ignore a primary key. It is defined as unique and not-null. If you want the table to have duplicates, then that is not the primary key.

How to constraint one column with values from a column from another table?

This isn't a big deal, but my OCD is acting up with the following problem in the database I'm creating. I'm not used to working with databases, but the data has to be stored somewhere...
Problem
I have two tables A and B.
One of the datafields is common to both tables - segments. There's a finite number of segments, and I want to write queries that connect values from A to B through their segment values, very much asif the following table structure was used:
However, as you can see the table Segments is empty. There's nothing more I want to put into that table, rather than the ID to give other table as foreign keys. I want my tables to be as simple as possible, and therefore adding another one just seems wrong.
Note also that one of these tables (A, say) is actually master, in the sense that you should be able to put any value for segment into A, but B one should first check with A before inserting.
EDIT
I tried one of the answers below:
create table A(
id int primary key identity,
segment int not null
)
create table B(
id integer primary key identity,
segment int not null
)
--Andomar's suggestion
alter table B add constraint FK_B_SegmentID
foreign key (segment) references A(segment)
This produced the following error.
Maybe I was somehow unclear that segments is not-unique in A or B and can appear many times in both tables.
Msg 1776, Level 16, State 0, Line 11 There are no primary or candidate
keys in the referenced table 'A' that match the referencing column
list in the foreign key 'FK_B_SegmentID'. Msg 1750, Level 16, State 0,
Line 11 Could not create constraint. See previous errors.

You can create a foreign key relationship directly from B.SegmentID to A.SegmentID. There's no need for the extra table.
Update: If the SegmentIDs aren't unique in TableA, then you do need the extra table to store the segment IDs, and create foreign key relationships from both tables to this table. This however is not enough to enforce that all segment IDs in TableB also occur in TableA. You could instead use triggers.

You can ensure the segment exists in A with a foreign key:
alter table B add constraint FK_B_SegmentID
foreign key (SegmentID) references A(SegmentID)
To avoid rows in B without a segment at all, make B.SegmentID not nullable:
alter table B alter column SegmentID int not null
There is no need to create a Segments table unless you want to associate extra data with a SegmentID.

As Andomar and Mark Byers wrote, you don't have to create an extra table.
You can also CASCADE UPDATEs or DELETEs on the master. Be very carefull with ON DELETE CASCADE though!
For queries use a JOIN:
SELECT *
FROM A
JOIN B ON a.SegmentID = b.SegmentID
Edit:
You have to add a UNIQUE constraint on segment_id in the "master" table to avoid duplicates there, or else the foreign key is not possible. Like this:
ALTER TABLE A ADD CONSTRAINT UNQ_A_SegmentID UNIQUE (SegmentID);

If I've understood correctly, a given segment cannot be inserted into table B unless it has also been inserted into table A. In which case, table A should reference table Segments and table B should reference table A; it would be implicit that table B ultimately references table Segments (indirectly via table A) so an explicit reference is not required. This could be done using foreign keys (e.g. no triggers required).
Because table A has its own key I assume a given segment_ID can appear in table A more than once, therefore for B to be able to reference the segment_ID value in A then a superkey would need to be defined on the compound of A_ID and segment_ID. Here's a quick sketch:
CREATE TABLE Segments
(
segment_ID INTEGER NOT NULL UNIQUE
);
CREATE TABLE A
(
A_ID INTEGER NOT NULL UNIQUE,
segment_ID INTEGER NOT NULL
REFERENCES Segments (segment_ID),
A_data INTEGER NOT NULL,
UNIQUE (segment_ID, A_ID) -- superkey
);
CREATE TABLE B
(
B_ID INTEGER NOT NULL UNIQUE,
A_ID INTEGER NOT NULL,
segment_ID INTEGER NOT NULL,
FOREIGN KEY (segment_ID, A_ID)
REFERENCES A (segment_ID, A_ID),
B_data INTEGER NOT NULL
);

Foreign key null - performance degradation

I have table folder where column parent_id references on id if that folder has parent, if not then parent_id is null. Is that ok solution or I need extra table for this connection or other solution? Can foreign key be null at all, and if can is this solution will has bigger time execution ?
table folder(
id int primary key, //primary key in my table
parent_id int references id, //foreign key on id column in same table
....
)

Yes, a foreign key can be made to accept NULL values:
CREATE TABLE folders (
id int NOT NULL PRIMARY KEY,
parent_id int NULL,
FOREIGN KEY (parent_id) REFERENCES folders (id)
) ENGINE=InnoDB;
Query OK, 0 rows affected (0.06 sec)
INSERT INTO folders VALUES (1, NULL);
Query OK, 1 row affected (0.00 sec)
Execution time is not affected if a foreign key is set to accept NULL values or not.
UPDATE: Further to comment below:
Keep in mind that B-tree indexes are most effective for high-cardinality data (i.e. columns with many possible values, where the data in the column is unique or almost unique). If you will be having many NULL values (or any other repeated value), the query optimizer might choose not to use the index to filter the records for your result set, since it would be faster not to. However this problem is independent of the fact that the column is a foreign key or not.

You can have NULL foreign keys. No problems. I would not put an extra table just for folders without a parent (root folders). It will make your design more complicated with no benefits.

How to create a unique index on a NULL column?

I am using SQL Server 2005. I want to constrain the values in a column to be unique, while allowing NULLS.
My current solution involves a unique index on a view like so:
CREATE VIEW vw_unq WITH SCHEMABINDING AS
SELECT Column1
FROM MyTable
WHERE Column1 IS NOT NULL
CREATE UNIQUE CLUSTERED INDEX unq_idx ON vw_unq (Column1)
Any better ideas?

Using SQL Server 2008, you can create a filtered index.
CREATE UNIQUE INDEX AK_MyTable_Column1 ON MyTable (Column1) WHERE Column1 IS NOT NULL
Another option is a trigger to check uniqueness, but this could affect performance.

The calculated column trick is widely known as a "nullbuster"; my notes credit Steve Kass:
CREATE TABLE dupNulls (
pk int identity(1,1) primary key,
X int NULL,
nullbuster as (case when X is null then pk else 0 end),
CONSTRAINT dupNulls_uqX UNIQUE (X,nullbuster)
)

Pretty sure you can't do that, as it violates the purpose of uniques.
However, this person seems to have a decent work around:
http://sqlservercodebook.blogspot.com/2008/04/multiple-null-values-in-unique-index-in.html

It is possible to use filter predicates to specify which rows to include in the index.
From the documentation:
WHERE <filter_predicate> Creates a filtered index by specifying which
rows to include in the index. The filtered index must be a
nonclustered index on a table. Creates filtered statistics for the
data rows in the filtered index.
Example:
CREATE TABLE Table1 (
NullableCol int NULL
)
CREATE UNIQUE INDEX IX_Table1 ON Table1 (NullableCol) WHERE NullableCol IS NOT NULL;

Strictly speaking, a unique nullable column (or set of columns) can be NULL (or a record of NULLs) only once, since having the same value (and this includes NULL) more than once obviously violates the unique constraint.
However, that doesn't mean the concept of "unique nullable columns" is valid; to actually implement it in any relational database we just have to bear in mind that this kind of databases are meant to be normalized to properly work, and normalization usually involves the addition of several (non-entity) extra tables to establish relationships between the entities.
Let's work a basic example considering only one "unique nullable column", it's easy to expand it to more such columns.
Suppose we the information represented by a table like this:
create table the_entity_incorrect
(
id integer,
uniqnull integer null, /* we want this to be "unique and nullable" */
primary key (id)
);
We can do it by putting uniqnull apart and adding a second table to establish a relationship between uniqnull values and the_entity (rather than having uniqnull "inside" the_entity):
create table the_entity
(
id integer,
primary key(id)
);
create table the_relation
(
the_entity_id integer not null,
uniqnull integer not null,
unique(the_entity_id),
unique(uniqnull),
/* primary key can be both or either of the_entity_id or uniqnull */
primary key (the_entity_id, uniqnull),
foreign key (the_entity_id) references the_entity(id)
);
To associate a value of uniqnull to a row in the_entity we need to also add a row in the_relation.
For rows in the_entity were no uniqnull values are associated (i.e. for the ones we would put NULL in the_entity_incorrect) we simply do not add a row in the_relation.
Note that values for uniqnull will be unique for all the_relation, and also notice that for each value in the_entity there can be at most one value in the_relation, since the primary and foreign keys on it enforce this.
Then, if a value of 5 for uniqnull is to be associated with an the_entity id of 3, we need to:
start transaction;
insert into the_entity (id) values (3);
insert into the_relation (the_entity_id, uniqnull) values (3, 5);
commit;
And, if an id value of 10 for the_entity has no uniqnull counterpart, we only do:
start transaction;
insert into the_entity (id) values (10);
commit;
To denormalize this information and obtain the data a table like the_entity_incorrect would hold, we need to:
select
id, uniqnull
from
the_entity left outer join the_relation
on
the_entity.id = the_relation.the_entity_id
;
The "left outer join" operator ensures all rows from the_entity will appear in the result, putting NULL in the uniqnull column when no matching columns are present in the_relation.
Remember, any effort spent for some days (or weeks or months) in designing a well normalized database (and the corresponding denormalizing views and procedures) will save you years (or decades) of pain and wasted resources.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas