Faster Sqlite insert from another table - sql

I have an Sqlite DB which I am doing updates on and its very slow. I am wondering if I am doing it the best way or is there a faster way. My tables are:
create table files(
fileid integer PRIMARY KEY,
name TEXT not null,
sha256 TEXT,
created INT,
mtime INT,
inode INT,
nlink INT,
fsno INT,
sha_id INT,
size INT not null
);
create table fls2 (
fileid integer PRIMARY KEY,
name TEXT not null UNIQUE,
size INT not null,
sha256 TEXT not null,
fs2,
fs3,
fs4,
fs7
);
Table 'files' is actually in an attached DB named ttb. I am then doing this:
UPDATE fls2
SET fs3 = (
SELECT inode || 'X' || mtime || 'X' || nlink
FROM
ttb.files
WHERE
ttb.files.fsno = 3
AND
fls2.name = ttb.files.name
AND
fls2.sha256 = ttb.files.sha256
);
So the idea is, fls2 has values in 'name' which are also present in ttb.files.name. In ttb.files there are other parameters which I want to insert into the corresponding rows in fls2. The query works but I assume the matching up of the two tables is taking the time, and I wonder if theres a more efficient way to do it. There are indexes on each column in fls2 but none on files. I am doing it as a transaction, and pragma journal = memory (although sqlite seems to be ignoring that because a journal file is being created).
It seems slow, so far about 90 minutes for around a million rows in each table.
One CPU is pegged so I assume its not disk bound.
Can anyone suggest a better way to structure the query?
EDIT: EXPLAIN QUERY PLAN
|--SCAN TABLE fls2
`--CORRELATED SCALAR SUBQUERY 1
`--SCAN TABLE files
Not sure what that means though. It carries out the SCAN TABLE files for each SCAN TABLE fls2 hit?
EDIT2:
Well blimey, Crtl-C the query which had been running 2.5 hours at that point, exit Sqlite, run sqlite with the files DB, create index (sha256, name) - 1 minute or so. Exit that, run Sqlite with the main DB. Explain shows that now the latter scan is done with the index. Run the update - takes 150 seconds. Compared to >150 minutes, thats a heck of a speed up. Thanks for the assistance.
TIA, Pete

There are indexes on each column in fls2
Indexes are used for faster selection. They slow down inserts and updates. Maybe removing the one for fls2.fs3 helps?

Not an expert on sqlite, but on some databases it is more performant to insert the data into temporary table, delete them, then insert them from the temp table.
Insert into tmptab
Select fileid,
name,
size,
sha256,
fs2,
inode || 'X' || mtime || 'X' || nlink,
fs4,
fs7
From fls2
Inner join files on
fls2.name = ttb.files.name
AND
fls2.sha256 = ttb.files.sha256
delete from
Fls2 where exists (select 1 from tmptab where tmptab.<primary key> = fls2.<primary key>)
Insert into fls2 select * from tmptab

Related

UPDATE two columns with new value under large size table

We have table like :
mytable (pid, string_value, int_value)
This table has more than 20M rows in total. Now we have a feature try to mark all the rows from this tables as invalid. So we need update the table columns: string_Value = NULL and int_value = 0 which indicate this is invalid row ( we still want to keep the pid as it is important to us)
So what is the best way?
I use the following SQL:
UPDATE Mytable
SET string_value = NULL,
int_value = 0;
but this query takes more than 4 minutes in my test env. Is there any better way we can improve it?
Updating all the rows can be quite expensive. Often, it is faster to empty the table and reload it.
In generic SQL this looks like:
create table mytable_temp as
select pid
from mytable;
truncate table mytable; -- back it up first!
insert into mytable (pid, string_value, int_value)
select pid, null, 0
from mytable_temp;
The creation of the temporary table may use different syntax, depending on our database.
Updates can take time to complete. Another way of achieving this is to follow the following steps:
Add new columns with the values you need set as the default value
Drop the original columns
Rename the new columns with the names of the original columns.
You can then drop the default values on the new columns.
This needs to be tested as different DBMSs allow different levels of table alters (i.e. not all DMBSs allow a drop default or a drop column).

Create virtual table with rowid only of another table

Suppose I have a table in sqlite as follows:
`name` `age`
"bob" 20 (rowid=1)
"tom" 30 (rowid=2)
"alice" 19 (rowid=3)
And I want to store the result of the following table using minimal storage space:
SELECT * FROM mytable WHERE name < 'm' ORDER BY age
How can I store a virtual table from this resultset that will just give me the ordered resultset. In other words, storing the rowid in an ordered way (in the above it would be 3,1) without saving all the data into a separate table.
For example, if I stored this information with just the rowid in a sorted order:
CREATE TABLE vtable AS
SELECT rowid from mytable WHERE name < 'm' ORDER BY age;
Then I believe every time I would need to query the vtable I would have to join it back to the original table using the rowid. Is there a way to do this so that the vtable "knows" the content that it has based on the external table (I believe this is referred to as external-content when creating an fts index -- https://sqlite.org/fts5.html#external_content_tables).
I believe this is referred to as external-content when creating an
fts.
No a virtual table is CREATED using CREATE VIRTUAL TABLE ...... USING module_name (module_parameters)
Virtual tables are tables that can call a module, thus the USING module_name(module_parameters) is mandatory.
For FTS (Full Text Serach) you would have to read the documentation but it could be something like
CREATE VIRTUAL TABLE IF NOT EXISTS bible_fts USING FTS3(book, chapter INTEGER, verse INTEGER, content TEXT)
You very likely don't need/want a VIRTUAL table.
CREATE TABLE vtable AS SELECT rowid from mytable WHERE name < 'm' ORDER BY age;
Would create a normal table IF it didn't already exist that would persist. And if you wanted to use it then it would probably only be of use by joining it with mytable. Effectively it would allow a snapshot, but at a cost, of at least 4k for every snapshot.
I'd suggest a single table for all snapshots that has two columns a snapshot identifier and the rowid of the snapshot. This would probably be far less space consuming.
Basic Example
Consider :-
CREATE TABLE IF NOT EXISTS mytable (
id INTEGER PRIMARY KEY, /* NOTE not using an alias of the rowid may present issues as the id's can change */
name TEXT,
age INTEGER
);
CREATE TABLE IF NOT EXISTS snapshot (id TEXT DEFAULT CURRENT_TIMESTAMP, mytable_map);
INSERT INTO mytable (name,age) VALUES('Mary',21),('George',22);
INSERT INTO snapshot (mytable_map) SELECT id FROM mytable;
SELECT snapshot.id,name,age FROM snapshot JOIN mytable ON mytable.id = snapshot.mytable_map;
And the above is run 3 times with a reasonable interval (seconds so as to distinguish the snapshot id (the timestamp)).
Then you would get 3 snapshots (each with a number of rows but the same value in the id column for each snapshot), the first with 2 rows, the 2nd with 4 and the last with 6 (as each run 2 rows are being added to mytable) :-

SQLite slow select query - howto make it faster

In SQLite I have a large DB (~35Mb).
It contains a table with the following syntax:
CREATE TABLE "log_temperature" (
"id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
"date" datetime NOT NULL,
"temperature" varchar(20) NOT NULL
)
Now when I want to search for datas within a period, it is too slow on an embedded system:
$ time sqlite3 sample.sqlite "select MIN(id) from log_temperature where
date BETWEEN '2019-08-13 00:00:00' AND '2019-08-13 23:59:59';"
331106
real 0m2.123s
user 0m0.847s
sys 0m0.279s
Note1: ids are running from 210610 to 331600.
Note2: if I run 'SELECT id FROM log_temperature ORDER BY ASC LIMIT 1', it gives the exact same timing as with the 'MIN' function.
I want to have the 'real time of 0m2.123s' to be as close to 0m0.0s as possible.
What are my options for making this faster? (Without removing hundreds of thousands of data?)
ps.: embedded system parameters are not important here. This shall be solved by optimizing the query or the underlying schema.
First, I would recommend that you write the query as:
select MIN(id)
from log_temperature
where date >= '2019-08-13' and date < '2019-08-14';
This doesn't impact performance, but it makes the query easier to write -- and no need to fiddle with times.
Then, you want an index on (date, id):
create index idx_log_temperature_date_id on log_temperature(date, id);
I don't think id is needed in the index, if it is declared as the primary key of the table.
Can you create an index on the date?
CREATE INDEX index_name ON log_temperature(date);

How Do I Deep Copy a Set of Data, and Change FK References to Point to All the Copies?

Suppose I have Table A and Table B. Table B references Table A. I want to deep copy a set of rows in Table A and Table B. I want all of the new Table B rows to reference the new Table A rows.
Note that I'm not copying the rows into any other tables. The rows in table A will be copied into table A, and the rows in table B will be copied into table B.
How can I ensure that the foreign key references get readjusted as part of the copy?
To clarify, I'm trying to find a generic way to do this. The example I'm giving involves two tables, but in practice the dependency graph may be much more complicated. Even a generic way to dynamically generate SQL to do the work would be fine.
UPDATE:
People are asking why this is necessary, so I'll give some background. It may be way too much, but here goes:
I'm working with an old desktop application that's been moved to a client-server model. But, the application still uses a rudimentary in-house binary file format for storing data for its tables. A data file is just a header followed by a series of rows, each of which is just the binary serialized field values, the order of which is determined by a schema text file. The only thing good about it is that it's very fast. It's terrible in every other respect. I'm moving the application to SQL Server and trying not to degrade the performance too badly.
This is a kind of scheduling application; the data's not critical to anybody, and there's no audit tracking, etc. necessary. It's not a supermassive amount of data, and we don't necessarily need to keep very old data around if the database grows too large.
One feature that they are accustomed to is the ability to duplicate entire schedules in order to create "what-if" scenarios that they can muck with. Any user can do this as many times as they want, as often as they want. In the old database, the data files for each schedule are stored in their own data folder, identified by name. So, copying a schedule was as simple as copying the data folder and renaming it.
I must be able to do effectively the same thing with SQL Server or the migration will not work. Maybe you're thinking that I can just only copy the data that actually gets changed in order to avoid redundancy; but that honestly sounds too complicated to be feasible.
To throw another wrench into the mix, there can be a hierarchy of schedule data folders. So, a data folder may contain a data folder, which may contain a data folder. And the copying can occur at any level.
In SQL Server, I'm implementing a nested set hierarchy to mimic this. I have a DATA_SET table like this:
CREATE TABLE dbo.DATA_SET
(
DATA_SET_ID UNIQUEIDENTIFIER PRIMARY KEY,
NAME NVARCHAR(128) NOT NULL,
LFT INT NOT NULL,
RGT INT NOT NULL
)
So, there's a tree structure of data sets. Each data set represents a schedule, and may contain child data sets. Every row in every table has a DATA_SET_ID FK reference, indicating which data set it belongs to. Whenever I copy a data set, I copy all the rows in the table for that data set, and every other data set, into the same table, but referencing new data sets.
So, here's a simple concrete example:
CREATE TABLE FOO
(
FOO_ID BIGINT PRIMARY KEY,
DATA_SET_ID BIGINT FOREIGN KEY REFERENCES DATA_SET(DATA_SET_ID) NOT NULL
)
CREATE TABLE BAR
(
BAR_ID BIGINT PRIMARY KEY,
DATA_SET_ID BIGINT FOREIGN KEY REFERENCES DATA_SET(DATA_SET_ID) NOT NULL,
FOO_ID UNIQUEIDENTIFIER PRIMARY KEY
)
INSERT INTO FOO
SELECT 1, 1 UNION ALL
SELECT 2, 1 UNION ALL
SELECT 3, 1 UNION ALL
INSERT INTO BAR
SELECT 1, 1, 1
SELECT 2, 1, 2
SELECT 3, 1, 3
So, let's say I copy data set 1 into a new data set of ID 2. After I copy, the tables will look like this:
FOO
FOO_ID, DATA_SET_ID
1 1
2 1
3 1
4 2
5 2
6 2
BAR
BAR_ID, DATA_SET_ID, FOO_ID
1 1 1
2 1 2
3 1 3
4 2 4
5 2 5
6 2 6
As you can see, the new BAR rows are referencing the new FOO rows. It's not the rewiring of the DATA_SET_ID's that I'm asking about. I'm asking about rewiring the foreign keys in general.
So, that was surely too much information, but there you go.
I'm sure there are a lot of concerns about performance with the idea of bulk copying the data like this. The tables are not going to be huge. I'm not expecting more than 1000 records in any table, and most of the tables will be much much smaller than that. Old data sets can be deleted outright with no repercussions.
Thanks,
Tedderz
Here is an example with three tables that can probably get you started.
DB schema
CREATE TABLE users
(user_id int auto_increment PRIMARY KEY,
user_name varchar(32));
CREATE TABLE agenda
(agenda_id int auto_increment PRIMARY KEY,
`user_id` int, `agenda_name` varchar(7));
CREATE TABLE events
(event_id int auto_increment PRIMARY KEY,
`agenda_id` int,
`event_name` varchar(8));
An SP to clone a user with his agenda and events records
DELIMITER $$
CREATE PROCEDURE clone_user(IN uid INT)
BEGIN
DECLARE last_user_id INT DEFAULT 0;
INSERT INTO users (user_name)
SELECT user_name
FROM users
WHERE user_id = uid;
SET last_user_id = LAST_INSERT_ID();
INSERT INTO agenda (user_id, agenda_name)
SELECT last_user_id, agenda_name
FROM agenda
WHERE user_id = uid;
INSERT INTO events (agenda_id, event_name)
SELECT a3.agenda_id_new, e.event_name
FROM events e JOIN
(SELECT a1.agenda_id agenda_id_old,
a2.agenda_id agenda_id_new
FROM
(SELECT agenda_id, #n := #n + 1 n
FROM agenda, (SELECT #n := 0) n
WHERE user_id = uid
ORDER BY agenda_id) a1 JOIN
(SELECT agenda_id, #m := #m + 1 m
FROM agenda, (SELECT #m := 0) m
WHERE user_id = last_user_id
ORDER BY agenda_id) a2 ON a1.n = a2.m) a3
ON e.agenda_id = a3.agenda_id_old;
END$$
DELIMITER ;
To clone a user
CALL clone_user(3);
Here is SQLFiddle demo.
I recently found myself needing to solve a similar problem; that is, I needed to copy a set of rows in a table (Table A) as well as all of the rows in related tables which have foreign keys pointing to Table A's primary key. I was using Postgres so the exact queries may differ but the overall approach is the same. The biggest benefit of this approach is that it can be used recursively to go infinitely deep
TLDR: the approach looks like this
1) find all the related table/columns of Table A
2) copy the necessary data into temporary tables
3) create a trigger and function to propagate primary key column
updates to related foreign keys columns in the temporary tables
4) update the primary key column in the temporary tables to the next
value in the auto increment sequence
5) Re-insert the data back into the source tables, and drop the
temporary tables/triggers/function
1) The first step is to query the information schema to find all of the tables and columns which are referencing Table A. In Postgres this might look like the following:
SELECT tc.table_name, kcu.column_name
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu
ON tc.constraint_name = kcu.constraint_name
JOIN information_schema.constraint_column_usage ccu
ON ccu.constraint_name = tc.constraint_name
WHERE constraint_type = 'FOREIGN KEY'
AND ccu.table_name='<Table A>'
AND ccu.column_name='<Primary Key>'
2) Next we need to copy the data from Table A, and any other tables which reference Table A - lets say there is one called Table B. To start this process, lets create a temporary table for each of these tables and we will populate it with the data that we need to copy. This might look like the following:
CREATE TEMP TABLE temp_table_a AS (
SELECT * FROM <Table A> WHERE ...
)
CREATE TEMP TABLE temp_table_b AS (
SELECT * FROM <Table B> WHERE <Foreign Key> IN (
SELECT <Primary Key> FROM temp_table_a
)
)
3) We can now define a function that will cascade primary key column updates out to related foreign key columns, and trigger which will execute whenever the primary key column changes. For example:
CREATE OR REPLACE FUNCTION cascade_temp_table_a_pk()
RETURNS trigger AS
$$
BEGIN
UPDATE <Temp Table B> SET <Foreign Key> = NEW.<Primary Key>
WHERE <Foreign Key> = OLD.<Primary Key>;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trigger_temp_table_a
AFTER UPDATE
ON <Temp Table A>
FOR EACH ROW
WHEN (OLD.<Primary Key> != NEW.<Primary Key>)
EXECUTE PROCEDURE cascade_temp_table_a_pk();
4) Now we just update the primary key column in to the next value of the sequence of the source table (). This will activate the trigger, and the updates will be cascaded out to the foreign key columns in . In Postgres you can do the following:
UPDATE <Temp Table A>
SET <Primary Key> = nextval(pg_get_serial_sequence('<Table A>', '<Primary Key>'))
5) Insert the data back from the temporary tables back into the source tables. And then drop the temporary tables, triggers, and functions after that.
INSERT INTO <Table A> (SELECT * FROM <Temp Table A>)
INSERT INTO <Table B> (SELECT * FROM <Temp Table B>)
DROP TRIGGER trigger_temp_table_a
DROP cascade_temp_table_a_pk()
It is possible to take this general approach and turn it into a script which can be called recursively in order to go infinitely deep. I ended up doing just that using python (our application was using django so I was able to use the django ORM to make some of this easier)

Create a unique primary key (hash) from database columns

I have this table which doesn't have a primary key.
I'm going to insert some records in a new table to analyze them and I'm thinking in creating a new primary key with the values from all the available columns.
If this were a programming language like Java I would:
int hash = column1 * 31 + column2 * 31 + column3*31
Or something like that. But this is SQL.
How can I create a primary key from the values of the available columns? It won't work for me to simply mark all the columns as PK, for what I need to do is to compare them with data from other DB table.
My table has 3 numbers and a date.
EDIT What my problem is
I think a bit more of background is needed. I'm sorry for not providing it before.
I have a database ( dm ) that is being updated everyday from another db ( original source ) . It has records form the past two years.
Last month ( july ) the update process got broken and for a month there was no data being updated into the dm.
I manually create a table with the same structure in my Oracle XE, and I copy the records from the original source into my db ( myxe ) I copied only records from July to create a report needed by the end of the month.
Finally on aug 8 the update process got fixed and the records which have been waiting to be migrated by this automatic process got copied into the database ( from originalsource to dm ).
This process does clean up from the original source the data once it is copied ( into dm ).
Everything look fine, but we have just realize that an amount of the records got lost ( about 25% of july )
So, what I want to do is to use my backup ( myxe ) and insert into the database ( dm ) all those records missing.
The problem here are:
They don't have a well defined PK.
They are in separate databases.
So I thought that If I could create a unique pk from both tables which gave the same number I could tell which were missing and insert them.
EDIT 2
So I did the following in my local environment:
select a.* from the_table#PRODUCTION a , the_table b where
a.idle = b.idle and
a.activity = b.activity and
a.finishdate = b.finishdate
Which returns all the rows that are present in both databases ( the .. union? ) I've got 2,000 records.
What I'm going to do next, is delete them all from the target db and then just insert them all s from my db into the target table
I hope I don't get in something worst : - S : -S
The danger of creating a hash value by combining the 3 numbers and the date is that it might not be unique and hence cannot be used safely as a primary key.
Instead I'd recommend using an autoincrementing ID for your primary key.
Just create a surrogate key:
ALTER TABLE mytable ADD pk_col INT
UPDATE mytable
SET pk_col = rownum
ALTER TABLE mytable MODIFY pk_col INT NOT NULL
ALTER TABLE mytable ADD CONSTRAINT pk_mytable_pk_col PRIMARY KEY (pk_col)
or this:
ALTER TABLE mytable ADD pk_col RAW(16)
UPDATE mytable
SET pk_col = SYS_GUID()
ALTER TABLE mytable MODIFY pk_col RAW(16) NOT NULL
ALTER TABLE mytable ADD CONSTRAINT pk_mytable_pk_col PRIMARY KEY (pk_col)
The latter uses GUID's which are unique across databases, but consume more spaces and are much slower to generate (your INSERT's will be slow)
Update:
If you need to create same PRIMARY KEYs on two tables with identical data, use this:
MERGE
INTO mytable v
USING (
SELECT rowid AS rid, rownum AS rn
FROM mytable
ORDER BY
co1l, col2, col3
)
ON (v.rowid = rid)
WHEN MATCHED THEN
UPDATE
SET pk_col = rn
Note that tables should be identical up to a single row (i. e. have same number of rows with same data in them).
Update 2:
For your very problem, you don't need a PK at all.
If you just want to select the records missing in dm, use this one (on dm side)
SELECT *
FROM mytable#myxe
MINUS
SELECT *
FROM mytable
This will return all records that exist in mytable#myxe but not in mytable#dm
Note that it will shrink all duplicates if any.
Assuming that you have ensured uniqueness...you can do almost the same thing in SQL. The only problem will be the conversion of the date to a numeric value so that you can hash it.
Select Table2.SomeFields
FROM Table1 LEFT OUTER JOIN Table2 ON
(Table1.col1 * 31) + (Table1.col2 * 31) + (Table1.col3 * 31) +
((DatePart(year,Table1.date) + DatePart(month,Table1.date) + DatePart(day,Table1.date) )* 31) = Table2.hashedPk
The above query would work for SQL Server, the only difference for Oracle would be in terms of how you handle the date conversion. Moreover, there are other functions for converting dates in SQL Server as well, so this is by no means the only solution.
And, you can combine this with Quassnoi's SET statement to populate the new field as well. Just use the left side of the Join condition logic for the value.
If you're loading your new table with values from the old table, and you then need to join the two tables, you can only "properly" do this if you can uniquely identify each row in the original table. Quassnoi's solution will allow you to do this, IF you can first alter the old table by adding a new column.
If you cannot alter the original table, generating some form of hash code based on the columns of the old table would work -- but, again, only if the hash codes uniquely identify each row. (Oracle has checksum functions, right? If so, use them.)
If hash code uniqueness cannot be guaranteed, you may have to settle for a primary key composed of as many columns are required to ensure uniqueness (e.g. the natural key). If there is no natural key, well, I heard once that Oracle provides a rownum for each row of data, could you use that?