Is high read but absolutely no writing on SQLite query normal? - sql

I've been running a query to add a column to a table by matching ID from another table. Both have about 600 million rows so its reasonable that this would take a while, but what is worrying me is that there are high read speeds (~500MB/s) on the disk but sqlite is writing 0B/s according to iotop. The filesize on my .db file has not changed in hours, but adding a column to a 600 million row table should change AT LEAST one byte, no?
Is this normal behavior from SQLite? The machine is pretty powerful, Ubuntu 16 on quad core i7 with 64GB RAM and NVMe SSD. The queries and table schema are below.
ALTER TABLE tableA ADD address TEXT;
UPDATE tableA SET address = (SELECT address FROM tableB WHERE tableA.ID = tx_out.ID);
Table schema:
CREATE TABLE tableA (
ID TEXT,
column1 INT,
column2 TEXT,
);
CREATE TABLE tx_out (
ID TEXT,
sequence INT,
address TEXT
);

Adding a column changes almost nothing on disk; a row with fewer values than the table has columns is assumed to have NULLs in the missing columns.
The UPDATE is extremely slow because the subquery has to scan through the entire tx_out table for every row in tableA.
You could speed it up greatly with an index on the tx_out.ID column.
When the database has to rewrite all rows anyway, and you have the disk space, it might be a better idea to create a new table:
INSERT INTO NewTable(ID, col1, col2, address)
SELECT ID, col1, col2, address
FROM tableA
JOIN tableB USING (ID); -- also needs an index to be fast

Too big for a comment
I've had this run for days with no change...I think it is prone to locking itself in one manner or another, I killed it after it's 3rd day of what appeared to be no changes. I've encountered very similar issues when attempting to add a new index as well, but that one successfully completed in 2 days before I hit my 3 day kill threshhold ;) Possible 3 days just wasn't enough?
My preference now is to create a second table that has the new column, load the table with the data from the old plus the new column, rename the old table to X_oldtablename, rename the new table to the table name. Run tests and drop x_oldtablename after you are confident the new table is working

Related

Improve insert performance when checking existing rows

I have this simple query that inserts rows from one table(sn_users_main) into another(sn_users_history).
To make sure sn_users_history only has unique rows it checks if the column query_time already exists and if it does then don't insert. query_time is kind of a session identifier that is the same for every row in sn_users_main.
This works fine but since sn_users_history is reaching 50k rows running this query takes more than 2 minutes to run which is too much. Is there anything I can do to improve performance and get the same result?
INSERT INTO sn_users_history(query_time,user_id,sn_name,sn_email,sn_manager,sn_active,sn_updated_on,sn_last_Login_time,sn_is_vip,sn_created_on,sn_is_team_lead,sn_company,sn_department,sn_division,sn_role,sn_employee_profile,sn_location,sn_employee_type,sn_workstation) --- Columns of history table
SELECT snm.query_time,
snm.user_id,
snm.sn_name,
snm.sn_email,
snm.sn_manager,
snm.sn_active,
snm.sn_updated_on,
snm.sn_last_Login_time,
snm.sn_is_vip,
snm.sn_created_on,
snm.sn_is_team_lead,
snm.sn_company,
snm.sn_department,
snm.sn_division,
snm.sn_role,
snm.sn_employee_profile,
snm.sn_location,
snm.sn_employee_type,
snm.sn_workstation
---Columns of main table
FROM sn_users_main snm
WHERE NOT EXISTS(SELECT snh.query_time
FROM sn_users_history snh
WHERE snh.query_time = snm.query_time) --Dont insert items into history table if they already exist
I think you are missing extra condition on user_id, when you are inserting into history table. You have to check combination of userid, querytime.
For your question, I think you are trying to reinvent the wheel. SQL Server is already having temporal tables, to suppor this historical data holding. Read about SQL Server Temporal Tables
If you want to still continue with this approach, I would suggest you to do in batches:
Create a configuration Table to hold the last processed querytime
CREATE TABLE HistoryConfig(HistoryConfigId int, HistoryTableName SYSNAME,
lastProcessedQueryTime DATETIME)
you can do incremental historical inserts
DECLARE #lastProcessedQueryTime DATETIME = (SELECT MAX(lastProcessedQueryTime) FROM HistoryConfig)
INSERT INTO sn_users_history(query_time,user_id,sn_name,sn_email,sn_manager,sn_active,sn_updated_on,sn_last_Login_time,sn_is_vip,sn_created_on,sn_is_team_lead,sn_company,sn_department,sn_division,sn_role,sn_employee_profile,sn_location,sn_employee_type,sn_workstation) --- Columns of history table
SELECT snm.query_time,
snm.user_id,
snm.sn_name,
snm.sn_email,
snm.sn_manager,
snm.sn_active,
snm.sn_updated_on,
snm.sn_last_Login_time,
snm.sn_is_vip,
snm.sn_created_on,
snm.sn_is_team_lead,
snm.sn_company,
snm.sn_department,
snm.sn_division,
snm.sn_role,
snm.sn_employee_profile,
snm.sn_location,
snm.sn_employee_type,
snm.sn_workstation
---Columns of main table
FROM sn_users_main snm
WHERE query_time > #lastProcessedQueryTime
Now, you can update the configuration again
UPDATE HistoryConfig SET lastProcessedQueryTime = (SELECT MAX(lastProcessedQueryTime) FROM HistoryConfig)
HistoryTableName = 'sn_users_history'
I would suggest you to create index on clustered index on UserId, Query_Time(if possible, Otherwise create non-clustered index) which will improve the performance.
Other approaches you can think of:
Create clustered index on userId, querytime in the historical table and also have userid,querytime as clustered index on the main table and perform MERGE operation.

Add column to huge table in Postgresql without downtime

I have a very small table(about 1 mil rows) and I'm going to drop constraints and add new column. The query below is hang about 5 minutes, had to rollback.
BEGIN WORK;
LOCK TABLE usertable IN SHARE MODE;
ALTER TABLE usertable ALTER COLUMN email DROP NOT NULL;
COMMIT WORK;
Another approach suggested on the similar questions in the internet -
CREATE TABLE new_tbl
(
field1 int,
field2 int,
...
);
INSERT INTO new_tbl(field1, field2, ...)
(
SELECT FROM ... -- use your new logic here to insert the updated data
)
CREATE INDEX -- add your constraints and indexes to new_tbl
DROP TABLE tbl;
ALTER TABLE tbl_new RENAME tbl;
Create new table
Insert records from old table to new one (take less then a second)
Drop old table - this query hangs for about 5 minutes ~. Had to rollback. Does not work for me.
Renamed new created table to old one
Dropping old table simply hangs. However when I try to drop new created table with 1 million rows - it works instantly. Why dropping of old table takes so much time ?
SELECT blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS blocking_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
I can see a lot of concurrent writes/reads which are waiting for my operation. Since I took lock on the table, I don't really think that the reason why I can't drop old table.
Just run vacuum on old table it did not help.
Why I can't drop old table why it takes so much time compared to dropping recently created table ?
I don't have a lot of experience with PostgreSQL, but my guess is that it keeps a bit to signify when a NULLable column is NULL (as opposed to empty) and when that column is marked as NOT NULL it no longer needs that bit. So, when you change that attribute on a column the system needs to go through the whole table and rearrange the data, moving lots of bits around so that the rows are all correctly structured.
This is much different from a DROP TABLE, which merely needs to mark the disk space as no longer in use and perhaps update a few metadata values.
In short, they're very different actions, so of course they take different amounts of time.
I was not able to drop/rename table original table cause of FK of others tables. Once I dropoed it, approach with renaming table works great

Copying timestamp columns within a Postgres table

I have a table with about 30 million rows in a Postgres 9.4 db. This table has 6 columns, the primary key id, 2 text, one boolean, and two timestamp. There are indices on one of the text columns, and obviously the primary key.
I want to copy the values in the first timestamp column, call it timestamp_a into the second timestamp column, call it timestamp_b. To do this, I ran the following query:
UPDATE my_table SET timestamp_b = timestamp_a;
This worked, but it took an hour and 15 minutes to complete, which seems a really long time to me considering, as far as I know, it's just copying values from one column to the next.
I ran EXPLAIN on the query and nothing seemed particularly inefficient. I then used pgtune to modify my config file, most notably it increased the shared_buffers, work_mem, and maintenance_work_mem.
I re-ran the query and it took essentially the same amount of time, actually slightly longer (an hour and 20 mins).
What else can I do to improve the speed of this update? Is this just how long it takes to write 30 million timestamps into postgres? For context I'm running this on a macbook pro, osx, quadcore, 16 gigs of ram.
The reason this is slow is that internally PostgreSQL doesn't update the field. It actually writes new rows with the new values. This usually takes a similar time to inserting that many values.
If there are indexes on any column this can further slow the update down. Even if they're not on columns being updated, because PostgreSQL has to write a new row and write new index entries to point to that row. HOT updates can help and will do so automatically if available, but that generally only helps if the table is subject to lots of small updates. It's also disabled if any of the fields being updated are indexed.
Since you're basically rewriting the table, if you don't mind locking out all concurrent users while you do it you can do it faster with:
BEGIN
DROP all indexes
UPDATE the table
CREATE all indexes again
COMMIT
PostgreSQL also has an optimisation for writes to tables that've just been TRUNCATEd, but to benefit from that you'd have to copy the data to a temp table, then TRUNCATE and copy it back. So there's no benefit.
#Craig mentioned an optimization for COPY after TRUNCATE: Postgres can skip WAL entries because if the transaction fails, nobody will ever have seen the new table anyway.
The same optimization is true for tables created with CREATE TABLE AS:
What causes large INSERT to slow down and disk usage to explode?
Details are missing in your description, but if you can afford to write a new table (no concurrent transactions get in the way, no dependencies), then the fastest way might be (except if you have big TOAST table entries - basically big columns):
BEGIN;
LOCK TABLE my_table IN SHARE MODE; -- only for concurrent access
SET LOCAL work_mem = '???? MB'; -- just for this transaction
CREATE my_table2
SELECT ..., timestamp_a, timestamp_a AS timestamp_b
-- columns in order, timestamp_a overwrites timestamp_b
FROM my_table
ORDER BY ??; -- optionally cluster table while being at it.
DROP TABLE my_table;
ALTER TABLE my_table2 RENAME TO my_table;
ALTER TABLE my_table
, ADD CONSTRAINT my_table_id_pk PRIMARY KEY (id);
-- more constraints, indices, triggers?
-- recreate views etc. if any
COMMIT;
The additional benefit: a pristine (optionally clustered) table without bloat. Related:
Best way to populate a new column in a large table?

Better Import Performance on Large Data Sets

I have a few jobs that insert large data sets from a text file. The data is loaded via .NET's SqlBulkCopy.
Currently, I load all the data into a temp table and then insert it into the production table. This was an improvement over straight importing into production. The T-SQL insert results query was a lot faster. Data is only loaded via this method, there is no other inserts or deletes.
However, I'm back to timeouts because of locks while the job is executing. The job consists of the following steps:
load data into temp table
start transaction
delete current and future dated rows
insert from temp table
commit
This happens once every hour. This portion takes 70 seconds. I need to get that to the smallest number possible.
The production table has about 20 million records and each import is about 70K rows. The table is not accessed at night, so I use this time to do all required maintenance (rebuild stats, index, etc.). Of the 70K, added, ~4K is kept from day-to-day - that is, the table grows by 4k a day.
I'm thinking a 2 part solution:
The job will turn into a copy/rename job. I insert all current data into the temp table, create stats & index, rename tables, drop old table.
Create a history table to break out older data. The "current" table would have a rolling 6 months data, about 990K records. This would make the delete/insert table smaller and [hopefully] more performant. I would prefer not to do this; the table is well designed with the perfect indexes; queries are plenty fast. But eventually it might be required.
Edit: Using Windows 2003, SQL Server 2008
Thoughts? Other suggestions?
Well one really sneaky way is to rename the current table as TableA and set up a second table with the same structure as TableB and the same data. Then set up a view with the same name and the exact fields in the TableA. Now all your existing code will use the view instead of the current table. The view starts out looking at TableA.
In your load process, load to TableB. Refresh the view defintion changing it to look at TableB. Your users are down for less than a second. Then load the same data to TableA and store which table you should start with somewhere in a database table. Next time load first to TableA and then change the view to point to TableA then reload TableB.
The answer should be that your queries that read from this table should READ UNCOMMITTED, since your data load is the only place that changes data. With READ UNCOMMITTED, the SELECT queries won't get locks.
http://msdn.microsoft.com/en-us/library/ms173763.aspx
You should look into using partitioned tables. The basic idea is that you can load all of your new data into a new table, then join that table to the original as a new partition. This is orders of magnitude faster than inserting into the current existing table.
Later on, you can merge multiple partitions into a single larger partition.
More information here: http://msdn.microsoft.com/en-us/library/ms191160.aspx
Get better hardware. Using 3 threads, 35.000 item batches I import around 90.000 items per second using this approach.
Sorry, but at a point hardware decides insert speed. Important: SSD for the logs, mirrored ;)
Another trick you could use is have a delta table for the updates. You'd have 2 tables with a view over them to merge them. One table, TableBig, will hold the old data, the second table, TableDelta, will hold deltas that you add to rows in tableBig.
You create a view over them that adds them up. A simple example:
For instance your old data in the TableBig (20M rows, lot's of indexes etc)
ID Col1 Col2
123 5 10
124 6 20
And you want to update 1 row and add a new one so that afterwards the table looks like this:
ID Col1 Col2
123 5 10 -- unchanged
124 8 30 -- this row is updated
125 9 60 -- this one added
Then in the TableDelta you insert these two rows:
ID Col1 Col2
124 2 10 -- these will be added to give the right number
125 9 60 -- this row is new
and the view is
select ID,
sum(col1) col1, -- the old value and delta added to give the correct value
sum(col2) col2
from (
select id, col1, col2 from TableBig
union all
select id, col1, col2 from TableDelta
)
group by ID
At night you can merge TableDelta into TableBig and index etc.
This way you can leave the big table alone completely during the day, and TableDelta will not have many rows so overall query perf is very good. Getting the data from BigTable benefits from inexing, getting rows from DeltaTable is no issue becuase it's small, summing them up is cheap compared to looking for data on disk. Pumping data into TableDelta is very cheap because you can just insert at the end.
Regards Gert-Jan
Update for text columns:
You could try somehting similar, with two tables, but instead of adding, you would substitute.Like this:
Select isnull(b.ID, o.ID) Id,
isnull(b.Col1, o.Col1) Col1
isnull(b.Col2, o.Col2) col2
From TableBig b
full join TableOverWrite o on b.ID = o.ID
The basic idea is the same: a big table with indexes and a small table for updates that doesn't need them.
Your solution seems sound to me. Another alternative to try would be to create the final data in a temp table, all outside of a transaction and then inside the transaction truncate the target table then load it from your temp table...something along those lines might be worth trying too.

how to select the newly added rows in a table efficiently?

I need to periodically update a local cache with new additions to some DB table. The table rows contain an auto-increment sequential number (SN) field. The cache keeps this number too, so basically I just need to fetch all rows with SN larger than the highest I already have.
SELECT * FROM table where SN > <max_cached_SN>
However, the majority of the attempts will bring no data (I just need to make sure that I have an absolutely up-to-date local copy). So I wander if this will be more efficient:
count = SELECT count(*) from table;
if (count > <cache_size>)
// fetch new rows as above
I suppose that selecting by an indexed numeric field is quite efficient, so I wander whether using count has benefit. On the other hand, this test/update will be done quite frequently and by many clients, so there is a motivation to optimize it.
this test/update will be done quite frequently and by many clients
this could lead to unexpected race competition for cache generation
I would suggest
upon new addition to your table add the newest id into a queue table
using like crontab to trigger the cache generation by checking queue table
upon new cache generated, delete the id from queue table
as you stress majority of the attempts will bring no data, the above will only trigger where there is new addition
and the queue table concept, even can expand for update and delete
I believe that
SELECT * FROM table where SN > <max_cached_SN>
will be faster, because select count(*) may call table scan. Just for clarification, do you never delete rows from this table?
SELECT COUNT(*) may involve a scan (even a full scan), while SELECT ... WHERE SN > constant can effectively use an index by SN, and looking at very few index nodes may suffice. Don't count items if you don't need the exact total, it's expensive.
You don't need to use SELECT COUNT(*)
There is two solution.
You can use a temp table that has one field that contain last count of your table, and create new Trigger after insert on your table and inc temp table field in Trigger.
You can use a temp table that has one field that contain last SN of your table is cached and create new Trigger after insert on your table and update temp table field in Trigger.
not much to this really
drop table if exists foo;
create table foo
(
foo_id int unsigned not null auto_increment primary key
)
engine=innodb;
insert into foo values (null),(null),(null),(null),(null),(null),(null),(null),(null);
select * from foo order by foo_id desc limit 10;
insert into foo values (null),(null),(null),(null),(null),(null),(null),(null),(null);
select * from foo order by foo_id desc limit 10;