How to make DELETE faster in fast changing (DELSERT) table in MonetDB? - indexing

I am using MonetDB (MDB) for OLAP queries. I am storing source data in PostgreSQL (PGSQL) and syncing it with MonetDB in batches written in Python.
In PGSQL there is a wide table with ID (non-unique) and few columns. Every few seconds Python script takes a batch of 10k records changed in the PGSQL and uploads them to MDB.
The process of upload to MDB is as follows:
Create staging table in MDB
Use COPY command to upload 10k records into the staging table.
DELETE from destination table all IDs that are in staging table.
INSERT to the destination table all rows from staging table.
So, it is basically a DELETE & INSERT. I cannot use MERGE statement, because I do not have a PK - one ID can have multiple values in the destination. So I need to do a delete and full insert for all IDs currently synced.
Now to the problem: the DELETE is slow.
When I do a DELETE on a destination table, deleting 10k records in table of 25M rows, it will take 500ms.
However! If I run simple SELECT * FROM destination WHERE id = 1 and THEN do a DELETE, it takes 2ms.
I think that it has something to do with automatic creation of auxiliary indices. But this is where my knowledge ends.
I tried to solve this problem of "pre-heating" by doing the lookup myself and it works - but only for the first DELETE after pre-heat.
Once I do DELETE and INSERT, the next DELETE gets again slow. And doing the pre-heating before each DELETE does not make sense, because the pre-heat itself takes 500ms.
Is there any way on how to sync data to MDB without breaking auxiliary indices already built? Or make the DELETE faster without pre-heat? Or should I use some different technique to sync data into MDB without PK (does MERGE has the same problem?).
Thanks!

Related

BigQuery, concurrent MERGE with Insert and Update -> insert duplicate

I'm contributing to a Kafka Connector loading data onto BigQuery.
It has a temporary table (my_tmp_tmp) and a destination table (detionation_tbl).
The way the data is loaded into detionation_tbl is through a MERGE
https://github.com/confluentinc/kafka-connect-bigquery/blob/d5f4eaeffa683ad8813a337cfeb66b5344e6dd91/kcbq-connector/src/main/java/com/wepay/kafka/connect/bigquery/MergeQueries.java#L216
The MERGE statement uses:
dedup
both insert and updates
However:
on the first load,
all of the requests will contain only INSERT statement (nothing is in the table),
and if two merges are run at the same time, (many workers with retry),
the same records might be inserted twice (according to this A MERGE DML statement does not conflict with other concurrently running DML statements as long as the statement only inserts rows and does not delete or update any existing rows. This can include MERGE statements with UPDATE or DELETE clauses, as long as those clauses aren't invoked when the query runs., source). I also see this in practice leading to duplicates
Duplicates are not wanted since the whole point of running a MERGE is lost (compared to a solution that run INSERTs and dedup later)
Since it is a live dataset (being queried by users), Duplicates will break integrity of the dataset, hence we need to find a solution at Sink/BigQuery level.
Is it possible to make the merge statement always conflict with others so this doesn't happen? Any other solution?

Can we do bulk write and read from same Postgres Table?

We run a cron every 15min which get some data from somewhere and inserts into the main table(basically appends), which is then used for reading in production.
So not to perform read and write in the same table, we use 2 tables(temp and main).
First we get data and insert into temp table, then we rename temp and main table(using a 3rd table), so basically temp becomes main and main becomes temp, and then we make temp equivalent to main(basically truncating temp, and inserting everything from main), so in the end we have identical data in both table.
I am not sure if this is the correct, but is there any better way of doing this?
We are not performing read and write on same table because there can be bulk write which can happen(because of cron) and at that time our read performance can be affected

How to delete all data from tables in a SQL Server 2014 database, but keep all the tables?

As stated I need help deleting all data from every table in a test database. There are 3477 tables and some of the tables were created by a past employee so I was unable to create a schema of the DB and recreate it empty.
Is there a fast way to delete all of the data and keep all of the tables and their structure? Also, I noticed when deleting data from the DB with Delete table_name, that the data file wasn't decreasing in size. Any reason why? Then I tried to just delete the data file to see what would happen and it erased everything, so i had to restore the test database. Now I'm back at block one....
Any help or guidance would be appreciated.... I've read a lot and everything just says use Delete or Truncate, but rather not do that for 3477 tables.
The TRUNCATE TABLE command deletes the data inside a table, but not the table itself.
You have a lot of tables (more than 3000...), so take a look to following link to truncate all tables:
Truncate all tables in a SQL Server database

Merge, Partition and Remote Database - Performance Tuning Oracle

I want to tune my merge query which inserts and updates table in Oracle based on source table in SQL Server. Table Size is around 120 million rows and normally around 120k records are inserted/updated daily. Merge takes around 1.5 hours to run. It uses nested loop and primary key index to perform insert and update.
There is no record update date in source table to use; so all records are compared.
Merge abc tgt
using
(
select a,b,c
from sourcetable#sqlserver_remote) src
on (tgt.ref_id = src.ref_id)
when matched then
update set
.......
where
decode(tgt.a, src.a,1,0) = 0
or ......
when not matched then
insert (....) values (.....);
commit;
Since the table is huge and growing every day, I partitioned the table in DEV based on ref id (10 groups) and created local index on ref id.
Now it uses hash join and full table scan and it runs longer than the existing process.
When I changed from local to global index (ref_id), i uses nested loops but still takes longer to run than the existing process.
Is there a way to performance tune the process.
Thanks...
I'd be wary to join/merge huge tables over a database link. I'd try to copy over the complete source table (for instance with a non-atomic mview, possibly compressed, possibly sorted, certainly only the columns you'll need). After gathering statistics, I'd merge the target table with the local copy. Afterwards, the local copy can be truncated.
I wouldn't be surprised, if partitioning speeds up the merge from the local copy to your target table.

Automatically dropping PostgreSQL tables once per day

I have a scenario where I have a central server and a node. Both server and node are capable of running PostgreSQL but the storage space on the node is limited. The node collects data at a high speed and writes the data to its local DB.
The server needs to replicate the data from the node. I plan on accomplishing this with Slony-I or Bucardo.
The node needs to be able to delete all records from its tables at a set interval in order to minimize disk space used. Should I use pgAgent with a job consisting of a script like
DELETE FROM tablex, tabley, tablez;
where the actual batch file to run the script would be something like
#echo off
C:\Progra~1\PostgreSQL\9.1\bin\psql -d database -h localhost -p 5432 -U postgres -f C:\deleteFrom.sql
?
I'm just looking for opinions if this is the best way to accomplish this task or if anyone knows of a more efficient way to pull data from a remote DB and clear that remote DB to save space on the remote node. Thanks for your time.
The most efficient command for you is the TRUNCATE command.
With TRUNCATE, you can chain up tables, like your example:
TRUNCATE tablex, tabley, tablez;
Here's the description from the postgres docs:
TRUNCATE quickly removes all rows from a set of tables. It has the same effect as an unqualified DELETE on each table, but since it does not actually scan the tables it is faster. Furthermore, it reclaims disk space immediately, rather than requiring a subsequent VACUUM operation. This is most useful on large tables.
You may also add CASCADE as a parameter:
CASCADE Automatically truncate all tables that have foreign-key references to any of the named tables, or to any tables added to the group due to CASCADE.
The two best options, depending on your exact needs and workflow, would be truncate, as #Bohemian suggested, or to create a new table, rename, then drop.
We use something much like the latter create/rename/drop method in one of our major projects. This has an advantage where you need to be able to delete some data, but not all data, from a table very quickly. The basic workflow is:
Create a new table with a schema identical to the old one
CREATE new_table LIKE ...
In a transaction, rename the old and new tables simultaneously:
BEGIN;
RENAME table TO old_table;
RENAME new_table TO table;
COMMIT;
[Optional] Now you can do stuff with the old table, while the new table is happily accepting new inserts. You can dump the data to your centralized server, run queries on it, or whatever.
Delete the old table
DROP old_table;
This is an especially useful strategy when you want to keep, say, 7 days of data around, and only discard the 8th day's data all at once. Doing a DELETE in this case can be very slow. By storing the data in partitions (one for each day), it is easy to drop an entire day's data at once.