Truncate and Insert in ClickHouse Database - sql

I have a particular scenario where I need to truncate and batch insert into a Table in ClickHouse DBMS for every 30 minutes or so. I could find no reference of truncate option in ClickHouse.
However, I could find suggestions that we can indirectly achieve this by dropping the old table, creating a new table with same name and inserting data into it.
With respect to that, I have a few questions.
How is this achieved ? What is the sequence of steps in this process ?
What happens to other queries such as Select during the time when the table is being dropped and recreated ?
How long does it usually take for a table to be dropped and recreated in ClickHouse ?
Is there a better and clean way this can be achieved ?

How is this achieved ? What is the sequence of steps in this process ?
TRUNCATE is supported. There is no need to drop and recreate the table now.
What happens to other queries such as Select during the time when the table is being dropped and recreated ?
That depends on which table engine you use. For merge-tree family you get a snapshot-like behavior for SELECT.
How long does it usually take for a table to be dropped and recreated in ClickHouse ?
I would assume it relies on how fast the underlying file system can handle file deletions. For large tables it might contain millions of data part files which leads to slow truncation. However in your case I wouldn't worry much.
Is there a better and clean way this can be achieved ?
I suggest using partitons with a (DateTime / 60) column (per minute) along with a user script that constantly do partition harvest for out of date partitions.

Related

Automated way of deleting millions of rows from Postgres tables

Postgres Version: PostgreSQL 10.9 (Ubuntu 10.9-1.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609, 64-bit
Before I ask my question, I would like to explain why I'm looking into this. we have a history table which has more than 5 million rows and growing every hour.
As the table length grows the select queries are becoming slower, even though we have a proper index. So ideally the first choice for us to delete the old records which are unused.
Approach #1
We tried deleting the records from the table using simple delete from table_name where created_date > certain_date where is_active = false
This took a very long time.
Approach #2
Create a script which would delete the rows with the cursor-based approach.
This also takes a very long time.
Approach #3
Created a new unlogged table.
Create an index on the new_table.
Copy the contents from the old table to a new table
Then set table is logged.
Rename the master table as a backup.
Issues with this approach, it requires some downtime.
On live productions instances, this would result in missing data / resulting in failures
Approach #4
On further investigation, the performant way to delete unused rows is if we create a table with partition https://www.postgresql.org/docs/10/ddl-partitioning.html - Which we could drop the entire partition immediately.
Questions with the above approach are
How can I create a partition on the existing table?
Will that require downtime?
How can we configure Postgres to create partition automatically, we can't really create partitions manually everyday right?
Any other approaches are also welcome, the thing is I really want this to be automated than manual because I would extend this to multiple tables.
Please let me know your thoughts, which would very helpful
I would go for approach 4, table partitioning.
Create partitions
New data goes directly to the correct partition
Move old data (manually / scripted) to the correct partition
Set a cron job to create partitions for the next X days, if they don't exists already
No downtime needed
We tried deleting the records from the table using simple delete from table_name where created_date > certain_date where is_active = false This took a very long time.
Surely you mean <, not > there? So what if it took a long time? How fast do you need it to be? Did it cause problems? How much data were you trying to delete at once?
5 millions rows is pretty small. I probably wouldn't use partitioning in the first place on something that size.
There is no easy and transparent way to migrate to having partitioned data if you don't already have it. The easiest way is to have a bit of downtime as you populate the new table.
If you do want to partition, your partitioning scheme would have to include is_active, not just created_date. And by day seems way too fine, you could do it by month and pre-create a few years worth up front.
Answering specifically to the below:
How can we configure Postgres to create partition automatically, we can't really create partitions manually everyday right?
As you are using Postgres 10, you could use https://github.com/pgpartman/pg_partman for auto management of partitions.

The best way to Update the database table through a pyspark job

I have a spark job that gets data from multiple sources and aggregates into one table. The job should update the table only if there is new data.
One approach I could think of is to fetch the data from the existing table, and compare with the new data that comes in. The comparison happens in the spark layer.
I was wondering if there is any better way to compare, that can improve the comparison performance.
Please let me know if anyone has a suggestion on this.
Thanks much in advance.
One approach I could think of is to fetch the data from the existing
table, and compare with the new data that comes in
IMHO entire data compare to load new data is not performant.
Option 1:
Instead you can create google-bigquery partition table and create a partition column to load the data and also while loading new data you can check whether the new data has same partition column.
Hitting partition level data in hive or bigquery is more useful/efficient than selecting entire data and comparing in spark.
Same is applicable for hive as well.
see this Creating partitioned tables
or
Creating and using integer range partitioned tables
Option 2:
Another alternative is with GOOGLE bigquery we have merge statement, if your requirement is to merge the data with out comparision, then you can go ahead with MERGE statement .. see doc link below
A MERGE statement is a DML statement that can combine INSERT, UPDATE, and DELETE operations into a single statement and perform the operations atomically.
Using this, We can get performance improvement because all three operations (INSERT, UPDATE, and DELETE) are performed in one pass. We do not need to write an individual statement to update changes in the target table.
There are many ways this problem can be solved, one of the less expensive, performant and scalable way is to use a datastore on the file system to determine true new data.
As data comes in for the 1st time write it to 2 places - database and to a file (say in s3). If data is already on the database then you need to initialize the local/s3 file with table data.
As data comes in 2nd time onwards, check if it is new based its presence on local/s3 file.
Mark delta data as new or updated. Export this to database as insert or update.
As time goes by this file will get bigger and bigger. Define a date range beyond which updated data won’t be coming. Regularly truncate this file to keep data within that time range.
You can also bucket and partition this data. You can use deltalake to maintain it too.
One downside is that whenever database is updated this file may need to be updated based on relevant data is being Changed or not. You can maintain a marker on the database table to signify sync date. Index that column too. Read changed records based on this column and update the file/deltalake.
This way your sparl app will be less dependent on a database. The database operations are not very scalable so keeping them away from critical path is better
Shouldnt you have a last update time in you DB? The approach you are using doesnt sound scalable so if you had a way to set update time to each row in the table it will solve the problem.

avoiding write conflicts while re-sorting a table

I have a large table that I need to re-sort periodically. I am partly basing this on a suggestion I was given to stay away from using cluster keys since I am inserting data ordered differently (by time) from how I need it clustered (by ID), and that can cause re-clustering to get a little out of control.
Since I am writing to the table on a hourly I am wary of causing problems with these two processes conflicting: If I CTAS to a newly sorted temp table and then swap the table name it seems like I am opening the door to have a write on the source table not make it to the temp table.
I figure I can trigger a flag when I am re-sorting that causes the ETL to pause writing, but that seems a bit hacky and maybe fragile.
I was considering leveraging locking and transactions, but this doesn't seem to be the right use case for this since I don't think I'd be locking the table I am copying from while I write to a new table. Any advice on how to approach this?
I've asked some clarifying questions in the comments regarding the clustering that you are avoiding, but in regards to your sort, have you considered creating a nice 4XL warehouse and leveraging the INSERT OVERWRITE option back into itself? It'd look something like:
INSERT OVERWRITE INTO table SELECT * FROM table ORDER BY id;
Assuming that your table isn't hundreds of TB in size, this will complete rather quickly (inside an hour, I would guess), and any inserts into the table during that period will queue up and wait for it to finish.
There are some reasons to avoid the automatic reclustering, but they're basically all the same reasons why you shouldn't set up a job to re-cluster frequently. You're making the database do all the same work, but without the built in management of it.
If your table is big enough that you are seeing performance issues with the clustering by time, and you know that the ID column is the main way that this table is filtered (in JOINs and WHERE clauses) then this is probably a good candidate for automatic clustering.
So I would recommend at least testing out a cluster key on the ID and then monitoring/comparing performance.
To give a brief answer to the question about resorting without conflicts as written:
I might recommend using a time column to re-sort records older than a given time (probably in a separate table). While it's sorting, you may get some new records. But you will be able to use that time column to marry up those new records with the, now sorted, older records.
You might consider revoking INSERT, UPDATE, DELETE privileges on the original table within the same script or procedure that performs the CTAS creating the newly sorted copy of the table. After a successful swap you can re-enable the privileges for the roles that are used to perform updates.

No waiting while Truncate Table

I have an SSIS package that runs repeatedly after 1 hour. This package first truncates a table and then populate that table with new data. And this process takes 15-20 minutes. When this package runs, data is not available to the users. So they have to wait until package runs completely. Is there any way to handle this situation so users don't have to wait?
Do not truncate the table. Instead, add a audit column with date data type, partition the table with hourly partitions on this audit column, drop the old partition once the new partition is loaded with new data.
Make sure the users query are directed to the proper partition with the help of the audit column
You can do an 'A-B flip'.
Instead of truncating the client-facing table and reloading it, you could use two tables to do the job.
For example, if the table in question is called ACCOUNT:
Load the data to a table called STG_ACCOUNT
Rename ACCOUNT to ACCOUNT_OLD
Rename STG_ACCOUNT to ACCOUNT
Rename ACCOUNT_OLD to STG_ACCOUNT
By doing this, you minimize the amount of time the users have an empty table.
It's very dangerous practice but you can change isolation levels of your transactions (I mean users queries) from ReadCommitted/Serializable to ReadUncommitted. But the behavior of this queries is very hard to predict. If your table is under using of SSIS package (insert/delete/update) and end users do some uncommitted reads (like SELECT * FROM Table1 WITH (NOLOCK) ), some rows can be counted several times or missed.
If users want to read only 'new-hour-data' you can try to change isolation levels to 'dirty read', but be careful!
If they can work with data from previous hour, the best solution is described by Arnab, but partitions are available only in Enterprise edition. Use rename in another SQL Server editions as Zak said.
[Updated] If the main lag (tens of minutes, as you said) is caused by complex calculations (and NOT because of amount of loaded rows!), you can use another table like a buffer. Store there several rows (hundreds, thousands etc.) and then reload them to the main table. So new data will be available in portions without 'dirty read' tricks.

How to quickly delete a column from a table containing 600 million rows?

I have a large table upon which I wish to run this query:
ALTER TABLE Employee
DROP COLUMN Zipcode
Zipcode data type is varchar.
This approach is not suitable as it's taking a long time and increasing the log space, and it goes on for hours. Can anyone guide me a unique and more sophisticated approach which will not affect the performance of database?
Did you actually test this? Dropping a column is instant. You don't have a problem to solve.
Ok, so it looks like you have a blocking issue. You have not yet followed up on my comment regarding this. You should.
If there's a transaction running that is using the table you want to alter then the alter must wait. Find out who is blocking access to that table. Run sp_who2 and look at the blocked-by column.
Truth be told I don't think you have any choice: ALTER TABLE statements like dropping a column can take a very long time on very large tables so I think you'll have to identify a windows of opportunity when you can get some decent down-time and go for it then. You might consider changing your database log model from FULL to SIMPLE or BULK LOGGED for the duration to help reduce log writes and possibly some of the time taken, but that definitely requires a time when you're not worrying about live transactional traffic.