Can I copy data table folders in QuestDb to another instance? - backup

I am running QuestDb on production server which constantly writes data to a table, 24x7. The table is daily partitioned.
I want to copy data to another instance and update it there incrementally since the old days data never changes. Sometimes the copy works but sometimes the data gets corrupted and reading from the second instance fails and I have to retry coping all the table data which is huge and takes a lot of time.
Is there a way to backup / restore QuestDb while not interrupting continuous data ingestion?

QuestDB appends data in following sequence
Append to column files inside partition directory
Append to symbol files inside root table directory
Mark transaction as committed in _txn file
There is no order between 1 and 2 but 3 always happens last. To incrementally copy data to another box you should copy in opposite manner:
Copy _txn file first
Copy root symbol files
Copy partition directory
Do it while your slave QuestDB sever is down and then on start the table should have data up to the point when you started copying _txn file.

Related

loading data from external stage - only truncate + load when theres new file

I'm loading data from a named external stage (S3) by using COPY INTO, and this S3 bucket keep all old files.
Here's what I want:
When a new file comes in, truncate the table and load the new file only, if there's no new file coming in, just keep the old data without truncation.
I understand that I can put option like FORCE = False to avoid loading old files again, but how do I only truncate the table when there's new file coming in?
I would likely do this a bit differently, since there isn't a way to truncate/delete records in the target table from the COPY command. This will be a multi-step process, but can be automated via Snowflake:
Create a transient table. For sake of description, I'll just call this STG_TABLE. You will also maintain your existing target table called TABLE.
Modify your COPY command to load to STG_TABLE.
Create a STREAM called STR_STG_TABLE over STG_TABLE.
Create a TASK called TSK_TABLE with the following statement
This statement will execute only if your COPY command actually loaded any new data.
CREATE OR REPLACE TASK TSG_TABLE
WAREHOUSE = warehouse_name
WHEN SYSTEM$STREAM_HAS_DATA('STR_STG_TABLE')
AS
INSERT OVERWRITE INTO TABLE (fields)
SELECT fields FROM STR_STG_TABLE;
The other benefit of using this method is that your transient table will have the full history of your files, which can be nice for debugging issues.

Best way to duplicate a ClickHouse replica?

I want to create another replica from an existing one by copying it.
Made a snapshot in AWS, created a new server, all my data has a copy on the new server.
Fixed the macro replica in config.
When I start the server it throws "No node" in error log for the first table that it finds, and gets stalled, repeating the same error once in a while.
<Error>: Application: Coordination::Exception: No node, path: /clickhouse/tables/0/test_pageviews/replicas/replica3/metadata: Cannot attach table `default`.`pageviews2` from metadata file . . .
I suspect this is because the node for this replica does not exist in Zookeeper (obviously, it was not created, because I did not run the CREATE TABLE for this replica as it is just a duplicate of another replica).
What is the correct way to do a duplication of a replica?
I mean, I would like to avoid copying the data, and make the replica pull only the data that was added from the moment in time when the snapshot was created.
I mean, I would like to avoid copying the data,
It's not possible.

How do I force hive to always create a consistent filename like 000000_0?

I am doing an Insert overwrite operation through a hive external table onto AWS S3. Hive creates a output file 000000_0 onto S3. However at times I am noticing that it creates file with other names like 0000003_0 etc. I always need to overwrite the existing file but with inconsistent file names I am unable to do so. How do I force hive to always create a consistent filename like 000000_0? Below is an example of how my code looks like, where tab_content is a hive external table.
INSERT OVERWRITE TABLE tab_content
PARTITION(datekey)
select * from source
Better do not do this and modify your program to accept any number of files in the directory.
Each reducer (or mapper if it runs on map-only) creates it's own file. These reducers do know nothing about each other, they named during creation. Files are marked as 000001_0,000002_0. But it can be 000001_1 also if attempt number 0 has failed and attempt number 1 has succeeded. Also if table is partitioned and there is no distribute by partition key at the end, each reducer will create it's own file in each partition.
You can force it to work on a single final reducer (it can be done for example if you add order by clause or setting set mapred.reduce.tasks = 1;). But bear in mind that this solution is not scalable, because too many data will cause performance problems on single reducer. Also What will happen if attempt 0 has failed and it was restarted and attempt 1 succeeded? It will create 000001_1 instead of 000001_0.

How to split big sql dump file into small chunks and maintain each record in origin files despite later other records deletions

Here's what I want to do with (MySQL example):
dumping only structure - structure.sql
dumping all table data - data.sql
spliting data.sql and putting each table data info seperate files - table1.sql, table2, sql, table3.sql ... tablen.sql
splitting each table into smaller files (1k lines per file)
commiting all files in my local git repository
coping all dir out to remote secure serwerwer
I have a problem with #4 step.
For instance I split table1.sql into 3 files: table1_a.sql and table1_b.sql and table1_c.sql.
If on new dump there are new records that is fine - it's just added to table1_b.sql.
But if there are deleted records that were in table1_a.sql all next records will move and git will treat files table1_b.sql and table1_c.sql as changed and that not OK.
Basicly it destroys whole idea keeping sql backup in SCM.
My question: How to split big sql dump file into small chunks and maintain each record in origin files despite later other records deletions?
To Split SQL Dumps in files of 500 lines execute in your terminal:
$ split -l 5000 hit_2017-09-28_20-07-25.sql dbpart-
Don't split them at all. Or split them by ranges of PK values. Or split them right down to 1 db row per file (and name the file after tablename + the content of the primary key).
(That apart from the even more obvious XY answer, which was my instinctive reaction.)

SSIS package which runs biweekly but there is no reverse out plan if it fails

Step 1- This is X job that creates the (b) job.dat file
Step 2- This is an SSIS package that splits the output dat file into 4 different files to send to Destination
Step 3-Moves the four files from the workarea to another location where MOVEIT can pick them up from
***Step two is not restartable
***There is no reversing out if any of the step fails
Note: what if i add exception handler or should I add condional split... any other ideas ?
Batch Persistence
One thing you can do for starters is to append the file(s) names with a timestamp that includes the date time of the last record processed (if timestamps do not apply then you can use a primary key incrementing value). The batch identifiers could also be stored in a database. If your SSIS package can smartly name the files in chronological sequence then third step can safely ignore files that is has already processed. Actually, you could do that at each step. This would give you the ability to start the whole process from scratch, if you must do it that way.
Ignorant and Hassle Free Dumping
Another suggestion would be do dump all data each day. If the files do not get super large then just dump all data for whatever it is you are dumping. This way each step would not have to maintain state and the process could start/stop at anytime.