Incrementally importing data to a PostgreSQL database

Incrementally importing data to a PostgreSQL database - sql

Situation:
I have a PostgreSQL-database that is logging data from sensors in a field-deployed unit (let's call this the source database). The unit has a very limited hard-disk space, meaning that if left untouched, the data-logging will cause the disk where the database is residing to fill up within a week. I have a (very limited) network link to the database (so I want to compress the dump-file), and on the other side of said link I have another PostgreSQL database (let's call that the destination database) that has a lot of free space (let's just, for argument's sake, say that the source is very limited with regard to space, and the destination is unlimited with regard to space).
I need to take incremental backups of the source database, append the rows that have been added since last backup to the destination database, and then clean out the added rows from the source database.
Now the source database might or might not have been cleaned since a backup was last taken, so the destination database needs to be able to only imported the new rows in an automated (scripted) process, but pg_restore fails miserably when trying to restore from a dump that has the same primary key numbers as the destination database.
So the question is:
What is the best way to restore only the rows from a source that are not already in the destination database?
The only solution that I've come up with so far is to pg_dump the database and restore the dump to a new secondary-database on the destination-side with pg_restore, then use simple sql to sort out which rows already exist in my main-destination database. But it seems like there should be a better way...
(extra question: Am I completely wrong in using PostgreSQL in such an application? I'm open to suggestions for other data-collection alternatives...)

A good way to start would probably be to use the --inserts option to pg_dump. From the documentation (emphasis mine) :
Dump data as INSERT commands (rather than COPY). This will make
restoration very slow; it is mainly useful for making dumps that can
be loaded into non-PostgreSQL databases. However, since this option
generates a separate command for each row, an error in reloading a row
causes only that row to be lost rather than the entire table contents.
Note that the restore might fail altogether if you have rearranged
column order. The --column-inserts option is safe against column order
changes, though even slower.
I don't have the means to test it right now with pg_restore, but this might be enough for your case.
You could also use the fact that from the version 9.5, PostgreSQL provides ON CONFLICT DO ... for INSERTs. Use a simple scripting language to add these to the dump and you should be fine. I haven't found an option for pg_dump to add those automatically, unfortunately.

You might google "sporadically connected database synchronization" to see related solutions.
It's not a neatly solved problem as far as I know - there are some common work-arounds, but I am not aware of a database-centric out-of-the-box solution.
The most common way of dealing with this is to use a message bus to move events between your machines. For instance, if your "source database" is just a data store, with no other logic, you might get rid of it, and use a message bus to say "event x has occurred", and point the endpoint of that message bus at your "destination machine", which then writes that to your database.
You might consider Apache ActiveMQ or read "Patterns of enterprise integration".

#!/bin/sh
PSQL=/opt/postgres-9.5/bin/psql
TARGET_HOST=localhost
TARGET_DB=mystuff
TARGET_SCHEMA_IMPORT=copied
TARGET_SCHEMA_FINAL=final
SOURCE_HOST=192.168.0.101
SOURCE_DB=slurpert
SOURCE_SCHEMA=public
########
create_local_stuff()
{
${PSQL} -h ${TARGET_HOST} -U postgres ${TARGET_DB} <<OMG0
CREATE SCHEMA IF NOT EXISTS ${TARGET_SCHEMA_IMPORT};
CREATE SCHEMA IF NOT EXISTS ${TARGET_SCHEMA_FINAL};
CREATE TABLE IF NOT EXISTS ${TARGET_SCHEMA_FINAL}.topic
( topic_id INTEGER NOT NULL PRIMARY KEY
, topic_date TIMESTAMP WITH TIME ZONE
, topic_body text
);
CREATE TABLE IF NOT EXISTS ${TARGET_SCHEMA_IMPORT}.tmp_topic
( topic_id INTEGER NOT NULL PRIMARY KEY
, topic_date TIMESTAMP WITH TIME ZONE
, topic_body text
);
OMG0
}
########
find_highest()
{
${PSQL} -q -t -h ${TARGET_HOST} -U postgres ${TARGET_DB} <<OMG1
SELECT MAX(topic_id) FROM ${TARGET_SCHEMA_IMPORT}.tmp_topic;
OMG1
}
########
fetch_new_data()
{
watermark=${1-0}
echo ${watermark}
${PSQL} -h ${SOURCE_HOST} -U postgres ${SOURCE_DB} <<OMG2
\COPY (SELECT topic_id, topic_date, topic_body FROM ${SOURCE_SCHEMA}.topic WHERE topic_id >${watermark}) TO '/tmp/topic.dat';
OMG2
}
########
insert_new_data()
{
${PSQL} -h ${TARGET_HOST} -U postgres ${TARGET_DB} <<OMG3
DELETE FROM ${TARGET_SCHEMA_IMPORT}.tmp_topic WHERE 1=1;
COPY ${TARGET_SCHEMA_IMPORT}.tmp_topic(topic_id, topic_date, topic_body) FROM '/tmp/topic.dat';
INSERT INTO ${TARGET_SCHEMA_FINAL}.topic(topic_id, topic_date, topic_body)
SELECT topic_id, topic_date, topic_body
FROM ${TARGET_SCHEMA_IMPORT}.tmp_topic src
WHERE NOT EXISTS (
SELECT *
FROM ${TARGET_SCHEMA_FINAL}.topic nx
WHERE nx.topic_id = src.topic_id
);
OMG3
}
########
delete_below_watermark()
{
watermark=${1-0}
echo ${watermark}
${PSQL} -h ${SOURCE_HOST} -U postgres ${SOURCE_DB} <<OMG4
-- delete not yet activated; COUNT(*) instead
-- DELETE
SELECT COUNT(*)
FROM ${SOURCE_SCHEMA}.topic WHERE topic_id <= ${watermark}
;
OMG4
}
######## Main
#create_local_stuff
watermark="`find_highest`"
echo 'Highest:' ${watermark}
fetch_new_data ${watermark}
insert_new_data
echo 'Delete below:' ${watermark}
delete_below_watermark ${watermark}
# Eof
This is just an example. Some notes:
I assume a non-decreasing serial PK for the table; in most cases it could also be a timestamp
for simplicity, all the queries are run as user postgres, you might need to change this
the watermark method will guarantee that only new records will be transmitted, minimising bandwidth usage
the method is atomic, if the script crashes, nothing is lost
only one table is fetched here, but you could add more
because I'm paranoid, I us a different name for the staging table and put it into a separate schema
The whole script does two queries on the remote machine (one for fetch one for delete); you could combine these.
but there is only one script (executing from the local=target machine) involved.
The DELETE is not yet active; it only does a count(*)

Related

Importing MySQL tables from other database in live site with mysqldump can cause trouble?

Scenario: I want to replicate MySQL tables from one database to other database.
Possible best solution: May be to use MySQL Replication feature.
Current solution on what I'm working as workaround (mysqldump) because can't spend time to learn about Replication in current deadline.
So currently I'm using command like this:
mysqldump -u user1 -ppassword1 --single-transaction SourceDb TblName | mysql -u user2 -ppassword2 DestinationDB
Based on some tests, it seems to be working fine.
While running above command, I run ab command with 1000 requests on destination site and tried accessing the site from browser also.
My concern is for destination live site on which we are importing data with whole table (which will internally drop existing table and create new one with new data).
Can I be sure that live site won't break while this process or is there any risk factor?
If yes then can that be resolved?

As such you already admitted replication is the best solution here, I'd agree to that.
You said you have 1000 requests on "Destination" side? Are these 1000 connections to Destination read-only?
Ofcourse dropping and recreating table isn't a right choice here for active connections.
Can suggest one improvement. Instead of directly loading to table, load to different database and swap tables. This should be quicker as far as connections to Destination database/tables are concerned.
create new table different database
mysqldump -u user1 -ppassword1 --single-transaction -hSOURCE_HOST SourceDb TblName | mysql -uuser2 -ppassword2 -hDESTINATION_HOST DB_New
(Are you sure you don't have "-h " here?)
Swap the tables
rename table DB.TblName to DB.old_TblName, DB_New.new_TblName to DestinationDB.TblName;
If you're on same host (which I dont think so), you might want to use pt-online-schema-change and swap tables!

Exporting from one schema and importing to another with pg_dump

I have a table called units, which exists in two separate schemas within the same database (we'll call them old_schema, and new_schema). The structure of the table in both schemas are identical. The only difference is that the units table in new_schema is presently empty.
I am attempting to export the data from this table in old_schema and import it into new_schema. I used pg_dump to handle the export, like so:
pg_dump -U username -p 5432 my_database -t old_schema.units -a > units.sql
I then attempted to import it using the following:
psql -U username -p 5432 my_database -f units.sql
Unfortunately, this appeared to try and reinsert back in to the old_schema. Looking at the generated sql file, it seems there is a line, which I think is causing this:
SET search_path = mysql_migration, pg_catalog;
I can, in fact, alter this line to read
SET search_path = public;
And this does prove successful, but I don't believe this is the "correct" way to accomplish this.
Question: When importing data via a script generated through pg_dump, how can I specify in to which schema the data should go without altering the generated file?

There are two main issues here based on the scenario you described.
The difference in the schemas, to which you alluded.
The fact that by dumping the whole table via pg_dump, you're dumping the table definition also, which will cause issues if the table is already present in the destination schema.
To dump only the data, if the table already exists in the destination database (which appears to be the case based on your scenario above), you can dump the table using pg_dump with the --data-only flag.
Then, to address the schema issue, I would recommend doing a search/replace (sed would be a quick way to do it) on the output sql file, replacing old_schema with new_schema.
That way, it will apply the data (which is all that would be in the file, not the table definition itself) to the table in new_schema.
If you need a solution on a broader level to support, say, dynamically named schemas, you can use the same search/replace trick with sed, but instead of replacing it with new_schema, replace it with some placeholder text, say, $$placeholder_schema$$ (something highly unlikely to appear as as token elsewhere in the file), and then, when you need to apply that file to a particular schema, use the original file as a template, copy it, and then modify the copy using sed or similar, replacing the placeholder token with the desired on-the-fly schema name.
You can set some options for psql on the command line, such as --set AUTOCOMMIT=off, however, a similar approach with SEARCH_PATH does not appear to have any effect.
Instead, it needs the form \set SEARCH_PATH to <path>, which can be specified with the -c option, but not in combination with -f (it's either or).
Given that, I think modifying the file with sed is probably the best all around option in this case for use with -f.

Redis move all keys from one database to another

Is there a command to move redis keys from one database to another or is it possible only with lua scripting??
There has been this type of question asked perviously redis move all keys but the answers are not appropriate and convincing for a beginner like me.

u can use "MOVE" to move one key to another redis database;
the text below is from redis.io
MOVE key db
Move key from the currently selected database (see SELECT) to the specified destination database. When key already exists in the destination database, or it does not exist in the source database, it does nothing. It is possible to use MOVE as a locking primitive because of this.
Return value
Integer reply, specifically:
1 if key was moved.
0 if key was not moved.

I think this will do the job:
redis-cli keys '*' | xargs -I % redis-cli move % 1 > /dev/null
(1 is the new database number, and redirection to /dev/null is in order to avoid getting millions of '1' lines - since it will move the keys one by one and return 1 each time)
Beware that redis might run out of connections and then will display tons of such errors:
Could not connect to Redis at 127.0.0.1:6379: Cannot assign requested address
So it could be better (and much faster) to just dump the database and then import it into the new one.

If you have a big database with millions of keys, you can use the SCAN command to select all the keys (without blocking like the dangerous KEYS command that even Redis authors do not recommend).
SCAN gives you the keys by "pages" one by one and the idea is to start at page 0 (formally called CURSOR 0) and then continue with the next page/cursor until you hit the end (the stop signal is when you get the CURSOR 0 again).
You may use any popular language for this like Redis or Ruby or Scala. Here a draft using Bash Scripting:
#!/bin/bash -e
REDIS_HOST=localhost
PAGE_SIZE=10000
KEYS_TO_QUERY="*"
SOURCE_DB=0
TARGET_DB=1
TOTAL=0
while [[ "$CURSOR" != "0" ]]; do
CURSOR=${CURSOR:-0}
>&2 echo $TOTAL:$CURSOR
KEYS=$(redis-cli -h $REDIS_HOST -n $SOURCE_DB scan $CURSOR match "$KEYS_TO_QUERY" count $PAGE_SIZE)
unset CURSOR
for KEY in $KEYS; do
if [[ -z $CURSOR ]]; then
CURSOR=$KEY
else
TOTAL=$(($TOTAL + 1))
redis-cli -h $REDIS_HOST -n $SOURCE_DB move $KEY $TARGET_DB
fi
done
done
IMPORTANT: As usual, please do not copy and paste scripts without understanding what is doing, so here some details:
The while loop is selecting the keys page by page with the SCAN command and with every key then running the MOVE command.
The SCAN command will return the next cursor in the first line and then the rest of the lines will be the found keys. The while loop starts with the variable CURSOR not defined and then defined in the first loop (this is some magic to just stop in the next CURSOR 0 that will signal the end of the scanning)
PAGE_SIZE is the value of how long will be each scan query, lower values will impact very low on the server but will be slow, bigger values will make the server "sweat" but faster ... here the network is impacted, so try to find a sweet spot around 10000 or even 50000 (ironically values of 1 or 2 may stress also the server but due to the network wrapping part of each query)
KEYS_TO_QUERY: It's a pattern on the keys you want to query, like "*balance*" witll select the keys that include balance in the name of the key (don't forget to include the quotes to avoid syntax errors) ... additionally you can do the filtering at script side, just query all the key with "*" and add a bash scripting if conditional, this will be slower but if you cannot find a pattern for your keys selection this will help.
REDIS_HOST: using localhost by default, change it to any server you like (if you are using a custom port other than the default port 6379 you can also include it with something like myredisserver:4739)
SOURCE_DB: the database ID you want the keys move from (by default 0)
TARGET_DB: the database ID you want the keys move to (by default 1)
You can use this script to execute other commands or checks with the keys, just replace the MOVE command call for anything you may need.
NOTE: To move keys from one Redis server to another Redis server (this is not only moving between internal databases) you can use redis-utils-cli from the NPM packages here -> https://www.npmjs.com/package/redis-utils-cli

Purging an SQL table

I have an SQL table which is used for logging purpose(There are lakhs of records in the table). I need to purge the table (Take a back up of the data and need to clear the table data).
Is there a standard way of doing it where I can automate it.?

You can do this within SQL Server Management Studio, by:
right clicking Database > Tasks > Generate Script
You can then select the table you wish to script out and also choose to include any associated objects, such as constraints and indexes.
Attaching an image which will give you the step by step procedure,
image_bkp_procedure
PFB the stackoverflow link which will give you more insight on this,
Table-level backup
And your automation requirement,
You can download bcp utility which copies data between an instance of Microsoft SQL Server and a data file in a user-specified format.
Sample syntax to export,
bcp "select * from [MyDatabase].dbo.Customer " queryout "Customer.bcp" -N -S localhost -T -E
You can automate this query by using any scheduling mechanism (UNIX etc)

Simply we can create a job that runs once in a month
--> That backups data in another table like archive table
--> Then deletes data in the main table
Its primitive partitioning I guess, this way it will be more flexible when you need to select data from the past deleted one i.e. now on archive table where you have backed up

Backup MySQL database

I have a MySQL Database of about 1.7GB. I usually back it up using mysqldump and this takes about 2 minutes. However, I would like to know the answers to the following questions:
Does mysqldump block read and/or write operations to the database? Because in a live scenario, I would not want to block users from using the database while it is being backed up.
It would be ideal for me to only backup the WHOLE database once in, say, a week, but in the intermediate days only one table needs to be backed up as the others won't change. Is there a way to achieve this?
Is mysqlhotcopy a better alternative for these purposes?

mysqlhotcopy does not work in certain cases where the readlock is lost,
and does not work with INNODB tables.
mysqldump is more used because it can back up all kinds of tables.
From MySQL documentation
mysqlhotcopy is a Perl script that was originally written and contributed by Tim Bunce. It uses LOCK TABLES, FLUSH TABLES, and cp or scp to make a database backup quickly. It is the fastest way to make a backup of the database or single tables, but it can be run only on the same machine where the database directories are located. mysqlhotcopy works only for backing up MyISAM and ARCHIVE tables. It runs on Unix and NetWare
The mysqldump client is a backup program originally written by Igor Romanenko. It can be used to dump a database or a collection of databases for backup or transfer to another SQL server (not necessarily a MySQL server). The dump typically contains SQL statements to create the table, populate it, or both. However, mysqldump can also be used to generate files in CSV, other delimited text, or XML format.
Bye.

1) mysqldump only blocks when you ask it to (one of the --lock-tables, --lock-all-tables, --single-transaction). but if you want your backup to be consistent then mysqldump should block (using --single-transaction or --lock-all-tables) or you might get an inconsistent database snapshot. Note: --single-transaction works only for InnoDB.
2) sure, just enumerate the tables you want to be backed up after the database name:
mysqldump OPTIONS DATABASE TABLE1 TABLE2 ...
Alternatively you can exclude the tables you don't want:
mysqldump ... --ignore-table=TABLE1 --ignore-table=TABLE2 .. DATABASE
So you can do a whole database dump once a week and backup only the changing tables once a day.
3) mysqlhotcopy inly works on MyISAM tables and in most applications you are better off with InnoDB. There are commercial tools (quite expensive) for hotbackup of innodb tables. Lately there is also the new opensource one for this purpose - Xtrabackup
Also, to automate the process you can use astrails-safe. It supports database backup with mysqldump and filesystem with tar. +encryption +upload to S3, +many other goodies. There is no xtrabackup support yet, but it should be easy to add if this is what you need.

Adding a mysql slave to your setup would allow you to take consistant backups without locking the production database.
Adding a slave also gives you a binary log of changes. A dump is a snapshot of the database at the time you took the dump. The binary log contains all statements that modified the data along with a timestamp.
If you have a failure in the middle of the day and your only taking backups once a day, you've lost a half a days worth of work. With binary logs and mysqldump, you could restore from the previous day and 'play' the logs forward to the point of failure.
http://dev.mysql.com/doc/refman/5.0/en/binary-log.html
If your running MySQL on a linux server with LVM disks or a windows server with VSS, you should check out Zamanda.
It takes binary diffs of the data on disk, which is much faster to read and restore than a text dump of the database.

No, you can specify tables to be locked using --lock-tables but they aren't by default
If you don't specify any tables then the whole DB is backed up, or you can specify a list of tables :
mysqldump [options] db_name [tables]
Not used it sorry, however I run a number of MySQL DBs, some bigger some smaller than 1.7gb and I use mysqldump for all my backups.

Maatkit dump might be useful.
http://www.maatkit.org/doc/mk-parallel-dump.htmlhttp://www.maatkit.org/doc/mk-parallel-dump.html

For mysql and PHP try this
This will also remove files after n days
$dbhost = 'localhost';
$dbuser = 'xxxxx';
$dbpass = 'xxxxx';
$dbname = 'database1';
$folder = 'backups/'; // Name of folder you want to place the file
$filename = $dbname . date("Y-m-d-H-i-s") . ".sql";
$remove_days = 7; // Number of days that the file will stay on the server
$command="mysqldump --host=$dbhost --user=$dbuser --password=$dbpass $dbname > $folder$filename";
system($command);
$files = (glob("$folder"."*.sql"));
foreach($files as $file) {
if(is_file($file)
&& time() - filemtime($file) >= $remove_days*24*60*60) { // 2 days = 2*24*60*60
unlink($file);
echo "$file removed \n";
} else { echo "$file was last modified: " . date ("F d Y H:i:s.", filemtime($file)) . "\n"; }
}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas