How works '--checksum-seed' in rsync - backup

I use a lot rsync for my backups.
Generally I use it with --archive and with --delete-after for do backups.
After I make the backup, I execute the same command, but with --checksum, to ensure the integrity of all copied data. With that, rsync takes a lot time 'sending incremental file list'. I suppose it's because the process of create checksum for all files is long.
Well, I read somewhere that a way to speed up the process of create checksums using '--checksum-sed' and a number, for example --checksum-seed=1234.
I try it, and it's works, the process is much more faster.
However, I don't understand how it's works. And I don't know what number I need to put as a value in --checksum-seed. Also, I'm not sure whether to use --checksum-sed is reliable.
Thanks for your time !
Not applicable for that question.

Related

How to quickly revert data changes in SQL Server database made by an end-to-end test?

Provided we have a suite of end-to-end automated tests, the goal is that every test starts with the same (initial) set of data in the database, to get reliable results.
Has somebody found any good solution how to quickly revert changes that were made in the SQL Server database during test execution?
There is always the possibility to truncate all the tables and re-import the initial data through SQL. But I'm thinking if there is something more elegant. Such as reverting to a snapshot.
Has somebody tried the
RESTORE DATABASE FROM DATABASE_SNAPSHOT
SQL Server feature? Is it fast enough? Does the speed of it depend on the amount of data there is in the database when the snapshot is created or does the speed depend rather on how many changes have been made since creation of the snapshot?
Thank you very much for any opinion around this.
The snapshot feature works by "capturing" the original database pages to the snapshot before they are first modified. It's a kind of copy-on-write. The original page goes to the snapshot, then the modification proceeds normally.
When you revert a snapshot those changed pages are written back to the source database. So the time it takes is proportional to how much data you have changed. Actually it is how much database pages have been changed. Say you changed 100 records but they happened to be on different pages, that's 100 pages to restore. So data locality matters.
Also, since you are making a copy of changed pages before writing to them, it's expected your modifications to take a bit longer.
In my experience it's fast enough, but depends on how much data you are churning.
I wouldn't probably to a truncate/import as I find it to be much more work. If you change a lot of data during your tests restoring a full backup might be easier/faster. You have to check out what works best for you.

Can Npgsql dump/restore an entire database?

Is it possible to use Npgsql in a way that basically mimics pg_dumpall to a single output file without having to iterate through each table in the database? Conversely, I'd also like to be able to take such output and use Npgsql to restore an entire database if possible.
I know that with more recent versions of Npgsql I can use the BeginBinaryExport, BeginTextExport, or BeginRawBinaryCopy methods to export from the database to STDOUT or to a file. On the other side of the process, I can use the BeginBinaryImport, BeginTextImport, or BeginRawBinaryCopy methods to import from STDIN or an existing file. However, from what I've been able to find so far, these methods use the COPY SQL syntax, which (AFAIK) is limited to a single table at a time.
Why am I asking this question? I currently have an old batch file that I use to export my production database to a file (using pg_dumpall.exe) before importing it back into my testing environment (using psql.exe with the < operation). This has been working pretty much flawlessly for quite a while now, but we've recently moved the server to an off-site hosted environment, which is causing a delay that prevents the batch file from completing successfully. Because of the potential for other connectivity/timeout issues, I'm thinking of moving the batch file's functionality to a .NET application, but this part has got me a bit stumped.
Thanks for your help and let me know if you need any further clarification.
This has been asked for in https://github.com/npgsql/npgsql/issues/1397.
Long story short, Npgsql doesn't have any sort of support for dumping/restoring entire databases. Implementing that would be a pretty significant effort that would pretty much duplicate all the pg_dump logic, and the danger for subtle omissions and bugs would be considerable.
If you just need to dump data for some tables, then as you mentioned, the COPY API is pretty good for that. If, however, you need to also save the schema itself as well as other, non-table entities (sequence state, extensions...), then the only current option AFAIK is to execute pg_dump as an external process (or use one of the other backup/restore options).

Is it possible run query COPY TO STDOUT WITH BINARY and stream results with node-postgres?

I'm worried about data-type coercion or will i get a nice Buffer or UInt8Array? Can I get it in chunks (streaming)?
Delving into npm I found: https://www.npmjs.com/package/pg-copy-streams -- this is the answer I was looking for.
Here is a bit more information (copied from the README) so you can avoid traversing the link:
pg-copy-streams
COPY FROM / COPY TO for node-postgres. Stream from one database to
another, and stuff.
how? what? huh?
Did you know the all powerful PostgreSQL supports streaming binary
data directly into and out of a table? This means you can take your
favorite CSV or TSV or whatever format file and pipe it directly into
an existing PostgreSQL table. You can also take a table and pipe it
directly to a file, another database, stdout, even to /dev/null if
you're crazy!
What this module gives you is a Readable or Writable stream directly
into/out of a table in your database. This mode of interfacing with
your table is very fast and very brittle. You are responsible for
properly encoding and ordering all your columns. If anything is out of
place PostgreSQL will send you back an error. The stream works within
a transaction so you wont leave things in a 1/2 borked state, but it's
still good to be aware of.
If you're not familiar with the feature (I wasn't either) you can read
this for some good helps:
http://www.postgresql.org/docs/9.3/static/sql-copy.html

Optimizing MSSQL for High Volume Single Record Inserts

We have a C# application that receives a file each day with ~35,000,000 rows. It opens the file, parses each record individually, formats some of the fields and then inserts one record at a time into a table. It's really slow, which is expected, but I've been asked to optimize it.
I have instructed that that any optimizations must be contained to SQL only. i.e., there can be no changes to the process or the C# code. I'm trying tom come up with ideas on how I can speed up this process while being limited to SQL modifications only. I have a couple of ideas I want to try but I'd also like feedback from anyone who has found themselves in this situation before.
Ideas:
1. Create a clustered index on the table so the insert always occurs at the tale end of the file. The records in the file are ordered by date/time and the current table has no clustered index so this seems like a valid approach.
Somehow reduce the logging overheard. This data is volatile in nature so it's not a big deal to be able to rollback. Even if the process blew up halfway through, they would just restart it.
Change the isolation level. Perhaps there is an isolation level that is more suited for sequential single-record inserts.
Reduce connection time. The C# app is opening/closing a connection for each insert. We can't change the C# code though so perhaps there is a trick to reducing overhead/time to make a connection.
I appreciate anyone taking the time to read my post and throw out any ideas they feel would be worth it.
Thanks,
Dean
I would suggest the following -- if possible.
Load the data into a staging table.
Do the transformations in SQL.
Bulk insert the data into the final table.
The second suggestion would be:
Modify the C# code to write the data into a file.
Bulk insert the file, either into a staging table or the final table.
Unfortunately, your problem is 35 million round trips from C# to the database. I doubt there is any database optimization that can fix that performance problem. In other words, you need to change the C# code to fix the performance issue. Anything else is probably just a waste of your time.
You can minimize logging either by using simple recovery or writing to a temporary table. Either of those might help. However, consider the second option, because it would be a minor change to the C# code and could result in big improvements.
Or, if you have to do the best in a really bad situation:
Run the C# code and database on the same server. Be sure it has lots of processors.
Attach lots of SSD or memory for the database (if you are not already using it).
Load the data into table spaces that are only on SSD or in memory.
Copy the data from the local database to the remote one.

How to create a "live" feed for two seperate Postgres Instances?

So I feel pretty confident inside of Postgres but I have an interesting problem in my opinion.
I have my local Postgres instance and a remote Postgres instance. My remote instance is read only as it is a production server. I need to be able to pull records and generate views/tables/reports/whatever.
How can I accomplish that?
Currently I am using dblink running every 15 Minutes pretty much resetting my local instance by dropping all objects and using pgAgent jobs to rebuild all objects ready for the next cycle. It is really labor intensive to make changes and even worse to troubleshoot.
My eventual solution was to make views through dblink. it is slightly clunky but the speed increase is substantial and worth the more restrictive coding requirements for the connection.