Fastest way to clear the content out of many tables - sql

Right now we're using TRUNCATE to clear out the contents of 798 tables in postgres (isolated test runs). Where possible we use transactions. However, in places where it's not possible, we'd like the fastest way to reset the state of the DB.
We're working towards only actually calling truncate on the tables that have been modified (for any given test only a few of the 798 tables will be modified).
What is the fastest way to delete all of the data from many PostgreSQL tables?

Two things come to mind:
Setup the clean DB as a template and createdb a copy from it before each test.
Setup the clean DB as the default schema, but run the TransactionTests in a different schema (SET search_path TO %s).

Related

How to sync tables schema without dropping the table?

Not:
DROP -> CREATE
I need:
COMPARE -> ALTER
I have a test and a production database, the data withing these two are different but the schemas should be the same.
I need something like a production script or a tool or a method which compare these two dbs schema and sync them. I'm coding in nodejs and the thing is I haven't used tools like an ORM or db-migrate, I've created the database using MYSQL-workbench and it costs a lot to write every alter query. there must be an easier way.

Backing up portion of data in SQL

I have a huge schema containing billions of records, I want to purge data older than 13 months from it and maintain it as a backup in such a way that it can be recovered again whenever required.
Which is the best way to do it in SQL - can we create a separate copy of this schema and add a delete trigger on all tables so that when trigger fires, purged data gets inserted to this new schema?
Will there be only one record per delete statement if we use triggers? Or all records will be inserted?
Can we somehow use bulk copy?
I would suggest this is a perfect use case for the Stretch Database feature in SQL Server 2016.
More info: https://msdn.microsoft.com/en-gb/library/dn935011.aspx
The cold data can be moved to the cloud with your given date criteria without any applications or users being aware of it when querying the database. No backups required and very easy to setup.
There is no need for triggers, you can use job running every day, that will put outdated data into archive tables.
The best way I guess is to create a copy of current schema. In main part - delete all that is older then 13 months, in archive part - delete all for last 13 month.
Than create SP (or any SPs) that will collect data - put it into archive and delete it from main table. Put this is into daily running job.
The cleanest and fastest way to do this (with billions of rows) is to create a partitioned table probably based on a date column by month. Moving data in a given partition is a meta operation and is extremely fast (if the partition setup and its function is set up properly.) I have managed 300GB tables using partitioning and it has been very effective. Be careful with the partition function so dates at each edge are handled correctly.
Some of the other proposed solutions involve deleting millions of rows which could take a long, long time to execute. Model the different solutions using profiler and/or extended events to see which is the most efficient.
I agree with the above to not create a trigger. Triggers fire with every insert/update/delete making them very slow.
You may be best served with a data archive stored procedure.
Consider using multiple databases. The current database that has your current data. Then an archive or multiple archive databases where you move your records out from your current database to with some sort of say nightly or monthly stored procedure process that moves the data over.
You can use the exact same schema as your production system.
If the data is already in the database no need for a Bulk Copy. From there you can backup your archive database so it is off the sql server. Restore the database if needed to make the data available again. This is much faster and more manageable than bulk copy.
According to Microsoft's documentation on Stretch DB (found here - https://learn.microsoft.com/en-us/azure/sql-server-stretch-database/), you can't update or delete rows that have been migrated to cold storage or rows that are eligible for migration.
So while Stretch DB does look like a capable technology for archive, the implementation in SQL 2016 does not appear to support archive and purge.

Updating/Inserting tables in one database from another database

How can I sync two databases and do a manual refresh on the entities on either of the database whenever I want?
Let's say I have two databases DB1(prod) and DB2(dev). I want to update/insert only a few tables from prod DB to dev DB. How could I achieve this? Is this possible instead of DBlink since I do not have privileges to create a database link?
If you only want to do a manual refresh set up an import/export/datapump script to copy the data across if there is not too much data involved. If there is a large amount of data you could write some pl/sql as described above to only move the new/changed rows. This will be easier if your data has fields such as created/updated_on

Migrating two columns in SQL script to the db of my rails app

I have a SQL script that has four columns and about 100 records. I only need two columns. I want to transfer the two columns into my seeds.rb file so that I can have these records in my db when I deploy my app. What would be the easiest way possible to do this? How would it look in my seeds.rb file?
The first thing to do is get the database in the format you want, and then create a database dump of some sort. MySQL makes this easier than Sqlite. Put the INSERT statements into your file like this:
ActiveRecord::Base.connection.execute("INSERT INTO `example` (`abbreviation`,`name`)
VALUES
('ABC', 'Alphabet Broadcasting Company'),
('DEF', 'Denver Echo Factory'),
('GHI', 'Gimbal Helper Industries')
")
Although seeds.rb is a handy way of pre-populating certain critical things, like essential administrators or lookup tables for countries, it does become difficult to maintain over time as seeds.rb must always conform to the latest schema.
It may be easier to simply deploy a seed Sqlite file and migrate that instead. With MySQL you typically deploy and load in a seed database dump to get things started, then migrate and enhance as required.

SSIS storing logging variables in a derived column

I am developing SSIS packages that consist of 2 main steps:
Step 1: Grab all sorts of data from existing legacy systems and dump them into a series of staging tables in my database.
Step 2: Move the data from my staging tables into a more relational set of tables that I'm using specifically for my project.
In step 1 I'm just doing a bulk SELECT and a bulk INSERT; however, in step 2 I'm doing row-by-row inserts into my tables using OLEDB Command tasks so that I can log very specific row-level activity of everything that's happening. Here is my general layout for step 2 processes.
alt text http://dl.dropbox.com/u/2468578/screenshots/step_1.png
You'll notice 3 OLEDB tasks: 1 for the actual INSERT, and 2 for success/fail INSERTs into our logging table.
The main thing I'm logging is source table/id and destination table/id for each row that passes through this flow. I'm storing this stuff in variables and adding them to the data flow using a Derived Column so that I can easily map them to the query parameters of the stored procedures.
alt text http://dl.dropbox.com/u/2468578/screenshots/step_3.png
I've decided to store these logging values in variables instead of hard-coding the values in the SqlCommand field on the task, because I'm pretty sure you CAN'T put variable expressions in that field (i.e. exec storedproc #[User::VariableName],... ,... ,...). So, this is the best solution I've found.
alt text http://dl.dropbox.com/u/2468578/screenshots/step_2.png
Is this the best solution? Probably not.
Is it good performance wise to add 4 logging columns to a data flow that consists of 500,000 records? Probably not.
Can you think of a better way?
I really don't think calling an OLEDBCommand 500,000 times is going to be performant.
If you are already going to staging tables - load it all to a staging table and take it from there in T-SQL or even another dataflow (or to a raw file and then something else depending on your complete operation). A Bulk insert is going to be hugely more efficient.
to add to Cade's answer if you truly need the logging info on a row by row basis, your best best is to leverage the oledb destination and use one or both of the following transformations to add columns to the dataflow:
Derived Column Transformation
Audit Transformation
This should be your best bet and should't add much overhead