PDI or mysqldump to extract data without blocking the database nor getting inconsistent data? - pentaho

I have an ETL process that will run periodically. I was using kettle (PDI) to extract the data from the source database and copy it to a stage database. For this I use several transformations with table input and table output steps. However, I think I could get inconsistent data if the source database is modified during the process, since this way I don't get a snapshot of the data. Furthermore, I don't know if the source database would be blocked. This would be a problem if the extraction takes some minutes (and it will take them). The advantage of PDI is that I can select only the necessary columns and use timestamps to get only the new data.
By the other hand, I think mysqldump with --single-transaction allows me to get the data in a consistent way and don't block the source database (all tables are innodb). The disadventage is that I would get innecessary data.
Can I use PDI, or I need mysqldump?
PD: I need to read specific tables from specific databases, so I think xtrabackup it's not a good option.

However, I think I could get inconsistent data if the source database is modified during the process, since this way I don't get a snapshot of the data
I think "Table Input" step doesn't take into account any modifications that are happening when you are reading. Try a simple experiment:
Take a .ktr file with a single table input and table output. Try loading the data into the target table. While in the middle of data load, insert few records in the source database. You will find that those records are not read into the target table. (note i tried with postgresql db and the number of rows read is : 1000000)
Now for your question, i suggest you using PDI since it gives you more control on the data in terms of versioning, sequences, SCDs and all the DWBI related activities. PDI makes it easier to load to the stage env. rather than simply dumping the entire tables.
Hope it helps :)

Interesting point. If you do all the table inputs in one transformation then at least they all start at same time but whilst likely to be consistent it's not guaranteed.
There is no reason you can't use pdi to orchestrate the process AND use mysql dump. In fact for bulk insert or extract it's nearly always better to use the vendor provided tools.

Related

Checking of replicated data Pentaho

I have about 100 tables to which we replicate data, e.g. from the Oracle database.
I would like to quickly check that the data replicated to the tables in db2 is the same as in the source system.
Does anyone have a way to do this? I can create 100 transformations, but that's monotonous and time consuming. I would prefer to process this in a loop.
I thought I would keep the queries in a table and reach into it for records.
I read the data from Table input (sql_db2, sql_source, table_name) and write do copy rows to result. Next I read single record and I read a single record and put it into a loop.
But here came a problem because I don't know how to dynamically compare the data for the tables. Each table has different columns and here I have a problem.
I don't know if this is also possible?
You can inject metadata (in this case your metadata would be the column and table names) to a lot of steps in Pentaho, you create a transformation to collect the metadata to inject to another transformation that has only the steps and some basic information, but the bulk of the information of the columns affected by the different steps is in the transformation injecting the metadata.
Check Pentaho official documentation about Metadata Injection (MDI) and the sample with a basic example of metadata injection available in your PDI installation.

SSIS : Huge Data Transfer from Source (SQL Server) to Destination (SQL Server)

Requirement :
Transfer millions of records from source (SQL Server) to destination (SQL Server).
Structure of source tables is different from destination tables.
Refresh data once per week in destination server.
Minimum amount of time for the processing.
I am looking for optimized approach using SSIS.
Was thinking these options :
Create Sql dump from source server and import that dump in destination server.
Directly copy the tables from source server to destination server.
Lots of issues to consider here. Such as are the servers in the same domain, on same network, etc.
Most of the time you will not want to move the data as a single large chunk of millions of records but in smaller amounts. An SSIS package handles that logic for you, but you can always recreate it as well but iterating the changes easier. Sometimes this is a reason to push changes more often rather than wait an entire week as smaller syncs are easier to manage with less downtime.
Another consideration is to be sure you understand your delta's and to ensure that you have ALL of the changes. For this reason I would generally suggest using a staging table at the destination server. By moving changes to staging and then loading to the final table you can more easily ensure that changes are applied correctly. Think of the scenario of a an increment being out of order (identity insert), datetime ordered incorrectly or 1 chunk failing. When using a staging table you don't have to rely solely on the id/date and can actually do joins on primary keys to look for changes.
Linked Servers proposed by Alex K. can be a great fit, but you will need to pay close attention to a couple of things. Always do it from Destination server so that it is a PULL not a push. Linked servers are fast at querying the data but horrible at updating/inserting in bulk. 1 XML column cannot be in the table at all. You may need to set some specific properties for distributed transactions.
I have done this task both ways and I would say that SSIS does give a bit of advantage over Linked Server just because of its robust error handling, threading logic, and ability to use different adapters (OLEDB, ODBC, etc. they have different performance do a search and you will find some results). But the key to your #4 is to do it in smaller chunks and from a staging table and if you can do it more often it is less likely to have an impact. E.g. daily means it would already be ~1/7th of the size as weekly assuming even daily distribution of changes.
Take 10,000,000 records changed a week.
Once weekly = 10mill
once daily = 1.4 mill
Once hourly = 59K records
Once Every 5 minutes = less than 5K records
And if it has to be once a week. just think about still doing it in small chunks so that each insert will have more minimal affect on your transaction logs, actual lock time on production table etc. Be sure that you never allow loading of a partially staged/transferred data otherwise identifying delta's could get messed up and you could end up missing changes/etc.
One other thought if this is a scenario like a reporting instance and you have enough server resources. You could bring over your entire table from production into a staging or update a copy of the table at destination and then simply do a drop of current table and rename the staging table. This is an extreme scenario and not one I generally like but it is possible and actual impact to the user would be very nominal.
I think SSIS is good at transfer data, my approach here:
1. Create a package with one Data Flow Task to transfer data. If the structure of two tables is different then it's okay, just map them.
2. Create a SQL Server Agent job to run your package every weekend
Also, feature Track Data Changes (SQL Server) is also good to take a look. You can config when you want to sync data and it's good at performance too
With SQL Server versions >2005, it has been my experience that a dump to a file with an export is equal to or slower than transferring data directly from table to table with SSIS.
That said, and in addition to the excellent points #Matt makes, this the usual pattern I follow for this sort of transfer.
Create a set of tables in your destination database that have the same table schemas as the tables in your source system.
I typically put these into their own database schema so their purpose is clear.
I also typically use the SSIS OLE DB Destination package's "New" button to create the tables.
Mind the square brackets on [Schema].[TableName] when editing the CREATE TABLE statement it provides.
Use SSIS Data Flow tasks to pull the data from the source to the replica tables in the destination.
This can be one package or many, depending on how many tables you're pulling over.
Create stored procedures in your destination database to transform the data into the shape it needs to be in the final tables.
Using SSIS data transformations is, almost without exception, less efficient than using server side SQL processing.
Use SSIS Execute SQL tasks to call the stored procedures.
Use parallel processing via Sequence Containers where possible to save time.
This can be one package or many, depending on how many tables you're transforming.
(Optional) If the transformations are complex, requiring intermediate data sets, you may want to create a separate Staging database schema for this step.
You will have to decide whether you want to use the stored procedures to land the data in your ultimate destination tables, or if you want to have the procedures write to intermediate tables, and then move the transformed data directly into the final tables. Using intermediate tables minimizes down time on the final tables, but if your transformations are simple or very fast, this may not be an issue for you.
If you use intermediate tables, you will need a package or packages to manage the final data load into the destination tables.
Depending on the number of packages all of this takes, you may want to create a Master SSIS package that will call the extraction package(s), then the transformation package(s), and then, if you use intermediate processing tables, the final load package(s).

How to create a Temporary Table using (Select * into ##temp from table) syntax(For MS SQL) using Pentaho data integration

When I am using the above syntax in "Execute row script" step...it is showing success but the temporary table is not getting created. Plz help me out in this.
Yes, the behavior you're seeing is exactly what I would expect. It works fine from the TSQL prompt, throws no error in the transform, but the table is not there after transform completes.
The problem here is the execution model of PDI transforms. When a transform is run, each step gets its own thread of execution. At startup, any step that needs a DB connection is given its own unique connection. After processing finishes, all steps disconnect from the DB. This includes the connection that defined the temp table. Once that happens (the defining connection goes out of scope), the temp table vanishes.
Note, that this means in a transform (as opposed to a Job), you cannot assume a specific order of completion of anything (without Blocking Steps).
We still don't have many specifics about what you're trying to do with this temp table and how you're using it's data, but I suspect you want its contents to persist outside your transform. In that case, you have some options, but a global temp table like this simply won't work.
Options that come to mind:
Convert temp table to a permanent table. This is the simplest
solution; you're basically making a staging table, loading it with a
Table Output step (or whatever), and then reading it with Table
Input steps in other transforms.
Write table contents to a temp file with something like a Text File
Output or Serialze to File step, then reading it back in from the
other transforms.
Store rows in memory. This involves wrapping your transforms in a
Job, and using the Copy Rows to Results and Get Rows from Results steps.
Each of these approaches has its own pros and cons. For example, storing rows in memory will be faster than writing to disk or network, but memory may be limited.
Another step it sounds like you might need depending on what you're doing is the ETL Metadata Injection step. This step allows you in many cases to dynamically move the metadata from one transform to another. See the docs for descriptions of how each of these work.
If you'd like further assistance here, or I've made a wrong assumption, please edit your question and add as much detail as you can.

SQL, moving million records from a database to other database

I am a C# developer, I am not really good with SQL. I have a simple questions here. I need to move more than 50 millions records from a database to other database. I tried to use the import function in ms SQL, however it got stuck because the log was full (I got an error message The transaction log for database 'mydatabase' is full due to 'LOG_BACKUP'). The database recovery model was set to simple. My friend said that importing millions records using task->import data will cause the log to be massive and told me to use loop instead to transfer the data, does anyone know how and why? thanks in advance
If you are moving the entire database, use backup and restore, it will be the quickest and easiest.
http://technet.microsoft.com/en-us/library/ms187048.aspx
If you are just moving a single table read about and use the BCP command line tools for this many records:
The bcp utility bulk copies data between an instance of Microsoft SQL Server and a data file in a user-specified format. The bcp utility can be used to import large numbers of new rows into SQL Server tables or to export data out of tables into data files. Except when used with the queryout option, the utility requires no knowledge of Transact-SQL. To import data into a table, you must either use a format file created for that table or understand the structure of the table and the types of data that are valid for its columns.
http://technet.microsoft.com/en-us/library/ms162802.aspx
The fastest and probably most reliable way is to bulk copy the data out via SQL Server's bcp.exe utility. If the schema on the destination database is exactly identical to that on the source database, including nullability of columns, export it in "native format":
http://technet.microsoft.com/en-us/library/ms191232.aspx
http://technet.microsoft.com/en-us/library/ms189941.aspx
If the schema differs between source and target, you will encounter...interesting (yes, interesting is a good word for it) problems.
If the schemas differ or you need to perform any transforms on the data, consider using text format. Or another format (BCP lets you create and use a format file to specify the format of the data for export/import).
You might consider exporting data in chunks: if you encounter problems it gives you an easier time of restarting without losing all the work done so far.
You might also consider zipping the exported data files up to minimize time on the wire.
Then FTP the files over to the destination server.
bcp them in. You can use the bcp utility on the destination server for the BULK IMPORT statement in SQL Server to do the work. Makes no real difference.
The nice thing about using BCP to load the data is that the load is what is described as a 'non-logged' transaction, though it's really more like a 'minimally logged' transaction.
If the tables on the destination server have IDENTITY columns, you'll need to use SET IDENTITY statement to disable the identity column on the the table(s) involved for the nonce (don't forget to reenable it). After your data is imported, you'll need to run DBCC CHECKIDENT to get things back in synch.
And depending on what your doing, it can sometimes be helpful to put the database in single-user mode or dbo-only mode for the duration of the surgery: http://msdn.microsoft.com/en-us/library/bb522682.aspx
Another approach I've used to great effect is to use Perl's DBI/DBD modules (which provide access to the bulk copy interface) and write a perl script to suck out the data from the source server, transform it and bulk load it directly into the destination server, without having to save it to disk and move it. Also means you can trap errors and design things for recovery and restart right at the point of failure.
Use BCP to migrate data.
Another approach i have used in the past is to take a backup of the transaction log and shrink the log Prior to the migration. Split the migration script in parts and run the log backup- shrink - migrate iteration a few times.

SSIS storing logging variables in a derived column

I am developing SSIS packages that consist of 2 main steps:
Step 1: Grab all sorts of data from existing legacy systems and dump them into a series of staging tables in my database.
Step 2: Move the data from my staging tables into a more relational set of tables that I'm using specifically for my project.
In step 1 I'm just doing a bulk SELECT and a bulk INSERT; however, in step 2 I'm doing row-by-row inserts into my tables using OLEDB Command tasks so that I can log very specific row-level activity of everything that's happening. Here is my general layout for step 2 processes.
alt text http://dl.dropbox.com/u/2468578/screenshots/step_1.png
You'll notice 3 OLEDB tasks: 1 for the actual INSERT, and 2 for success/fail INSERTs into our logging table.
The main thing I'm logging is source table/id and destination table/id for each row that passes through this flow. I'm storing this stuff in variables and adding them to the data flow using a Derived Column so that I can easily map them to the query parameters of the stored procedures.
alt text http://dl.dropbox.com/u/2468578/screenshots/step_3.png
I've decided to store these logging values in variables instead of hard-coding the values in the SqlCommand field on the task, because I'm pretty sure you CAN'T put variable expressions in that field (i.e. exec storedproc #[User::VariableName],... ,... ,...). So, this is the best solution I've found.
alt text http://dl.dropbox.com/u/2468578/screenshots/step_2.png
Is this the best solution? Probably not.
Is it good performance wise to add 4 logging columns to a data flow that consists of 500,000 records? Probably not.
Can you think of a better way?
I really don't think calling an OLEDBCommand 500,000 times is going to be performant.
If you are already going to staging tables - load it all to a staging table and take it from there in T-SQL or even another dataflow (or to a raw file and then something else depending on your complete operation). A Bulk insert is going to be hugely more efficient.
to add to Cade's answer if you truly need the logging info on a row by row basis, your best best is to leverage the oledb destination and use one or both of the following transformations to add columns to the dataflow:
Derived Column Transformation
Audit Transformation
This should be your best bet and should't add much overhead