Is there a way to check if both data sets are the same( Pentaho & OracleDB) - sql

I am using Pentaho (Spoon PDI), I have created input and output transformations. When running the sql in OracleDB I get back all the rows and data Pentaho I can generate a preview of the data set(generally 1000 rows).
Is there a tool where I can verify if all the returned values are the same between the two ? I have tried returning rows in each data set as a start but it seems like an innefficent way to go about this. Any input would be appreciated.

Related

Checking of replicated data Pentaho

I have about 100 tables to which we replicate data, e.g. from the Oracle database.
I would like to quickly check that the data replicated to the tables in db2 is the same as in the source system.
Does anyone have a way to do this? I can create 100 transformations, but that's monotonous and time consuming. I would prefer to process this in a loop.
I thought I would keep the queries in a table and reach into it for records.
I read the data from Table input (sql_db2, sql_source, table_name) and write do copy rows to result. Next I read single record and I read a single record and put it into a loop.
But here came a problem because I don't know how to dynamically compare the data for the tables. Each table has different columns and here I have a problem.
I don't know if this is also possible?
You can inject metadata (in this case your metadata would be the column and table names) to a lot of steps in Pentaho, you create a transformation to collect the metadata to inject to another transformation that has only the steps and some basic information, but the bulk of the information of the columns affected by the different steps is in the transformation injecting the metadata.
Check Pentaho official documentation about Metadata Injection (MDI) and the sample with a basic example of metadata injection available in your PDI installation.

Newly inserted or updated row count in pentaho data integration

I am new to Pentaho Data Integration; I need to integrate one database to another location as ETL Job. I want to count the number of insert/updat during the ETL job, and insert that count to another table . Can anyone help me on this?
I don't think that there's a built-in functionality for returning the number of affected rows of an Insert/Update step in PDI to date.
Nevertheless, most database vendors are able to provide you with the ability to get the number of affected rows from a given operation.
In PostgreSQL, for instance, it would look like this:
/* Count affected rows from INSERT */
WITH inserted_rows AS (
INSERT INTO ...
VALUES
...
RETURNING 1
)
SELECT count(*) FROM inserted_rows;
/* Count affected rows from UPDATE */
WITH updated_rows AS (
UPDATE ...
SET ...
WHERE ...
RETURNING 1
)
SELECT count(*) FROM updated_rows;
However, you're aiming to do that from within a PDI job, so I suggest that you try to get to a point where you control the SQL script.
Suggestion: Save the source data in a file on the target DB server, then use it, perhaps with a bulk loading functionality, to insert/update, then save the number of affected rows into a PDI variable. Note that you may need to use the SQL script step in the Job's scope.
EDIT: the implementation is a matter of chosen design, so the suggested solution is one of many. On a very high level, you could do something like the following.
Transformation I - extract data from source
Get the data from the source, be it a database or anything else
Prepare it for output in a way that it fits the target DB's structure
Save a CSV file using the text file output step on the file system
Parent Job
If the PDI server is the same as the target DB server:
Use the Execute SQL Script step to:
Read data from the file and perform the INSERT/UPDATE
Write the number of affected rows into a table (ideally, this table could also contain the time-stamp of the operation so you could keep track of things)
If the PDI server is NOT the same as the target DB server:
Upload the source data file to the server, e.g. with the FTP/SFTP file upload steps
Use the Execute SQL Script step to:
Read data from the file and perform the INSERT/UPDATE
Write the number of affected rows into a table
EDIT 2: another suggested solution
As suggested by #user3123116, you can use the Compare Fields step (if not part of your environment, check the marketplace for it).
The only shortcoming I see is that you have to query the target database before inserting/updating, which is, of course, less performant.
Eventually it could look like so (note that this is just the comparison and counting part):
Also note that you can split the input of the source data stream (COPY, not DISTRIBUTE), and do your insert/update, but this stream must wait for the stream of the field comparison to end the query on the target database, otherwise you might end up with the wrong statistics.
The "Compare Fields" step will take 2 streams as input for comparison, and its output is 4 distinct streams for "Identical", Changed", "Added", and "Removed" records. You can count those 4, and then process the "Changed", "Added", and "Removed" records with an Insert/Update.
You can do it from the Logging option inside the Transformation settings. Please follow the below steps :
Click on Edit menu --> Settings
Switch to Logging Tab
Select Step from the left menu
Provide the Log Connection & Log table name(Say StepLog)
Select the required fields for logging(LINES_OUTPUT - for inserted count & LINES_UPDATED - for updated count)
Click on SQL button and create the table by clicking on the Execute button
Now all the steps will be logged into the Log table(StepLog), you can use it for further actions.
Enjoy

SSRS Complex Query Report

I am developing a report in SSRS.
My report has around 50 row headers. Data for each row header is the reult of a complex query to the database.
2 row header may/may not have data that relates to another row header.
In this case what would be the best way to create the report?
-- Do we create a procedure that gets all data to a temporary table and then generate the report using this temp table?
-- Do we create multiple datasets for this report.
Please advice on what would be the best way to proceed.
I read somewhere about using Link wherein data is retrieved from the post gre database (project uses postGreSql db) to the local sql server that SSRS provides.
Report then retrieves data from the local sql server to generate the report.
Thoughts?
You are best using a stored procedure as the source. It is easy to optimize the stored procedure so as to get the best performance so the report runs fast.
Assembling all your data so that you can use a single dataset to represent it would be the way to go.

pentaho ETL Tool data migration

I am migrating the data through pentaho. there is a problem occur when the number of rows is more than 4 lankhs.transaction fail in b/w the transaction.how can we migrate the large data by pentaho ETL Tool.
As a basic debugging, do the following
If your output is a text file or Excel file, make sure that you check the size of string/text columns. As defaut the 'text ouput step' will take the maximum string length and when you start writing, it can throw up heap errors. So reduce the size and re-run the ktr files.
If the output is a table ouput step, then again check for columns with datatypes and maximum column size defined in your output table.
Kindly share the error logs if you think there is something else running around. :)

Using SSIS denormalize the data from xml file and load into sql server

I am new to SSIS, I am trying to load data from XML file into SQL server table. I have created the project and can transform and load data into table, but with one issue, below is sample of XML data
<EventLocationInfo>
<Facility>NY 31</Facility>
<Direction>eastbound</Direction>
<City>Cicero</City>
<County>Onondaga</County>
<State>NY</State>
<LocationDetails>
<LocationItem>
<Intersections>
<PrimaryLoc>I-81</PrimaryLoc>
<Article>area of</Article>
</Intersections>
</LocationItem>
<LocationItem>
<PointCoordinates Datum="NAD83">
<Lat>43.1755981445313</Lat>
<Lon>-76.1159973144531</Lon>
</PointCoordinates>
</LocationItem>
<LocationItem>
<AssociatedCities>
<PrimaryCity>Cicero</PrimaryCity>
<Article>area of</Article>
</AssociatedCities>
</LocationItem>
</LocationDetails>
</EventLocationInfo>
The result I am getting is like this
Is it possible to generate only one row instead of getting three different rows, if possible what Data Flow Transformation tool i can use to get this result.
Please help, thanks in advance.
Brijesh