Informatica coding - sql

I am currently working on a scenario in Informatica Powercenter designer where the situation is as follows:
SQ1: I am pulling employee records according to the criteria of having a layer of employees based on their hierarchy (Client relation directors) which is the first source qualifier and in which i am doing a sql override to extract data from 3 tables,
and for those selected employees I have to pull some other information for example:
SQ2: what client relations they are handling which is in a separate source qualifier and
SQ3 some of the personal information from their profile which is in a third source qualifier.
I have single mapping in which there are three source qualifiers as described above and in all of them I am using SQL override. My question is that the data that i have pulled in first qualifier brings a subset of the total employee records, but in Source qualifier 2 and source qualifier 3 I have to pull all employee data and then do a join on employee_id in two joiners to finally collect data for the layer of employees that are coming from source qualifier 1. What i want is that if somehow I save the employee ids from and SQ1 and use them in SQ2 and SQ3 so that i pull data for only a subset of employees, the problem is I cant split the mapping and cannot add the code for selecting the subset from SQ1 bc it will be repitition of code and taking a long time to run, also the number of records are about one million. I cant find a way to perform the above that is why i am asking for help here.
I am pulling data from db2, and working in powercenter designer 9.5.1.
I will be thankful if i can get guidance regarding the above issue

What you can do is if all the table is in database,you can pull the source tables in one source qualifier and then override the SQL and create a join.
So the point is instead of 3 different source qualifier you can have one source qualifier.

I assume you are having three separate source qualifiers because the data is present in different databases. If not, doing an application join from three different source qualifiers( you will have to use 2 joiners) is very expensive. There are a couple of ways you can do this:
split the mapping to stage the data first and then use this staging layer as source to perform more complex operation
Identify your driving table. Since the record count in the SQ2 and SQ3 are bigger, I am assuming they can be the driving table. Use a lookup for SQ1 (Since its a smaller table size, the cache time would not be very big)
I would still suggest you use a staging layer to extract and stage the data, then transform it. Try to perform database joins(or lookups) as much as you can instead of joining at the application layer.

Consider using a pipeline lookup as a query for your SQ1 and use it in the pipeline that joins SQ2 and SQ3 .
Usage for pipeline lookup can be found at :
https://marketplace.informatica.com/solutions/performance_tuning_pipeline_lookup
Let me know if it helps .

Related

Understanding a table's structure/schema in SQL

I wanted to reach out to ask if there is a practical way of finding out a given table's structure/schema e.g.,the column names and example row data inserted into the table(like the head function in python) if you only have the table name. I have access to several tables in my current role, however, a person who developed the tables left the team I am on. I was interested in examining the tables closer via SQL Assistant in Teradata (these tables often contain often hundreds of thousands of rows hence there are issues of hitting CPU exception criteria errors).
I have tried the following select statement, but there is an issue of hitting internal CPU exception criteria limits.
SELECT top10 * FROM dbc.table1
Thank you in advance for any tips/advice!
You can use one of these commands to get table's structure details in teradata
SHOW TABLE Database_Name.Table_Name;
or
HELP TABLE Database_Name.Table_Name;
It shows the table structure details

Google Big Query - Date-Partitioned Tables with Eventual Data

Our use case for BigQuery is a little unique. I want to start using Date-Partitioned Tables but our data is very much eventual. It doesn't get inserted when it occurs, but eventually when it's provided to the server. At times this can be days or even months before any data is inserted. Thus, the _PARTITION_LOAD_TIME attribute is useless to us.
My question is there a way I can specify the column that would act like the _PARTITION_LOAD_TIME argument and still have the benefits of a Date-Partitioned table? If I could emulate this manually and have BigQuery update accordingly, then I can start using Date-Partitioned tables.
Anyone have a good solution here?
You don't need create your own column.
_PARTITIONTIME pseudo column still will work for you!
The only what you will need to do is insert/load respective data batch into respective partition by referencing not just table name but rather table with partition decorator - like yourtable$20160718
This way you can load data into partition that it belong to

join 78 specific codes from created table to linked table. Can't use IN() function (character limit), Can't do RI

I have a DB (Access 2010) that I am pulling data from, but I am trying to make it easier to pull specific cases instead of mucking about in Excel.
We have about 78 product type codes that we classify as a certain account type. Unfortunately I can't use an IN() function because there are too many characters (there is the 1024 char limit). I looked online for help and it was suggested that I make a table to inner join on the product codes that I want.
I created a table with the codes I want to pull, then joined on the productcodetype in the linked database table. Unfortunately when I run the sql nothing shows up, just blank. I tried different join combinations to no avail, read up further and found that you can't enforce referential integrity on linked DB tables from non-linked DB tables.
I think this is my problem but I'm not sure, and I don't know if I'm using the right language, but I can't find a similar issue to mine so I'm hoping it's an easy fix and I'm just not thinking about it the right way.
Is there any way to select certain cases (78 product type codes) from a large database using something like IN() or a reference table when I can't create a new table in the linked db?
Thank you,
K
You must to use two tables and build a query that join them. If your join don't return any result, be sure that the joined fields are of the same data type and realy share the same values.
If your data source is Excel, be sure that there isn't any trailing blanks or other 'invisible' character.

How to compare rows in source and destination tables dynamically in SQL Server

We receive a data feed from our customers and we get roughly the same schema each time, though it can change on the customer end as they are using a 3rd party application. When we receive the data files we import the data into a staging database with a table for each data file (students, attendance, etc). We then want to compare that data to the data that we already have existing in the database for that customer and see what data has changed (either the column has changed or the whole row was possibly deleted) from the previous run. We then want to write the updated values or deleted rows to an audit table so we can then go back to see what data changed from the previous data import. We don't want to update the data itself, we only want to record what's different between the two datasets. We will then delete all the data from the customer database and import the data exactly as is from the new data files without changing it(this directive has been handed down and cannot change). The big problem is that I need to do this dynamically since I don't know exactly what schema I'm going to be getting from our customers since they can make customization to their tables. I need to be able to dynamically determine what tables there are in the destination, and their structure, and then look at the source and compare the values to see what has changed in the data.
Additional info:
There are no ID columns on source, though there are several columns that can be used as a surrogate key that would make up a distinct row.
I'd like to be able to do this generically for each table without having to hard-code values in, though I might have to do that for the surrogate keys for each table in a separate reference table.
I can use either SSIS, SPs, triggers, etc., whichever would make more sense. I've looked at all, including tablediff, and none seem to have everything I need or the logic starts to get extremely complex once I get into them.
Of course any specific examples anyone has of something like this they have already done would be greatly appreciated.
Let me know if there's any other information that would be helpful.
Thanks
I've worked on a similar problem and used a series of meta data tables to dynamically compare datasets. These meta data tables described which datasets need to be staged and which combination of columns (and their data types) serve as business key for each table.
This way you can dynamically construct a SQL query (e.g., with a SSIS script component) that performs a full outer join to find the differences between the two.
You can join your own meta data with SQL Server's meta data (using sys.* or INFORMATION_SCHEMA.*) to detect if the columns still exist in the source and the data types are as you anticipated.
Redirect unmatched meta data to an error flow for evaluation.
This way of working is very risky, but can be done if you maintain your meta data well.
If you want to compare two tables to see what is different the keyword is 'except'
select col1,col2,... from table1
except
select col1,col2,... from table2
this gives you everything in table1 that is not in table2.
select col1,col2,... from table2
except
select col1,col2,... from table1
this gives you everything in table2 that is not in table1.
Assuming you have some kind of useful durable primary key on the two tables, everything in both sets, is a change. Everything in the first set is an insert; Everything in the second set is a delete.

SSIS Moving data from one place to another

I was asked by the company I work for, to create SSIS that will take data from few tables in one data source and change few things in the data, then put it in few tables in the destination.
The main entity is "Person". In the people table, each person has a PersonID.
I need to loop on these records and for each person, take his names from the orders from the orders table, and other data from few other tables.
I know how to take data from one table and just move it to a different table in the destination. What I don't know is how to manipulate the data before dumping it in the destination. Also, how can i get data from few tables for each person id?
I need to be done with this very fast, so if someone can tell me which items in ssis i need to use and how, that will be greate.
Thanks
Microsoft has a few tutorials.
Typically it is easy to simply do your joins in SQL before extracting and use that query as the source for extraction. You can also do data modification in that query.
I would recommend using code in SSIS tasks for only things where SQL is problematic - custom scalar functions which can be quicker in the scripting runtime and handling disparate data sources.
I would start with the Data Flow Task.
Use the OledbSource to execute a stored proc that will read, manipulate and return the data you need.
Then you can pass that to either an OleDBDestination or an OleDBCommand that will move that to the destination.