SSIS Check Excel source rows redirect rows to another table on 'x' number of field matches - sql

I work in a sales based environment and our data consists of 'leads'.
Let's say we record CompanyName, PhoneNumber, Address1 & PostCode(ZIP). These rows a seeded with a unique ID in the schema.
The leads come in from various sources and are compiled onto a spread sheet and then imported into SQL 2012 using SSIS.
After a validation check to see if a file exists we then use a simple data flow which consists of an Excel source, Derived Column, Data Conversion and finally an OLE DB Destination.
My requirement I'm sure has a relatively simple solution. I understand what I need to achieve is the first step. I need to take a sample of data from the last rolling two months, if 2 or more fields in the source excel file match the corresponding field in the destination sql table then I want to redirect to another table.
I am unsure of which combination of components I could use to achieve this. I believe that Fuzzy lookup may not be what I am looking for as I am looking to find exact field matches, I have looked at the lookup component but I am unsure if this is the way to go.
Could anyone please provide some advice on how I can best achieve this as simply as possible.

You can use the Lookup to check for matches in your existing table. However, it will be fairly complicated to implement the requirement of checking for any two or more fields matching. Your expression would be long and complex basically consisting of:
(using pseudo code for readability)
IIF((a=a AND b=b) OR (a=a AND c=c) OR (b=b AND c=c) OR ...and so on
for every combination of two columns you want to test
I would do this by importing the entire spreadsheet to a staging table, and doing the existing rows check in a SQL stored proc that moves the data to the desired destination table.

Related

PDI /Kettle - Passing data from previous hop to database query

I'm new to PDI and Kettle, and what I thought was a simple experiment to teach myself some basics has turned into a lot of frustration.
I want to check a database to see if a particular record exists (i.e. vendor). I would like to get the name of the vendor from reading a flat file (.CSV).
My first hurdle selecting only the vendor name from 8 fields in the CSV
The second hurdle is how to use that vendor name as a variable in a database query.
My third issue is what type of step to use for the database lookup.
I tried a dynamic SQL query, but I couldn't determine how to build the query using a variable, then how to pass the desired value to the variable.
The database table (VendorRatings) has 30 fields, one of which is vendor. The CSV also has 8 fields, one of which is also vendor.
My best effort was to use a dynamic query using:
SELECT * FROM VENDORRATINGS WHERE VENDOR = ?
How do I programmatically assign the desired value to "?" in the query? Specifically, how do I link the output of a specific field from Text File Input to the "vendor = ?" SQL query?
The best practice is a Stream lookup. For each record in the main flow (VendorRating) lookup in the reference file (the CSV) for the vendor details (lookup fields), based on its identifier (possibly its number or name or firstname+lastname).
First "hurdle" : Once the path of the csv file defined, press the Get field button.
It will take the first line as header to know the field names and explore the first 100 (customizable) record to determine the field types.
If the name is not on the first line, uncheck the Header row present, press the Get field button, and then change the name on the panel.
If there is more than one header row or other complexities, use the Text file input.
The same is valid for the lookup step: use the Get lookup field button and delete the fields you do not need.
Due to the fact that
There is at most one vendorrating per vendor.
You have to do something if there is no match.
I suggest the following flow:
Read the CSV and for each row look up in the table (i.e.: the lookup table is the SQL table rather that the CSV file). And put default upon not matching. I suggest something really visible like "--- NO MATCH ---".
Then, in case of no match, the filter redirect the flow to the alternative action (here: insert into the SQL table). Then the two flows and merged into the downstream flow.

Merging two different spreadsheets database data

newbie here working on something bit complicated..not sure how to start and whats the best way..looking for some advice and tips
So, we have 2 system running using MS Dynamics POS 2009 and have extract of all data (inventory/stock) in spreadsheets. Both dbo have pretty much the same items but because they have been run separately all naming and Part Numbers are in different format.
I need to create one database (one excel file) from both. Where partial match on Part Number will be identified and "merged" (keeping Part Number and Description from sheet1 and updating Stock (sheet1 stock + stock from sheet2)
Problem is that Part Numbers are written in completely different styles (by different people) and can by match only by some partial match (i guess last 3-6 characters in Part numbers)
I am not excel expert so any advice and tips would be appreciated.
Also have thoughts of loading those excel sheets into 2 separate SQL databases and doing it from SSMS as not sure if excel can cope with this
Thanks
I'm not 100% sure of the source data, but based on the available information here are some possible steps:
-Create a new Database in SSMS
-Load the data from your excel extracts in with the import data tool (Right on your newly-created database, tasks, Import data). This
will pull up a wizard that will transform your Excel spreadsheet to a table in SQL server. Do this for all spreadsheets
http://searchsqlserver.techtarget.com/feature/The-SQL-Server-Import-and-Export-Wizard-how-to-guide
-You may be able to do some matching based on the start/end of characters and use a MERGE statement to get unique data. The merge statement
allows you to set a match criteria, and then take certain actions depending on a positive or negative match. For example, if your different POS
systems have two different spreadsheets of products where there is some overlap, but also some products that are unique to each system, you could start with a source table from the first system and only insert products into it that are unique to the other system, if there is a match do nothing. Something like
MERGE ProductA A
USING ProductB B
ON RIGHT(A.ProductID, 5) = RIGHT(B.ProductID,5)
WHEN NOT MATCHED BY TARGET THEN
INSERT (ProductID, Descrption)
VALUES (b.ProductID, b.Description)
https://www.simple-talk.com/sql/learn-sql-server/the-merge-statement-in-sql-server-2008/

SQL Split functions and imported data

Ok, I've got a database table where data gets dumped by this horrid little program that I despise, but can't change at the moment. It has merchant data in there, names, addresses, and a set of categories that are pipe-delimited. What I need is a clean way to split these out, so I have one row for each merchant/category pair. From there, I can easily get it into the new data structure. This will need to be a repeatable process for a short period of time. I realize the optimal solution is to rid myself of this structure, but I've wracked my brain trying to figure out how to do this cleanly in sql.
I already have a function in the database that will split a delimited string and return a table.
This is in sql server 2008, btw.
Edit (for clarity_
Basically, the following might be a merchant (with the categories attached - other fields redacted for simplicity. Using commas for field delimiters here).
Jimbo's Bait Shoppe, Bait|Sports Gear|Sandwiches
What I need is:
Jimbo's Bait Shoppe, Bait
Jimbo's Bait Shoppe, Sports Gear
Jimbo's Bait Shoppe, Sandwiches
If you have already written a function that splits the string and returns the table you can use a trigger.
Create a trigger on INSERT on the table where the "horrid" program spits the data. The trigger will then take the unformatted data and populate two clean tables (I think in your case you should have two tables: one is a merchant, and another one is products, that are linked using one-to-many relationship using MerchantID).
In this case you can use the table with unformatted data as a "dirty" table. You can cleanse straight after the "horrid" program imported a file.
Please comment if you need help with the triggers

Splitting data by column value into an indefinite number of tables using an ETL tool

I'm trying to split a table into multiple tables based on the value of a given column using Talend Open Studio. Let's say this column can contain any of the integer values of 1, 2, 3, etc. then according to this value, these rows should go to table_1, table_2, table_3 etc.
It would be best if I could solve this when the number of different values in that column is not known in advance, but for now we can assume that all these output tables exists already. The bottom line is that the number of different values and therefore the number of different tables are high enough that setting up the individual filters manually is not an option.
Is this possible to solve this using Talend Open Studio or any similiary open source ETL tools like Pentaho Keetle?
Of course, I could just write a simple script myself, but I would prefer to use a proper ETL tool since the complete ETL process is quite complex.
In PDI or Pentaho Kettle you could do this with partitioning. (A right click option on the step IIRC) Partitioning in PDI is designed for exactly this sort of problem.
Yes that's Possible to do and split the data on the basis of single column to different table, but for that you need to create table dynamically :-
tFileInputDelimited->tFlowtoIterate ->tFixedFlowInput->and the can use
globalMap() to get the column values and use the same to seperate the
data to different tables. -> And the can use globalMap(Columnused to
seperate data) in table name.
The first solution that came to my mind was using the replicator to transport the current row to three filters which act as guard and only let rows through with either 1 2 or 3 in the given column. pic: http://i.imgur.com/FmvwU.png
But you could also build the table name dynamically, if that is what you want, pic: http://i.imgur.com/8LR7Q.png

SSIS - Column names as Variable/Changed Programmatically

I'm hoping someone might be able to help me out with this one - I have 24 files in CSV format, they all have the same layout and need to be joined onto some pre-existing data. Each file has a single column that needs to be joined onto the rest of the data, but those columns all have the same names in the original files. I need the columns automatically renamed to the filename as part of the join.
The final name of the column needs to be: Filename - data from another column.
My current approach is to use a foreach container and use the variable generated by the container to name the column, but there's nowhere I can input that value in the join, and even if I did, it'd mess up the output mappings, because the column names would be different.
Does anyone have any thoughts about how to get around these issues? Whoever has an idea will be saving my neck!
EDIT In case some more detail helps with this... SSIS version is 2008 and there are only a few hundred rows per file. It's basically a one time task to collect a full billing history from several bills which are issued monthly.
The source data has three columns, the product number, the product type and the cost.
The destination needs to have 24*3 columns, each of which has a monthly cost for a given product category. There are three product categories, and 24 bills (in seperate files) hence 24*3.
So hopefully I'm being a bit clearer - all I really need to know how to do, is to change the name of a column using a variable passed in from the foreach file container.
I think the easiest is to create a tmp database (aka staging db)
to load data from xls file to it and to define stored procedures where you can pass paramas (ex file names etc) and to build your won logic ...
Cheers Mario