Reading metadata CSV from a datalake, too big for a lookup activity

Reading metadata CSV from a datalake, too big for a lookup activity - azure-sql-database

I need to create a pipeline to read CSVs from a folder, load from Row 8 into an Azure SQL table, Frist 5 rows will go into a different table ([tblMetadata]).
So far I have done it using Lookup Activity, works fine, but one of the files is bigger than 6 MB and it fails.
I checked all options in Lookup, read everything about Copy Activity (which I am using to load main data - skip 7 rows). The pipeline is created using GUI.
The output from the Lookup is used as parameters for a Stored Procedure to insert into tblMetadata
Can someone advise me how to deal with this? At the moment I am on the training, no one can help me on site.

You could probably do this with a single Data Flow activity that has a couple of transformations.
You would use a Source transformation that reads from a folder using folder paths and wildcards, then add a conditional split transformation to send different rows to different sinks.

I did workaround in different way, modified CSVs that are bing imported to have whole Metadata in the first row (as this was part of my different project). Then used FirstRow only in Lookup.

Related

Update multiple Excel sheets of one document within one Pentaho Kettle transformation

I am researching standard sample from Pentaho DI package: GetXMLData - Read parent children rows. It reads separately from same XML input parent rows & children rows. I need to do the same and update two different sheets of the same MS Excel Documents.
My understanding is that normal way to achieve it is to put first sequence in one transformation file with XML Output or Writer, second to the second one & at the end create job with chain from start, through 1st & 2nd transformations.
My problems are:
When I try to chain above sequences I loose content of first updated Excel sheet in the final document;
I need to have at the end just one file with either Job or Transformation without dependencies (In case of above proposed scenario I would have 1 KJB job + 2 KTR transformation files).
Questions are:
Is it possible to join 2 sequences from above sample with some wait node before starting update 2nd Excel sheet?
If above doesn't work: Is it possible to embed transformations to the job instead of referencing them from external files?
And extra question: What is better to use: Excel Output or Excel Writer?
=================
UPDATE:
Based on #AlainD proposal I have tried to put Block node in-between. Here is a result:
Looks like Block step can be an option, but somehow it doesn't work as expected with Excel Output / Writers node (or I do something wrong). What I have observed is that Pentaho tries to execute next after Block steps before Excel file is closed properly by the previous step. That leads to one of the following: I either get Excel file with one empty sheet or generated result file is malformed.
My input XML file (from Pentaho distribution) & test playground transformation are: HERE
NOTE: While playing do not forget to remove generated MS Excel files between runs.
Screenshot:
Any suggestions how to fix my transformation?

The pattern goes as follow:
read data: 1 row per children, with the parent data in one or more column
group the data : 1 row per parent, forget the children, keep the parent data. Transform and save as needed.
back from the original data, lookup each row (children) and fetch the parent in the grouped data flow.
the result is one row per children and the needed column of the transformed parent. Transform and save as needed.
It is a pattern, you may want to change the flow, and/or sort to speed up. But it will not lock, nor feed up the memory: the group by and lookup are pretty reliable.

Question 1: Yes, the step you are looking after is named Block until this (other) step finishes, or Blocking Step (untill all rows are processed).
Question 2: Yes, you can pass the rows from one transformation to an other via the job. But it would be wiser to first produce the parent sheet and, when finished, read it again in the second transformation. You can also pass the row in a sub-transformation, or use other architecture strategies...
Question 3: (Short answer) The Excel Writer appends data (new sheet or new rows) to an existing Excel file, while the Excel Output creates and feed a one sheet Excel file.

Output Multiple flat files to multiple SQL tables

I have multiple flat files. I need to output each flat file to a different table using SSIS. I created a For each file Enumerator to bring every source file but it's uploading all of them to the same table which then throws error because they have different fields.
How may I configure a package to output to different tables?

You cannot, at least within a single data flow, have different source meta data. DTS supported this but SSIS does not. The number and type of columns in an SSIS package must be fixed.
You can have multiple data flows within your ForEach loop and then enable/disable them based on the file name or some other criteria to support loading different sources and destinations.
Some might suggest you read them all in a single line and then use a conditional split based on file type and then use a derived column to split it out into specific columns. That works but it is a maintenance nightmare I would not wish on my most hated enemy.

Get list of columns of source flat file in SSIS

We get weekly data files (flat files) from our vendor to import into SQL, and at times the column names change or new columns are added.
What we have currently is an SSIS package to import columns that have been defined. Since we've assigned the mapping, SSIS only throws up an error when a column is absent. However when a new column is added (apart from the existing ones), it doesn't get imported at all, as it is not named. This is a concern for us.
What we'd like is to get the list of all the columns present in the flat file so that we can check whether any new columns are present before we import the file.
I am relatively new to SSIS, so a detailed help would be much appreciated.
Thanks!

Exactly how to code this will depend on the rules for the flat file layout, but I would approach this by writing a script task that reads the flat file using the file system object and a StreamReader object, and looks at the columns, which are hopefully named in the first line of the file.
However, about all you can do if the columns have changed is send an alert. I know of no way to dynamically change your data transformation task to accomodate new columns. It will have to be edited to handle them. And frankly, if all you're going to do is send an alert, you might as well just use the error handler to do it, and save yourself the trouble of pre-reading the column list.

I agree with the answer provided by #TabAlleman. SSIS can't natively handle dynamic columns (and niether can your SQL destination).
May I propose an alternative? You can detect a change in headers without using a C# Script Tasks. One way to do this would be to create a flafile connection that reads the entire row as a single column. Use a Conditional Split to discard anything other than the header row. Save that row to a RecordSet object. Any change? Send Email.
The "Get Header Row" DataFlow would look like this. Row Number if needed.
The Control Flow level would look like this. Use a ForEach ADO RecordSet object to assign the header row value to an SSIS variable CurrentHeader..
Above, the precedent constraints (fx icons ) of
[#ExpectedHeader] == [#CurrentHeader]
[#ExpectedHeader] != [#CurrentHeader]
determine whether you load data or send email.
Hope this helps!

i have worked for banking clients. And for banks to randomly add columns to a db is not possible due to fed requirements and rules. That said I get your not fed regulated bizz. So here are some steps
This is not a code issue but more of soft skills and working with other teams(yours and your vendors).
Steps you can take are:
(1) reach a solid columns structure that you always require. Because for newer columns older data rows will carry NULL.
(2) if a new column is going to be sent by the vendor. You or your team needs to make the DDL/DML changes to the table were data will be inserted. Ofcouse of correct data type.
(3) document this change in data dictanary as over time you or another member will do analysis on this data and would like to know what is the use of each attribute or column.
(4) long-term you do not wish to keep changing table structure monthly because one of your many vendors decided to change the style the send you data. Some clients push back very aggresively other not so much.

If a third-party tool is an option for you, check out CozyRoc's Data Flow Task Plus. It handles variable columns in sources.

SSIS cannot make the columns dynamic,

one thing, i always do, is use a script task to read the first and last lines of a file.
if it is not an expected list of csv columns i mark file as errored and continue/fail as required.
Headers are obviously important, but so are footers. Files can through any unknown issue be partially built. Requesting the header be placed at the rear of the file it is a double check.
I also do not know if SSIS can do this dynamically, but it never ceases to amaze me how people add/change order of columns and assume things will still work.

1-SSIS Does not provide dynamic source and destination mapping.But some third party component such as Data flow task plus , supporting this feature
2-We can achieve this using ssis script task.
3-If the Header is correct process further for migration else fail the package before DFT execute.
4-Read the line from the header using script task and store in array or list object
5-Then compare those array values to user defined variables declare earlier contained default value as column name.
6-If values are matching exactly then progress further else fail it.

SSIS Pass Datasource Between Control Flow Tasks

I'm having troubles solving this little problem, hopefully someone can help me.
In my SSIS package I have a data flow task.
There's several different transforms, merges and conversions that happen in here.
At the end of the dataflow task, there is two datasets, one that contains two numbers that need to be compared, and another dataset that contains a bunch of records.
Ideally, I would like to have these passed onto a whole new data flow task (separate sequence container) where I can do some validation work on it and separate the logic.
I cant for the life of me figure out how to do it. Iv tried looking into scripting and storing the datasets as variables, but I'm not sure this is the right way to do it.
The next step is to export the large dataset as a spreadsheet, but before this happens i need to compare the two numbers from the other dataset and ensure they're correct.

To pass data flowing in one dataflow to another, You have to have a temporary location.
This means that You have to put data in destination in one dataflow and then read that data in another dataflow.
You can put data in number of destinations:
database table
raw file
flat file
dataset variable (recordset destination)
any other destination component that you can read from with corresponding source component or by writing script or whatever
Raw files are meant to be used for cases like this. They are binary and as such they are extremely fast to write to and read from.
In case You insist to use recordset destination take a look at http://consultingblogs.emc.com/jamiethomson/archive/2006/01/04/SSIS_3A00_-Recordsets-instead-of-raw-files.aspx because there is no recordset source component.

A Data Flow Task needs to have a destination; a Data Flow Task likewise is NOT a destination. Otherwise the data doesn't go anywhere in the pipeline. From my experience, your best bets are to:
1) Pump the data into staging tables in SQL Server, and then pick up the validations from there.
2) Do the validations in the same Data Flow task.

Help Importing CSV file with Variable Columns per Row into SQL Table using Import tool or SSIS

I am stuck with a CSV file with over 100,000 rows that contains product images from a provider. Here are the details of the issue, I would really appreciate some tips to help resolve this. Thanks.
The File has 1 Row per product and the following 4 columns.
ID,URL,HEIGHT,WIDTH
example: 1,http://i.img.com,100,200
Problem starts when a product has multiple images.
Instead of having 1 row per image the file has more columns in same row.
example:
1,http://i.img.com,100,200,//i.img.com,20,100,//i.img.com,30,50
Note that only first image has "http://" remaining images start with "//"
There is no telling how many images per product hence no way to tell how many total columns per row or max columns.
How can I import this using SSIS or sql import wizard.
Also I need to do this on regular intervals.
Thank you for your help.

I don't think that you can use any standard SSIS task or wizard to do this. You're going to have to write some custom code which parses each line. You can do this in SSIS using VB code or you can import the file into a staging table that's just a single column to hold each row and do the parsing in SQL. SSIS will probably be faster for this kind of operation.
Another possibility is to preprocess the file using regex or a search-and-replace command. Try to get double-quotes around the image list then you should be able to import the whole file fine, with the quoted part going into a single column. Catching the start of the string should be easy enough given the "http:\" for which you can search. Determining where the end quote goes might be more of a problem.
A third potential solution would be to get the source to fix the data. Even if you can't get the images in separate rows (or another file with separate rows, which would be ideal), maybe you can get the double-quotes added from the source as part of the export. This would likely be less error-prone than using the search-and-replace method.
Good luck!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Reading metadata CSV from a datalake, too big for a lookup activity - azure-sql-database

You could probably do this with a single Data Flow activity that has a couple of transformations. You would use a Source transformation that reads from a folder using folder paths and wildcards, then add a conditional split transformation to send different rows to different sinks.

I did workaround in different way, modified CSVs that are bing imported to have whole Metadata in the first row (as this was part of my different project). Then used FirstRow only in Lookup.

Related

Update multiple Excel sheets of one document within one Pentaho Kettle transformation

Output Multiple flat files to multiple SQL tables

Get list of columns of source flat file in SSIS

SSIS Pass Datasource Between Control Flow Tasks

Help Importing CSV file with Variable Columns per Row into SQL Table using Import tool or SSIS

Categories

Resources