How to handle file inputs with changing schemas in Talend - schema

Questions: How do I continue to process files that differ substantially from a base schema and that trigger tSchemaComplianceCheck errors?
Background
Suppose I have a folder with Customer xls files called file1,file2,....file1000. Assume I have imported the file schema into Talend repository and called it 6Columns and I have the talend job configured to iterate through each of the files and process them
1-tFileInput ->2-tSchemaCompliance-6Columns -> 3-tMap ->4-FurtherProcessing
Read each excel file
Compare it to the schema 6Columns
Format the output (rename columns)
Take the collection of Customer data and process it more
While processing I notice that the schema compliance is generating errors (errorCode 16) which points to a number of files (200) with a different schema 13Columns but there isn't a way to identify the files in advance to filter then into a subjob
How do I amend my processing to correctly integrate the files with 13Columnsschema into the process (whats the recommended way of handling) and designing incase other schema changes occur
1-tFileInput ->2-tSchemaCompliance-6Columns -> 3-tMap ->4-FurtherProcessing
|
|Reject Flow (ErrorCode 16)
|Schema-13Columns
|
|-> ??
Current Thinking When ErrorCode 16 detected
Option 1 Parallel. Take the file path for the current file and process it against 13Columns using a new FileInput before merging the 2 flows back into 1
Option 2 Serial. Collect the list of files that triggered the error and process them after I've finished with the compliance files?

You could try something like below :
tFileList - Read your input repository
tFileInput "schema6" - tSchemaComplianceCheck : read files as 6-columns schema
tMap_1 : further processing
In the reject part :
tMap after reject link : add a new column containing the filepath that has been rejected
tFlowToIterate : used to get an iterate link, acceptable input for tFileInputDelimited that follows.
tFileInput : read data as 13-columns schema. Following components are the same as in part 1.
After that, you can push your data to tHashOutput, in order to read them further in another subjob.

Related

Repast: how to add and set a new parameter directly from the code instead of GUI

I want to create a parameter that contains a list of string (list of hub codes). This list of string is created by reading an external csv file (this list could contain the different codes depending on the hub codes in the CSV file)
What I want is to find a easy auto way to perform batch runs by each hub code in the list.
So this question is:
1) how to add and set a new parameter directly from the code (during the initialization when reading the CSV) instead of GUI parameter panel?
2) how to avoid manual configuration of hub list in the batch run configuration
Something like this for adding the parameters should work in your ContextBuilder.
Parameters params = RunEnvironment.getInstance().getParameters();
((DefaultParameters)params).addParameter("foo", "Big Foo", Integer.class, 3, false);
You would read the csv file to get the parameter name and value.
I'm not sure I completely understand the batch run configuration question, but each batch run has a run number associated with it
RunState.getInstance().getRunInfo().getRunNumber()
If you can associate line numbers in your csv parameter file with run number (e.g. run number 1 should use line 1, and so on), then each batch run would use a different parameter line.

How to create a http request that contains multiple FileHeaders?

I am trying to test a uploading service that supports multiple files uploading,and I found this:
golang POST data using the Content-Type multipart/form-data
that introduced how to create a request to upload a single file,but I need to upload multiple files,is there simple way to create this kind of request?
update:
please check line:38 and 39 in post:to support html5 multiple files uploading
line 38 files := m.File["myfiles"]
line 29 for i, _ := range files {
It seems that it needs to set single name for multiple file headers to stimulate the html5 multiple files uploading.
For each file, call CreateFormFile to create the header for the file. Call Write on the writer returned from CreateFormFile one or more times to write data to the file. When done with all files, close the multipart writer.
The top answer in the linked question uploads two files, one named "image" and one named "key". The data for the "image" is copied from a file. The data for "key" is simply the bytes "KEY".
The field name is the first argument to CreateFormFile. If you want to upload multiple files with the same name, use the same name each time you call CreateFormFile.

How to fix source is empty error in XML source while using Foreach loop container in SSIS 2012?

I have an issue with a very simple task in SSIS 2012.
I have a for-each container that runs in FOR-EACH-FILE Enumerator mode. I want to read a target folder with XML files. The path to the folder is correctly configured. The files field is set to *.xml
The variable mapping is defined with the follwing Variable: User::FileVar , Index 0.
Now I add a simple data flow task inside the container. The dataflow task only has a XML-Data Source task, that's it. For the XML Data source task, the XSD location is set. When I click choose columns, I can see the columns from the XSD schema.
BUT: When I save the XML task , I always get the error message: The Property XMLDataVariable is empty. I tried both data Access modes, XML file from variable and XML data from variable. The error message remains, I cannot run the package.
I don't use any expressions, neither at the foreach loop container nor at the data flow task.
I dont know what's wrong here, I did the steps exactly as shown in some tutorials for older versions of SSIS.
Do you have any ideas?
The issue is that the XML Source is trying to validate the existence of the given file during the design time. However, you will know the file name only during runtime when the Foreach loop container executes and loops through every XML file available in a given folder.
I recreated an SSIS 2012 package using my answer to one of other SO questions.
SSIS reading multiple xml files from folder
I was able to reproduce the error The property "XMLDataVariable" on the XML Source was empty
On the XML source, I set the property ValidateExternalMetadata to False. Setting this to false will force the package not to verify the existence of the xml file path during design time.
I was successfully able to execute the package.
Hope that helps.

SSIS Connection Error - File name not valid

I'm seeing an issue with an SSIS (SQL Server 2005) job where I'm getting the following error:
The file name "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=\UNC\FOLDERS\filename.xls;Extended Properties="EXCEL 8.0;HDR=YES";" specified in the connection was not valid.
My searching around this site and others indicates that the most common cause of this is a permissions error but I don't believe that's the case in this situation since any number of files have successfully been processed through this implementation.
Here's an overview of the setup:
Vendors FTP files to us on a daily basis that a Windows service picks up, copies to a temporary directory and then calls SSIS jobs on those files. There are two SSIS jobs for each vendor one for a snapshot data feed and one for a transaction listing.
There are currently over 50 different SSIS jobs in the overall process. All of them work except for one specific transaction job which fails with the above error in a script task step. Files come in at least daily with unique file names so I grab the job, determine the vendor based off the source directory and then the file type based off indicators in the file name to determine which SSIS job to call. Since file names change every day, when the service calls the SSIS job, I pass in a series of parameters including the vendor file name so it can properly connect to the file.
Each job begins with a script task that sets necessary variable values for the rest of the job. For example, since the vendor file name changes with each run, I pass in the vendor file name through the SSIS variables collection then set the connection string of a datasource using that file name as the DataSource in the string. It is at that point of the script task that the above error occurs. Here's the task script code where the error occurs:
Dts.Connections("Transactions File").ConnectionString = _
Dts.Variables("ConnectionString").Value.ToString().Replace("##FILE_PATH##", sourceFilePath)
The ConnectionString value is: Provider=Microsoft.Jet.OLEDB.4.0;Data Source=##FILE_PATH##;Extended Properties="EXCEL 8.0;HDR=YES";
The sourceFilePath is the full UNC path to the vendor file in the processing directory
I don't believe it's a permissions error since all the other files going through this process (using the same holding directory for processing) are working. It shouldn't be an issue of the file not existing since again it follows the same process as every other file and I have verified the file properly ends up in the correct directory. I also considered that the connection string might be too long, but the filepath ends up at 109 characters and even with a shorter (<90) full path, the same error occurs.
Is there anything else you can you think of for me to look at? Thanks for any help.
Based on the information presented, you are doing everything correct. If you're new to SSIS, one thing I'd suggest, is that you get a copy of the excellent add-in BIDSHelper. It has great features that can really save you time especially with regard to configurations and expressions.
I created a reference package that had an Excel Connection Manager pointing to C:\ssisdata\so_paulsmithjr.xls and wired everything up.
At this point, I know things are working so it was time to make the package move. I created the following variables and their values
CurrentFile - C:\ssisdata\so_paulsmithjr.xls
PlaceHolder - ##FILE_PATH##
TemplateConnection - Provider=Microsoft.Jet.OLEDB.4.0;Data Source=##FILE_PATH##;Extended Properties="Excel 8.0;HDR=YES";
A fourth variable is set to be an expression (Right click on variable, properties window. Set Evaluate as Expression = True & Expression is below)
CurrentConnection - REPLACE(#[User::TemplateConnection], #[User::PlaceHolder], #[User::CurrentFile])
I compared the CurrentConnection value to the ReferenceConnection (which is the original value of the Excel Connection Manager's connection string) and things were a match. At this point, if I were to change the value of CurrentFile to C:\ssisdata\so_paulsmithjr - Copy.xls, that would automatically be reflected in the value of CurrentConnection.
The final trick would be to use an Expression on the Excel Connection Manager. Again, right click on the CM and under Properties, there will be Expressions. It won't expand as there is nothing under it. Instead click the ellipses and then select ConnectionString property and select the ellipses again and this time drag down the #[User::CurrentFile] variable. Click OK x2 and now your connection manager is set to use wherever the CurrentConnection variable specifies.
Does that work any better?

SSIS 2005 flat file source - partial row which isn't actually a partial row

I'm currently working on an SSIS package to load mainframe logs from multiple server/file sources into a database.
As it stands at the moment I'm using a foreach loop container to loop through a recordset containing filenames and load the files using a Data Flow task from a Flat File Source and File connection to an OLE DB Destination through a Derived column.
I've built in error handling on the Data Flow task to allow for the fact that there won't always be a log file in the location specified (ie. because the server was down for maintenance during a specific period as the files are generated on an hourly basis), but the problems start after it finishes handling these errors.
If the file immediately following an attempt to load a file that wasn't found exists it begins to load it but then throws the following warning message: [Message Log File Source (NORDXSL) [57]] Warning: There is a partial row at the end of the file., and doesn't load all of the records in that file.
However, when I remove the files I know won't exist from the recordset (so that it only attempts to load files that do exist, including the one with the alleged "partial row"), everything works fine and all files/rows are loaded without a problem. It just seems to not want to load the first file after it's failed a missing file correctly and I can't for the life of me work out why?
I've tried calling Dispose() and ReleaseConnection() on the file connection after the Data Flow task has finished processing but this makes no difference and I'm now completely out of ideas.
Any help would be really appreciated as this is the last bug in this project and I want to get it out the door. PLEASE!!
Thanks,
James
I've now found a workaround for this problem...
I've added a Script Task before the Data Flow Task to load the files that checks to see if the file I want to read exists:
If (System.IO.File.Exists(Dts.Variables("MQLogMessagePath").Value.ToString)) Then
Dts.TaskResult = Dts.Results.Success
Else
Dts.TaskResult = Dts.Results.Failure
End If
If it doesn't exist it fails the iteration of the Foreach Loop container and continues onto the next file.
BINGO!