Talend - Read xml schema of Ldap from an xml file - ldap

What I'm looking ?
I want to read the schema of an LDAPinput from an xml file.
Info:
The user will define the attributes that he wants in the xml file.
The job will retrive only those attributes that are defined in the xml from the LDAP folder. How can I do that?
I am new to talend and I cant find any question on this in SO.

Honestly, this is very painful to do properly and I'd seriously reconsider why you need to limit the columns back from the LDAP service and not just ignore the extraneous columns.
First of all you need to parse your XML input to get the requested columns and drop that into a list and then lob that into the globalMap.
What you're going to have to do is read in the entire output with all the columns from a correctly configured tLDAPInput component but with the schema for the component set to have a single dynamic column.
From here you'll need to use a tJavaRow/tJavaFlex component to loop through the list of expected columns from your XML input and then retrieve each column's name from the dynamic column's metadata and if the column name matches the provided values from your XML input then to output the value into an output column.
The output schema for your tJavaRow/tJavaFlex will need to contain as many columns as you could possibly have return (so every LDAP column for your service) but then populate them as they are needed. Alternatively you could output another dynamic schema column which means you don't need fixed schema columns but you'd have to add a meta column (so a column inside the dynamic column) with each match of column names.

Related

Dynamic filename in Mapping Data Flow sink without the column in file

The way I understand it, is if you want to have dynamic filenames when writing to blob storage from a mapping data flow, the solution is to set "As data in column" on the file name options on the sink. This then uses the contents of a column as the filename for each row. To set the filename in the row you can have a derived column that contains the expression.
With auto mapping enabled on the sink this then results in having a column in the file containing the filename.
With auto mapping turned off, I could map all columns except for this one, but as I want to also have schema drift enabled on the source and keep any extra columns in the destination I can't have a fixed set of output columns.
How can I dynamically set the filename that gets generated without including it as a column in the file?
Or if we assume every row will have the same filename is there another way to dynamically set a filename? I've struggled to find any documentation on the file name options, but Pattern looks like it just adds a number and single file looks like a fixed value.
When you choose 'Output to single file' option, you can create a parameter in Data Flow and use it as file name. Then pass the value from pipeline to Data Flow like this:
My test:
1.add a parameter in Data Flow.
2.use that parameter as file name.
3.pass the value to parameter.

Is there any way to exclude columns from a source file/table in Pentaho using "like" or any other function?

I have a CSV file having more than 700 columns. I just want 175 columns from them to be inserted into a RDBMS table or a flat file usingPentaho (PDI). Now, the source CSV file has variable columns i.e. the columns can keep adding or deleting but have some specific keywords that remain constant throughout. I have the list of keywords which are present in column names that have to excluded, e.g. starts_with("avgbal_"), starts_with("emi_"), starts_with("delinq_prin_"), starts_with("total_utilization_"), starts_with("min_overdue_"), starts_with("payment_received_")
Any column which have the above keywords have to be excluded and should not pass onto my RDBMS table or a flat file. Is there any way to remove the above columns by writing some SQL query in PDI? Selecting specific 175 columns is not possible as they are variable in nature.
I think your example is fit to use meta data injection you can refer to example shared below
https://help.pentaho.com/Documentation/7.1/0L0/0Y0/0K0/ETL_Metadata_Injection
two things you need to be careful
maintain list of columns you need to push in.
since you have changing column names so you may face issue with valid columns as well which you want to import or work with. in order to do so make sure you generate the meta data file every time so you are sure about the column names you want to push out from the flat file.

Create table schema and load data in bigquery table using source google drive

I am creating table using google drive as a source and google sheet as a format.
I have selected "Drive" as a value for create table from. For file Format, I selected Google Sheet.
Also I selected the Auto Detect Schema and input parameters.
Its creating the table but the first row of the sheet is also loaded as a data instead of table fields.
Kindly tell me what I need to do to get the first row of the sheet as a table column name not as a data.
It would have been helpful if you could include a screenshot of the top few rows of the file you're trying to upload at least to see the data types you have in there. BigQuery, at least as of when this response was composed, cannot differentiate between column names and data rows if both have similar datatypes while schema auto detection is used. For instance, if your data looks like this:
headerA, headerB
row1a, row1b
row2a, row2b
row3a, row3b
BigQuery would not be able to detect the column names (at least automatically using the UI options alone) since all the headers and row data are Strings. The "Header rows to skip" option would not help with this.
Schema auto detection should be able to detect and differentiate column names from data rows when you have different data types for different columns though.
You have an option to skip header row in Advanced options. Simply put 1 as the number of rows to skip (your first row is where your header is). It will skip the first row and use it as the values for your header.

PDI /Kettle - Passing data from previous hop to database query

I'm new to PDI and Kettle, and what I thought was a simple experiment to teach myself some basics has turned into a lot of frustration.
I want to check a database to see if a particular record exists (i.e. vendor). I would like to get the name of the vendor from reading a flat file (.CSV).
My first hurdle selecting only the vendor name from 8 fields in the CSV
The second hurdle is how to use that vendor name as a variable in a database query.
My third issue is what type of step to use for the database lookup.
I tried a dynamic SQL query, but I couldn't determine how to build the query using a variable, then how to pass the desired value to the variable.
The database table (VendorRatings) has 30 fields, one of which is vendor. The CSV also has 8 fields, one of which is also vendor.
My best effort was to use a dynamic query using:
SELECT * FROM VENDORRATINGS WHERE VENDOR = ?
How do I programmatically assign the desired value to "?" in the query? Specifically, how do I link the output of a specific field from Text File Input to the "vendor = ?" SQL query?
The best practice is a Stream lookup. For each record in the main flow (VendorRating) lookup in the reference file (the CSV) for the vendor details (lookup fields), based on its identifier (possibly its number or name or firstname+lastname).
First "hurdle" : Once the path of the csv file defined, press the Get field button.
It will take the first line as header to know the field names and explore the first 100 (customizable) record to determine the field types.
If the name is not on the first line, uncheck the Header row present, press the Get field button, and then change the name on the panel.
If there is more than one header row or other complexities, use the Text file input.
The same is valid for the lookup step: use the Get lookup field button and delete the fields you do not need.
Due to the fact that
There is at most one vendorrating per vendor.
You have to do something if there is no match.
I suggest the following flow:
Read the CSV and for each row look up in the table (i.e.: the lookup table is the SQL table rather that the CSV file). And put default upon not matching. I suggest something really visible like "--- NO MATCH ---".
Then, in case of no match, the filter redirect the flow to the alternative action (here: insert into the SQL table). Then the two flows and merged into the downstream flow.

How to map input to output fileds from excel to csv in pentaho?

How to map input to output fileds from excel to csv in pentaho?
How to tranform this in pentaho ? Where to map values of input to output columns as the positions and name are different in input to output.
You can rename the fields right in your MS-Excel-Input step, and you can reorder the fields in Text-File-Output. Also, a Select-Values step allows you to rename and reorder fields in one sweep on the Select & Alter tab.
The Select Values step allows you to change the column names and position (as well as type).
Note that the column name in the Excel Input is arbitrary and do not need to be related to the actual name in the Excel file, so that can rename them at wish. You can even copy/paste the names into the Field list.
Note also that the order of the column in the CSV output file is defined in Fields tab of Text file output step. You can change it with the Ctrl-Arrows keys.
If you need to industrialize the process and have the new columns name and order in, for example, a set of files or a database table, then you need Metadata injection. Have a look to Diethard's or Jens' examples.