I have a CSV file which contain millions records/rows. The header row is like:
<NIC,Name,Address,Telephone,CardType,Payment>
In my scenario I want to load data "CardType" is equal to "VIP". How can I preform this operation without loading whole records in the file into a staging table?
I am not loading these records into a data warehouse. I only need to separate these data in CSV file.
The question isn't super-clear, but it sounds like you want to do some processing of the rows before outputting them back into another CSV file. If that's the case, then you'll want to make use of the various transforms available, notably Conditional Split. In there, you can look for rows where CardType == VIP and send those down one output (call it "Valid Rows"), and send the others into the default output. Connect up your valid rows output to your CSV destination and that should be it.
Related
I have a CSV file that goes something like this:
Report Name: Stackoverflow parse data
Date of Report: 31 October, 2022
Col1, Col2, Col3,...
Data, Data, Data, ...
The values before Headers, essentially data that states what the CSV is for and when it was created (can contain multiple values, hence has dynamic number of rows), need to be removed from the CSV so I can parse it in Pentaho. Now, the CSV files are on an S3 bucket and I am fetching them using S3 CSV Input but I am not sure how to proceed with filtering the non-required data so I can successfully parse the CSV files.
You can read the complete file as a CSV with only one column, adding the rownumber to the output. Then you apply a filter to get rid of the first n rows, and then you use the Split fields step to separate the rows into columns.
You'll need more steps to transform numbers and dates into the correct format (using the Split fields you'll get strings), and maybe more operations to preformat some other columns.
Or you could create a temporal copy of with your S3 CSV file without the first n rows, and read that file instead of the original one.
Step1: In the Csv input, just adding rownumber
Step2:Use filter
Step3:Add a output component like csv or database.
I have a list of 45,000 ids. For every id I need to generate the data for every calendar day which at the end will give me 27mn records. I can do it manually by passing the list of ids in the transformation and running it but I wonder what will be the automated way to do it? Save ids in a xls file/txt file in a batch of 1000 records and get Pentaho to read one file, run the transformation, save the output, open another file, run the transformation, save the output etc etc etc?
I have a Flat File that I'm loading into SQL and that Flat file has 2 different RecordTypes and 2 Different File Layouts based on the RecordType.
So I may have
000010203201501011 (RecordType 1)
00002XXYYABCDEFGH2 (RecordType 2)
So I want to immediately check for Records of RecordType1 and then send those records thru [Derived Column] & [Data Conversion] & [Loading to SQL]
And I want to ignore all Records of RecordType2.
I tried a Conditional Split but it seems like the Records of RecordType2 are still trying to go thru the [Derived Column]&[DataConversion] Steps.
It gives me a DataConversion error on the RecordType2 Records.
I have the Conditional Split set up as RecordType == 1 to go thru the process i have set up.
I guess Conditional Split isn't set up to be used this way?
Where in my process can i tell it to check for RecordType1 and only send records past that point that are RecordType=1?
It makes perfect sense you are having data type errors for Record Type 2 rows since you probably have defined columns along with their data types based on Record Type 1 records. I see three options to achieve what you want to do:
Have a script task in the control flow to copy only Record Type 1
records to a fresh file that would be used by the data flow you
already have (Pro: you do not need to touch the data flow, Con:
reading file twice), OR
In the existing data flow: Instead of getting all the columns from
the data source, read every line coming from the file as one big-fat
column, then a Derived Column to get RecordType, then a Conditional
Split, then a Derived Column to re-create all the columns you had
defined in the data source, OR
Ideal if you have another package processing Record Type 2 rows:
Dump the file into a database table in the staging area, then
replace the Data Source in your Data Flow for an OLEDB Data Source
(or whatever you use) and obtain+filter the records with something
like: SELECT substring(rowdata,1,5) AS RecordType,
substring(rowdata,6,...) AS Column2, .... FROM STG.FileData WHERE
substring(rowdata,1,5) = '00001'. If using this approach it would
be better to have a dedicated column for RecordType
Let's say I have a "master" qvd file named salesHistory.qvd, and I want to append new monthly sales from file salesMarch.csv
How do I do that without replacing existing information, but adding new months?
Thanks for the help!
By default, QlikView automatically appends table loads to a previously loaded table if the fields are identical. You can use this to your advantage by using a script similar to the following:
SalesHistory:
LOAD
*
FROM
[salesHistory.qvd] (qvd);
LOAD
*
FROM
[salesMarch.csv]
(txt, utf8, embedded labels, delimiter is ',', msq);
STORE SalesHistory INTO [salesHistory.qvd] (qvd);
This initially loads the contents of your salesHistory.qvd file into a table, and then loads the contents of salesMarch.csv and concatenates it into the SalesHistory table (which contains the contents of salesHistory.qvd.
The final STORE step saves this concatenated table into the salesHistory.qvd file by overwriting it completely.
In the above example, we use * as a field specifier to load all fields from the source files. This means that this approach only works if your QVD file contains the same fields (and field names) as your CSV file.
Furthermore, as this script loads the contents of the QVD file each time it is executed, it will start to duplicate data if it is executed more than once per month as there is no determination of which months already exist in the QVD file. If you need to execute it more than once per month (perhaps due to adjustments) then you may wish to consider applying a WHERE clause to the load from salesHistory.qvd so that only data up to and including the previous month is loaded.
Finally, you may wish to alter the name of your CSV file so that it is always the same (e.g. salesCurrentMonth.csv) so that you do not have to change the filename in your script.
I have multiple flat files. I need to output each flat file to a different table using SSIS. I created a For each file Enumerator to bring every source file but it's uploading all of them to the same table which then throws error because they have different fields.
How may I configure a package to output to different tables?
You cannot, at least within a single data flow, have different source meta data. DTS supported this but SSIS does not. The number and type of columns in an SSIS package must be fixed.
You can have multiple data flows within your ForEach loop and then enable/disable them based on the file name or some other criteria to support loading different sources and destinations.
Some might suggest you read them all in a single line and then use a conditional split based on file type and then use a derived column to split it out into specific columns. That works but it is a maintenance nightmare I would not wish on my most hated enemy.