Processing CSV files with Metadata in S3 bucket in Pentaho - pentaho

I have a CSV file that goes something like this:
Report Name: Stackoverflow parse data
Date of Report: 31 October, 2022
Col1, Col2, Col3,...
Data, Data, Data, ...
The values before Headers, essentially data that states what the CSV is for and when it was created (can contain multiple values, hence has dynamic number of rows), need to be removed from the CSV so I can parse it in Pentaho. Now, the CSV files are on an S3 bucket and I am fetching them using S3 CSV Input but I am not sure how to proceed with filtering the non-required data so I can successfully parse the CSV files.

You can read the complete file as a CSV with only one column, adding the rownumber to the output. Then you apply a filter to get rid of the first n rows, and then you use the Split fields step to separate the rows into columns.
You'll need more steps to transform numbers and dates into the correct format (using the Split fields you'll get strings), and maybe more operations to preformat some other columns.
Or you could create a temporal copy of with your S3 CSV file without the first n rows, and read that file instead of the original one.

Step1: In the Csv input, just adding rownumber
Step2:Use filter
Step3:Add a output component like csv or database.

Related

Trouble uploading single column csv to bigquery with split columns

I'm trying to upload a dataset to bigquery so that i can query the data. The dataset is currently in a csv, with all the data for each row in one column, split by commas. I want to have the data split into columns using the comma as a delimiter.
When trying to upload using autodetect schema, the 10 columns have been detected, but are called 'string_0, string_1, string_2 etc' and the rows still have all the data in the first column.
When trying to upload by manually inputting the schema, i get these errors:
CSV table encountered too many errors, giving up. Rows: 1; errors: 1.
CSV table references column position 9, but line starting at position:117 contains only 1 columns.
On both occasions I set header rows to skip = 1
Here's an image of the dataset.
Any help would be really appreciated!
I see here a three potential reasons for the error you're hitting:
Source data CSV file structural problem - the file does not correspond to the RFC 4180 specification prerequisites, i.e. used untypical line-breaks(line delimiters);
Bigquery sink table schema mismatch - i.e. missing a
dedicated column for a particular input data;
Bigquery schema type mismatch - parsing a table column that owns a
type that differs from input one.
Please find also more particularities for Bigquery auto-detect schema method, loading CSV format data, that can help you to solve above mentioned issue.

Pentaho use an Excel input with previous fields

I'm working with excel files in pentaho.
I do a preprocessing in directories because the information is stored in this way:
/[year_dir]/[mounth_dir]/[store_id]_[day_ofmount].xls
' example /2017/01/4567_3.xls means 03/01/2017 sells of the store 4567
and pass the filename to an Excel input but the information of year,day ,store_id the columns name are added to the begining shifting the rest of the columns names but not the data of the excel
Easiest way is to include the filename (whole path) in your output data stream, then use a regex to split it into the various bits and pieces you need and extracting the date and store id from there.
You can later have a select values step to re-order the fields if order is important.

Google BigQuery How to Load custom columns from csv?

I have a doubleclick csv file with 20 columns (Timestamp,AdvertiserId,ActionName,Brower,OSID ...) without any header. I would like to ingest only first 3 columns into a BQ table. Is there any way to achieve that without mapping each and every column one-by-one manually into BQ's UI (create new_table ->"Schema" section)?
Fields in CSV is comma separated and newlines are defined as semi-colon';'.
There are two possible ways to do that: BigQuery: Load from CSV, skip columns
In your case I would probably suggest the second approach. Set the ignoreUnknownValues flag and pass in a schema with just the first three columns. For example:
bq load --ignore_unknown_values dataset.new_table gs://path/to/file.csv ~/path/to/schema.json

append new information in csv files to existing historical qvd

Let's say I have a "master" qvd file named salesHistory.qvd, and I want to append new monthly sales from file salesMarch.csv
How do I do that without replacing existing information, but adding new months?
Thanks for the help!
By default, QlikView automatically appends table loads to a previously loaded table if the fields are identical. You can use this to your advantage by using a script similar to the following:
SalesHistory:
LOAD
*
FROM
[salesHistory.qvd] (qvd);
LOAD
*
FROM
[salesMarch.csv]
(txt, utf8, embedded labels, delimiter is ',', msq);
STORE SalesHistory INTO [salesHistory.qvd] (qvd);
This initially loads the contents of your salesHistory.qvd file into a table, and then loads the contents of salesMarch.csv and concatenates it into the SalesHistory table (which contains the contents of salesHistory.qvd.
The final STORE step saves this concatenated table into the salesHistory.qvd file by overwriting it completely.
In the above example, we use * as a field specifier to load all fields from the source files. This means that this approach only works if your QVD file contains the same fields (and field names) as your CSV file.
Furthermore, as this script loads the contents of the QVD file each time it is executed, it will start to duplicate data if it is executed more than once per month as there is no determination of which months already exist in the QVD file. If you need to execute it more than once per month (perhaps due to adjustments) then you may wish to consider applying a WHERE clause to the load from salesHistory.qvd so that only data up to and including the previous month is loaded.
Finally, you may wish to alter the name of your CSV file so that it is always the same (e.g. salesCurrentMonth.csv) so that you do not have to change the filename in your script.

How to Load data in CSV file with a query in SSIS

I have a CSV file which contain millions records/rows. The header row is like:
<NIC,Name,Address,Telephone,CardType,Payment>
In my scenario I want to load data "CardType" is equal to "VIP". How can I preform this operation without loading whole records in the file into a staging table?
I am not loading these records into a data warehouse. I only need to separate these data in CSV file.
The question isn't super-clear, but it sounds like you want to do some processing of the rows before outputting them back into another CSV file. If that's the case, then you'll want to make use of the various transforms available, notably Conditional Split. In there, you can look for rows where CardType == VIP and send those down one output (call it "Valid Rows"), and send the others into the default output. Connect up your valid rows output to your CSV destination and that should be it.