Comparing filenames in PDI

Comparing filenames in PDI - pentaho

I am trying to import a certain .CSV file into my database using PDI (Kettle).
Normally this would be rather easy, as you could just link up a CSV file input step with a Table output step and be good to go. However, the problem is that I don't know which file I want to import in advance, as in before executing the job/transformation in PDI.
That is because I have many files in my import folder, which all have the same format regarding their filename: KeyDate_Filename_YYYYMMDD.CSV
The idea is to have file with the newest YYYYMMDD imported for a given key date.
My theoretical approach to implement this would be:
Make the given key date available in PDI as a parameter (already done)
Read in the names of all files stored in the import folder
Filter said filenames for the given key date
Compare the YYYYMMDD of the remaining files and select the newest
Use selected filename as parameter in a CSV file input step (already done)
Import data via Table output step (already done)
Unfortunately I am fairly new to PDI and don't really have a compelling idea on how to implement the bold parts or if that approach as a whole is even viable.
Can anybody think of a way to get this done? Appreciate any feedback
Edit: Forgot to mention that I am using PDI 3.2.6.
In 4.x.x I could simple use a User Defined Java Class to get this done :/

There are various ways to achieve that result. Here's one:
Get filenames lists all files within a specific folder that match a given pattern. As ${KeyDate} is already defined as a parameter, the pattern could be
${KeyDate}[^]_[0-9].csv
(you can use a simpler regex, but this one will match only filenames in that format;
With a regex evaluation you get the date: ${KeyDate}_[^_]*_([0-9]*)\.csv. Remember to tick the "create fields for capture groups" checkbox;
Order rows by that newly created date field.
Group by (without a key field) and take first value of filename (if asc order) or last value (if desc order).
The output of the Group by step is a single row with the most recent filename that matches your pattern.
Now you pass it to the CSV file input, telling it to "accept filenames from previous step", and specifying which field to use (default is filename).

Related

How to change format directly in GoodData

I have exported some data from Paddle in a CSV format. Unfortunately, the CSV format hasn't been recognized because it contains both the date and time. I didn't manage to change the column in Numbers, neither in Google Drive, as the file is way too big. Is there maybe a way how to change the format directly in GoodData? Thanks

No, this is not possible. The data have to be already transformed when loading to the GD. Generally, the way it's formatted in the CSV doesn't seem to be recognized correctly because it contains both date and time.
Here are the requirements for loading the CSVs. There are some online tools, which allow transforming the data - they can even use google sheets. Alternatively, as a workaround, you could set the time as an attribute - in case this column does not need to be treated as a date, but has to be in the platform."

SAS renaming variables during input

self-taught SAS user here.
I often work with datasets that I have little control over and are shared among several different users.
I generally have been reading in files as CSVs using an infile statement + defining the variables with blocks of informat, format, and input statements. During this process, can I go ahead and rename variables--provided that everything is renamed in the correct order--or do they have to match the original dataset and be renamed in a later data step?
For example, the variable name in the dataset is '100% Fully Paid Out.' I know SAS variables can't start with numbers and I'd also like to simplify variable names in general, so could I do something like the following:
infile statement...
informat Paid $3.;
format Paid $3.;
input Paid $;
run;
Or maybe I'm going about this very inefficiently. I've tried doing simple proc imports without this whole informat/format/input business, but I've found that trying to redefine variable types afterwards causes more of a headache for me (all datasets I work with have combinations of text, dollars, percentages, general numbers, dates...). In any case, other tips highly appreciated--thanks!
EDIT
Maybe the question I should ask is this: is there any way of keeping the format of the csv for dollars and percentages (through proc import, which seems to convert these to characters)? I know I can manually change the formats from dollars/percentages to "general" in Excel prior to importing the file, but I'd prefer avoiding additional manual steps and also because I actually do want to keep these as dollars and percentages. Or am I just better off doing the informat/format/input to specify data types for the csv, so that variables are read in exactly how I want them to be read in?
Note: I've been unable to proc import xls or xlsx files, either because I'm on a 64-bit computer and/or I'm missing required drivers (or both). I was never able to do this even on a 32-bit computer either.

CSV files do not contain any metadata about the variable types, as your note about trying to import them into Excel demonstrates. You can use PROC IMPORT to have SAS make an educated guess as to how to read them, but the answer could vary from file to file based on the particular data values that happen to appear.
If you have data in XLS or XLSX files you should be able to read them directly into SAS using a libname with the XLS or XLSX engine. That does not use Excel and so does not have any conflicts between 32 and 64 installation. In fact you don't even need Excel installed. SAS will do a better job of determining the variable types from Excel files than from CSV files, but since Excel is a free-form spreadsheet you still might not have consistent variable types for the same variable across multiple files. With an Excel spreadsheet you might not even have the same data type consistently in a single column of a single sheet.
You are better off writing your own data step to read the file. That way you can enforce consistency.
What I typically do when given a CSV file is copy the names from the first row and use it to create a LENGTH statement. This will both define the variables and set the order of the variables. You could at this point give the variables new names.
length paid $3 date amount 8 ;
Then for variables that require an INFORMAT to be read properly I add an INFORMAT statement. Normally this is only needed for date/time variables, but it might also be needed if numeric values include commas or percent signs. The DOLLAR. informat is useful if your CSV file has numbers formatted with $ and/or thousands separators.
informat date mmddyy. amount dollar. ;
Then for variables that require a FORMAT to be displayed properly I add a FORMAT statement. Normally this is only needed for date/time variables. It is only required for character variables if you want to attach $CHAR. format in order to preserve leading spaces.
format date yymmdd10. ;
Then the INPUT statement is really easy since you can use a positional variable list. Note that there is no need to include informats or $ in the INPUT statement since the types are already defined by the LENGTH statement.
input paid -- amount ;

Text was truncated or one or more characters had no match in the target code page including the primary key in an unpivot

I'm trying to import a flat file into an oledb target sql server database.
here's the field that's giving me trouble:
here are the properties of that flat file connection, specifically the field:
here's the error message:
[Source - 18942979103_txt [424]] Error: Data conversion failed. The
data conversion for column "recipient-name" returned status value 4
and status text "Text was truncated or one or more characters had no
match in the target code page.".
What am I doing wrong?

Here is what fixed the problem for me. I did not have to convert to Excel. Just modified the DataType when choosing the data source to "text stream" (Figure 1). You can also check the "Edit Mappings" dialog to verify the change to the size (Figure 2).
Figure 1
Figure 2

After failing by increasing the length or even changing to data type text, I solved this by creating an XLSX file and importing. It accurately detected the data type instead of setting all columns as varchar(50). Turns out nvarchar(255) for that column would have done it too.

I solved this problem by ORDERING my source data (xls, csv, whatever) such that the longest text values on at the top of the file. Excel is great. use the LEN() function on your challenging column. Order by that length value with the longest value on top of your dataset. Save. Try the import again.

SQL Server may be able to suggest the right data type for you (even when it does not choose the right type by default) - clicking the "Suggest Types" button (shown in your screenshot above) allows you to have SQL Server scan the source and suggest a data type for the field that's throwing an error. In my case, choosing to scan 20000 rows to generate the suggestions, and using the resulting suggested data type, fixed the issue.

While an approach proposed above (#chookoos, here in this q&a convert to Excel workbook) and import resolves those kinds of issues, this solution this solution in another q&a is excellent because you can stay with your csv or tsv or txt file, and perfom the necessary fine tuning without creating a Microsoft product related solution

I've resolved it by checking the 'UNICODE'checkbox. Click on below Image link:

You need to go increase the column length while importing the data for particular column.
Choose a data source >> Advanced >> increase the column from default 50 to 200 or more.

Not really a technical solution, but SQL Server 2017 flat file import is totally revamped, and imported my large-ish file with 5 clicks, handled encoding / field length issues without any input from me

SQl Management Studio data import looks at the first few rows to determine source data specs..
shift your records around so that the longest text is at top.

None of the above worked for me. I SOLVED my problem by saving my source data (save as) Excel file as a single xls Worksheet Excel 5.0/95 and imported without column headings. Also, I created the table in advance and mapped manually instead of letting SQL create the table.

I had similar problem against 2 different databases (DB2 and SQL), finally I solved it by using CAST in the source query from DB2. I also take advantage of using a query by adapting the source column to varchar and avoiding the useless blank spaces:
CAST(RTRIM(LTRIM(COLUMN_NAME)) AS VARCHAR(60) CCSID UNICODE
FOR SBCS DATA) COLUMN_NAME
The important issue here is the CCSID conversion.

It usually because in connection manager it may be still of 50 char , hence I have resolved the problem by going to Connection Manager--> Advanced and then change to 100 or may be 1000 if its big enough

Flexible schema with Google Bigquery

I have around 1000 files that have seven columns. Some of these files have a few rows that have an eighth column (if there is data).
What is the best way to load this into BigQuery? Do I have to find and edit all these files to either
- add an empty eighth column in all files
- remove the eighth column from all files? I don't care about the value in this column.
Is there a way to specify eight columns in the schema and add a null value for the eighth column when there is no data available.
I am using BigQuery APIs to load data if that might help.

You can use the 'allowJaggedRows' argument, which will treat non-existent values at the end of a row as nulls. So your schema could have 8 columns, and all of the rows that don't have that value will be null.
This is documented here: https://developers.google.com/bigquery/docs/reference/v2/jobs#configuration.load.allowJaggedRows
I've filed a doc bug to make this easier to find.

If your logs are in JSON, you can define a nullable field, and if it does not appear in the record, it would remain null.
I am not sure how it works with CSV, but I think that you have to have all fields (even empty).

There is a possible solution here if you don't want to worry about having to change the CSV values (which would be my recommendation otherwise)
If the number of rows with an eight parameter is fairly small and you can afford to "sacrifice" those rows, then you can pass a maxBadRecords param with a reasonable number. In that case, all the "bad" rows (i.e. the ones not conforming to the schema) would be ignored and wouldn't be loaded.
If you are using bigquery for statistical information and you can afford to ignore those rows, it could solve your problem.

Found a workable "hack".
Ran a job for each file with the seven column schema and then ran another job on all files with eight columns schema. One of the job would complete successfully. Saving me time to edit each file individually and reupload 1000+ files.

Testing a CSV - how far should I go?

I'm generating a CSV which contains several rows and columns.
However, when I'm testing said CSV I feel like I am simply repeating the code that builds the file in the test as I'm checking each and every field is correct.
Question is, is this more sensible than it seems to me, or is there a better way?

A far simpler test is to just import the CSV into a spreadsheet or database and verify the data output is aligned to the proper fields. No extra columns or extra rows, data selected from the imported recordset is a perfect INTERSECT with the recordset from which the CSV was generated, etc.
More importantly, I recommend making sure your test data includes common CSV fail scenarios such as:
Field contains a comma (or whatever your separator character)
Field contains multiple commas (You might think it's the same thing, but I've seen one fail where the other succeeded)
Field contains the new-row character(s)
Field contains characters not in the code page of the CSV file
...to make sure your code is handling them properly.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas