In kettle use text file input read csv file from a tar.gz file but it didn't worked. Where it might be wrong? - pentaho

I have a csv file that is tared and zipped. So I have test.tar.gz.
I would like, through text file input, read csv file.
I try this tar:gz:file://C:/test/test.tar.gz!/test.tar! use wildcard like ".*\.csv".
But it sometime can't read success.
It throws Exception
org.apache.commons.vfs.FileNotFolderException:
Could not list the contents of
"tar:gz:file:///C:/test/test.tar.gz!/test.tar!/"
because it is not a folder.
I use windows8.1, pdi 5.2
Where it might be wrong?

For a compressed file csv reading, "Text File Input" step in Pentaho Kettle only supports the first files inside the compressed folder(either in Zip/GZip file). Check the Pentaho Wiki in the compression section.
Now for your issue, try removing the wildcard entry since only the first file inside the zip/gzip file will be read. (as explained above)
I have placed a sample code containing both reading zip and gzip files. Check it here.
Hope it helps :)

Related

trouble with utf-8 with julia and jupyterlab

I'm reading the csv file at https://github.com/VinitaSilaparasetty/julia-beginners/blob/master/data/nba/nba19-20.csv
I get a DataFrame and I save it as XLSX. When I try to read it in jupyterlab I get the error the file is not UTF-8 encoded and therefore the file is not read.
This is my code:
using HTTP, XLSX, CSV, DataFrames
df = CSV.read(HTTP.get("https://raw.githubusercontent.com/VinitaSilaparasetty/julia-beginners/master/data/nba/nba19-20.csv").body)
# first(df,5) # first shows the top five rows ok
XLSX.writetable("data/nba/nba19-20.XLSX", collect(eachcol(df)), names(df), overwrite = true)
The file is saved in my data folder. When I try to open it with jupyterlab, I get a pop up with the file is not UTF-8 encoded and the file is not opened.
When I try to open the file in Ubuntu (with LibreOffice) I do not see anything suspicious.
As I'm new to Julia I'm struggling to understand where the problem lies or how to fix it.
I tried to see if I could encode the dataframe in UTF-8 (after saving the file to disk) with
data = DataFrame(CSV.File(open(read,"data/nba/nba19-20.csv", enc"utf-8")))
But I did not see any change. Any suggestion is welcome.
Do you have the jupyterlab-spreadsheet plugin installed? JupyterLab by default doesn't support opening xlsx files (it isn't mentioned in the file formats list here for example).
See also this similar question involving Python pandas (which says pretty much the same thing).

How to extract .sql file that seems to be a .zip

I have received a file from a customer. The file is said to be
SQL code (application/sql)
However, this has turned out to be wrong: nothing could open it. It turns out it was secretely a .zip file. By renaming it to '.zip' and manually extracting it I was able to get the files contained in it. I would like to do a similar process in python.
So far I've renamed the file:
file_name_zip = file_name.replace('.sql', '.zip')
os.rename(file_name, file_name_zip)
And I've tried extracting it:
zip_ref = zipfile.ZipFile(file_name_zip, 'r')
zip_ref.extractall(extracted_file)
However, this failed because
zipfile.BadZipFile: File is not a zip file
I've googled, and apparently this can sometimes be fixed using:
zip_file_name_2 = zip_file_name.replace('.zip', '2.zip')
os.system(f'zip -FF {zip_file_name} --out {zip_file_name_2}')
This required me to put in a bunch of settings, which I wasn't able to figure out. There must be a better way to go about this.
Does anybody know how to parse such an .sql file?

Regexpression for getting a file

I have to get a file through PDI based on the filename and i want to select file with name matching pattern eligible_for_push which has to be at the end.The file can be .txt or .csv
Please Help
Thanks
There are two part to your query:
1. Finding all files ending with "eligible_for_push":
You cannot use regex to find this sort of pattern (at least i am not aware of). So as an alternate do the following:
Search all the files in the path using "Get Filename" steps. Use modified Javascript to find out the file ending with the above pattern. Check the JS file below.
2. Files can be ".txt" or ".csv":
You can use the below regex/wildcard to find choose between either .txt or .csv
.*\.txt|.*\.csv
Note : Use this code once you have filtered out the files ending with "eligible_for_push". The above JS ignore all the file patterns. After that use the second step to sort out all the .txt or .csv files.
Hope it helps :)

How do I can check an input file is compressed (ZIP) or not?

How do I can check an input file is compressed (ZIP) or not ?.
Is the solution to read the file info using "Get File Names" step and check the extension field ?
Use the "file" command if you're on Unix.
If not install cygwin and goto 1.
If this is related to your other question about conditionally reading different files then I would consider getting your files into a consistent format first. i.e. all compressed.

How can I determine if a file upload is a valid CSV file - or at least text - in ColdFusion 8?

I have a form which allows a user to upload a file to the server. How can I validate that the uploaded file is in fact the expected format (CSV, or at least validate that it is a text file) in ColdFusion 8?
For simple formats like CSV, just check yourself, for example via regex.
<cffile action="read" file="#uploadedFile#" variable="contents" charset="UTF-8">
<cfset LooksLikeCSV = REFind("^([^;]*;)+[^;]*$", contents)>
You can place additional checks with regard to file size limits or forbidden characters.
For other file formats, you can check for header signatures that occur in the first few bytes of the file.
You could even write a full parser for your expected file format - for CSV validation, you could do a ListToArray() at CR/LF and check each line individually against a regex. XML should work pretty straightforward as well - just try to pass it to XmlParse(). Binary formats like images are a little more difficult, but libraries exist there as well.
I dont know if it can help you but Ben Nadel wrote excellents posts about CSV:
http://www.bennadel.com/blog/483-Parsing-CSV-Data-Using-ColdFusion.htm
http://www.bennadel.com/blog/976-Regular-Expressions-Make-CSV-Parsing-In-ColdFusion-So-Much-Easier-And-Faster-.htm
http://www.bennadel.com/blog/501-Parsing-CSV-Values-In-To-A-ColdFusion-Query.htm
I think it's as simple as specifying the accept value in cffile ...Unfortunately the CF8 docs don't specify the value as part of the info for cffile ... It's under file management ...
<cffile action=”upload” filefield=”filename” destination=”#destination#” accept=”text/csv”>
CF8 » Controlling the type of file uploaded