How do I can check an input file is compressed (ZIP) or not? - pentaho

How do I can check an input file is compressed (ZIP) or not ?.
Is the solution to read the file info using "Get File Names" step and check the extension field ?

Use the "file" command if you're on Unix.
If not install cygwin and goto 1.
If this is related to your other question about conditionally reading different files then I would consider getting your files into a consistent format first. i.e. all compressed.

Related

Hive output to xlsx

I am not able to open an .xlsx file. Is this the correct way to output the result to an .xlsx file?
hive -f hiveScript.hql > output.xlsx
hive -S -f hiveScript.hql > output.xls
This will work
There is no easy way to create an Excel (.xlsx) file directly from hive. You could output you queries content to an older version of Excel (.xls) by the answers given above and it would open in Excel properly (with an initial warning in latest versions of Office) but in essence it is just a text file with .xls extension. If you open this file with any text editor you would see the contents of the query output.
Take any .xlsx file on your system and open it with a text editor and see what you get. It will be all junk characters since that is not a simple text file.
Having said that there are many programming languages that allow you to convert/read a text file and create xlsx. Since no information is provided/requested on this I will not go into details. However, you may use Pandas in Python to create excels.
output csv or tsv file, and I used Python to do converting (pandas library)
I am away from my setup right now so really cannot test this. But you can give this a try in your hive shell:
hive -f hiveScript.hql >> output.xls

In kettle use text file input read csv file from a tar.gz file but it didn't worked. Where it might be wrong?

I have a csv file that is tared and zipped. So I have test.tar.gz.
I would like, through text file input, read csv file.
I try this tar:gz:file://C:/test/test.tar.gz!/test.tar! use wildcard like ".*\.csv".
But it sometime can't read success.
It throws Exception
org.apache.commons.vfs.FileNotFolderException:
Could not list the contents of
"tar:gz:file:///C:/test/test.tar.gz!/test.tar!/"
because it is not a folder.
I use windows8.1, pdi 5.2
Where it might be wrong?
For a compressed file csv reading, "Text File Input" step in Pentaho Kettle only supports the first files inside the compressed folder(either in Zip/GZip file). Check the Pentaho Wiki in the compression section.
Now for your issue, try removing the wildcard entry since only the first file inside the zip/gzip file will be read. (as explained above)
I have placed a sample code containing both reading zip and gzip files. Check it here.
Hope it helps :)

Regexpression for getting a file

I have to get a file through PDI based on the filename and i want to select file with name matching pattern eligible_for_push which has to be at the end.The file can be .txt or .csv
Please Help
Thanks
There are two part to your query:
1. Finding all files ending with "eligible_for_push":
You cannot use regex to find this sort of pattern (at least i am not aware of). So as an alternate do the following:
Search all the files in the path using "Get Filename" steps. Use modified Javascript to find out the file ending with the above pattern. Check the JS file below.
2. Files can be ".txt" or ".csv":
You can use the below regex/wildcard to find choose between either .txt or .csv
.*\.txt|.*\.csv
Note : Use this code once you have filtered out the files ending with "eligible_for_push". The above JS ignore all the file patterns. After that use the second step to sort out all the .txt or .csv files.
Hope it helps :)

load script from other file extension?

is it possible to load module from file with extension other than .lua?
require("grid.txt") results in:
module 'grid.txt' not found:
no field package.preload['grid.txt']
no file './grid/txt.lua'
no file '/usr/local/share/lua/5.1/grid/txt.lua'
no file '/usr/local/share/lua/5.1/grid/txt/init.lua'
no file '/usr/local/lib/lua/5.1/grid/txt.lua'
no file '/usr/local/lib/lua/5.1/grid/txt/init.lua'
no file './grid/txt.so'
no file '/usr/local/lib/lua/5.1/grid/txt.so'
no file '/usr/local/lib/lua/5.1/loadall.so'
no file './grid.so'
no file '/usr/local/lib/lua/5.1/grid.so'
no file '/usr/local/lib/lua/5.1/loadall.so'
I suspect that it's somehow possible to load the script into package.preaload['grid.txt'] (whatever that is) before calling require?
It depends on what you mean by load.
If you want to execute the code in a file named grid.txt in the current directory, then just do dofile"grid.txt". If grid.txt is in a different directory, give a path to it.
If you want to use the path search that require performs, then add a template for .txt in package.path, with the correct path and then do require"grid". Note the absence of suffix: require loads modules identified by names, not by paths.
If you want require("grid.txt") to work should someone try that then yes, you'll need to manually loadfile and run the script and put whatever it returns (or whatever require is documented to return when the module doesn't return anything) into package.loaded["grid.txt"].
Alternatively, you could write your own loader just for entries like this which you set into package.preload["grid.txt"] which finds and loads/runs the file or, more generically, you could write yourself a loader function, insert it into package.loaders, and then let it do its job whenever it sees a "*.txt" module come its way.

How can I determine if a file upload is a valid CSV file - or at least text - in ColdFusion 8?

I have a form which allows a user to upload a file to the server. How can I validate that the uploaded file is in fact the expected format (CSV, or at least validate that it is a text file) in ColdFusion 8?
For simple formats like CSV, just check yourself, for example via regex.
<cffile action="read" file="#uploadedFile#" variable="contents" charset="UTF-8">
<cfset LooksLikeCSV = REFind("^([^;]*;)+[^;]*$", contents)>
You can place additional checks with regard to file size limits or forbidden characters.
For other file formats, you can check for header signatures that occur in the first few bytes of the file.
You could even write a full parser for your expected file format - for CSV validation, you could do a ListToArray() at CR/LF and check each line individually against a regex. XML should work pretty straightforward as well - just try to pass it to XmlParse(). Binary formats like images are a little more difficult, but libraries exist there as well.
I dont know if it can help you but Ben Nadel wrote excellents posts about CSV:
http://www.bennadel.com/blog/483-Parsing-CSV-Data-Using-ColdFusion.htm
http://www.bennadel.com/blog/976-Regular-Expressions-Make-CSV-Parsing-In-ColdFusion-So-Much-Easier-And-Faster-.htm
http://www.bennadel.com/blog/501-Parsing-CSV-Values-In-To-A-ColdFusion-Query.htm
I think it's as simple as specifying the accept value in cffile ...Unfortunately the CF8 docs don't specify the value as part of the info for cffile ... It's under file management ...
<cffile action=”upload” filefield=”filename” destination=”#destination#” accept=”text/csv”>
CF8 » Controlling the type of file uploaded