I found several defintions for a gzipped tarbar (.tgz or .tar.gz), which is the correct one, is there one?
application/x-gtar (Wikipedia, Some bugtracker)
application/x-tar-gz (Forum, Python)
Didn't found a match in the official list.
Try following RegExp for all MIME type compressed files
application/(zip|gzip|tgz|tar|tar+gzip|x-tgz|x-tar|x-gtar|x-tar-gz|x-gzip|x-tar+gzip|x-xz-compressed-tar)
Related
I would like to use pandas.read_csv to open a gzip file (.asc.gz) within a zipped directory (.zip). Is there an easy way to do this?
This code doesn't work:
csv = pd.read_csv(r'C:\folder.zip\file.asc.gz') // can't find the file
This code does work (however, it requires me to unzip the folder, which I want to avoid because my dataset currently contains thousands of zipped folders):
csv = pd.read_csv(r'C:\folder\file.asc.gz')
Is there an easy way to do this? I have tried using a combination of zipfile.Zipfile and read_csv, but have been unsuccessful (I think partly due to the fact that this is an ascii file as well)
Maybe the followings might help.
df = pd.read_csv('filename.gz', compression='gzip')
OR
import gzip
file=gzip.open('filename.gz','rb')
content=file.read()
I have a csv file that is tared and zipped. So I have test.tar.gz.
I would like, through text file input, read csv file.
I try this tar:gz:file://C:/test/test.tar.gz!/test.tar! use wildcard like ".*\.csv".
But it sometime can't read success.
It throws Exception
org.apache.commons.vfs.FileNotFolderException:
Could not list the contents of
"tar:gz:file:///C:/test/test.tar.gz!/test.tar!/"
because it is not a folder.
I use windows8.1, pdi 5.2
Where it might be wrong?
For a compressed file csv reading, "Text File Input" step in Pentaho Kettle only supports the first files inside the compressed folder(either in Zip/GZip file). Check the Pentaho Wiki in the compression section.
Now for your issue, try removing the wildcard entry since only the first file inside the zip/gzip file will be read. (as explained above)
I have placed a sample code containing both reading zip and gzip files. Check it here.
Hope it helps :)
I have just installed a new Exist-db and I'm willing to use it to parse XML files that are actually compressed in gzip.
It is my understanding that exist-db has the cappability to perform this kind of operations, but I keep getting the error MIME type invalid.
I've added a new MIME type in the mime-types.xml file with the following parameters:
<mime-type name="application/zip" type="binary">
<description>GZIP archive</description>
<extensions>.gz</extensions>
</mime-type>
But I keep getting the same reading error.
Could somebody point me in the right direction? Am I missing something?
Thanks!
G.
eXist-db can only work on XML data that has been parsed and processed (and indexed) into the eXist-db internal storage format. This means that the data needs to be decompressed before it can be queried; A GZIPped XML document stored in the database is considered to be "a binary blob' and cannot be queried.
When the GZIP file is stored in the database, you can use the compression:unzip() function (link) to uncompress the document. The document can then be stored in the database.
How do I can check an input file is compressed (ZIP) or not ?.
Is the solution to read the file info using "Get File Names" step and check the extension field ?
Use the "file" command if you're on Unix.
If not install cygwin and goto 1.
If this is related to your other question about conditionally reading different files then I would consider getting your files into a consistent format first. i.e. all compressed.
I have a form which allows a user to upload a file to the server. How can I validate that the uploaded file is in fact the expected format (CSV, or at least validate that it is a text file) in ColdFusion 8?
For simple formats like CSV, just check yourself, for example via regex.
<cffile action="read" file="#uploadedFile#" variable="contents" charset="UTF-8">
<cfset LooksLikeCSV = REFind("^([^;]*;)+[^;]*$", contents)>
You can place additional checks with regard to file size limits or forbidden characters.
For other file formats, you can check for header signatures that occur in the first few bytes of the file.
You could even write a full parser for your expected file format - for CSV validation, you could do a ListToArray() at CR/LF and check each line individually against a regex. XML should work pretty straightforward as well - just try to pass it to XmlParse(). Binary formats like images are a little more difficult, but libraries exist there as well.
I dont know if it can help you but Ben Nadel wrote excellents posts about CSV:
http://www.bennadel.com/blog/483-Parsing-CSV-Data-Using-ColdFusion.htm
http://www.bennadel.com/blog/976-Regular-Expressions-Make-CSV-Parsing-In-ColdFusion-So-Much-Easier-And-Faster-.htm
http://www.bennadel.com/blog/501-Parsing-CSV-Values-In-To-A-ColdFusion-Query.htm
I think it's as simple as specifying the accept value in cffile ...Unfortunately the CF8 docs don't specify the value as part of the info for cffile ... It's under file management ...
<cffile action=”upload” filefield=”filename” destination=”#destination#” accept=”text/csv”>
CF8 » Controlling the type of file uploaded