File Mime Type Detection - file-upload

There is a necessity to identify the mime type of the file. The file is uploaded through the browser and is read by the server in chunks asynchronously. Is it possible to detect file mime type using Apache Tika with few initial chunks rather than having whole file data?
Is there any other tool to detect mime type from the first few chunks?

Related

consume gzip files with databricks autoloader

I am currently unable to find a direct way to load .gz files via autoloader. I can load the files as a binary content but I cannot extract the compressed xml files and process them further in a streaming way.
Therefore, I would like to know if there is a way to consume the content of a gzip file via databricks autoloader

Amazon S3 PDF files have wrong content type after Winscp upload

Yesterday I transferred ~1500 files to our Amazon S3 bucket using Winscp. I have a problem when downloading PDF files, since they are returned without extension. The cause seems to be the wrong content type being present on the file in the S3 bucket. When I update the content type manually, the file is downloaded correctly. Since I have so many files manually changing the meta data is not an option.
The interesting part is that Excel files do have the correct content type. When looking into this difference, it seems like the machine I upload from does understand the Excel file format, since the file in Winscp is shown with an Excel icon. The PDF files have just a plain white icon.
Does anyone know how to solve the content type problem? Does it depend on the host computer understanding the file type so it can set the correct content type automatically?
After installing a PDF reader client, in my case Adobe Acrobat Reader, and uploading the files again, the content type is set correctly.

Proper MIME media type for TAR files

When working with TAR files, I've run across the MIME type
application/x-tar
However I am not sure it is right, because
MIME types in the x- namespace are considered experimental
and it seems odd to me for venerable TAR.
Are there any recent RFC about MIME and TAR ?
And what's about widespread HTTP server and browsers support ?
The MIME type for an uncompressed TAR file as mentioned in Wikipedia is application/x-tar.
Refer:
https://en.wikipedia.org/wiki/List_of_archive_formats

Verify MIME type in PDF file upload in DRUPAL

How to verify the mime type of pdf file in drupal file upload.
Issue: any one can upload script file just by renaming or adding extension of the file (eg: script.php.pdf)
I have implemented the mime type check for image upload (as it is a separate module), but can't figure out where to validate the mime type of pdf files.
Code for Image MIME type check:
$supported_mime = array('image/jpg', 'image/jpeg', 'image/png', 'image/gif');
$elements[$delta]['#upload_validators']['file_validate_mime_type'][0] = implode('::', $supported_mime);
Code for additional validation should be placed in your hook_file_validate() funciton:
https://api.drupal.org/api/drupal/modules%21system%21system.api.php/function/hook_file_validate/7.x
However, it seem very unlikely that just renaming files (hiding real extension) can do the trick and fool Drupal. I mean even if php if php file is uploaded, with .pdf extension it's not going to be executed.

how can I upload a gzipped json file to bigquery via the HTTP API?

When I try to upload an uncompressed json file, it works fine; but when I try a gzipped version of the same json file, the job would fail with lexical error resulted from failure to parse the json content.
I gzipped the json file with the gzip command from Mac OSX 10.8 and I have set the sourceFormat to: "NEWLINE_DELIMITED_JSON".
Did I do something incorrectly or gzipped json file should be processed differently?
I believe that using the multipart/related request it is not possible to submit binary data (such as the compressed file. However, if you don't want to use uncompressed data, you may be able to use resumable upload.
What language are you coding in? The python jobs.insert() api takes a media upload parameter, which you should be able to give a filename to in order to do resumable upload (which sends your job metadata and new table data as separate streams). I was able to use this to upload a compressed file.
This is what bq.py uses, so you could look at the source code here.
If you aren't using python, the googleapis client libraries for other languages should have similar functionality.
You can upload gzipped files to Google Cloud Storage, and BigQuery will be able to ingest it with a load job:
https://developers.google.com/bigquery/loading-data-into-bigquery#loaddatagcs