I have a requirement to compress Data Lake file to GZ compression. Have seen that outputters on USQL do not directly support this. In my earlier post How to preprocess and decompress .gz file on Azure Data Lake store? #Michael Rys mentioned that automatic compression capability is currently on road map. Does any one have idea on Implementing custom code to achieve this..?
The compressed output is now available on preview:
https://github.com/Azure/AzureDataLake/blob/master/docs/Release_Notes/2017/2017_Summer/USQL_Release_Notes_2017_Summer.md#automatic-gzip-compression-on-output-statement-is-now-in-preview-opt-in-statement-is-provided
Related
i have a task in hand, where I am supposed to create python based HTTP API connector for airbyte. connector will return a response which will contain some links of zip files.
each zip file contains csv file, which is supposed to be uploaded to the bigquery
now I have made the connector which is returning the URL of the zip file.
The main question is how to send the underlying csv file to the bigquery ,
i can for sure unzip or even read the csv file in the python connector, but i am stuck on the part of sending the same to the bigquery.
p.s if you guys can tell me even about sending the CSV to google cloud storage, that will be awesome too
When you are building an Airbyte source connector with the CDK your connector code must output records that will be sent to the destination, BigQuery in your case. This allows to decouple extraction logic from loading logic, and makes your source connector destination agnostic.
I'd suggest this high level logic in your source connector's implementation:
Call the source API to retrieve the zip's url
Download + unzip the zip
Parse the CSV file with Pandas
Output parsed records
This is under the assumption that all CSV files have the same schema. If not you'll have to declare one stream per schema.
A great guide, with more details on how to develop a Python connector is available here.
Once your source connector outputs AirbyteRecordMessages you'll be able to connect it to BigQuery and chose the best loading method according to your need (Standard or GCS staging).
I have clickstream blob storage about 800mb average file sizes and when i open the file it defaults to text file. How do i open and read the data possibly json format or column format. I would also like to understand if i can build an API to consume that data. I recently built an azure function app http trigger but the file is too large to open it up and the function times out. So any suggestion on those two would be appreciated
Thank you
I would like to read file from a blob that is first compressed (gz) and then encrypted. The encryption done using Azure SDK when file uploaded to Blob (BlobEncryptionPolicy passed to CloudBlockBlob.UploadFromStreamAsync method).
There blob file have .gz extension so U-SQL trying to decompress but fails as the file is encrypted.
Is it possible to set my u-sql script to handle the decompression automatically same as done by Azure SDK (for instance by CloudBlockBlob.BeginDownloadToStream)?
If not and I need to use custom extractor, how can I prevent the U-SQL from trying to decompress the stream automatically?
The decompression is automatically triggered by the ".gz" extension. So you would have to rename the document. Also, please note that you cannot call to any external resource to decrypt from within your user-code. You will have to pass all keys as parameters to the custom extractor.
Finally, if you store the data in ADLS, you get transparent encryption of the data and it makes the whole thing a lot easier. Why are you storing it in Windows Azure Blob Storage instead?
According to the BigQuery federated source documentation:
[...]or are compressed must be less than 1 GB each.
This would imply that compressed files are supported types for federated sources in BigQuery.
However, I get the following error when trying to query a gz file in GCS:
I tested with an uncompressed file and it works fine. Are compressed files supported as federated sources in BigQuery, or have I misinterpreted the documentation?
Compression mode defaults to NONE and needs to be explicitly specified in the external table definition.
At the time of the question, this couldn't be done through the UI. This is now fixed and compressed data should be automatically detected.
For more background information, see:
https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.query
The interesting parameter is "configuration.query.tableDefinitions.[key].compression".
I would to generate a big file (several TB) with special format using my C# logic and persist it to S3. What is the best way to do this. I can launch a node in EC2 and then write the big file into EBS and then upload the file from the EBS into S3 using the S3 .net Clinent library.
Can I stream the file content as I am generating in my code and directly stream it to S3 until the generation is done specially for such large file and out of memory issues. I can see this code help with stream but it sounds like the stream should have already filled up with. I obviously can not put such a mount of data to memory and also do not want to save it as a file to the disk first.
PutObjectRequest request = new PutObjectRequest();
request.WithBucketName(BUCKET_NAME);
request.WithKey(S3_KEY);
request.WithInputStream(ms);
s3Client.PutObject(request);
What is my best bet to generate this big file ans stream it to S3 as I am generating it?
You certainly could upload any file up to 5 TB that's the limit. I recommend using the streaming and multipart put operations. Uploading a file 1TB could easily fail in the process and you'd have to do it all over, break it up into parts when you're storing it. Also you should be aware that if you need to modify the file you would need to download the file, modify the file and re-upload. If you plan on modifying the file at all i recommend trying to split it up into smaller files.
http://docs.amazonwebservices.com/AmazonS3/latest/dev/UploadingObjects.html