How to read encrypted and gzipped blob data from u-sql - azure-data-lake

I would like to read file from a blob that is first compressed (gz) and then encrypted. The encryption done using Azure SDK when file uploaded to Blob (BlobEncryptionPolicy passed to CloudBlockBlob.UploadFromStreamAsync method).
There blob file have .gz extension so U-SQL trying to decompress but fails as the file is encrypted.
Is it possible to set my u-sql script to handle the decompression automatically same as done by Azure SDK (for instance by CloudBlockBlob.BeginDownloadToStream)?
If not and I need to use custom extractor, how can I prevent the U-SQL from trying to decompress the stream automatically?

The decompression is automatically triggered by the ".gz" extension. So you would have to rename the document. Also, please note that you cannot call to any external resource to decrypt from within your user-code. You will have to pass all keys as parameters to the custom extractor.
Finally, if you store the data in ADLS, you get transparent encryption of the data and it makes the whole thing a lot easier. Why are you storing it in Windows Azure Blob Storage instead?

Related

airbyte ETL ,connection between http API source and big query

i have a task in hand, where I am supposed to create python based HTTP API connector for airbyte. connector will return a response which will contain some links of zip files.
each zip file contains csv file, which is supposed to be uploaded to the bigquery
now I have made the connector which is returning the URL of the zip file.
The main question is how to send the underlying csv file to the bigquery ,
i can for sure unzip or even read the csv file in the python connector, but i am stuck on the part of sending the same to the bigquery.
p.s if you guys can tell me even about sending the CSV to google cloud storage, that will be awesome too
When you are building an Airbyte source connector with the CDK your connector code must output records that will be sent to the destination, BigQuery in your case. This allows to decouple extraction logic from loading logic, and makes your source connector destination agnostic.
I'd suggest this high level logic in your source connector's implementation:
Call the source API to retrieve the zip's url
Download + unzip the zip
Parse the CSV file with Pandas
Output parsed records
This is under the assumption that all CSV files have the same schema. If not you'll have to declare one stream per schema.
A great guide, with more details on how to develop a Python connector is available here.
Once your source connector outputs AirbyteRecordMessages you'll be able to connect it to BigQuery and chose the best loading method according to your need (Standard or GCS staging).

Download big number of files (400k) from S3 bucket into Azure Datalake Gen2 using Azure Data Factory

I need to download a big number of files (around 400k) files from an S3 bucket. I have the paths stored in a csv file. Some of the paths may not exist.
The two options i see are:
Use the foreach activity and somehow pass the contents of the file there. But i think that this would flood my monitor pane with a huge number of runs, and it feels like it is meant to be for smaller pipelines.
Use the listOfFiles option which is supported in the S3 source. The problem with this approach is that the list must be in the S3 bucket and cannot be loaded from Azure Datalake Gen2 (anybody knows why, please let me know as well).
I have tried using the listOfFiles way, but the pipeline fails once it finds the first missing file. The fault tolerance options contain a "skip missing file" option but it is defined as "Skip the files if it is being deleted from source store during the data movement", so it is of no use to me.
I don't want to download more files than needed, so copying the bucket as-is is not an option. How can i approach this issue with ADF? I'm looking for a solution that uses the predefined transformations, ideally i would like to not involve Azure Batch or Azure Functions for such a simple task.

What is the best way or tool to open to read a raw clickstream blob storage data in azure

I have clickstream blob storage about 800mb average file sizes and when i open the file it defaults to text file. How do i open and read the data possibly json format or column format. I would also like to understand if i can build an API to consume that data. I recently built an azure function app http trigger but the file is too large to open it up and the function times out. So any suggestion on those two would be appreciated
Thank you

Generate A Large File Inside s3 with .NET

I would to generate a big file (several TB) with special format using my C# logic and persist it to S3. What is the best way to do this. I can launch a node in EC2 and then write the big file into EBS and then upload the file from the EBS into S3 using the S3 .net Clinent library.
Can I stream the file content as I am generating in my code and directly stream it to S3 until the generation is done specially for such large file and out of memory issues. I can see this code help with stream but it sounds like the stream should have already filled up with. I obviously can not put such a mount of data to memory and also do not want to save it as a file to the disk first.
PutObjectRequest request = new PutObjectRequest();
request.WithBucketName(BUCKET_NAME);
request.WithKey(S3_KEY);
request.WithInputStream(ms);
s3Client.PutObject(request);
What is my best bet to generate this big file ans stream it to S3 as I am generating it?
You certainly could upload any file up to 5 TB that's the limit. I recommend using the streaming and multipart put operations. Uploading a file 1TB could easily fail in the process and you'd have to do it all over, break it up into parts when you're storing it. Also you should be aware that if you need to modify the file you would need to download the file, modify the file and re-upload. If you plan on modifying the file at all i recommend trying to split it up into smaller files.
http://docs.amazonwebservices.com/AmazonS3/latest/dev/UploadingObjects.html

query regarding cloud file storage services- can i append data to an existing file

I am working to create an application where some files will be stored in Amazon S3/Rackspace Cloud Files/other similar cloud file storage providers.
There are a couple of scenarios where it would be easier for me, if I could append data to an existing file... Is this possible? Or do I have to download the file from Amazon S3, then append data to it, and finally upload the modified file back to Amazon S3?
There is no way to append anything to existing files in S3.
You will have to download it and upload it again after modifying.
If you wish though, you can always upload the new data with a tag (a timestamp or a counter), e.g. file_201201011344. So when reading files, you get all files mactching your pattern and append them on the client side.