Copy $Logs container data to another Blob location - azure-storage

We have a requirement wherein we need to merge all $logs of all storage accounts into a single blob.
I heard we can use AzCopy but is there any other simple way to achieve it?

You can use Azure data factory to achieve it.
The $logs container; which is automatically created when Storage Analytics is enabled for a storage account is not displayed when a container listing operation is performed via Data Factory UI. The file path must be provided directly for Data Factory to consume files from the $logs container.

Related

Query blob storage with Get-AzDataLakeGen2ChildItem?

Our powershell test harness used to use Get-AzDataLakeGen2ChildItem to list blobs found in non data lake storage accounts. Today I updated the powershell and Az module versions they were locked at, and now when issuing the command (specifying a Filesystem container, and context), the following error is returned:
Get-AzDataLakeGen2ChildItem: Input string was not in a correct format.
I'm assuming something has changed, and this function cannot process a result from non data lake storage compatibly anymore.
For one reason or another, a while back we changed from using Get-AzStorageBlob. So interested to know if there's any solution to be able to continue working with this call, rather than to deviate from Get-AzDataLakeGen2ChildItem where required.
One of the workaround to list the sub directories and files in a directory or Filesystem from an Azure storage account using the Get-AzDataLakeGen2ChildItem .
To do that we must have enabled Hierarchical Namespace .
Then you will get something like below example;
NOTE:- If you are using existing storage which has not enabled Hierarchical Namespace then you need to upgrade that storage account by doing the below steps:
For more information please refer the below links:-
MS DOC| Get-AzDataLakeGen2ChildItem , Get-AzStorageBlob .
SO THREAD FOR SIMILAR ISSUE.

With Azure table storage, how can I fall back to a secondary storage account?

I am developing an Azure application (C#, .NET 6, ASP.NET Core) that uses Azure blob storage as well as table storage.
I have geo-redundancy enabled on my storage account (RA_GRS) so that if my main storage account goes down, a read-only copy will be available in another Azure region.
When reading from blob storage, as far as I understand, I should be able to get it to automatically fall back to the secondary address by setting the GeoRedundantSecondaryUri property like this (using the Azure.Storage.Blobs NuGet, version 12.8.4):
return new BlobServiceClient(
new Uri($"https://{accountName}.blob.core.windows.net/"),
sharedKeyCredential,
new BlobClientOptions
{
GeoRedundantSecondaryUri = new Uri($"https://{accountName}-secondary.blob.core.windows.net/")
});
Can I do something similar when reading from table storage?
The classes I am using are CloudStorageAccount, CloudTableClient and CloudTable (from the Microsoft.Azure.Cosmos.Table NuGet, version 1.0.8). None of them seem to have a property similar to BlobClientOptions.GeoRedundantSecondaryUri. I don't know whether there is another set of classes I ought to use instead.
Is there any easy way to make Azure table storage fall back automatically, or will I have to implement it myself?

Download big number of files (400k) from S3 bucket into Azure Datalake Gen2 using Azure Data Factory

I need to download a big number of files (around 400k) files from an S3 bucket. I have the paths stored in a csv file. Some of the paths may not exist.
The two options i see are:
Use the foreach activity and somehow pass the contents of the file there. But i think that this would flood my monitor pane with a huge number of runs, and it feels like it is meant to be for smaller pipelines.
Use the listOfFiles option which is supported in the S3 source. The problem with this approach is that the list must be in the S3 bucket and cannot be loaded from Azure Datalake Gen2 (anybody knows why, please let me know as well).
I have tried using the listOfFiles way, but the pipeline fails once it finds the first missing file. The fault tolerance options contain a "skip missing file" option but it is defined as "Skip the files if it is being deleted from source store during the data movement", so it is of no use to me.
I don't want to download more files than needed, so copying the bucket as-is is not an option. How can i approach this issue with ADF? I'm looking for a solution that uses the predefined transformations, ideally i would like to not involve Azure Batch or Azure Functions for such a simple task.

SnowFlake and S3 MetaData

I have custom metadata properties on my s3 files such as:
x-amz-meta-custom1: "testvalue"
x-amz-meta-custom2: "whoohoo"
When these files are loaded into SnowFLake, how do I access the custom properties associated with the files. Google and SnowFlake documentation haven't turned up any gems yet.
Based on docs, I think the only metadata that you can access via the stage is filename and row number. https://docs.snowflake.com/en/user-guide/querying-metadata.html
You could possibly write something custom that picks up the S3 metadata and writes out a s3 filename along with the metadata and then ingest that back into another snowflake table.

How to upload multiple files to google cloud storage bucket as a transaction

Use Case:
Upload multiple files into a cloud storage bucket, and then use that data as a source to a bigquery import. Use the name of the bucket as the metadata to drive which sharded table the data should go into.
Question:
In order to prevent partial import to the bigquery table, ideally, I would like to do the following,
Upload the files into a staging bucket
Verify all files have been uploaded correctly
Rename the staging bucket to its final name (for example, gs://20130112)
Trigger the bigquery import to load the bucket into a sharded table
Since gsutil does not seem to support bucket rename, what are the alternative ways to accomplish this?
Google Cloud Storage does not support renaming buckets, or more generally an atomic way to operate on more than one object at a time.
If your main concern is that all objects were uploaded correctly (as opposed to needing to ensure the bucket content is only visible once all objects are uploaded), gsutil cp supports that -- if any object fails to upload, it will report the number that failed to upload and exit with a non-zero status.
So, a possible implementation would be a script that runs gsutil cp to upload all your files, and then checks the gsutil exit status before creating the BigQuery table load job.
Mike Schwartz, Google Cloud Storage team
Object names are actually flat in Google Cloud Storage; from the service's perspective, '/' is just another character in the name. The folder abstraction is provided by clients, like gsutil and various GUI tools. Renaming a folder requires clients to request a sequence of copy and delete operations on each object in the folder. There is no atomic way to rename a folder.
Mike Schwartz, Google Cloud Storage team