Azure blob Storage: Copy blobs with access tier ARCHIVE within the same Azure storage account is not working - azure-storage

I'm using the startCopy API from azure-storage java sdk version 8.6.5 to copy blobs between containers within the same storage account. As per the docs, it will copy a block blob's contents, properties, and metadata to a new block blob. Does this also mean source and destination access tier will match ?
String copyJobId = cloudBlockBlob.startCopy(sourceBlob);
If the source blob access tier is ARCHIVE, I am getting the following exception -
com.microsoft.azure.storage.StorageException: This operation is not permitted on an archived blob.
at com.microsoft.azure.storage.StorageException.translateException(StorageException.java:87) ~[azure-storage-8.6.5.jar:?]
at com.microsoft.azure.storage.core.StorageRequest.materializeException(StorageRequest.java:305) ~[azure-storage-8.6.5.jar:?]
at com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:196) ~[azure-storage-8.6.5.jar:?]
at com.microsoft.azure.storage.blob.CloudBlob.startCopy(CloudBlob.java:791) ~[azure-storage-8.6.5.jar:?]
at com.microsoft.azure.storage.blob.CloudBlockBlob.startCopy(CloudBlockBlob.java:302) ~[azure-storage-8.6.5.jar:?]
at com.microsoft.azure.storage.blob.CloudBlockBlob.startCopy(CloudBlockBlob.java:180) ~[azure-storage-8.6.5.jar:?]
As shown below, I used startCopy API to copy all blobs in container02(src) to container03(destination). Blob with access tier ARCHIVE failed and also test1.txt blob's access tier is not same as in source.
I just want to confirm if this is expected or I am not using the right API and need to set these properties explicitly if I need both source and destination to look same??
Thanks in Advance!!!

1. Blob with access tier ARCHIVE failed
You cannot execute the startCopy operation when the access tier is ARCHIVE.
Please refer to this official documentation:
While a blob is in archive storage, the blob data is offline and can't be read, overwritten, or modified. To read or download a blob in archive, you must first rehydrate it to an online tier. You can't take snapshots of a blob in archive storage. However, the blob metadata remains online and available, allowing you to list the blob, its properties, metadata, and blob index tags. Setting or modifying the blob metadata while in archive is not allowed; however you may set and modify the blob index tags. For blobs in archive, the only valid operations are GetBlobProperties, GetBlobMetadata, SetBlobTags, GetBlobTags, FindBlobsByTags, ListBlobs, SetBlobTier, CopyBlob, and DeleteBlob.
2. test1.txt blob's access tier is not same as in source.
The access tier of the copied blob may be related to the default access tier.
Solution:
You may need to move the files from archive storage to the hot or cool access tier. Or you can use this API and specify standardBlobTier and rehydratePriority:
public final String startCopy(final CloudBlockBlob sourceBlob, String contentMd5, boolean syncCopy, final StandardBlobTier standardBlobTier, RehydratePriority rehydratePriority, final AccessCondition sourceAccessCondition, final AccessCondition destinationAccessCondition, BlobRequestOptions options, OperationContext opContext)

Related

How to rename a file in blob storage by using Azure Datalake Gen2 Rest API

I've try to do the following instruction of this document : LINK
I used SAS authentication and added this to request header "x-ms-rename-source" but i kept getting this error "403-AuthorizationPermissionMismatch". Doing fine with all others api method but this one seem really tricky. Does anyone have success rename a file or directory with this one ?
Instead of using SAS authentication, i used authorization headers. You can check it here.
My request headers :
DateTime now = DateTime.UtcNow;
requestMessage.Headers.Add("x-ms-date", now.ToString("R", CultureInfo.InvariantCulture));
requestMessage.Headers.Add("x-ms-version", "2018-11-09");
//your source path you want to rename
requestMessage.Headers.Add("x-ms-rename-source", renameSourcePath);
//rename operation only accept authorize by shared key via header
requestMessage.Headers.Authorization = AzureStorageAuthenticationHelper.GetAuthorizationHeader(
StorageGen2AccountName, StorageGen2AccountKey, now, requestMessage);
You can try to rename the file in Blob Storage by using Storage Explorer tool
Kindly let us know if the above helps or you need further assistance on this issue.

How to push data from AWS IoT MQTT broker to a random file in S3 bucket

I have created a rule to forward all messages published to any topic e.g. foo/bar of my AWS IoT core managed MQTT broker to a nested folder in S3 bucket. For that, I am using key section. I can send data to nested folder like a/b/c. The problem is - it takes c as destination file and this file gets updated with new data as it arrives. Is there any configuration that I can do to put data in bucket in a new file (with any random name) as it arrives (similar to how it happens when we forward data from firehose to S3)
You can change your key to use the newuuid() function. e.g.
a/b/${newuuid()}
This will write the data to a file in the a/b folder with a filename that is a generated UUID.
The key in AWS IoT S3 Actions allow you to use the IoT SQL Reference Functions to form the folder and filename.
The documentation for the key states:
The path to the file where the data is written. For example, if the value of this argument is "${topic()}/${timestamp()}", the topic the message was sent to is "this/is/my/topic,", and the current timestamp is 1460685389, the data is written to a file called "1460685389" in the "this/is/my/topic" folder on Amazon S3.
If you don't want to use a timestamp then you could form the name of the file using other functions such as a random float (rand()), calculate a hash (md5()), a UUID (newuuid()) or the trace id of the message (traceid()).

Auditing Azure File Storage Service

It is documented that Storage Analytics logging currently does not work for File storage service.
Storage Analytics metrics are available for the Blob, Queue, Table,
and File services.
Storage Analytics logging is available for the Blob, Queue, and Table
services.
https://learn.microsoft.com/en-us/rest/api/storageservices/enabling-and-configuring-storage-analytics
Knowing this I was hoping I could identify File service usage via the metrics, however I wasn't able to isolate something I could conclusively see as being for file usage. The capacity didn't seem to go up and ingress / egress I couldn't isolate as being just for files.
How best to audit File usage?
There is a workaround for getting metrics/analytics on storage services, specifically Azure files. It is not in storage analytics as of yet.
There is an option in the .net SDK, which allows you to view different metrics. Though, you have to use the resource ID, this is done via Azure Storage Metrics:
If you want to list the metric definitions for blob, table, file, or queue, you must specify different resource IDs for each service with the API.
Code Sample:
public static async Task ListStorageMetricDefinition()
{
// Resource ID for storage account
var resourceId = "/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Storage/storageAccounts/{storageAccountName}";
var subscriptionId = "{SubscriptionID}";
// How to identify Tenant ID, Application ID and Access Key: https://azure.microsoft.com/documentation/articles/resource-group-create-service-principal-portal/
var tenantId = "{TenantID}";
var applicationId = "{ApplicationID}";
var accessKey = "{AccessKey}";
// Using metrics in Azure Monitor is currently free. However, if you use additional solutions ingesting metrics data, you may be billed by these solutions. For example, you are billed by Azure Storage if you archive metrics data to an Azure Storage account. Or you are billed by Operation Management Suite (OMS) if you stream metrics data to OMS for advanced analysis.
MonitorClient readOnlyClient = AuthenticateWithReadOnlyClient(tenantId, applicationId, accessKey, subscriptionId).Result;
IEnumerable<MetricDefinition> metricDefinitions = await readOnlyClient.MetricDefinitions.ListAsync(resourceUri: resourceId, cancellationToken: new CancellationToken());
foreach (var metricDefinition in metricDefinitions)
{
//Enumrate metric definition:
// Id
// ResourceId
// Name
// Unit
// MetricAvailabilities
// PrimaryAggregationType
// Dimensions
// IsDimensionRequired
}
}
Source:Azure Storage metrics in Azure Monitor
And you can also do it via Portal as below:

Accessing FlowFile content in NIFI PutS3Object Processor

I am new to NIFI and want to push data from Kafka to an S3 bucket. I am using the PutS3Object processor and can push data to S3 if I hard code the Bucket value as mphdf/orderEvent, but I want to specify the buckets based on a field in the content of the FlowFile, which is in Json. So, if the Json content is this {"menu": {"type": "file","value": "File"}}, can I have the value for the Bucket property as as mphdf/$.menu.type? I have tried to do this and get the error below. I want to know if there is a way to access the FlowFile content with the PutS3Object processor and make Bucket names configurable or will I have to build my own processor?
ERROR [Timer-Driven Process Thread-10]
o.a.nifi.processors.aws.s3.PutS3Object
com.amazonaws.services.s3.model.AmazonS3Exception: The XML you
provided was not well-formed or did not validate against our
published schema (Service: Amazon S3; Status Code: 400; Error Code:
MalformedXML; Request ID: 77DF07828CBA0E5F)
I believe what you want to do is use an EvaluateJSONPath processor, which evaluates arbitrary JSONPath expressions against the JSON content and extracts the results to flowfile attributes. You can then reference the flowfile attribute using NiFi Expression Language in the PutS3Object configuration (see your first property Object Key which references ${filename}). In this way, you would evaluate $.menu.type and store it into an attribute menuType in the EvaluateJSONPath processor, then in PutS3Object you would have Bucket be mphdf/${menuType}.
You might have to play around with it a bit but off the top of my head I think that should work.

Is there a fast way of accessing line in AWS S3 file?

I have a collection of JSON messages in a file stored on S3 (one message per line). Each message has a unique key as part of the message. I also have a simple DynamoDB table where this key is used as the primary key. The table contains the name of the S3 file where the corresponding JSON message is located.
My goal is to extract a JSON message from the file given the key. Of course, the worst case scenario is when the message is the very last line in the file.
What is the fastest way of extracting the message from the file using the boto library? In particular, is it possible to somehow read the file line by line directly? Of course, I can read the entire contents to a local file using boto.s3.key.get_file() then open the file and read it line by line and check for the id to match. But is there a more efficient way?
Thanks much!
S3 cannot do this. That said, you have some other options:
Store the record's length and position (byte offset) instead of the line number in DynamoDB. This would allow you to retrieve just that record using the Range: header.
Use caching layer to store { S3 object key, line number } => { position, length } tuples. When you want to look up a record by { S3 object key, line number }, reference the cache. If you don't already have this data, you have to fetch the whole file like you do now -- but having fetched the file, you can calculate offsets for every line within it, and save yourself work down the line.
Store the JSON record in DynamoDB directly. This may or may not be practical, given the 64 KB item limit.
Store each JSON record in S3 separately. You could then eliminate the DynamoDB key lookup, and go straight to S3 for a given record.
Which is most appropriate for you depends on your application architecture, the way in which this data is accessed, concurrency issues (probably not significant given your current solution), and your sensitivities for latency and cost.
you can use the built-in readline with streams:
const readline = require('readline');
const AWS = require('aws-sdk');
const s3 = new AWS.S3();
const params = {Bucket: 'yourbucket', Key: 'somefile.txt'};
const readStream = s3.getObject(params).createReadStream();
const lineReader = readline.createInterface({
input: readStream,
});
lineReader.on('line', (line) => console.log(line));
You can use S3 SELECT to accomplish this. Also works on parquet files.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-glacier-select-sql-reference-select.html