Automating Folder creation in Azure Data Lake Store (Gen2)

Automating Folder creation in Azure Data Lake Store (Gen2) - azure-storage

Is there any way we can automate Folder creations in ADLS Gen2 pragmatically. It is required for doing this in Production Subscription on which there is limited access from Portal.
If there is any suggestions, please help on this.

For the benefit of broader audience re-sharing Ivan Yang's comment as answer i.e.,.
You can use the rest api here

You could use Azure cli for this, but it is still in preview.
#add the extension
az extension add --name storage-preview
# Create Directory
$dirCheck = az storage blob directory exists -c $filesystemName -d $dirname --account-name $storageaccountname --debug
$exists = $dirCheck | ConvertFrom-Json
If (-Not $exists.exists[0])
{
az storage blob directory create -c $filesystemName -d $dirname --account-name $storageaccountname
}

Related

Using the `s3fs` python library with Task IAM role credentials on AWS Batch

I'm trying to get an ML job to run on AWS Batch. The job runs in a docker container, using credentials generated for a Task IAM Role.
I use DVC to manage the large data files needed for the task, which are hosted in an S3 repository. However, when the task tries to pull the data files, it gets an access denied message.
I can verify that the role has permissions to the bucket, because I can access the exact same files if I run an aws s3 cp command (as shown in the example below). But, I need to do it through DVC so that it downloads the right version of each file and puts it in the expected place.
I've been able to trace down the problem to s3fs, which is used by DVC to integrate with S3. As I demonstrate in the example below, it gets an access denied message even when I use s3fs by itself, passing in the credentials explicitly. It seems to fail on this line, where it tries to list the contents of the file after failing to find the object via a head_object call.
I suspect there may be a bug in s3fs, or in the particular combination of boto, http, and s3 libraries. Can anyone help me figure out how to fix this?
Here is a minimal reproducible example:
Shell script for the job:
#!/bin/bash
AWS_CREDENTIALS=$(curl http://169.254.170.2$AWS_CONTAINER_CREDENTIALS_RELATIVE_URI)
export AWS_DEFAULT_REGION=us-east-1
export AWS_ACCESS_KEY_ID=$(echo "$AWS_CREDENTIALS" | jq .AccessKeyId -r)
export AWS_SECRET_ACCESS_KEY=$(echo "$AWS_CREDENTIALS" | jq .SecretAccessKey -r)
export AWS_SESSION_TOKEN=$(echo "$AWS_CREDENTIALS" | jq .Token -r)
echo "AWS_ACCESS_KEY_ID=<$AWS_ACCESS_KEY_ID>"
echo "AWS_SECRET_ACCESS_KEY=<$(cat <(echo "$AWS_SECRET_ACCESS_KEY" | head -c 6) <(echo -n "...") <(echo "$AWS_SECRET_ACCESS_KEY" | tail -c 6))>"
echo "AWS_SESSION_TOKEN=<$(cat <(echo "$AWS_SESSION_TOKEN" | head -c 6) <(echo -n "...") <(echo "$AWS_SESSION_TOKEN" | tail -c 6))>"
dvc doctor
# Succeeds!
aws s3 ls s3://company-dvc/repo/
# Succeeds!
aws s3 cp s3://company-dvc/repo/00/0e4343c163bd70df0a6f9d81e1b4d2 mycopy.txt
# Fails!
python3 download_via_s3fs.py
download_via_s3fs.py:
import os
import s3fs
# Just to make sure we're reading the credentials correctly.
print(os.environ["AWS_ACCESS_KEY_ID"])
print(os.environ["AWS_SECRET_ACCESS_KEY"])
print(os.environ["AWS_SESSION_TOKEN"])
print("running with credentials")
fs = s3fs.S3FileSystem(
key=os.environ["AWS_ACCESS_KEY_ID"],
secret=os.environ["AWS_SECRET_ACCESS_KEY"],
token=os.environ["AWS_SESSION_TOKEN"],
client_kwargs={"region_name": "us-east-1"}
)
# Fails with "access denied" on ListObjectV2
print(fs.exists("company-dvc/repo/00/0e4343c163bd70df0a6f9d81e1b4d2"))
Terraform for IAM role:
data "aws_iam_policy_document" "standard-batch-job-role" {
# S3 read access to related buckets
statement {
actions = [
"s3:Get*",
"s3:List*",
]
resources = [
data.aws_s3_bucket.company-dvc.arn,
"${data.aws_s3_bucket.company-dvc.arn}/*",
]
effect = "Allow"
}
}
Environment
OS: Ubuntu 20.04
Python: 3.10
s3fs: 2023.1.0
boto3: 1.24.59

How to copy data from private S3 bucket to Azure Blob storage via Azure yaml pipeline

I have one S3 bucket storing CSV files in it. New CSV files get added to this bucket at the beginning of each month. I want these new files to be uploaded automatically to the Azure blob storage at the beginning of each month.
The way I was thinking to do this is to create a script(bash/PowerShell) that pulls data from the AWS S3 bucket to Azure blob storage via AZ Copy command. and then plug this script into an Azure YAML pipeline which runs every start of the month to execute this script. but I can't find a way to integrate this script in an Azure YAML pipeline. Is this command feasible with the YAML pipeline? or is there any simple way to do this?

We can copy data from S3 bucket to Azure Blob Storage using azcopy.
azcopy copy "<s3-bucket-uri>" "https://StorageAccountName.blob.core.windows.net/container-name/?sas-token" --recursive
We can integrate the azcopy with YAML pipeline.
Firstly, install azcopy in pipeline agent as below :
- task: Bash#3
displayName: Install azcopy
inputs:
targetType: 'inline'
script: |
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
mkdir $(Agent.ToolsDirectory)/azcopy
wget -O $(Agent.ToolsDirectory)/azcopy/azcopy_v10.tar.gz https://aka.ms/downloadazcopy-v10-linux
tar -xf $(Agent.ToolsDirectory)/azcopy/azcopy_v10.tar.gz -C $(Agent.ToolsDirectory)/azcopy --strip-components=1
Create cli-task with azcopy in pipeline to copy data from S3 bucket to Azure Blob Storage using azcopy
Reference code :
- task: AzureCLI#2
displayName: Download using azcopy
inputs:
azureSubscription: 'Service-Connection'
scriptType: 'bash'
scriptLocation: 'inlineScript'
inlineScript: |
end=`date -u -d "180 minutes" '+%Y-%m-%dT%H:%M:00Z'`
$(Agent.ToolsDirectory)/azcopy/azcopy copy "<s3-bucket-uri>" "https://StorageAccountName.blob.core.windows.net/container-name/?sas-token" --recursive --check-md5=FailIfDifferent
Reference SO thread : Azure Pipelines - Download files with azcopy - Stack Overflow

Scaleway GLACIER class object storage with restic

Scaleway recently launched GLACIER class storage "C14 Cold Storage Class"
They have a great plan of 75GB free and I'd like to take advantage of this using the restic backup tool.
To get this working I have successfully followed the S3 instructions for repository creation and uploading, with one caveat. I can not successfully pass the storage-class header as GLACIER.
Using awscliv2, I can successfully pass a header that looks very much like this from my local machine: aws s3 cp object s3://bucket/ --storage-class GLACIER
But with restic, having dug through some github issues, I can see an option to pass a -o flag. The linked issues resolution is not that clear to me so I have tried the following restic commands without successfully seeing the "GLACIER" class of storage label next to the files objects in the Scaleway bucket console:
restic -r s3:s3.fr-par.scw.cloud/restic-testing -o GLACIER --verbose backup ~/test.txt
restic -r s3:s3.fr-par.scw.cloud/restic-testing -o storage-class=GLACIER --verbose backup ~/test.txt
Can someone suggest another option?

I'm starting to use C14's GLACIER storage class with restic, and until now it seems be working very well.
I suggest to create the repository in the usual way with restic -r s3:s3.fr-par.scw.cloud/test-bucket init, which will create the config file and keys in the STANDARD storage class.
For backups, I'm using the command:
$ restic backup -r s3:s3.fr-par.scw.cloud/test-bucket -o s3.storage-class=GLACIER --host host /path
similar to what you did, apart the option is s3.storage-class and not storage-class.
In this way files in the data and snapshots directories are in GLACIER storage class, and you can add backups with no problem.
I can also mount the repository while data is in GLACIER class (I suppose all the info are taken from cache) so I can do restic mount /mnt/c14 and I can browse the files, also if I cannot copy them or see their content.
If I need to restore files, I restore all bucket in STANDARD class with s3cmd restore --recursive s3://test-bucket/ (see s3cmd), I test that all files are correctly in standard class with:
$ aws s3 ls s3://test-bucket --recursive | tr -s ' ' | cut -d' ' -f 4 | xargs -n 1 -I {} sh -c "aws s3api head-object --bucket unitedhost --key '{}' | jq -r .StorageClass" | grep --quiet GLACIER
which returns true if at least one file is in GLACIER class, so you have to wait this command to returns false.
Obviously a restore will need more time, but I'm using C14 glacier as a second or third backup, while using another restic repository in Backblaze B2 which is a warm storage.

In addition to vstefanoxx 's answer : Here is my workflow.
I setup the restic repository just like vstefanoxx.
Now, if you want to prune the repository... you cannot as the files are in glacier and restic needs read-write access to the bucket to prune.
What is interesting about Scaleway is that file transferts between glacier and standard class are free. So let's move the data back to the standard class :
s3cmd restore --recursive s3://test-bucket
And wait until the end of the process using the command given by vstefanoxx. Once your data is in the standard class it costs you five times more, so we have to be efficient :-)
So we now prune the repository:
restic prune -r s3:s3.fr-par.scw.cloud/test-bucket
And once it is finished, move everything (in fact data, index and snapshots but not keys) back to glacier:
s3cmd cp s3://test-bucket/data/ s3://test-bucket/data/ --recursive --storage-class=GLACIER
s3cmd cp s3://test-bucket/index/ s3://test-bucket/index/ --recursive --storage-class=GLACIER
s3cmd cp s3://test-bucket/snapshots/ s3://test-bucket/snapshots/ --recursive --storage-class=GLACIER
So we are now to a point where we have pruned the repository, trying to pay the least amount of money !

The chosen answer doesn't seem to work when doing incremental backups. I went with a different solution.
I set up a normal bucket, initialized with your usual restic init. Then I set up the following lifetime rule:
<?xml version="1.0" ?>
<LifecycleConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Rule>
<ID>data-to-glacier</ID>
<Filter>
<Prefix>data/</Prefix>
</Filter>
<Status>Enabled</Status>
<Transition>
<Days>0</Days>
<StorageClass>GLACIER</StorageClass>
</Transition>
</Rule>
</LifecycleConfiguration>
Days is set to 0, which means that the rule will be applied to all files. Rules are not applied continuously though, they're applied once a day at midnight UTC.
This rule will only apply to the files in data/, which are the big files.
This rule description is supposed to be used with s3cmd but you can also do it from the dashboard if you prefer a GUI.

gsutil: specify project on copy

I'm attempting to come up with commands to facilitate deployment to different environments (production, staging) in my GCP project using gsutil.
The following deploys to production without issue:
gsutil cp -r ./build/* gs://<production-project-name>/
I'd like to deploy to a bucket in another project. The gsutil help page alludes to a -p option for ls and mb used to change the project context of the gsutil command.
I'd like to use a command like this to deploy my app to a staging environment:
gsutil cp -r ./build/* gs://<existing-bucket-in-staging-project>/ -p <staging-project-name>
Alas, the -p option is not available for the cp command. I confirmed on the gsutil cp doc page.
What is the best way to deploy a build artifact to a Google Cloud storage bucket to a bucket in a project other than the one currently specified in the terminal environment?

The bucket namespace is global, so as long as the credentials you're using have permission to the other project, you shouldn't need a project parameter with the cp command. In other words, this command should work fine:
gsutil cp -r ./build/* gs://<bucket-in-staging-project>

Transporting with data from S3 amazon to local server

I am trying to import data from S3 and using the described below script (which I sort of inherited). It's a bit long...The problem is I kept receiving following output:
The config profile (importer) could not be found
I am not a bash person-so be gentle, please. It seemed there are some credentials missing or something else is wrong with configuration of "importer" on local machine.
In S3 configs(the console) - there is a user with the same name, which, according to permissions can perform access the bucket and download data.
I have tried changing access keys in amazon console for the user and creating file, named "credentials" in home/.aws(there was no .aws folder in home dir by default-created it), including the new keys in the file, tried upgrading AWS CLI with pip - nothing helped
Then I have modified the "credentials", placing [importer] as profile name, so it looked like:
[importer]
aws_access_key = xxxxxxxxxxxxxxxxx
aws_secret+key = xxxxxxxxxxxxxxxxxxx
Appears, that I have gone through the "miss-configuration":
A client error (InvalidAccessKeyId) occurred when calling the ListObjects operation: The AWS Access Key Id you provided does not exist in our records.
Completed 1 part(s) with ... file(s) remaining
And here's the part, where I am stuck...I placed the keys, I have obtained from the amazon into that config file. Double checked...Any suggestions? I can't produce anymore keys-aws quota/user. Below is part of the script:
#!/bin/sh
echo "\n$0 started at: `date`"
incomming='/Database/incomming'
IFS='
';
mkdir -p ${incomming}
echo "syncing files from arrivals bucket to ${incomming} incomming folder"
echo aws --profile importer \
s3 --region eu-west-1 sync s3://path-to-s3-folder ${incomming}
aws --profile importer \
s3 --region eu-west-1 sync s3://path-to-s3-folder ${incomming}
count=0
echo ""
echo "Searching for zip files in ${incomming} folder"
for f in `find ${incomming} -name '*.zip'`;
do
echo "\n${count}: ${f} --------------------"
count=$((count+1))
name=`basename "$f" | cut -d'.' -f1`
dir=`dirname "$f"`
if [ -d "${dir}/${name}" ]; then
echo "\tWarning: directory "${dir}/${name}" already exist for file: ${f} ... - skipping - not imported"
continue
fi

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Automating Folder creation in Azure Data Lake Store (Gen2) - azure-storage

Is there any way we can automate Folder creations in ADLS Gen2 pragmatically. It is required for doing this in Production Subscription on which there is limited access from Portal. If there is any suggestions, please help on this.

For the benefit of broader audience re-sharing Ivan Yang's comment as answer i.e.,. You can use the rest api here

Related

Using the `s3fs` python library with Task IAM role credentials on AWS Batch

How to copy data from private S3 bucket to Azure Blob storage via Azure yaml pipeline

Scaleway GLACIER class object storage with restic

gsutil: specify project on copy

Transporting with data from S3 amazon to local server

Categories

Resources