adding permissions to azure datalake gen1 - azure-data-lake

I have an instance of Azure Datalake Storage Gen1 structured as :
folder structure
└── Tenants
├── Tenant-01
│   ├── Product-A
│   └── Product-B
└── Tenant-02
├── Product-A
└── Product-B
team structure
there is a one to one mapping between
Team-A owns Product-A
Team-B owns Product-B
permissions structure
Ideally I would want to give write permissions to Product-X for all tenants
Team-A write access under Tenants/**/Product-A/
Team-B write access under Tenants/**/Product-B/
and then potential read access like
Team-A read under Tenants/**/Product-B/
questions
how can we achieve this without hitting the 32 ACLs limit and with large number of Tenant-* without writing out each ACL specifically.
is this supported / how can this be migrated to Azure Data lake Storage Gen2

You can leverage role assignable groups in AAD to have groups created
for teamA and teamB and then assign required ACLS on those groups.
This way we no need to add ACLs for each user individually.
Please check below link to know how to create role assignable groups
in Azure Active Directory- Create Groups
You can opt to keep the POSIX access control lists (ACLs) along with
the data files while upgrading from Azure Data Lake Storage Gen1 to
Gen2 or copying data between ADLS Gen2. See Access control in Azure
Data Lake Storage Gen1 and Access control in Azure Data Lake Storage
Gen2 for additional information on access control.
When you use ADF to preserve ACLs from Data Lake Storage Gen1/Gen2 to
Gen2, the existing ACLs on sink Gen2's corresponding folder/files
will be overwritten.
Preserve ACLs from Data Lake Storage Gen1/Gen2 to Gen2 | Docs

Related

always start with 1 folder under a container in azure data lake storage gen2

Per the Warning in this link, MS recommends that directories under a container (i.e., Gen2) should begin with just 1 folder (instead of jumping straight to multiple folders) because some applications cannot mount the root of a container. I have never seen this happen. What are some example applications that can do so? Is this a legitimate warning?

Where is the root Azure Storage instance?

I am trying to access logs from my Databricks notebook which is run as a job. I would like to see these logs in an azure storage account.
From the documentation: https://learn.microsoft.com/en-us/azure/databricks/administration-guide/workspace/storage#notebook-results
According to this, my results are stored in the workspace's root Azure Storage instance. However, I can't find any reference to this elsewhere online. How would I access this?
The documentation says:
Notebook results are stored in workspace system data storage, which is not accessible by users.
But you can retrieve these results via UI, or via get-output command of Jobs REST API, or via runs get-output command of databricks-cli.

Attempting to Read parcquet files on linked storage in Azure Synapse

I am attempting to give access to parquet files on a Gen2 Data Lake container. I have owner RBAC on the container but would prefer to limit access in the container for other users.
My Query is very simple:
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://aztsworddataaipocacldl.dfs.core.windows.net/pocacl/Top/Sub/part-00006-c62926ba-c530-4ad8-87d1-cf38c67a2da3-c000.snappy.parquet',
FORMAT='PARQUET'
) AS [result]
When I run this I have no problems connecting. I have attempted to add ACL rights onto the files (and of course the containing folders 'Top' and 'Sub').
I've give RWX on the 'Top' folder using Storage Explorer and default so that it cascades to the 'Sub' folder and parquet files as I add them
When my colleague attempts to run the SQL script the get the error message. Failed to execute query. Error: File 'https://aztsworddataaipocacldl.dfs.core.windows.net/pocacl/Top/Sub/part-00006-c62926ba-c530-4ad8-87d1-cf38c67a2da3-c000.snappy.parquet' cannot be opened because it does not exist or it is used by another process.
NB similar results are also experienced in Spark but with a 403 instead
SQL on-demand provides a link to the following help file after the error, it suggests:
If your query fails with the error saying 'File cannot be opened because it does not exist or it is used by another process' and you're sure both file exist and it's not used by another process it means SQL on-demand can't access the file. This problem usually happens because your Azure Active Directory identity doesn't have rights to access the file. By default, SQL on-demand is trying to access the file using your Azure Active Directory identity. To resolve this issue, you need to have proper rights to access the file. Easiest way is to grant yourself 'Storage Blob Data Contributor' role on the storage account you're trying to query.
I don't wish to grant Storage Blob Data Contributor or Storage Blob Data Reader as this gives access to every file on the container and not just those I want end users to be able to query. We have found the same experience occurs for SSMS connecting to parquet external tables.
So then in parts:
Is this the correct pattern using ACL to grant access, or should I use another method?
Are there settings on the Storage Account or within my query/notebook that I should be enabling to support ACL?*
Has ACL been implemented on Synapse Workspace to date given that we're still in preview?
*I have resisted pasting my entire settings as I really have no idea what is relevant and what entirely irrelevant to this issue but of course can supply.
It would appear that the ACL feature was not working correctly in Preview for Azure Synapse Analytics.
I have now managed to get it to work. At present I see that once Read|Execute is provided to a folder it allows access to the files contained within that folder and sub folders. Access is available even when no specific ACL access is provided on a file in a sub folder. This is not quite what I expected however it provides enough for me to proceed: only giving access to the Gold folder allows for separation of access to the files I want to let users query and the working files that I want to keep hidden.
When you assign ACL to a folder it's not propagated recursively to all files inside the folder. Only new files inherit from the folder.
You can see this here
Go to azure storage explorer change ACL permissions in the route Folder and right click on your storage and click on "propogate access control lists"

Copy files between cloud storage providers

I need to upload a large number of files to one cloud storage provider and then copy those files to another cloud storage provider using software that I will write. I have looked at several cloud storage providers and I don't see an easy way to do what I need to do unless I first download the files and then upload them to the second storage provider. I want to copy directly using cloud storage provider API's. Any suggestions or links to storage providers that have API's that will allow copying from one provider to another would be most welcome.
There is several option you could choose. First using cloud transfer services such as Multi Cloud. I've using it to transfer from AWS S3 or Egnyte to Google Drive.
Multicloud https://www.multcloud.com which is free to for 30GB data traffic per month.
Mountain duck https://mountainduck.io/ if connector are available you could mount each cloud services as your hard drive, and move each file easily.
I hope this could help.
If you want to write code for it use Google's gsutil :
The gsutil cp command allows you to copy data between your local file
system and the cloud, copy data within the cloud, and copy data
between cloud storage providers.
You will find detailed info in this link :
https://cloud.google.com/storage/docs/gsutil/commands/cp
If you want a software, use Multicloud. https://www.multcloud.com/
It can download directly from the web and it can also transfer the file from one cloud storage like dropbox to another like google drive.
Cloud HQ also as a chrome extension is one of the best solutions to sync your data between clouds. You can check it out.

Import data from BigQuery to Cloud Storage in different project

I have two projects under the same account:
projectA with BQ and projectB with cloud storage
projectA has BQ with dataset and table - testDataset.testTable
prjectB has cloud storage and bucket - testBucket
I use python, google cloud rest api
account key credentials for every project, with different permissions: projectA key has permissions only for BQ; projectB has permissions only for cloud storage
What I need:
import data from projectA testDataset.testTable to projectB testBucket
Problems
of course, I'm running into error Permission denied while I'm trying to do it, because apparently, projectA key does not have permissions for projectB storage and etc
another strange issue as I have testBucket in projetB I can't create a bucket with the same name in projectA and getting
This bucket name is already in use. Bucket names must be globally
unique. Try another name.
So, looks like all accounts are connected I guess it means should be possible to import data from one account to another one via API
What can I do in this case?
You put this wrong. You need to provide access to the user account on both projects to have accessible across projects. So there needs to be a user authorized to do the BQ thing and also the GCP thing on the different project.
Also Bucket names must be globally unique it means I can't create the name as well, it's global (for the entire planet you reserved that name, not just for project)