Restrict data movement from ADLS Gen2 to other platforms - azure-data-lake

We are looking for a feature of restricting the data movement outside ADLS Gen2. If we grant read only access to an user or a SPN, they can copy the data from ADLS to any platform as they wish. Is there a way to restrict the data movement outside ADLS or generate alert if any such data movement outside ADLS is triggered?

Let's revisit the question , let's say that one user have read only access on storage account and so he can now view the data using the portal,Storage explorer etc. The user is planning to write an automation to copy the data from the account to some other account . here are few option which can be used and also if he can do that .
ADF : he cannot use as he does not have the keys
Powershell/CLI : He can do this if runs the script unders his user context .
3.Manually : User can always open the file and and safe that local and then play around with the data .
So to the extent I know I don't think we can solve this in totality .

Related

Query blob storage with Get-AzDataLakeGen2ChildItem?

Our powershell test harness used to use Get-AzDataLakeGen2ChildItem to list blobs found in non data lake storage accounts. Today I updated the powershell and Az module versions they were locked at, and now when issuing the command (specifying a Filesystem container, and context), the following error is returned:
Get-AzDataLakeGen2ChildItem: Input string was not in a correct format.
I'm assuming something has changed, and this function cannot process a result from non data lake storage compatibly anymore.
For one reason or another, a while back we changed from using Get-AzStorageBlob. So interested to know if there's any solution to be able to continue working with this call, rather than to deviate from Get-AzDataLakeGen2ChildItem where required.
One of the workaround to list the sub directories and files in a directory or Filesystem from an Azure storage account using the Get-AzDataLakeGen2ChildItem .
To do that we must have enabled Hierarchical Namespace .
Then you will get something like below example;
NOTE:- If you are using existing storage which has not enabled Hierarchical Namespace then you need to upgrade that storage account by doing the below steps:
For more information please refer the below links:-
MS DOC| Get-AzDataLakeGen2ChildItem , Get-AzStorageBlob .
SO THREAD FOR SIMILAR ISSUE.

Azure Blob Storage - How to read source string: wasbs://training#dbtrainsouthcentralus.blob.core.windows.net

I am doing a lab for an Azure Data course and there was some code to run from within Azure Databricks.
I noticed that it seemed to mount something from the following location:
wasbs://training#dbtrainsouthcentralus.blob.core.windows.net
So I am trying to figure out how to deconstruct the above string
wasbs looks to mean "windows azure storage blob"
The string training#dbtrainsouthcentralus.blob.core.windows.net looks like it means "container name"#"account name" - Which I would think should be something in my Azure Data Lake.
I dug around in my ADLS and was not able to find anything related to "training#dbtrainsouthcentralus.blob.core.windows.net"
So I was wondering, where on earth did this come from? How can I trace back to where this path came from?
The url is indeed constructed as follows:
wasbs://[container-name]#[storage-account-name].blob.core.windows.net[directory-name] (source)
I dug around in my ADLS ...
You won't find it in ALDS, it is a seperate resource in your subscription. There should be a storage account named dbtrainsouthcentralus.
Note: It could also be a public accessible storage account in some training subscription you do not have access to and is provided by microsoft for training purposes.

is it possible to read a Google Drive folder (all files) as BigQuery external data source?

I am using Google Drive as an external data source in BigQuery. I can able to access a single file, but unable to read a folder with multiple files.
Note:
I have picked up the shareable link from Google Drive for folder and used "bq mk.." command referencing the link ID. Although it creates the table but unable to pull data.
I've not tried it with drive so I have no sense of how performant it is, but when defining an external table (or load job), you can specify the source data as a list of URIs. My suspicion is that it's not particularly scalable and may run into limits in drive, as that's not a typical access pattern. Google Cloud Storage is a much more suitable datasource for this kind of thing.

How to backup on premise data to AWS S3 bucket using tool/service?

Let me explain a little bit, we are keeping users data to a centralized Share folder(Configured on Domain Controller, permission set via NTFS & Security group), like M-Marketing, S-Sales & T-Trading, etc(These drives are mapped to windows login profile according to their work profile).
On-premise back is already configured. Now I want to back up some of the important drives (Like M, S, T) to AWS S3 to keep data safe & whenever is Source data is not available for whatever reason, I must be able to map those drives according to their work profile.

Transferring Storage Accounts Table into Data Lake using Data Factory

I am trying to use Data Factory to transfer a table from Storage Accounts into Data Lake. Microsoft claims that one can, "store files of arbitrary sizes and formats into Data Lake". I use the online wizard and try to create a pipeline. Pipeline gets created, but I then always get an error saying:
Copy activity encountered a user error: ErrorCode=UserErrorTabularCopyBehaviorNotSupported,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=CopyBehavior property is not supported if the source is tabular data source.,Source=Microsoft.DataTransfer.ClientLibrary,'.
Any suggestions what I can do to be able to use Data Factory to transfer data from Storage Accounts table into Data Lake?
Thanks.
Your case is supported by ADF. While for the error you hit, there is a known defect that for some cases the copy wizard mis-generate a "CopyBehavior" property which is not applicable. We are fixing that now.
For you to workaround, go to Azure portal -> Author and deploy -> select that pipeline -> find the "CopyBehavior": "MergeFiles" under AzureDataLakeStoreSink and remove that line -> then deploy and rerun the activity.
If you happened to author an run-once pipeline, please re-author a scheduled one given the former is hard to be updated using JSON.
Thanks,
Linda