Where is the root Azure Storage instance? - azure-storage

I am trying to access logs from my Databricks notebook which is run as a job. I would like to see these logs in an azure storage account.
From the documentation: https://learn.microsoft.com/en-us/azure/databricks/administration-guide/workspace/storage#notebook-results
According to this, my results are stored in the workspace's root Azure Storage instance. However, I can't find any reference to this elsewhere online. How would I access this?

The documentation says:
Notebook results are stored in workspace system data storage, which is not accessible by users.
But you can retrieve these results via UI, or via get-output command of Jobs REST API, or via runs get-output command of databricks-cli.

Related

How do I access files generated during Cloud Run function execution?

I'm running a very simple program getting screenshots of a page using Selenium in Cloud Run. I know that Cloud Run is stateless and I cannot access the screenshot that is generated after the program finishes executing, but I wanted to know where/how can I access these files right after the screenshot is taken and read them, so I can store a reference to them in my Cloud Storage bucket too
You have several solution:
Store the screenshot locally, and then upload them to Cloud Storage (you can create a script for that, use client libraries,...). A good evolution is to make a tar (optionally a gzip also) to upload only 1 file, it's faster.
Use Cloud Run execution runtime 2nd generation, and mount a bucket with GCSFuse into your Cloud Run instance. Like that, a file directly written in the mounted directory will be written on Cloud Storage. For that solution, and despite the good tutorial, it requires good skills in container.

What is the best approach to sync data from AWS 3 bucket to Azure Data Lake Gen 2

Currently, I download csv files from AWS S3 to my local computer using:
aws s3 sync s3://<cloud_source> c:/<local_destination> --profile aws_profile. Now, I would like to use the same process to sync the files from AWS to Azure Data Lake Storage Gen2 (one-way sync) on a daily basis. [Note: I only have read/download permissions for the S3 data source.]
I thought about 5 potential paths to solving this problem:
Use AWS CLI commands within Azure. I'm not entirely sure how to do that without running an Azure VM. Also, I would like to have my AWS profile credentials persist?
Use Python's subprocess library to run AWS CLI commands. I run into similar issues as option 1, namely a) maintaining a persistent install of AWS CLI, b) passing AWS profile credentials, and c) running without an Azure VM.
Use Python's Boto3 library to access AWS services. In the past, it appears that Boto3 didn't support the AWS sync command. So, developers like #raydel-miranda developed their own. [see Sync two buckets through boto3]. However, it now appears that there is a DataSync class for Boto3. [see DataSync | Boto3 Docs 1.17.27 documentation]. Would I still need to run this in an Azure VM or could I use Azure Data Factory?
Use Azure Data Factory to copy data from AWS S3 bucket. [see Copy data from Amazon Simple Storage Service by using Azure Data Factory] My concern would be that I would want to sync rather than copy. I believe Azure Data Factory has functionality to check if a file already exists, but what if the file has been deleted from AWS S3 data source?
Use Azure Data Science Virtual Machine to: a) install the AWS CLI, 2) create my AWS profile to store the access credentials, and 3) run the aws s3 sync... command.
Any tips, suggestions, or ideas on automating this process are greatly appreciated.
Adding one more to the list :)
6. Please do also look into Azcopy option . https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-s3?toc=/azure/storage/blobs/toc.json
I am not aware of any tool which helps in syncing the data , more or less all will do the copy , I think you will have to implement that . Couple of quick thoughts .
#3 ) You can run this from a batch service . You can initate that from Azure data factory . Also since are talking about Python , you can also run that from Azure data bricks .
#4) ADF does not have any sync logic for the files to be deleted. We can implement that using the getMetadat activity . https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity
AzReplciate is another option - especially for very large containers https://learn.microsoft.com/en-us/samples/azure/azreplicate/azreplicate/

Attempting to Read parcquet files on linked storage in Azure Synapse

I am attempting to give access to parquet files on a Gen2 Data Lake container. I have owner RBAC on the container but would prefer to limit access in the container for other users.
My Query is very simple:
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://aztsworddataaipocacldl.dfs.core.windows.net/pocacl/Top/Sub/part-00006-c62926ba-c530-4ad8-87d1-cf38c67a2da3-c000.snappy.parquet',
FORMAT='PARQUET'
) AS [result]
When I run this I have no problems connecting. I have attempted to add ACL rights onto the files (and of course the containing folders 'Top' and 'Sub').
I've give RWX on the 'Top' folder using Storage Explorer and default so that it cascades to the 'Sub' folder and parquet files as I add them
When my colleague attempts to run the SQL script the get the error message. Failed to execute query. Error: File 'https://aztsworddataaipocacldl.dfs.core.windows.net/pocacl/Top/Sub/part-00006-c62926ba-c530-4ad8-87d1-cf38c67a2da3-c000.snappy.parquet' cannot be opened because it does not exist or it is used by another process.
NB similar results are also experienced in Spark but with a 403 instead
SQL on-demand provides a link to the following help file after the error, it suggests:
If your query fails with the error saying 'File cannot be opened because it does not exist or it is used by another process' and you're sure both file exist and it's not used by another process it means SQL on-demand can't access the file. This problem usually happens because your Azure Active Directory identity doesn't have rights to access the file. By default, SQL on-demand is trying to access the file using your Azure Active Directory identity. To resolve this issue, you need to have proper rights to access the file. Easiest way is to grant yourself 'Storage Blob Data Contributor' role on the storage account you're trying to query.
I don't wish to grant Storage Blob Data Contributor or Storage Blob Data Reader as this gives access to every file on the container and not just those I want end users to be able to query. We have found the same experience occurs for SSMS connecting to parquet external tables.
So then in parts:
Is this the correct pattern using ACL to grant access, or should I use another method?
Are there settings on the Storage Account or within my query/notebook that I should be enabling to support ACL?*
Has ACL been implemented on Synapse Workspace to date given that we're still in preview?
*I have resisted pasting my entire settings as I really have no idea what is relevant and what entirely irrelevant to this issue but of course can supply.
It would appear that the ACL feature was not working correctly in Preview for Azure Synapse Analytics.
I have now managed to get it to work. At present I see that once Read|Execute is provided to a folder it allows access to the files contained within that folder and sub folders. Access is available even when no specific ACL access is provided on a file in a sub folder. This is not quite what I expected however it provides enough for me to proceed: only giving access to the Gold folder allows for separation of access to the files I want to let users query and the working files that I want to keep hidden.
When you assign ACL to a folder it's not propagated recursively to all files inside the folder. Only new files inherit from the folder.
You can see this here
Go to azure storage explorer change ACL permissions in the route Folder and right click on your storage and click on "propogate access control lists"

AzureFileShareConfiguration mount drive disconnected

I am trying to create a Pool using Azure Batch . I have uploaded content to Azure Storage using File Shares.
I would like my Pool to mount this Azure File Share as virtual file system (ref: https://learn.microsoft.com/en-us/azure/batch/virtual-file-mount#mount-a-virtual-file-system-on-a-pool ).
I am creating AzureFileShareConfiguration object using code:
mount_configuration=batchmodels.MountConfiguration(azure_file_share_configuration=batchmodels.AzureFileShareConfiguration(
account_name="mystorage",
azure_file_url="https://mystorage.file.core.windows.net/my-share1",
account_key="mystorage/key==",
relative_mount_path="S"
)
)
Using this, I get "CMDKEY: Credentials added successfully" in fsmounts. But when I RDP to the node in the pool, the S drive appears "Disconnected".
My Azure batch package versions are:
azure-batch==8.0.0
azure-common==1.1.24
Can you please help diagnose the issue or suggest the right usage?
Thanks in Advance!
I think this is windows VM you are trying?, just by looking at the drive letter : ).
Here is the key issue with RDP permissions is different then your Batch level model when your code runs and mount.
At Batch level when you mount your Drive: and you can see it via your Start task then it is working. i.e. that a Batch level permissioning model and when you RDP into Node it will be as a "user" you are logged-in. If you want to see via UI RDP user you should re-run the command from your RDP login to update that you have key to see that drive.
Although having said that try it with /persistent:Yes as mount_options.
The best test is going to be -- You mount the drive and from your start task go to the mounted directory via : S:\\Whatever_file.txt or read the mounted file which will add the result in your stdout.txt of batch node or might be just dir it or something.
Rest extra stuff below
try with this mount_options value
Also specifically this will help for various SMB version et. al. support: https://learn.microsoft.com/en-us/azure/storage/files/storage-how-to-use-files-windows and I think this you already know : https://learn.microsoft.com/en-us/azure/batch/virtual-file-mount#azure-files-share
In order to use an Azure file share outside of the Azure region it is
hosted in, such as on-premises or in a different Azure region, the OS
must support SMB 3.0.
So add this to your API and give it a try:
MountOptions = "/persistent:Yes" i.e. mount_options = "/persistent:Yes"
Also: key needs to be Storage account Key, i.e. it should not start with mystorage/key :) but it could be you hiding it, so just a mention and fyi.
Sample code:
I think SDK you have is python?
mount_configuration=batchmodels.MountConfiguration(azure_file_share_configuration=batchmodels.AzureFileShareConfiguration(
account_name="mystorage",
azure_file_url="https://mystorage.file.core.windows.net/my-share1",
account_key="mystorage/key==",
relative_mount_path="S",
mount_options = "/persistent:Yes"
)
hope this helps!
relative_mount_path: The relative path on the compute node where the file system will be mounted. All file systems are mounted relative to the Batch mounts directory, accessible via the AZ_BATCH_NODE_MOUNTS_DIR environment variable.
Azure Files is the standard Azure cloud file system offering. To learn more about how to get any of the parameters in the mount configuration code sample, see Use an Azure Files share.

Instance Types missing while creating Server Group in Spinnaker

I have used AWS Community AMI for configuring Spinnaker. I am able to get the lists of ELB, AMI and Security Groups while creating Server Group. But, I am not getting the Instance types in the custom drop down list. Any idea about what could be going wrong?
Spinnaker Cluster Error
It looks like you are not having a correct IAM role assigned to the user whose access keys you are using for the spinnaker integration with AWS.
Mostly if you used the spinnaker.Check if you have enough rights in AWS.
If not then create a role and assign AWS POWER USER ACCESS to your user and then try to get the integration .
Spinnaker is a tool which would need AWS EC2 Full access atleast as it directly access EC2 spin up its server groups.
Instance types are cached in the browser's local storage. You can explicitly refresh the cache via the 'Refresh all caches' link:
If you show the network tab of your browser's console (prior to clicking 'Refresh all caches'), you should see a request to http://localhost:8084/instanceTypes.
If the response contains your instance types, you should be good to go.