always start with 1 folder under a container in azure data lake storage gen2 - azure-data-lake

Per the Warning in this link, MS recommends that directories under a container (i.e., Gen2) should begin with just 1 folder (instead of jumping straight to multiple folders) because some applications cannot mount the root of a container. I have never seen this happen. What are some example applications that can do so? Is this a legitimate warning?

Related

Attempting to Read parcquet files on linked storage in Azure Synapse

I am attempting to give access to parquet files on a Gen2 Data Lake container. I have owner RBAC on the container but would prefer to limit access in the container for other users.
My Query is very simple:
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://aztsworddataaipocacldl.dfs.core.windows.net/pocacl/Top/Sub/part-00006-c62926ba-c530-4ad8-87d1-cf38c67a2da3-c000.snappy.parquet',
FORMAT='PARQUET'
) AS [result]
When I run this I have no problems connecting. I have attempted to add ACL rights onto the files (and of course the containing folders 'Top' and 'Sub').
I've give RWX on the 'Top' folder using Storage Explorer and default so that it cascades to the 'Sub' folder and parquet files as I add them
When my colleague attempts to run the SQL script the get the error message. Failed to execute query. Error: File 'https://aztsworddataaipocacldl.dfs.core.windows.net/pocacl/Top/Sub/part-00006-c62926ba-c530-4ad8-87d1-cf38c67a2da3-c000.snappy.parquet' cannot be opened because it does not exist or it is used by another process.
NB similar results are also experienced in Spark but with a 403 instead
SQL on-demand provides a link to the following help file after the error, it suggests:
If your query fails with the error saying 'File cannot be opened because it does not exist or it is used by another process' and you're sure both file exist and it's not used by another process it means SQL on-demand can't access the file. This problem usually happens because your Azure Active Directory identity doesn't have rights to access the file. By default, SQL on-demand is trying to access the file using your Azure Active Directory identity. To resolve this issue, you need to have proper rights to access the file. Easiest way is to grant yourself 'Storage Blob Data Contributor' role on the storage account you're trying to query.
I don't wish to grant Storage Blob Data Contributor or Storage Blob Data Reader as this gives access to every file on the container and not just those I want end users to be able to query. We have found the same experience occurs for SSMS connecting to parquet external tables.
So then in parts:
Is this the correct pattern using ACL to grant access, or should I use another method?
Are there settings on the Storage Account or within my query/notebook that I should be enabling to support ACL?*
Has ACL been implemented on Synapse Workspace to date given that we're still in preview?
*I have resisted pasting my entire settings as I really have no idea what is relevant and what entirely irrelevant to this issue but of course can supply.
It would appear that the ACL feature was not working correctly in Preview for Azure Synapse Analytics.
I have now managed to get it to work. At present I see that once Read|Execute is provided to a folder it allows access to the files contained within that folder and sub folders. Access is available even when no specific ACL access is provided on a file in a sub folder. This is not quite what I expected however it provides enough for me to proceed: only giving access to the Gold folder allows for separation of access to the files I want to let users query and the working files that I want to keep hidden.
When you assign ACL to a folder it's not propagated recursively to all files inside the folder. Only new files inherit from the folder.
You can see this here
Go to azure storage explorer change ACL permissions in the route Folder and right click on your storage and click on "propogate access control lists"

OperationalError: Attempt to Write A ReadOnly Database on Google Cloud Application

Recently, I have been trying to deploy an interactive Google App Engine that writes to a SQLite database, which works fine when running the app locally, but when running it through the server, I receive the error:
OperationalError: attempt to write a readonly database
I tried changing the permissions on my .db, .sql but no luck.
Any advice would be greatly appreciated.
You can try changing permission of the directory and checking that .sqllite file exists and is writable
But generally speaking is not a good idea to rely on disk data when working on app engine as disk storage is ephemeral (unless you are using persistent disks on flex) but even then its better to use a cloud database solution
App Engine has exactly read-only file system, i.e. no files can be modified. It has, however, /tmp/ folder to store temporary files as the name suggests. It actually uses RAM, so not a good idea if the database is huge.
On app startup you can copy your original database file to /tmp/ folder and use it from there afterwards.
This works. However, all the changes in the database are lost when the app nodes scale to 0. Each node of the app has its own database copy and the data is not shared between the nodes. If you need the data to be shared between the app nodes, better use CloudSQL.

Redis Cache Share Across Regions

I've got an application using redis for cache, it works well so far. However we need spread our applications to different regions(thru dynamic DNS dispatcher via user locations, local user could visit nearest server).
Considering the network limitation and bandwith, it's not likely to build a centralised redis. So we have to assign different redis for different regions. So the problem here is how can we handle the roaming case. User opens the app in location 1, while continuing using the app in location 2 without missing the cache in location1.
You will have to use a tiered architecture. This is how most CDNs like Akamai, or Amazon Cloudfront work.
Simply put, this is how it works :
When a object is requested, see if it exists in the redis cache server S1 assigned for location L1.
If it does not exist in S1, check whether it exists in caching servers of other locations i.e. S2,S3....SN.
If it is found in S2...SN, store the object in S1 as well, and serve the object.
If not found in S2...SN as well, fetch the object fresh from backend, and store in S1.
If you are using memcached for caching, then facebook's open-source mcrouter project will help, as it does centralized caching.

NFS file open in C code

If I open a file in my C/C++/Java code using a pathname that goes to an nfs directory, how the does the read and write syntax work with NFS being stateless and all? I have tried but cant find an example code accessing NFS mounted files. My current understanding is that it is the job of the NFS client to keep state (like read and write pointer) and the application uses the same syntax.
A related question is regarding VFS and UFS. Are all files in a current unix machine accessed through their vnodes first and then (depending on local vs remote) inode or rnode structures?
NFS (short of file locking) is no different than local storage to user-level applications. It might be slower, or it might drop out unexpectedly, but that can happen to local storage too. That's probably why you can't find specific NFS-centric example code.

AWS EC2- Synching source code files with S3 - is it a proper approach?

On an app server in which a few source files change frequently, Is the following approach recommended?
Use a cron job with S3tools to sync the source files with S3 private bucket (every 15 mins for example).
On server start up - Use user data script to sync with the sources bucket to retrieve the latest sources.
Advantages:
1. No need to attach EBS for app server just to save a few files
2. Similar setup to all app servers
3. Sources automatically backed up.
4. As a byproduct, distributes code to multiple app servers automatically.
Disadvantages:
keeping source code on S3
other?
What do you think about this methodology? Is this the right way to use EC2 when source code change frequently (a few times a day) please recommend the best approach to run EC2 instances where sources change often.
I think you're better off using a proper source code repository, like Subversion or Git, rather than storing the source files on S3. That way you can have a central location for the source files while avoiding the update consistency problems that kdgregory mentioned.
You can put the source repository on one of your own servers outside of EC2, or host it on an EC2 instance (make sure the repository files are on an EBS volume in the latter case).
If you're going to be running a large number of EC2 instances, then it will be less effort to have them sync themselves from a central location (ie, you sync to private bucket, app-servers sync from that bucket).
HOWEVER, recognize that updates to an S3 bucket are atomic only at the object level, and more importantly, are not guaranteed to be immediately consistent (although I recall seeing a recent note that the us-west endpoint does offer read-after-write consistency).
This means that your app-servers may load a set of new files that are internally inconsistent -- some will be old, some will be new. If this is a problem for you, then you should implement a scheme that uploads directly to the app-servers, and ensures changeset consistency (perhaps by uploading to a temporary directory that is then renamed).