How to have multiple users run the same pipeline in snakemake - metadata permissions issue - snakemake

I have a pipeline that we want to store in a shared space so that all our analysts can access it and use it without having to copy the repo / singularity container. The problem is when one person runs the pipeline the files in .snakemake/metadata belong to that user. When the next user tries to run the pipeline this metadata is accessed and causes an error.
I know I could delete the data in metadata at the end of the process but ideally we want users to be able to run the pipeline in parallel (on different inputs / output). All users are within the same group but this doesn't not help
Any suggestions?

Related

ADF: Using ForEach and Execute Pipeline with Pipeline Folder

I have a folder of pipelines, and I want to execute the pipelines inside the folder using a single pipeline. There will be times when there will be another pipeline added to the folder, so creating a pipeline filled with Execute Pipelines is not an option (well, it is the current method, but it's not very "automate-y" and adding another Execute Pipeline whenever a new pipeline is added is, as you can imagine, a pain). I thought of the ForEach Activity, but I don't know what the approach is.
I have not tried this approach but I think we can use the
ADF RestAPI to get all the details of the pipelines which needs to be executed. Since the response is in JSON you can write it back to temp blob and add filter and focus on what you need .
https://learn.microsoft.com/en-us/rest/api/datafactory/pipelines/list-by-factory?tabs=HTTP
You can use the Create RUN API to trigger the pipeline .
https://learn.microsoft.com/en-us/rest/api/datafactory/pipelines/create-run?tabs=HTTP
As Joel called out , if different pipeline has different count of paramter , it will be little messy to maintain .
Folders are really just organizational structures for the code assets that describe pipelines (same for Datasets and Data Flows), they have no real substance or purpose inside the executing environment. This is why pipeline names have to be globally unique rather than unique to their containing folder.
Another problem you are going to face is that the "Execute Pipeline" activity is not very dynamic. The pipeline name has to be known as design time, and while parameter values are dynamic, the parameter names are not. For these reasons, you can't have a foreach loop that dynamically executes child pipelines.
If I were tackling this problem, it would be through an external pipeline management system that you would have to build yourself. This is not trivial, and in your case would have additional challenges because of the folder level focus.

Multiple users executing the same workflow

Are there guidelines regarding how to share a Snakemake workflow among multiple users on the same data under Linux, or is the whole thing considered bad practice?
Let me explain in case it's not clear:
Suppose user A executes a workflow in directory dir/. Assume the workflow terminates successfully, and he/she then properly sets file/directory permissions recursively on all output and intermediate files and the .snakemake/ subdirectory for other users to read/write, of course.
User B subsequently navigates to dir/, adds input files to the workflow, then executes it. Can anything go wrong?
TL;DR: I'm asking about non-concurrent execution of the same workflow by distinct users on the same system, and on the same data on disk. Is Snakemake designed for such use cases?
It's possible to run snakemake --nolock which will prevent locking of the directory, so multiple runs can be made from inside the same directory. However, without lock, there's now an opening for errors due to concurrent runs trying to modify the same files. It's probably OK, if you are certain that this will be avoided, e.g. if you are in constant communication with another user about which files will be modified.
An alternative option is to create a third directory/path, and put all the data there. This way you can work from separate directories/path and avoid costly recomputes.
I would say that from the point of view of snakemake, and workflow management in general, it's ok for user B to add or update input files and re-run the pipeline. After all, one of the advantages of a workflow management system is to update results according to new input. The problem is that user A could find her results updated without being aware of it.
From the top of my head and without more detail this is what I would suggest. Make snakemake read the list of input files from a table (pandas comes in handy for this) or from some configuration file. Keep this sample sheet under version control (with git/github) together with the Snakefile and other source code.
When users update the working directory with new files, they will also need to update the sample sheet in order for snakemake to "see" the new input and other users will know about it via version control. I prefer this setup over dumping files in a directory and letting snakemake process whatever is in there.

Pass output from one pipeline run and use as parameter in another pipeline

The way my ADF setup currently works, is that I have multiple pipelines, each containing atleast one activity. Then I have one big pipeline that sort of chains these pipelines together.
However, now in the big "master" pipeline, I would like to use the output of an activity from one pipeline and then pass it to another pipeline. All of this orchestrated from the "master" pipeline.
My "master" pipeline would look something like this:
What I have tried to do is adding a parameter to "Execute Pipeline2", and I have tried passing:
#activity('Execute Pipeline1').output.pipeline.runId.output.runOutput
#activity('Execute Pipeline1').output.pipelineRunId.output.runOutput
#activity('Execute Pipeline1').output.runOutput
How would one go about doing this?
unfortunately we don't have a way to pass the output of an activity across pipelines. Right now pipelines don't have outputs (only activities).
We have a workitem that will allow a user to choose what should be the output for a pipeline (imagine a pipeline with 40 activities, user would be able to choose the output of activity 3 as pipeline output). However, this workitem is in very early stages so don't expect to see this soon.
For now, the only way would be to save the output that you want in storage (blob, for example) and then read it and pass it to the other pipeline. Another method could be a web activity that gets the pipeline run (passing run id) and you get the output using ADF SDK or REST API, and then you pass that to the next Execute Pipeline activity.

How to find the role after logging in to bigquery?

I already have access to Google analytics provided by my client and the bigquery has been configured to the project. But i want to know if i can create jobs. How do i find the role assigned to my id ?
i want to know if i can create jobs
Below is simple way to get this:
Just open Web UI and try to switch to project of your interest
a. If you do have it in the list of available projects – just select it and then run (just in case) some simple query (SELECT 1)
If it is run successfully - you can create jobs in this project (because any query is in reality a job)
b. If it is not in the initial list – select “Display Project” and enter project of your interest and also check “Make this my current project” box. If result is successful – most likely you again lucky and can create jobs in this project (but still – run some simple query to be 110% sure
How do i find the role assigned to my id
This would be more involved – you will need to use respective IAM (Google Identity and Access Management) APIs
For example you can use testIamPermissions() API that allows you to test Cloud IAM permissions on a user for a resource. It takes the resource URL and a set of permissions as input parameters, and returns the set of permissions that the caller is allowed.
The permission you should look for is bigquery.jobs.create, but yo can pass to this API list of any permissions you want to check if you have

Adding auxiliary DB data during deployment

My app consists of two containers: the app itself and a database. I'm planning to wrap the app into a chart, thus paving a way for easy reproducible deployment.
Apart from setting/reading environment envs (which helm+kubernetes seems to handle really well), part of app's configuration is:
making sure the database is pre-filled with special auxiliary data (e.g. admin user exists, some user role names required to create new users are there, etc.).
I like the idea of having readable yaml files hold the entire configuration in a human readable format. However at a glance it doesn't seem that helm in any way would help with this (DB records) kind of configuration.
That being said, what is the best place to put code/configuration ensuring that DB contains certain auxiliary records? A config yaml file? An container init script, written in bash?
You are right, Kubernetes or Helm cannot help with preparing your pre-filled database records/schema.
You should probably have your application initialize those pre-filled data. If you don't want to put this logic into your application, you can ship an initialization script and configure an init container with Kubernetes.
Kubernetes makes sure every time your application container is restarted, the init container runs first. In the init container, you can execute a bash/python/... script that makes sure the records you want are there.