Is there a way to lazy evaluate variables in the shell directive of snakemake?
Our PBS computing cluster creates a node-local scratch directory for every submitted job and sets the value of $TMPDIR to the path to this directory during the execution. After the job ends, this temporary scratch directory gets deleted. Since snakemake is creating the jobscripts (which are later qsub'ed to PBS) on the login node and at a point in time where the temporary scratch is not even created, I'm unable to use the scratch directory for efficient job grouping. This way I have to read and write the files back and forth on the NFS.
I tried using shadow and setting the shadow-dir to $TMPDIR, but again, the correct value is only set after the jobscript got parsed by PBS.
Related
In my original python code, there is a frequent restore of the ckpt model file. It takes too much time to read the checkpoints again and again. So I decided to save the model in the memory. A simple way is to create a RAMDisk and save the model in that disk. However, something unexpected happens.
I deployed 1G of RAMDisk according to the tutorial How to Create RAM Disk in Windows 10 for Super-Fast Read and Write Speeds. My system is windows 11.
I made two attempts: In the first one, I copied my code to the RAMDisk E: and used tf.train.Saver().save(self.sess,'./') to save the model, but it reports that UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 114: invalid start byte. However, if I put the code on other normal folders, it runs successfully.
In the second attempt, I put the code under D: and modified the line as tf.train.Saver().save(self.sess,'E:\\'), and it reports that cannot create directory E: Permission Denied. Obviously, E:\ is not a directory to create. So I don't know how to handle this.
Your jupyter/python environment cannot go beyond the directory from which jupyter/python is started from and that's why you get a permission denied error.
However, you can run shell commands from the jupyter notebook. If your user has write access to your destination, you can do the following.
model.save("my_model") # This will save the model to the current directory.
!mv "my_model" "E:\my_model" # This will move the model from the current directory to your required directory.
On a side note, I when searching for tf.train.Saver().save(), I get this page as the only relevant result, which says it is used for saving checkpoints and not model. Also they recommend switching to the newer tf.train.Checkpoint or tf.keras.Model.save_weights. None the less, the above method should work as expected.
Are there guidelines regarding how to share a Snakemake workflow among multiple users on the same data under Linux, or is the whole thing considered bad practice?
Let me explain in case it's not clear:
Suppose user A executes a workflow in directory dir/. Assume the workflow terminates successfully, and he/she then properly sets file/directory permissions recursively on all output and intermediate files and the .snakemake/ subdirectory for other users to read/write, of course.
User B subsequently navigates to dir/, adds input files to the workflow, then executes it. Can anything go wrong?
TL;DR: I'm asking about non-concurrent execution of the same workflow by distinct users on the same system, and on the same data on disk. Is Snakemake designed for such use cases?
It's possible to run snakemake --nolock which will prevent locking of the directory, so multiple runs can be made from inside the same directory. However, without lock, there's now an opening for errors due to concurrent runs trying to modify the same files. It's probably OK, if you are certain that this will be avoided, e.g. if you are in constant communication with another user about which files will be modified.
An alternative option is to create a third directory/path, and put all the data there. This way you can work from separate directories/path and avoid costly recomputes.
I would say that from the point of view of snakemake, and workflow management in general, it's ok for user B to add or update input files and re-run the pipeline. After all, one of the advantages of a workflow management system is to update results according to new input. The problem is that user A could find her results updated without being aware of it.
From the top of my head and without more detail this is what I would suggest. Make snakemake read the list of input files from a table (pandas comes in handy for this) or from some configuration file. Keep this sample sheet under version control (with git/github) together with the Snakefile and other source code.
When users update the working directory with new files, they will also need to update the sample sheet in order for snakemake to "see" the new input and other users will know about it via version control. I prefer this setup over dumping files in a directory and letting snakemake process whatever is in there.
I have a very large tar file(>1GB) that needs to be checked out and is a precondition for executing any tests.
I cannot have dedicated build server for my tests since tests are going to be executed on slave machines which are disposable.
Checking out a file(>1GB) is not optimal since in this case test execution time would increase because of precondition.What is the best optimal way of solving this problem?
I would dedicate a location on the slaves for that file.
Then in your tests, check if the file is in that location. If not, check it out and move it there. Since this location is outside your normal work area it won't get cleaned, and the file will stay there for the next test execution to use, and you won't need to check it out again.
Of course if the file changes you have to clear those caches. A first option would be to do this manual, alternative you can create a hash of the file and keep that hash in the cash and in your version control. You would then compare only the hashes, and only if those change you would check out the file.
Of course this requires that you have the ability to checkout all the rest of your code without the big file. How to do that obviously depends on the version control system in use.
Each transformation will create an csv file in a folder, and I want to upload all of them when transformations done. I add a Dummy but the process didn't work as my expectation. Each transformation will execute Hadoop Copy Files step. Why? And how could I design the flow? Thanks.
First of all, if possible, try launching the .ktr files in parallel (right click on the START Step > Click on Launch Next Entries in parallel). This will ensure that all the ktr are launched parallely.
Secondly, You can choose either of the below steps depending upon your feasibility (instead of dummy step):
"Checks if files exist" Step: Before moving to the Hadoop step, you can do a small check if all the files has been properly created and then proceed with your execution.
"Wait For" Step: You can give some time to wait for all the step to complete before moving to the next entry. I don't suggest this since the time of writing a csv file might vary, unless you are totally sure of some time.
"Evaluate files metrics" : Check the count of the files before moving forward. In your case check if the file count is 9 or not.
I just wanted to do a some sort of checking on the files before you copy the data to HDFS.
Hope it helps :)
You cannot join the transformations like you do.
Each transformation, upon success, will follow to the Dummy step, so it'll be called for EVERY transformation.
If you want to wait until the last transformation finishes to run only once the Hadoop copy files step you need to do one of two things:
Run the transformations in a sequence, where each ktr will be called upon success of the previous one (slower)
As suggested in another answer, launch the KTRs in parallel, but with one caveat: they need to be called from a sub-job. Here's the idea:
Your main job has a start, calls a sub-job and upon success, calls the Hadoop copy files step.
Your sub-job has a start, from which all transformations are called in different flows. You use the "Launch next entries in parallel" so all are launched at once.
The sub-job will keep running until the last transformation finishes and only then the flow is passed to the Hadoop copy files step, which will only be launched once.
There are set of log files that have a pattern xxxxxYYY where xxxx -> some text and YYY is a sequence number increasing sequentially by one and wrapping around. Only the last n number of files are available at a given time.
I would like to write a foolproof script that would make sure that all the log files are backed up in another server (via ssh/scp).
Can somebody please suggest a logic/code snippet(perl or shell) for it?
=> The script can run every few minutes to ensure bursts of traffic do not cause log files to miss getting backed up.
=> The roll over needs to be detected so that files are not overwritten in the destination server/directory.
-> I do not have super user either in source or destination boxes. The destiantion box does not have rsync installed and would take too long to get it installed.
-> Only one log file gets updated at a time.
I would look at having cron run an rsync --backup command.