How does snakemake know files created though a slurm job are incomplete when job is canceled? - snakemake

for example, if running a job that sumbits to slurm and canceling slurm job via scancel -u username and canceling snakemake with ctrl-c, on next run snakemake will say:
The files below seem to be incomplete. If you are sure that certain files are not incomplete, mark them as complete with
How does it know this? Does it make its own "touch file" internally?

I don't have a conclusive answer but maybe this puts you in the right direction: snakemake writes a hidden directory, named .snakemake, in the working directory (you need ls -a to see it).
Inside .snakemake there are various subdirectories and the answer to your question maybe there. In particular, there is a metadata directory with cryptic filenames which appear to be json files with information about each job. An entry of these jsons is "incomplete": true/false.
As an aside, it's good to know that .snakemake/log contains the logs of each job.

Related

How to update output files without changing them in Snakemake

I have changed an initial rule that adds links to database files. Snakemake typically requires all dependent rules to rerun, even the database change is not really relevant. I tried to avoid this behaviour with
snakemake --touch
Unfortunately it does not work and still wants to rerun all blast processes on top.
How to check the dependence tree ? How to mark individual files as up-to-date ?
snakemake --version 6.4.1
Best, Michael

change the location of the .nextflow folder?

when I run nextflow, I get a .nextflow folder, but I can't find a way to change its location (i.e. it is't -work-dir). How can I change the location of the .nextflow folder?
I have looked at launchDir but it seems that is a read-only implicit variable and cannot be overwritten in the CLI, also, the --launchDir option is only valid for the k8s scope (see original chat in gitter)
I'm using Nextflow 20.10.0 build 5430.
Keeping things neat and tidy is admirable. From this comment, it looks like the only way (without doing crazy things...) is to change to the directory you want your .nextflow cache directory to live and point all other options (i.e. -work-dir, -log etc) away to a separate directory:
If you want .nextflow in dir A and the pipeline work dir in B:
cd A
nextflow run -w B
The .nextflow has to be in the launching
directory to properly maintain the history of the executions.

It seems the file stream is not closed when it uses read command

I use read command for some read files in the scenarios.
If the tests are running, the files can not be moved and deleted.
It seems that the file stream is not closed when it uses read command.
Can it close the files as declarative?
Environment:
Karate 0.9.2
We are also writing log files and in some cases this can cause issues, see: https://github.com/intuit/karate/issues/661#issuecomment-458127918
But I think keeping a file lock is expected. Also we are closing files whenever we read them, but maybe we have a bug. So if you can create a sample project to replicate your problem, that would be great: https://github.com/intuit/karate/wiki/How-to-Submit-an-Issue

gsutil rsync only files matching a pattern

I need to rsync files from a bucket to a local machine everyday, and the bucket contains 20k files. I need to download only the changed files that end with *some_naming_convention.csv .
What's the best way to do that? using a wildcard in the download source gave me an error.
I don't think you can do that with Rsynch. As Christopher told you, you can skip files by using the "-x" flag, but no just synch those [1]. I created a public Feature Request on your behalf [2] for you to follow updates there.
As I say in the FR, IMHO I consider this to not follow the purpose of rsynch, as it's to keep folders/buckets synchronise, and just synchronising some of them don't fall in that purpose.
There is a possible "workaround" by using gsutil cp to copy files and -n to skip the ones that already exist. The whole command for your case should be:
gsutil -m cp -n <bucket>/*some_naming_convention.csv <directory>
Other option, maybe a little bit more far-fetched is to copy/move those files to a folder and then use that folder to rsynch.
I hope this works for you ;)
Original Answer
From here, you can do something like gsutil rsync -r -x '^(?!.*\.json$).*' gs://mybucket mydir to rsync all json files. The key is the ?! prefix to the pattern you actually want.
Edit
The -x flag excludes a pattern. The pattern ^(?!.*\.json$).* uses negative look-ahead to specify patterns not ending in .json. It follows that the result of the gsutil rsync call will get all files which end in .json.
Rsync lets you include and exclude files matching patterns.
For each file rsync applies the first patch that matches, some if you want to sync only selected files then you need to include those, and then exclude everything else.
Add the following to your rsync options:
--include='*some_naming_convention.csv' --exclude='*'
That's enough if all your files are in one directory. If you also want to search sub folders then you need a little bit more:
--include='*/' --include='*some_naming_convention.csv' --exclude='*'
This will duplicate all the directory tree, but only copy the files you want. If that leaves empty directories you don't want then add --prune-empty-dirs.

Mock filesystem in ocaml

I am writing code that creates a folder/file structure in ocaml, and I want to write some tests for it. I'd like to not have to create and delete files each time the tests are run, since they cna be run many times.
What would be the best way to go to mock filesystem? I'd be open to have a filesystem in memory or just mock up functions.
Maybe you could use a Makefile to help you.
For instance make test might start by compiling your program, then create the files and folders required for testing, launching your program, and then cleaning the test folder if need be (at that time, you might also want to check if the state of the test folder is as expected).
On linux:
mount -o size=50m -t tmpfs none ./ramdisk
will create a filesystem in ram, size 50M, mounted to ./ramdisk. Only root can do this. Non-root users can use it. It will show up in df and du. You can clean it by doing umount ./ramdisk.
Creation, usage and removal are working just fine, maybe the root requirement is an obstacle.