Using GNU Make for Script Building - scripting

I have a script (the language is VBScript, but for the sake of the question, it's unimportant) which is used as the basis for a number of other scripts -- call it a "base script" or "wrapper script" for others. I would like to modularize my repository so that this base script can be combined with the functions unique to a specific script instance and then rebuilt later, should either one of the two change.
Example:
baseScript.vbs -- Logging, reporting, and other generic functions.
queryServerFunctions.vbs -- A script with specific, unique tasks (functions) that depend on functions in baseScript.vbs.
I would like to use make to combine the two (or any arbitrary number of files with script fragments) into a single script -- say, queryServer.vbs -- that is entirely self-contained. This self-contained script could then be rebuilt by make anytime either of its source scripts changes.
The question, then, is: Can I use make to manage script builds and, if so, what is the best or preferred way of doing so?
If it matters, my scripting environment is Cygwin running on Windows 7 x64.

The combination of VBScript and GNU make is unusual, so I doubt you'll find a "preferred way" of doing this. It is certainly possible. Taking your example, and adding another script called fooServer.vbs to show how the solution works for multiple scripts, here's a simple makefile:
# list all possible outputs here
OUTPUT := queryServer.vbs fooServer.vbs
# tell make that the default action should be to produce all outputs
.PHONY: all
all: $(OUTPUT)
# generic recipe for combining scripts together
$(OUTPUT):
cat $+ > $#
# note: The first character on the line above should be a tab, not a space
# now we describe the inputs for each output file
queryServer.vbs: baseScript.vbs queryServerFunctions.vbs
fooServer.vbs: baseScript.vbs fooFunctions.vbs
That will create the two scripts for you from their inputs, and if you touch, for example queryServerFunctions.vbs then only queryServer.vbs will be remade.
Why go to all that trouble, though?
The purpose of make is to "rebuild" things efficiently, by comparing file timestamps to judge when a build step can be safely skipped because the inputs have not changed since the last time the output file was updated. The assumption is that build steps are expensive, so it's worth skipping them where possible, and the performance gain is worth the risk of skipping too much due to bugs in the makefile or misleading file timestamps.
For copying a few small files around, I would say a simple batch file like
copy /y baseScript.vbs+queryServerFunctions.vbs queryServer.vbs
copy /y baseScript.vbs+fooServerFunctions.vbs fooServer.vbs
would be a better fit.

Related

How to name the second element?

As a programmer with OCD (Obsessive-Compulsive Disorder), I wonder usually how people name the second element (variable, file name, etc.) in the programmer's world?
For example, I create a file with the name file. I do NOT expect there is another one in this series.
However, one day I got a second one. What usually do you name it?
For example, it can be file1, or file2, or file0, or file_b, or fileB, or file_, or file (1) ...
There could be a lot. Which one is better (for some reasons)?
I am mostly concerned about file2 VS file1, as element starts from 0 in the computer science world, however the real world starts from 1.
Depending on how exactly it should read, I think most people will do file_001or file_002 but I've seen it on professionally-written code many different ways, though all numbering systems use numbers and not letters.
Also, always name your files with leading leading zeros so that the files don't get out of order: file11 would come before file2 in this case, so do something like file011 and file002.
It's usually not a big deal, but open source projects may specify a way to name files in the readme. If file naming is important to you, it never hurts to explain how you name your files in your project readme.
As is often the case, it's better to refactor instead of patching up the current "code": rename the first file as well (to file1, file_01, file_a or whatever), unless that would cause too much trouble (but even in that case, it would make sense to consider using a "view": leave the current file but add a file_01 hardlink/softlink to it - or probably better, a softlink from file to file_01).
For filenames in particular leavingfile as is will be annoying because it will usually get placed after the numbered files in directory listings.
And in the last paragraph, I imagine you meant file0 VS file1...?
If so, I'd say to go with 1, it's much more common in my experience.
And it's not true that element starts from 0 in the computer science world, that's indeed what most programming languages and maybe about all low-level stuff do, but it's not a must, and by personal experience I guarantee you that when you can without too much risk starting with 1 in many cases helps readibility a lot, and this base-0 thing is one of the many mantras that should be let go in software development.
But in any case for naming files and stuff in general (as opposed to array-indexing) it's more common to start from 1 (in my experience).

Elegantly handle samples with insufficient data in workflow?

I've set up a Snakemake pipeline for doing some simple QC and analysis on shallow shotgun metagenomics samples coming through our lab.
Some of the tools in the pipeline will fail or error when samples with low amounts of data are delivered as inputs -- but this is sometimes not knowable from the raw input data, as intermediate filtering steps (such as adapter trimming and host genome removal) can remove varying numbers of reads.
Ideally, I would like to be able to handle these cases with some sort of check on certain input rules, which could evaluate the number of reads in an input file and choose whether or not to continue with that portion of the workflow graph. Has anyone implemented something like this successfully?
Many thanks,
-jon
I'm not aware of the possibility to not complete the workflow based on some computation happening inside the workflow. The rules to be executed are determined based on the final required output, and failure will happen if this final output cannot be generated.
One approach could be catch the particular tool failure (try ... except construct in a run section or return code handling in a shell section) and generate a dummy output file for the corresponding rule, and have the downstream rules "propagate" dummy file generation based on a test identifying the rule's input as such a dummy file.
Another approach could be to pre-process the data outside of your snakemake workflow to determine which input to skip, and then use some filtering on the wildcards combinations as described here: https://stackoverflow.com/a/41185568/1878788.
I've been trying to find a solution to this issue as well.
Thus far I think I've identified a few potential solution but have yet to be able to correctly implement them.
I use seqkit stats to quickly generate a txt file and use the num_seqs column to filter with. You can write a quick pandas function to return a list of files which pass your threshold, and I use config.yaml to pass the minimum read threshold:
def get_passing_fastq_files(wildcards):
qc = pd.read_table('fastq.stats.txt').fillna(0)
passing = list(qc[qc['num_seqs'] > config['minReads']]['file'])
return passing
Trying to implement that as an input function in Snakemake has been an esoteric nightmare to be honest. Probably my own lack of nuanced understand about the Wildcards object.
I think the use of a checkpoint is also necessary in the process to force Snakemake to recompute the DAG after filtering samples out. Haven't been able to connect all the dots yet however, and I'm trying to avoid janky solutions that use token files etc.

Processing Files - Keeping Track

Currently we have an application that picks files out of a folder and processes them. It's simple enough but there are two pretty major issues with it. The processing is simply converting images to a base64 string and putting that into a database.
Problem
The problem is after the file has been processed, it won't need processing again and for performance reasons we don't really want it to be so.
Moving the files after processing is also not an option as these image files need to always be available in the same directory for other parts of the system to use.
This program must be written in VB.NET as it is an extension of a product already using this.
Ideal Solution
What we are looking for really is a way of keeping track of which files have been processed so we can develop a kind of ignore list when running the application.
For every processed image file Image0001.ext, once processed create a second file Image0001.ext.done. When looking for files to process, use a filter on the extension type of your images, and as each filename is found check for the existence of a .done file.
This approach will get incrementally slower as the number of files increases, but unless you move (or delete) files this is inevitable. On NTFS you should be OK until you get well into the tens of thousands of files.
EDIT: My approach would be to apply KISS:
Everything is in one folder, therefore cannot be a big number of images: I don't need to handle hundreds of files per hour every hour of every day (first run might be different).
Writing a console application to convert one file (passed on the command line) is each. Left as an exercise.
There is no indication of any urgency to the conversion: can schedule to run every 15min (say). Also left as an exercise.
Use PowerShell to run the program for all images not already processed:
cd $TheImageFolder;
# .png assumed as image type. Can have multiple filters here for more image types.
Get-Item -filter *.png |
Where-Object { -not (Test-File -path ($_.FullName + '.done') } |
Foreach-Object { ProcessFile $_.FullName; New-Item ($_.FullName + '.done') -ItemType file }
In a table, store the file name, file size, (and file hash if you need to be more sure about the file), for each file processed. Now, when you're taking a new file to process, you can compare it with your table entries (a simple query would do). Using hashes might degrade your performance, but you can be a bit more certain about an already processed file.

how to look for the content of text file in pentaho?

I have a ETL which give text file output and I have to check the those text content has the word error or bad using pentaho.
Is there any simple way to find it?
If you are trying to process a number of files, you can use a Get Filenames step to get all the filenames. Then, if your text files are small, you can use a Get File Content step to get the whole file as one row, then use a Java Filter or other matching step (RegEx, e.g.) to search for the words.
If your text files are too big but line-based or otherwise in a fixed format (which it likely is if you used a text file output step), you can use a Text File Input step to get the lines, then a matcher step (see above) to find the words in the line. Then you can use a Filter Rows step to choose just those rows that contain the words, then Select Values to choose just the filename, then a Sort Rows on the filename, then a Unique Rows step. The result should be a list of filenames whose contents contain the search words.
This may seem like a lot of steps, but Pentaho Data Integration or PDI (aka Kettle) is designed to be a flow of steps with distinct (and very reusable) functionality. A smaller but less "PDI" method is to write a User Defined Java Class (or other scripting) step to do all the work. This solution has a smaller number of steps but is not very configurable or reusable.
If you're writing these files out yourself, then dont you already know the content? So scan the fields at the point at which you already have them in memory.
If you're trying to see if Pentaho has written an error to the file, then you should use error handling on the output step.
Finally PDI is not a text searching tool. If you really need to do this, then probably best bet is good old grep..

Sort sets by number of elements in Redis

I have a Redis database with a number of sets, all identified by a common key pattern, let's say "myset:".
Is there a way, from the command line client, to sort all my sets by number of elements they contain and return that information? The SORT command only takes single keys, as far as I understand.
I know I can do it quite easily with a programming language, but I prefer to be able to do it without having to install any driver, programming environment and so on on the server.
Thanks for your help.
No, there is no easy trick to do this.
Redis is a store, not really a database management system. It supports no query language. If you need some data to be retrieved, then you have to anticipate the access paths and design the data structure accordingly.
For instance in your example, you could maintain a zset while adding/removing items from the sets you are interested in. In this zset, the value will be the key of the set, and the score the cardinality of the set.
Retrieving the content of the zset by rank will give you the sets sorted by cardinality.
If you did not plan for this access path and still need the data, you will have no other choice than using a programming language. If you cannot install any Redis driver, then you could work from a Redis dump file (to be generated by the BGSAVE command), download this file to another box, and use the following package from Sripathi Krishnan to parse it and calculate the statistics you require.
https://github.com/sripathikrishnan/redis-rdb-tools
Caveat: The approach in this answer is not intended as a general solution -- remember that use of the keys command is discouraged in a production setting.
That said, here's a solution which will output the set name followed by it's length (cardinality), sorted by cardinality.
# Capture the names of the keys (sets)
KEYS=$(redis-cli keys 'myset:*')
# Paste each line from the key names with the output of `redis-cli scard key`
# and sort on the second key - the size - in reverse
paste <(echo "$KEYS") <(echo "$KEYS" | sed 's/^/scard /' | redis-cli) | sort -k2 -r -n
Note the use of the paste command above. I count on redis-cli to send me the results in order, which I'm pretty sure it will do. So paste will take one name from the $KEYS and one value from the redis output and output them on a single line.