Generating a parameterized number of outputfiles for a snakemake rule - snakemake

My workflow needs to be executed on two different clusters. The first cluster schedules jobs to nodes based on resource availability. The second cluster reserves entire nodes for a given job and asks its users to use those multiple cores efficiently within their job script. For the second cluster, it is accepted practice to submit a smaller number of jobs and stack processes in the background.
For a toy example, say I have four files I would like to create:
SAMPLES = [1, 2, 3, 4]
rule all:
input:
expand("sample.{sample}", sample=SAMPLES)
rule normal_create_files:
input:
output:
expand("sample.{sample}", sample=SAMPLES)
shell:
"touch {output}"
This can be run in parallel with one job per sample.
In addition to four jobs creating a single file each, I would like to be able to have two jobs creating two files each.
I've tried a few ideas, but have not gotten very far. The following workflow does the same as above, except it creates batches and launches the jobs as background processes within each batch:
rule all:
input:
expand("sample.{sample}", sample=SAMPLES)
rule stacked_create_files:
input:
output:
"sample.{sample}"
run:
import subprocess as sp
def chunks(l, n):
for i in range(0, len(l), n):
yield l[i:i + n]
pids = []
for chunk in chunks({output}.pop(), 2):
for sample in chunk:
pids.append(sp.Popen(["touch", sample]))
exit_codes = [p.wait() for p in pids]
However, this still creates four jobs!
I also came across Karel Brinda's response on the mailing list on a related topic. He pointed to his own project where he does dynamic rule creation in python. I will try something along these lines next.
The ideal solution would be a single rule that generates a set of output files, but is able to generate those files in batches. The number of batches would be set by a configuration parameter.
Has anyone here encountered a similar situation? Any thoughts, or ideas would be greatly appreciated!

I think the true solution to your problem will be the ability to group Snakemake jobs together. This feature is currently in the planning phase (in fact I have a research grant about this).
Indeed, currently the only solution is to somehow encode this into the rules themselves (e.g. via code generation).
In the future, you will be able to specify how the DAG of jobs shall be partitioned/grouped. Each of the resulting groups of jobs is submitted to the cluster as one batch.

Related

Python multiprocessing between ubuntu and centOS

I am trying to run some parallel jobs through Python multiprocessing. Here is an example code:
import multiprocessing as mp
import os
def f(name, total):
print('process {:d} starting doing business in {:d}'.format(name, total))
#there will be some unix command to run external program
if __name__ == '__main__':
total_task_num = 100
mp.Queue()
all_processes = []
for i in range(total_task_num):
p = mp.Process(target=f, args=(i,total_task_num))
all_processes.append(p)
p.start()
for p in all_processes:
p.join()
I also set export OMP_NUM_THREADS=1 to make sure that only one thread for one process.
Now I have 20 cores in my desktop. For 100 parallel jobs, I want to let it run 5 cycles so that each core run one job (20*5=100).
I tried to do the same code in CentOS and ubuntu. It seems that CentOS will automatically do a job splitting. In other words, there will be only 20 parallel running jobs at the same time. However, ubuntu will start 100 jobs simultaneously. As such, each core will be occupied by 5 jobs. This will significantly increase the total run time due to high work load.
I wonder if there is an elegant solution to teach ubuntu to run only 1 job per core.
To enable a process run on a specific CPU, you use the command taskset in linux. Accordingly you can arrive on a logic based on "taskset -p [mask] [pid]" that assigns each process to a specific core in a loop.
Also , python helps in incorporation of affinity control via sched_setaffinity that can be checked for confining a process to specific cores. Accordingly , you can arrive on a logic for usage of "os.sched_setaffinity(pid, mask)" where pid is the process id of the process whose mask represents the group of CPUs to which the process shall be confined to.
In python, there are also other tools like https://pypi.org/project/affinity/ that can be explored for usage.

How to set up job dependencies in google bigquery?

I have a few jobs, say one is loading a text file from a google cloud storage bucket to bigquery table, and another one is a scheduled query to copy data from one table to another table with some transformation, I want the second job to depend on the success of the first one, how do we achieve this in bigquery if it is possible to do so at all?
Many thanks.
Best regards,
Right now a developer needs to put together the chain of operations.
It can be done either using Cloud Functions (supports, Node.js, Go, Python) or via Cloud Run container (supports gcloud API, any programming language).
Basically you need to
issue a job
get the job id
poll for the job id
job is finished trigger other steps
If using Cloud Functions
place the file into a dedicated GCS bucket
setup a GCF that monitors that bucket and when a new file is uploaded it will execute a function that imports into GCS - wait until the operations ends
at the end of the GCF you can trigger other functions for next step
another use case with Cloud Functions:
A: a trigger starts the GCF
B: function executes the query (copy data to another table)
C: gets a job id - fires another function with a bit of delay
I: a function gets a jobid
J: polls for job is ready?
K: if not ready, fires himself again with a bit of delay
L: if ready triggers next step - could be a dedicated function or parameterized function
It is possible to address your scenario with either cloud functions(CF) or with a scheduler (airflow). The first approach is event-driven getting your data crunch immediately. With the scheduler, expect data availability delay.
As it has been stated once you submit BigQuery job you get back job ID, that needs to be check till it completes. Then based on the status you can handle on success or failure post actions respectively.
If you were to develop CF, note that there are certain limitations like execution time (max 9min), which you would have to address in case BigQuery job takes more than 9 min to complete. Another challenge with CF is idempotency, making sure that if the same datafile event comes more than once, the processing should not result in data duplicates.
Alternatively, you can consider using some event-driven serverless open source projects like BqTail - Google Cloud Storage BigQuery Loader with post-load transformation.
Here is an example of the bqtail rule.
rule.yaml
When:
Prefix: "/mypath/mysubpath"
Suffix: ".json"
Async: true
Batch:
Window:
DurationInSec: 85
Dest:
Table: bqtail.transactions
Transient:
Dataset: temp
Alias: t
Transform:
charge: (CASE WHEN type_id = 1 THEN t.payment + f.value WHEN type_id = 2 THEN t.payment * (1 + f.value) END)
SideInputs:
- Table: bqtail.fees
Alias: f
'On': t.fee_id = f.id
OnSuccess:
- Action: query
Request:
SQL: SELECT
DATE(timestamp) AS date,
sku_id,
supply_entity_id,
MAX($EventID) AS batch_id,
SUM( payment) payment,
SUM((CASE WHEN type_id = 1 THEN t.payment + f.value WHEN type_id = 2 THEN t.payment * (1 + f.value) END)) charge,
SUM(COALESCE(qty, 1.0)) AS qty
FROM $TempTable t
LEFT JOIN bqtail.fees f ON f.id = t.fee_id
GROUP BY 1, 2, 3
Dest: bqtail.supply_performance
Append: true
OnFailure:
- Action: notify
Request:
Channels:
- "#e2e"
Title: Failed to aggregate data to supply_performance
Message: "$Error"
OnSuccess:
- Action: query
Request:
SQL: SELECT CURRENT_TIMESTAMP() AS timestamp, $EventID AS job_id
Dest: bqtail.supply_performance_batches
Append: true
- Action: delete
You want to use an orchestration tool, especially if you want to set up this tasks as recurring jobs.
We use Google Cloud Composer, which is a managed service based on Airflow, to do workflow orchestration and works great. It comes with automatically retry, monitoring, alerting, and much more.
You might want to give it a try.
Basically you can use Cloud Logging to know almost all kinds of operations in GCP.
BigQuery is no exception. When the query job completed, you can find the corresponding log in the log viewer.
The next question is how to anchor the exact query you want, one way to achieve this is to use labeled query (means attach labels to your query) [1].
For example, you can use below bq command to issue query with foo:bar label
bq query \
--nouse_legacy_sql \
--label foo:bar \
'SELECT COUNT(*) FROM `bigquery-public-data`.samples.shakespeare'
Then, when you go to Logs Viewer and issue below log filter, you will find the exactly log generated by above query.
resource.type="bigquery_resource"
protoPayload.serviceData.jobCompletedEvent.job.jobConfiguration.labels.foo="bar"
The next question is how to emit an event based on this log for the next workload. Then, the Cloud Pub/Sub comes into play.
2 ways to publish an event based on log pattern are:
Log Routers: set Pub/Sub topic as the destination [1]
Log-based Metrics: create alert policy whose notification channel is Pub/Sub [2]
So, the next workload can subscribe to the Pub/Sub topic, and be triggered when the previous query has completed.
Hope this helps ~
[1] https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#jobconfiguration
[2] https://cloud.google.com/logging/docs/routing/overview
[3] https://cloud.google.com/logging/docs/logs-based-metrics

What is the best way to communicate among multiple processes in ubuntu

I've three different machine learning models in python. To improve performance, I run them on different terminals in parallel. They are communicating and sharing data with one another through files. These models are creating batches of files to make available for other. All the processes are running in parallel but dependent on data prepared by other process. Once a process A prepares a batch of data, it creates a file to give signal to other process that data is ready, then process B starts processing it, while looking for other batch too simultaneously. How can this huge data be shared with next process without creating files? Is there any better way to communicate among these processes without creating/deleting temporary files in python?
Thanks
You could consider running up a small Redis instance... a very fast, in-memory data structure server.
It allows you to share strings, lists, queues, hashes, atomic integers, sets, ordered sets between processes very simply.
As it is networked, you can share all these data structures not only within a single machine, but across multiple machines.
As it has bindings for C/C++, Python, bash, Ruby, Perl and so on, it also means you can use the shell, for example, to quickly inject commands/data into your app to change its behaviour, or get debugging insight by looking at how variables are set.
Here's an example of how to do multiprocessing in Python3. Instead of storing results in a file the results are stored in a dictionary (see output)
from multiprocessing import Pool, cpu_count
def multi_processor(function_name):
file_list = []
# Test, put 6 strings in the list so your_function should run six times
# with 6 processors in parallel, (assuming your CPU has enough cores)
file_list.append("test1")
file_list.append("test2")
file_list.append("test3")
file_list.append("test4")
file_list.append("test5")
file_list.append("test6")
# Use max number of system processors - 1
pool = Pool(processes=cpu_count()-1)
pool.daemon = True
results = {}
# for every item in the file_list, start a new process
for aud_file in file_list:
results[aud_file] = pool.apply_async(your_function, args=("arg1", "arg2"))
# Wait for all processes to finish before proceeding
pool.close()
pool.join()
# Results and any errors are returned
return {your_function: result.get() for your_function, result in results.items()}
def your_function(arg1, arg2):
try:
print("put your stuff in this function")
your_results = ""
return your_results
except Exception as e:
return str(e)
if __name__ == "__main__":
some_results = multi_processor("your_function")
print(some_results)
The output is
put your stuff in this function
put your stuff in this function
put your stuff in this function
put your stuff in this function
put your stuff in this function
put your stuff in this function
{'test1': '', 'test2': '', 'test3': '', 'test4': '', 'test5': '', 'test6': ''}
Try using a sqlite database to share files.
I made this for this exact purpose:
https://pypi.org/project/keyvalue-sqlite/
You can use it like this:
from keyvalue_sqlite import KeyValueSqlite
DB_PATH = '/path/to/db.sqlite'
db = KeyValueSqlite(DB_PATH, 'table-name')
# Now use standard dictionary operators
db.set_default('0', '1')
actual_value = db.get('0')
assert '1' == actual_value
db.set_default('0', '2')
assert '1' == db.get('0')

How to parallelize a REST API crawler in http4s & fs2?

I wrote a sequential REST API crawler in http4s & fs2 here:
https://gist.github.com/NicolasRouquette/656ed7a2d6984ce0995fd78a3aec2566
This is to query a REST API service to get a starting set of IDs, fetch elements for a batch of IDs and continue based on the cross-reference IDs found in these elements until there are no new IDs to fetch and return a map of all elements fetched.
This works; however, the performance is inadequate -- too slow!
Since I don't have access to the server, I tried experimenting with varying batch sizes, from 10, 50, 100, 200, 500 and even batching all IDs in a single query. Query time increases significantly with batch size.
At large sizes (500 and all), I even got HTTP 500 responses from the server.
I would like to experiment with batching parallel queries in a load-balancing fashion using a pool of threads; however, it is unclear to me how to do this based on the fs2 docs.
Can someone provide suggestions how to achieve this?
Regarding using http4s & fs2: Well, I found this library fairly easy to use for simple client-side programming. Given the emphasis on supporting tasks, streams, etc..., I presume that batching parallel queries should be doable somehow.
fs2.concurrent.join will allow you to run multiple streams concurrently. The specific section in the guide is available at https://github.com/functional-streams-for-scala/fs2/blob/v0.9.7/docs/guide.md#concurrency
For your use case you could take your queue of ids, chunk them, create a http task and then wrap it in a stream. You would then run this stream of streams concurrently with join and combine the results.
def createHttpRequest(ids: Seq[ID]): Task[(ElementMap, Set[ID])] = ???
def fetch(queue: Set[ID]): Task[(ElementMap, Set[ID])] = {
val resultStreams = Stream.emits(queue.toSeq)
.vectorChunkN(batchSize)
.map(createHttpRequest)
.map(Stream.eval)
val resultStream = fs2.concurrent.join(maxOpen)(resultStreams)
resultStream.runFold((Map.empty[ID, Element], Set.empty[ID])) {
case ((a, b), (_a, _b)) => (a ++ _a, b ++ _b)
}
}

julia on PBS cluster: what to give to addprocs()?

I'm trying to setup a cluster across machines on a PBS managed cluster. I'm perfectly able to compute within one node by saying julia -p 12 (after having reserved one node with 12 CPUs).
I understand that to use several machines, I have to add them to the master process with addprocs. I was able to do that on a different cluster (SGE). on this one here something is going wrong.
You can see everything I'm doing, including submit scripts etc, on this branch of a github repo.
to get a list of machines, I parse the PBS_NODEFILE, which for the case of a submit script with option
#PBS -l nodes=2:ppn=12 # give me 2 nodes with 12 processors each
looks like something like this:
red0004
red0004
...
red0004
red0347
...
red0347
I parse this file with bind_pe_procs() in sge.jl in the repo and give a vector of machine names to addprocs. When I submit this I get this error which I put up a gist with the resulting SSH error. I don't know what it means.
has this to do with a system setting, ie do i have to talk to the sys admin about SSH between machines? What are the right questions to ask?
I am unsure about what exactly I have to give to addprocs(). I don't want to add the master process (I don't want worker 1 SSHing into itself?), so I exclude ENV["HOST"] = node001 from my list. but what about all processors with the same name node002? do i list all of those
machines = [ "red0347" for i=1:12]
or just once
machines = ["red0347"]
in addprocs(machines)
thanks!