Properly accessing cluster_config '__default__' values - snakemake

I have a cluster.json file that looks like this:
{
"__default__":
{
"queue":"normal",
"memory":"12288",
"nCPU":"1",
"name":"{rule}_{wildcards.sample}",
"o":"logs/cluster/{wildcards.sample}/{rule}.o",
"e":"logs/cluster/{wildcards.sample}/{rule}.e",
"jvm":"10240m"
},
"aln_pe":
{
"memory":"61440",
"nCPU":"16"
},
"GenotypeGVCFs":
{
"jvm":"102400m",
"memory":"122880"
}
}
In my snakefile I have a few rules that try to access the cluster_config object in their params
params:
memory=cluster_config['__default__']['jvm']
But this will give me a 'KeyError'
KeyError in line 27 of home/bwubb/projects/Germline/S0330901/haplotype.snake:
'__default__'
Does this have something to do with '__default__' being a special object? It pprints in a visually appealing dictionary where as the others are labeled OrderDict, but when I look at the json it looks the same.
If nothing is wrong with my json, then should I refrain from accessing '__default__'?

The default value is accessed via the keyword "cluster", not
__default__
See here in this example in the tutorial:
{
"__default__" :
{
"account" : "my account",
"time" : "00:15:00",
"n" : 1,
"partition" : "core"
},
"compute1" :
{
"time" : "00:20:00"
}
}
The JSON list in the URL above and listed above is the one being accessed in this example. It's unfortunate they are not on the same page.
To access time, J.K. uses the following call.
#!python
#!/usr/bin/env python3
import os
import sys
from snakemake.utils import read_job_properties
jobscript = sys.argv[1]
job_properties = read_job_properties(jobscript)
# do something useful with the threads
threads = job_properties[threads]
# access property defined in the cluster configuration file (Snakemake >=3.6.0)
job_properties["cluster"]["time"]
os.system("qsub -t {threads} {script}".format(threads=threads, script=jobscript))

Related

Pipeline generation - passing in simple datastructures like lists/arrays

For a code repository project in Palantir Foundry, I am struggling with re-using some of my transformation logic.
It seems almost trivial, but: is there way to send an Input to a Transform that is not a dataset/dataframe reference?
In my case I want to pass in strings or lists/arrays.
This is my code:
from pyspark.sql import functions as F
from transforms.api import Transform, Input, Output
def my_computation(result, customFilter, scope, my_categories, my_mappings):
scope_df = scope.dataframe()
my_categories_df = my_categories.dataframe()
my_mappings_df = my_mappings.dataframe()
filtered_cat_df = (
my_categories_df
.filter(F.col('CAT_NAME').isin(customFilter))
)
# ... more logic
def generateTransforms(config):
transforms = []
for key, value in config.items():
o = {}
for outKey, outValue in value['outputs'].items():
o[outKey] = Output(outValue)
i = {}
for inpKey, inpValue in value['inputs'].items():
i[inpKey] = Input(inpValue)
i['customFilter'] = Input(value['my_custom_filter'])
transforms.append(Transform(my_computation, inputs=i, outputs=o))
return transforms
config = {
"transform_one": {
"my_custom_filter": {
"foo",
"bar"
},
"inputs": {
"scope": "/my-project/input/scope",
"my_categories": "/my-project/input/my_categories",
"my_mappings": "/my-project/input/my_mappings"
},
"outputs": {
"result": "/my-project/output/result"
}
}
}
TRANSFORMS = generateTransforms(config)
The concrete question is: how can I send in the values from my_custom_filter into customFilter in the transformation function my_computation?
If I execute it like above, I get the error "TypeError: unhashable type: 'set'"
This looks like a python issue, any chance you can point out which line is causing the error?
Reading throung your code, I would guess it's this line:
i['customFilter'] = Input(value['my_custom_filter'])
Your python logic is wrong, if we unpack your code you're trying to do this call:
i['customFilter'] = Input({"foo", "bar"})
Edit to answer the comment on how to create a python transform to lock a variable in a closure:
def create_transform(inputs={}, outputs={}, my_other_var):
#transform(**inputs, **outputs)
def compute(input_foo, input_bar, output_foobar, ctx):
df = input_foo.dataframe()
df = df.withColumn("mycol", F.lit(my_other_var))
output_foorbar.write_dataframe(df)
return compute
and now you can call this:
transforms.append(create_tranform(inputs, outptus, "foobar"))

Terraform 0.12: Output list of buckets, use as input for another module and iterate

I'm using Tf 0.12. I have an s3 module that outputs a list of buckets, that I would like to use as an input for a cloudfront module that I've got.
The problem I'm facing is that when I do terraform plan/apply I get the following error count.index is 0 |var.redirect-buckets is tuple with 1 element
I've tried all kinds of splats moving the count.index call around to no avail. My sample code is below.
module.s3
resource "aws_s3_bucket" "redirect" {
count = length(var.redirects)
bucket = element(var.redirects, count.index)
}
mdoule.s3.output
output "redirect-buckets" {
value = [aws_s3_bucket.redirect.*]
}
module.cdn.variables
...
variable "redirect-buckets" {
description = "Redirect buckets"
default = []
}
....
The error is thrown down here
module.cdn
resource "aws_cloudfront_distribution" "redirect" {
count = length(var.redirect-buckets)
default_cache_behavior {
// Line below throws the error, one amongst many
target_origin_id = "cloudfront-distribution-origin-${var.redirect-buckets[count.index]}.s3.amazonaws.com"
....
//Another error throwing line
target_origin_id = "cloudfront-distribution-origin-${var.redirect-buckets[count.index]}.s3.amazonaws.com"
Any help is greatly appreciated.
module.s3
resource "aws_s3_bucket" "redirects" {
for_each = var.redirects
bucket = each.value
}
Your variable definition for redirects needs to change to something like this:
variable "redirects" {
type = map(string)
}
module.s3.output:
output "redirect_buckets" {
value = aws_s3_bucket.redirects
}
module.cdn
resource "aws_cloudfront_distribution" "redirects" {
for_each = var.redirect_buckets
default_cache_behavior {
target_origin_id = "cloudfront-distribution-origin-${each.value.id}.s3.amazonaws.com"
}
Your variable definition for redirect-buckets needs to change to something like this (note underscores, using skewercase is going to behave strangely in some cases, not worth it):
variable "redirect_buckets" {
type = map(object(
{
id = string
}
))
}
root module
module "s3" {
source = "../s3" // or whatever the path is
redirects = {
site1 = "some-bucket-name"
site2 = "some-other-bucket"
}
}
module "cdn" {
source = "../cdn" // or whatever the path is
redirects_buckets = module.s3.redirect_buckets
}
From an example perspective, this is interesting, but you don't need to use outputs from S3 here since you could just hand the cdn module the same map of redirects and use for_each on those.
There is a tool called Terragrunt which wraps Terraform and supports dependencies.
https://terragrunt.gruntwork.io/docs/features/execute-terraform-commands-on-multiple-modules-at-once/#dependencies-between-modules

How to create single PredictRequest() with multiple model_spec?

I'm trying to add multiple model_spec & their respected inputs into single predict_pb2.PredictRequest() as follow:
tmp = predict_pb2.PredictRequest()
tmp.model_spec.name = '1'
tmp.inputs['tokens'].CopyFrom(make_tensor_proto([1,2,3]))
tmp.model_spec.name = '2'
tmp.inputs['tokens'].CopyFrom(make_tensor_proto([4,5,6]))
But I'm only getting 2's information:
>> tmp
model_spec {
name: "2"
}
inputs {
key: "tokens"
value {
dtype: DT_INT32
tensor_shape {
dim {
size: 3
}
}
tensor_content: "\004\000\000\000\005\000\000\000\006\000\000\000"
}
}
How can I get a single PredictRequest() for multiple models with their respective inputs?
My aim is to create a single request and send it to the tensorflow serving which is serving two models. Is there any other way around this? Creating two separate requests for both models and getting results from tf_serving one after another works, but I'm wondering if I can just combine two requests into one.
I'm afraid it's not possible. In tensorflow_serving/api/predict.proto, each PredictRequest has only one ModelSpec. You may try to add some code to do this.
Did you try using Configuration File.
Contents of Config file can be as shown below:
model_config_list {
config {
name: 'my_first_model'
base_path: '/tmp/my_first_model/'
}
config {
name: 'my_second_model'
base_path: '/tmp/my_second_model/'
}
}
For more information, you can refer the link shown below:
https://www.tensorflow.org/tfx/serving/serving_config

Test file structure in groovy(Spock)

How to test created and expected file tree in groovy(Spock)?
Right now I'm using Set where I specify paths which I expect to get and collecting actual paths in this way:
Set<String> getCreatedFilePaths(String root) {
Set<String> createFilePaths = new HashSet<>()
new File(root).eachFileRecurse {
createFilePaths << it.absolutePath
}
return createFilePaths
}
But the readability of the test isn't so good.
Is it possible in groovy to write expected paths as a tree, and after that compare with actual
For example, expected:
region:
usa:
new_york.json
california.json
europe:
spain.json
italy.json
And actual will be converted to this kind of tree.
Not sure if you can do it with the built-in recursive methods. There certainly are powerful ones, but this is standard recursion code you can use:
def path = new File("/Users/me/Downloads")
def printTree(File file, Integer level) {
println " " * level + "${file.name}:"
file.eachFile {
println " " * (level + 1) + it.name
}
file.eachDir {
printTree(it, level + 1)
}
}
printTree(path, 1)
That prints the format you describe
You can either build your own parser or use Groovy's built-in JSON parser:
package de.scrum_master.stackoverflow
import groovy.json.JsonParserType
import groovy.json.JsonSlurper
import spock.lang.Specification
class FileRecursionTest extends Specification {
def jsonDirectoryTree = """{
com : {
na : {
tests : [
MyBaseIT.groovy
]
},
twg : {
sample : {
model : [
PrimeNumberCalculatorSpec.groovy
]
}
}
},
de : {
scrum_master : {
stackoverflow : [
AllowedPasswordsTest.groovy,
CarTest.groovy,
FileRecursionTest.groovy,
{
foo : [
LoginIT.groovy,
LoginModule.groovy,
LoginPage.groovy,
LoginValidationPage.groovy,
User.groovy
]
},
LuceneTest.groovy
],
testing : [
GebTestHelper.groovy,
RestartBrowserIT.groovy,
SampleGebIT.groovy
]
}
}
}"""
def "Parse directory tree JSON representation"() {
given:
def jsonSlurper = new JsonSlurper(type: JsonParserType.LAX)
def rootDirectory = jsonSlurper.parseText(jsonDirectoryTree)
expect:
rootDirectory.de.scrum_master.stackoverflow.contains("CarTest.groovy")
rootDirectory.com.twg.sample.model.contains("PrimeNumberCalculatorSpec.groovy")
when:
def fileList = objectGraphToFileList("src/test/groovy", rootDirectory)
fileList.each { println it }
then:
fileList.size() == 14
fileList.contains("src/test/groovy/de/scrum_master/stackoverflow/CarTest.groovy")
fileList.contains("src/test/groovy/com/twg/sample/model/PrimeNumberCalculatorSpec.groovy")
}
List<File> objectGraphToFileList(String directoryPath, Object directoryContent) {
List<File> files = []
directoryContent.each {
switch (it) {
case String:
files << directoryPath + "/" + it
break
case Map:
files += objectGraphToFileList(directoryPath, it)
break
case Map.Entry:
files += objectGraphToFileList(directoryPath + "/" + (it as Map.Entry).key, (it as Map.Entry).value)
break
default:
throw new IllegalArgumentException("unexpected directory content value $it")
}
}
files
}
}
Please note:
I used new JsonSlurper(type: JsonParserType.LAX) in order to avoid having to quote each single String in the JSON structure. If your file names contain spaces or other special characters, you will have to use something like "my file name", though.
In rootDirectory.de.scrum_master.stackoverflow.contains("CarTest.groovy") you can see how you can nicely interact with the parsed JSON object graph in .property syntax. You might like it or not, need it or not.
Recursive method objectGraphToFileList converts the parsed object graph to a list of files (if you prefer a set, change it, but File.eachFileRecurse(..) should not yield any duplicates, so the set is not needed.
If you do not like the parentheses etc. in the JSON, you can still build your own parser.
You might want to add another utility method to create a JSON string like the given one from a validated directory structure, so you have less work when writing similar tests.
Modified Bavo Bruylandt answer to collect file tree paths, and sort it to not care about the order of files.
def "check directory structure"() {
expect:
String created = getCreatedFilePaths(new File("/tmp/region"))
String expected = new File("expected.txt").text
created == expected
}
private String getCreatedFilePaths(File root) {
List paths = new ArrayList()
printTree(root, 0, paths)
return paths.join("\n")
}
private void printTree(File file, Integer level, List paths) {
paths << ("\t" * level + "${file.name}:")
file.listFiles().sort{it.name}.each {
if (it.isFile()) {
paths << ("\t" * (level + 1) + it.name)
}
if (it.isDirectory()) {
collectFileTree(it, level + 1, paths)
}
}
}
And expected files put in the expected.txt file with indent(\t) in this way:
region:
usa:
new_york.json
california.json
europe:
spain.json
italy.json

Using spark-kernel comm API

I started using spark-kernel recently.
As given in tutorial and sample code, I was able to set up client and use it for executing code snippets on spark-kernel and retrieving back results as given in this example code.
Now, I need to use comm API provided with spark-kernel. I tried this tutorial, but I am not able to make it work. In fact, I have no understanding of how to make that work.
I tried the following code, but when I run this code, I get this error "Received invalid target for Comm Open: my_target" on the kernel.
package examples
import scala.runtime.ScalaRunTime._
import scala.collection.mutable.ListBuffer
import com.ibm.spark.kernel.protocol.v5.MIMEType
import com.ibm.spark.kernel.protocol.v5.client.boot.ClientBootstrap
import com.ibm.spark.kernel.protocol.v5.client.boot.layers.{StandardHandlerInitialization, StandardSystemInitialization}
import com.ibm.spark.kernel.protocol.v5.content._
import com.typesafe.config.{Config, ConfigFactory}
import Array._
object commclient extends App{
val profileJSON: String = """
{
"stdin_port" : 48691,
"control_port" : 44808,
"hb_port" : 49691,
"shell_port" : 40544,
"iopub_port" : 43462,
"ip" : "127.0.0.1",
"transport" : "tcp",
"signature_scheme" : "hmac-sha256",
"key" : ""
}
""".stripMargin
val config: Config = ConfigFactory.parseString(profileJSON)
val client = (new ClientBootstrap(config)
with StandardSystemInitialization
with StandardHandlerInitialization).createClient()
def printResult(result: ExecuteResult) = {
println(s"${result.data.get(MIMEType.PlainText).get}")
}
def printStreamContent(content:StreamContent) = {
println(s"${content.text}")
}
def printError(reply:ExecuteReplyError) = {
println(s"Error was: ${reply.ename.get}")
}
client.comm.register("my_target").addMsgHandler {
(commWriter, commId, data) =>
println(data)
commWriter.close()
}
// Initiate the Comm connection
client.comm.open("my_target")
}
Can someone tell me how shall I run this piece of code:
// Register the callback to respond to being opened from the client
kernel.comm.register("my target").addOpenHandler {
(commWriter, commId, targetName, data) =>
commWriter.writeMsg(Map("response" -> "Hello World!"))
}
I would really appreciate if someone can point me to complete working example on usage of comm API.
Any help will be appreciated. Thanks
You can use your client to run this server (kernel) side registration once in one program. Then your other programs can communicate to kernel using this channel.
Here is a way I ran my registration in the first program I mentioned above:
client.execute(
"""
// Register the callback to respond to being opened from the client
kernel.comm.register("my target").
addOpenHandler {
(commWriter, commId, targetName, data) =>
commWriter.writeMsg(org.apache.toree.kernel.protocol.v5.MsgData("response" -> "Toree Hello World!"))
}.
addMsgHandler {
(commWriter, _, data) =>
if (!data.toString.contains("closing")) {
commWriter.writeMsg(data)
} else {
commWriter.writeMsg(org.apache.toree.kernel.protocol.v5.MsgData("closing" -> "done"))
}
}
""".stripMargin
)