How to pass string arguments with spaces to SQL notebook in databricks? - sql

I have a SQL notebook(notebookA) where I want to pass arguments from another notebook(notebookB).
---notebookA---
SELECT $v as $c
When I do this from notebook B, it is giving me result.
---notebookB---
%run ./notebookA $v='james' $c=name
But when there is a space in value it is giving me error like below
---notebookB---
%run ./notebookA $v='james potter' $c=name
Failed to parse %run command: string matching regex `\$[\w_]+' expected but `p' found)
What would be the solution then?

Magic commands do not allow variables to be passed. Instead you can use dbutils.
Python:
dbutils.notebook.run("notebookA", 60, {"v": "james potter", "c": name})
Reference: https://docs.databricks.com/user-guide/notebooks/notebook-workflows.html

Related

How to pass `keys_and_args` to redis-py' `eval` function, and on to lua script

I wish to pass a number of keys and values from python to a lua script, via redis's eval function which is documented as:
eval(script, numkeys, *keys_and_args)
Execute the Lua script, specifying the numkeys the script will touch and the key names and argument values in keys_and_args. Returns the result of the script.
In practice, use the object returned by register_script. This function exists purely for Redis API completion.
I am following this answer as a starting point. That script increment the scores of all values in the sorted set specified by 1. As I wish to specify the values to update (key names) and the increment count for each (argument values) my script looks like this:
-- some logging
local loglist = "lualog"
redis.pcall("DEL", loglist)
local function logit(msg)
redis.pcall("RPUSH", loglist, msg)
end
logit("started")
-- count & log the keys provided
local countofkeys = table.getn(KEYS)
logit(countofkeys)
-- loop through each key and increment
for n = 1, countofkeys do
redis.call("zincrby", "test_set", ARGV[n], KEYS[n])
end
I can run this from the command line with:
$ redis-cli --eval script.lua apple orange , 1 1
Then in Python confirm that the values have incremented:
>>> r.zrange('test_set', start = 0, end = -1, withscores=True)
[(b'apple', 1.0), (b'orange', 1.0)]
However I don't know how to run this using eval:
>>> c.eval(script,1,{'orange':1,'apple':1})
redis.exceptions.DataError: Invalid input of type: 'dict'. Convert to a byte, string or number first.
>>> c.eval(script,2,'apple orange , 1 1')
redis.exceptions.ResponseError: Number of keys can't be greater than number of args
>>> c.eval(script,1,'apple orange , 1 1')
redis.exceptions.ResponseError: Error running script (call to f_aaecafd58b474f08bafa5d4fefe9db98a58b4084): #user_script:21:
#user_script: 21: Lua redis() command arguments must be strings or integers
The documentation isn't too clear on what keys_and_args should look like. Also at the comand line numkeys isn't actually required by the looks of things. Does anyone know what this should look like?
Bonus question: How to avoid hard coding "test_set" into the lua script.
*keys_and_args should be an iterable (e.g. a list) - the use of an asterisk as a prefix to the argument's name is the Pythonic way of saying that.
Bonus tip: look into redis-py Script helper.
Bonus answer: Any key names touched by the script need to be provided via the KEYS table. Your script is doing it all wrong - read the documentation about EVAL.
Also at the comand line numkeys isn't actually required by the looks of things
This is only with the cli when used in that fashion - the comma (',') delimits between key names and arguments.
The first argument in the documentation is numkeys and the rest of the arguments are termed as *keys_and_args. The way to provide arguments is similar to argc and argv. So you would do something like this:
redis.eval(lua_script, 1, "BUCKET_SIZE", total_bucket_size, refill_size)
The numkeys will specify that the first argument should be considered as a key and the following as args. The 1 specifies the number of keys present in your *keys_and_args array.
eval() receives 3 parameters or to make it easy to understand "4" parameters
script: string
number_of_keys: integer
key_list: unpacked iterable key objects, for example, *{1, 2, 3} => 1, 2, 3
We have 3 keys, so number_of_keys should be 3
argument_list: unpacked iterable argument objects, for example, *{'one', 'two', 'three'} => 'one', 'two', 'three'
If we want to access 2nd element of the key list, use KEYS[2] in the LUA script.
If we want to access the 1st element of the argument list, user ARGV[1].
To return a list of KEYS[2] and ARGV[1]:
cache.eval('return {KEYS[2], ARGV[1]}', 3, 1, 2, 3, 'one', 'two', 'three')
So, now if we are back with 3 parameters, the last one should be *keys_and_arguments: unpacked iterable keys and arguments.

Snakemake - input function exception

I am trying to run snakemake code using.json file as input. While checking the dry run i got foloowing error
InputFunctionException in line 172 of /home/Snakefile_ChIPseq_pe:
KeyError: '130241_1'
Wildcards:
library=130241_1
This is the part of snakemake code
rule findPeaks:
input:
sample = os.path.join(HOMERTAG_DIR, "{library}"),
input = lambda wildcards: os.path.join(HOMERTAG_DIR, config['lib_input'][wildcards.library])
output:
os.path.join(HOMERPEAK_DIR, "{library}.all.hpeaks")
params:
config['homer_findPeaks_params']
shell:
"findPeaks {input.sample} -i {input.input} {params} -o {output}"
There is single quote around input sample which is missing in the 'lib_input' part. How to add that single quote ahead of variable?
Also library names are like 12345_1, 12345_2 etc., never had this problem before however for the first time I have libraries with "underscore" in the names.
Snakemake will first try to interpret the given value as number. Only if that fails, it will interpret the value as string. Here, it does not fail, because the underscore _ is interpreted as thousand separator.
My guess is that in your json file the library IDs are not quoted. E.g. you have this:
{
"lib_input": {1234_1: "input.txt"}
}
Instead of:
{
"lib_input": {"1234_1": "input.txt"}
}
Or maybe library 130241_1 is not in the json at all?

Snakemake Using expand with dictionary

I am writing this rule:
rule process_files:
input:
dataout=expand("{{dataset}}/{{sample}}.{{ref}}.{{state}}.{{case}}.myresult.{name}.tsv", name=my_list[wildcards.ref])
output:
"{dataset}/{sample}.{ref}.{state}.{case}.endresult.tsv"
shell:
do something ...
Were expand will get value from dictionary my_dictionary based on the ref value. I used wildcards like this my_dictionary[wildcards.ref]. But it ends up with this error name 'wildcards' is not defined
my_dictionary something like:
{A:[1,2,3], B:[s1,s2..].....}
I could use
def myfun(wildcards):
return expand("{{dataset}}/{{sample}}.{{ref}}.{{state}}.{{case}}.myresult.{name}.tsv", name=my_dictionary[wildcards.ref])
and use myfun as input , but this does not answer why I can not use expand in place directly
Any suggestion how to fix it?
As #dariober mentioned there is the wildcards objects but this is only accesible in the run/shell portion but can be accessed using an input function in input.
Here is an example implementation that will expand the input based on the wildcards.ref:
rule all:
input: expand("{dataset}/{sample}.{ref}.{state}.{case}.endresult.tsv", dataset=["D1", "D2"], sample=["S1", "S2"], ref=["R1", "R2"], state=["STATE1", "STATE2"], case=["C1", "C2"])
my_list = {"R1": [1, 2, 3], "R2": ["s1", "s2"]}
rule process_files:
input:
lambda wildcards: expand(
"{{dataset}}/{{sample}}.{{ref}}.{{state}}.{{case}}.myresult.{name}.tsv", name=my_list[wildcards.ref])
output:
"{dataset}/{sample}.{ref}.{state}.{case}.endresult.tsv"
shell:
"echo '{input}' > {output}"
If you implement it as the lambda function example above, it should resolve the issue you mention:
The function worked but it did not resolve the variable between double curly braces so it will ask for input for {dataset}/{sample}.{ref}.{state}.{case}and raise an error.
Your question seems similar to snakemake wildcards or expand command and the bottom line is that wildcards is not defined in the input. So your solution of using an input function (or a lambda function) seems correct.
(As to why wildcards is not defined in input, I don't know...)

InputFunctionException with KeyError

Suppose I have a code in python that generates a dictionary as the result. I need to write each element of dictionary in a separate folder which later will be used by other set of rules in snakemake.
I have written the code as following but it does not work!
simulation_index_dict={1:'test1',2:'test2'}
def indexer(wildcards):
return(simulation_index_dict[wildcards.simulation_index])
rule SimulateAll:
input:
expand("{simulation_index}/ProteinCodingGene/alfsim.drw",simulation_index=simulation_index_dict.keys())
rule simulate_phylogeny:
output:
ProteinCodingGeneParams=expand("{{simulation_index}}/ProteinCodingGene/alfsim.drw"),
IntergenicRegionParams=expand("{{simulation_index}}/IntergenicRegions/dawg_IR.dawg"),
RNAGeneParams=expand("{{simulation_index}}/IntergenicRegions/dawg_RG.dawg"),
RepeatRegionParams=expand("{{simulation_index}}/IntergenicRegions/dawg_RR.dawg"),
params:
value= indexer,
shell:
"""
echo {params.value} > {output.ProteinCodingGeneParams}
echo {params.value} > {output.IntergenicRegionParams}
echo {params.value} > {output.RNAGeneParams}
echo {params.value} > {output.RepeatRegionParams}
"""
The error it return is :
InputFunctionException in line 14 of /$/test.snake:
KeyError: '1'
Wildcards:
simulation_index=1
It seems that problems is with the params section of the rule because deleting it will eliminates the error but I can not figure out what is wrong with the params!
The solution: using strings as dictionary keys
One can guess from the error message (KeyError: '1') that some query in a dictionary went wrong on a key that is '1', which happens to be a string.
However, the dictionary used in the indexer "params" function has integers as keys.
Apparently, using strings instead of ints as keys to this simulation_index_dict dictionary solves the problem (see comments below the question).
The cause: loss of type information during workflow inference
The cause of the problem is likely that the integer nature (inherited from simulation_index_dict.keys()) of the value assigned to the simulation_index parameter of the expand in SimulateAll is "forgotten" in subsequent steps of the workflow inference.
Indeed, the expand results in a list of strings, which are then matched against the output of the other rules (which also consist in strings), to infer the values of the wildcards attributes (which are also strings). Therefore, when the indexer function is executed, wildcards.simulation_index is a string, and this causes a KeyError when looking it up in simulation_index_dict.

How can I pass command-line parameters with whitespace to an apache pig script?

I want to write a pig script that takes a filter condition as a command line parameter. From the command line I want to type something like:
pig -p "MY_FILTER=field1 == 0 and field2 == 5" myscript.pig
In my script I have a line:
my_filtered_data = filter my_data by $MY_FILTER;
This works as expected when MY_FILTER has no spaces and I pass quotes around my value; So if I type MY_FILTER=\"field1==0\" at the command line the shell will pass the quotes with the value and pig does the expansion I want. However, the parameter will fail to expand if I supply it like MY_FILTER=\"field1 == 0\"
I've tried a bunch of different quoting techniques and even tried running the command directly from python's subprocess module to ensure my shell wasn't doing something weird.
Which version of Pig do you use? I use 0.9.2 and the following command works for me:
pig -p "F='field1 == 3 AND field2 == 5'" test.pig
But it doesn't work with 0.8.1.