How to pass `keys_and_args` to redis-py' `eval` function, and on to lua script - redis

I wish to pass a number of keys and values from python to a lua script, via redis's eval function which is documented as:
eval(script, numkeys, *keys_and_args)
Execute the Lua script, specifying the numkeys the script will touch and the key names and argument values in keys_and_args. Returns the result of the script.
In practice, use the object returned by register_script. This function exists purely for Redis API completion.
I am following this answer as a starting point. That script increment the scores of all values in the sorted set specified by 1. As I wish to specify the values to update (key names) and the increment count for each (argument values) my script looks like this:
-- some logging
local loglist = "lualog"
redis.pcall("DEL", loglist)
local function logit(msg)
redis.pcall("RPUSH", loglist, msg)
end
logit("started")
-- count & log the keys provided
local countofkeys = table.getn(KEYS)
logit(countofkeys)
-- loop through each key and increment
for n = 1, countofkeys do
redis.call("zincrby", "test_set", ARGV[n], KEYS[n])
end
I can run this from the command line with:
$ redis-cli --eval script.lua apple orange , 1 1
Then in Python confirm that the values have incremented:
>>> r.zrange('test_set', start = 0, end = -1, withscores=True)
[(b'apple', 1.0), (b'orange', 1.0)]
However I don't know how to run this using eval:
>>> c.eval(script,1,{'orange':1,'apple':1})
redis.exceptions.DataError: Invalid input of type: 'dict'. Convert to a byte, string or number first.
>>> c.eval(script,2,'apple orange , 1 1')
redis.exceptions.ResponseError: Number of keys can't be greater than number of args
>>> c.eval(script,1,'apple orange , 1 1')
redis.exceptions.ResponseError: Error running script (call to f_aaecafd58b474f08bafa5d4fefe9db98a58b4084): #user_script:21:
#user_script: 21: Lua redis() command arguments must be strings or integers
The documentation isn't too clear on what keys_and_args should look like. Also at the comand line numkeys isn't actually required by the looks of things. Does anyone know what this should look like?
Bonus question: How to avoid hard coding "test_set" into the lua script.

*keys_and_args should be an iterable (e.g. a list) - the use of an asterisk as a prefix to the argument's name is the Pythonic way of saying that.
Bonus tip: look into redis-py Script helper.
Bonus answer: Any key names touched by the script need to be provided via the KEYS table. Your script is doing it all wrong - read the documentation about EVAL.
Also at the comand line numkeys isn't actually required by the looks of things
This is only with the cli when used in that fashion - the comma (',') delimits between key names and arguments.

The first argument in the documentation is numkeys and the rest of the arguments are termed as *keys_and_args. The way to provide arguments is similar to argc and argv. So you would do something like this:
redis.eval(lua_script, 1, "BUCKET_SIZE", total_bucket_size, refill_size)
The numkeys will specify that the first argument should be considered as a key and the following as args. The 1 specifies the number of keys present in your *keys_and_args array.

eval() receives 3 parameters or to make it easy to understand "4" parameters
script: string
number_of_keys: integer
key_list: unpacked iterable key objects, for example, *{1, 2, 3} => 1, 2, 3
We have 3 keys, so number_of_keys should be 3
argument_list: unpacked iterable argument objects, for example, *{'one', 'two', 'three'} => 'one', 'two', 'three'
If we want to access 2nd element of the key list, use KEYS[2] in the LUA script.
If we want to access the 1st element of the argument list, user ARGV[1].
To return a list of KEYS[2] and ARGV[1]:
cache.eval('return {KEYS[2], ARGV[1]}', 3, 1, 2, 3, 'one', 'two', 'three')
So, now if we are back with 3 parameters, the last one should be *keys_and_arguments: unpacked iterable keys and arguments.

Related

Nextflow: add unique ID, hash, or row number to tuple

ch_files = Channel.fromPath("myfiles/*.csv")
ch_parameters = Channel.from(['A','B, 'C', 'D'])
ch_samplesize = Channel.from([4, 16, 128])
process makeGrid {
input:
path input_file from ch_files
each parameter from ch_parameters
each samplesize from ch_samplesize
output:
tuple path(input_file), parameter, samplesize, path("config_file.ini") into settings_grid
"""
echo "parameter=$parameter;sampleSize=$samplesize" > config_file.ini
"""
}
gives me a number_of_files * 4 * 3 grid of settings files, so I can run some script for each combination of parameters and input files.
How do I add some ID to each line of this grid? A row ID would be OK, but I would even prefer some unique 6-digit alphanumeric code without a "meaning" because the order in the table doesn't matter. I could extract out the last part of the working folder which is seemingly unique per process; but I don't think it is ideal to rely on sed and $PWD for this, and I didn't see it provided as a runtime metadata variable provider. (plus it's a bit long but OK). In a former setup I had a job ID from the LSF cluster system for this purpose, but I want this to be portable.
Every combination is not guaranteed to be unique (e.g. having parameter 'A' twice in the input channel should be valid).
To be clear, I would like this output
file1.csv A 4 pathto/config.ini 1ac5r
file1.csv A 16 pathto/config.ini 7zfge
file1.csv A 128 pathto/config.ini ztgg4
file2.csv A 4 pathto/config.ini 123js
etc.
Given the input declaration, which uses the each qualifier as an input repeater, it will be difficult to append some unique id to the grid without some refactoring to use either the combine or cross operators. If the inputs are just files or simple values (like in your example code), refactoring doesn't make much sense.
To get a unique code, the simple options are:
Like you mentioned, there's no way, unfortunately, to access the unique task hash without some hack to parse $PWD. Although, it might be possible to use BASH parameter substitution to avoid sed/awk/cut (assuming BASH is your shell of course...) you could try using: "${PWD##*/}"
You might instead prefer using ${task.index}, which is a unique index within the same task. Although the task index is not guaranteed to be unique across executions, it should be sufficient in most cases. It can also be formatted for example:
process example {
...
script:
def idx = String.format("%06d", task.index)
"""
echo "${idx}"
"""
}
Alternatively, create your own UUID. You might be able to take the first N characters but this will of course decrease the likelihood of the IDs being unique (not that there was any guarantee of that anyway). This might not really matter though for a small finite set of inputs:
process example {
...
script:
def uuid = UUID.randomUUID().toString()
"""
echo "${uuid}"
echo "${uuid.take(6)}"
echo "${uuid.takeBefore('-')}"
"""
}

How to pass string arguments with spaces to SQL notebook in databricks?

I have a SQL notebook(notebookA) where I want to pass arguments from another notebook(notebookB).
---notebookA---
SELECT $v as $c
When I do this from notebook B, it is giving me result.
---notebookB---
%run ./notebookA $v='james' $c=name
But when there is a space in value it is giving me error like below
---notebookB---
%run ./notebookA $v='james potter' $c=name
Failed to parse %run command: string matching regex `\$[\w_]+' expected but `p' found)
What would be the solution then?
Magic commands do not allow variables to be passed. Instead you can use dbutils.
Python:
dbutils.notebook.run("notebookA", 60, {"v": "james potter", "c": name})
Reference: https://docs.databricks.com/user-guide/notebooks/notebook-workflows.html

How to get all keys/values from redis in order to insert them into SQL db?

I have a lot of analytics data that I'm adding to redis. I plan on incrementally moving the data out of redis and into my database.
I know I can use KEYS [the_key]:* to get all keys that match. For example, I can do that to get the following:
127.0.0.1:6379> KEYS c_Track:*
1) "c_Track:6c93a5c1-77e9-4c4a-9232-bf182713a02e"
2) "c_Track:2c9d99c2-af37-4de9-ac64-b48f339e97a9"
3) "c_Track:9e7fd190-86d9-4b4a-9a70-7bf4c7768eef"
4) "c_Track:7f2d2e98-7440-4fd7-a80a-2af309ab15a4"
Is there a recommended way to get these values easily? I can get the keys, but how can I get all the values as well? I can loop through the keys to get the values, but is there some one-shot method for doing this?
Also I know I shouldn't use keys, but this is just an example. Thanks
Thanks
Also I know I shouldn't use keys
So don't. Use SCAN instead.
is there some one-shot method for doing this?
No, not as a core Redis command, but given the need this is fairly simple to achieve with a server-side Lua script. For example, assuming that your values are strings, you could do something like the following:
local cursor = tonumber(ARGV[1])
local pattern = ARGV[2]
local scan = redis.call('SCAN', cursor, 'MATCH', pattern)
for i, v in ipairs(scan[2]) do
local val = redis.call('GET', v)
scan[2][i] = { v, val }
end
return scan
Assuming that this script is saved under "scan.lua", you can run it as follows:
$ redis-cli SET foo bar
OK
$ redis-cli SET baz qaz
OK
$ redis-cli --eval scan.lua , 0 "*"
1) "0"
2) 1) 1) "baz"
2) "qaz"
2) 1) "foo"
2) "bar"
To scan your entire keyspace, call the script with the returned cursor until it returns 0.
Notes:
1) If your keys are of different types, you should change the script accordingly (e.g. https://github.com/itamarhaber/redis-lua-scripts/blob/master/scanfetch.lua).
2) While this script goes against the common recommendation of generating key names inside a script, it is still safe to run as SCAN returns keys that are in the server's keyspace (whether single-instance or clustered).

InputFunctionException with KeyError

Suppose I have a code in python that generates a dictionary as the result. I need to write each element of dictionary in a separate folder which later will be used by other set of rules in snakemake.
I have written the code as following but it does not work!
simulation_index_dict={1:'test1',2:'test2'}
def indexer(wildcards):
return(simulation_index_dict[wildcards.simulation_index])
rule SimulateAll:
input:
expand("{simulation_index}/ProteinCodingGene/alfsim.drw",simulation_index=simulation_index_dict.keys())
rule simulate_phylogeny:
output:
ProteinCodingGeneParams=expand("{{simulation_index}}/ProteinCodingGene/alfsim.drw"),
IntergenicRegionParams=expand("{{simulation_index}}/IntergenicRegions/dawg_IR.dawg"),
RNAGeneParams=expand("{{simulation_index}}/IntergenicRegions/dawg_RG.dawg"),
RepeatRegionParams=expand("{{simulation_index}}/IntergenicRegions/dawg_RR.dawg"),
params:
value= indexer,
shell:
"""
echo {params.value} > {output.ProteinCodingGeneParams}
echo {params.value} > {output.IntergenicRegionParams}
echo {params.value} > {output.RNAGeneParams}
echo {params.value} > {output.RepeatRegionParams}
"""
The error it return is :
InputFunctionException in line 14 of /$/test.snake:
KeyError: '1'
Wildcards:
simulation_index=1
It seems that problems is with the params section of the rule because deleting it will eliminates the error but I can not figure out what is wrong with the params!
The solution: using strings as dictionary keys
One can guess from the error message (KeyError: '1') that some query in a dictionary went wrong on a key that is '1', which happens to be a string.
However, the dictionary used in the indexer "params" function has integers as keys.
Apparently, using strings instead of ints as keys to this simulation_index_dict dictionary solves the problem (see comments below the question).
The cause: loss of type information during workflow inference
The cause of the problem is likely that the integer nature (inherited from simulation_index_dict.keys()) of the value assigned to the simulation_index parameter of the expand in SimulateAll is "forgotten" in subsequent steps of the workflow inference.
Indeed, the expand results in a list of strings, which are then matched against the output of the other rules (which also consist in strings), to infer the values of the wildcards attributes (which are also strings). Therefore, when the indexer function is executed, wildcards.simulation_index is a string, and this causes a KeyError when looking it up in simulation_index_dict.

[Q]uestion about reading and saving a large txt-file via {RSQLite} line by line into a DB

Since my hardware is very limited (a dual core with 32bit Win7 and 4GB of ram - I need to make the best of it.....) I try to save a large text-file (about 1.2GB) into a DB, which I can then trigger by SQL-like queries to do some analytics on particular subgroups.
To be honest I'm not familiar with this area and since I could not find help regarding my issues via "googling", I just quickly show what I came up with and how I thought things would look like:
First I check how many columns my txt-file has:
k <- length(scan("data.txt", nlines=1, sep="\t", what="character"))
Then I open a connection to the text file so that it does not need to be opened
again for every single line:
filecon<-file("data.txt", open="r")
Then I initialize a connection (dbcon) to an SQLite database
dbcon<- dbConnect(dbDriver("SQLite"), dbname="mydb.dbms")
I find out where the position of the first line is
pos<-seek(filecon, rw="r")
Since the first line contains the column-names I save them for later use
col_names <- unlist(strsplit(readLines(filecon, n=1), "\t"))
Next, I test to read the first 10 lines, line by line,
and save them into a DB, which themself (should) contain k - columns with columns-names = col_names.
for(i in 1:10) {
# prints the iteration number in hundreds
if(i %% 100 == 0) {
print(i)
}
# read one line into a variable tt
tt<-readLines(filecon, n=1)
# parse tt into a variable tt2, since tt is a string
tt2<-unlist(strsplit(tt, "\t"))
# Every line, read and parsed from the text file, is immediately saved
# in the SQLite database table "results" using the command dbWriteTable()
dbWriteTable(conn=dbcon, name="results", value=as.data.frame(t(tt2[1:k]),stringsAsFactors=T), col.names=col_names, append=T)
pos<-c(pos, seek(filecon, rw="r"))
}
If I run this I get the following error
Warning messages:
1: In value[[3L]](cond) :
RS-DBI driver: (error in statement: table results has 738 columns but 13 values were supplied)
Why should I supply 738 columns? If I change k (which is 12) to 738, the code works but then I need to trigger the columns by V1, V2, V3,.... and not by the column-names I intended to supply
res <- dbGetQuery(dbcon, "select V1, V2, V3, V4, V5, V6 from results")
Any help or even a small hint is very much appreciated!