Redis mass insertion: protocol vs inline commands - redis

For my task I need to load a bulk of data into Redis as soon as possible. It looks like this article is right about my case: https://redis.io/topics/mass-insert
The article starts from giving an example of using multiple inline SET commands with redis-cli. Then they proceed to generating Redis protocol and again use it with redis-cli. They don't explain the reasons or benefits of using Redis protocol.
Using of Redis protocol is a bit harder and it generates a bit more traffic. I wonder, what are the reasons to use Redis protocol rather than simple one-line commands? Probably despite the fact the data is larger, it is easier (and faster) for Redis to parse it?

Good point.
Only a small percentage of clients support non-blocking I/O, and not
all the clients are able to parse the replies in an efficient way in
order to maximize throughput. For all this reasons the preferred way
to mass import data into Redis is to generate a text file containing
the Redis protocol, in raw format, in order to call the commands
needed to insert the required data.
What I understood is that you emulate a client when you use Redis protocol directly, which would benefit from the highlighted points.
Based on the docs you provided, I tried these scripts:
test.rb
def gen_redis_proto(*cmd)
proto = ""
proto << "*"+cmd.length.to_s+"\r\n"
cmd.each{|arg|
proto << "$"+arg.to_s.bytesize.to_s+"\r\n"
proto << arg.to_s+"\r\n"
}
proto
end
(0...100000).each{|n|
STDOUT.write(gen_redis_proto("SET","Key#{n}","Value#{n}"))
}
test_no_protocol.rb
(0...100000).each{|n|
STDOUT.write("SET Key#{n} Value#{n}\r\n")
}
ruby test.rb > 100k_prot.txt
ruby test_no_protocol.rb > 100k_no_prot.txt
time cat 100k.txt | redis-cli --pipe
time cat 100k_no_prot.txt | redis-cli --pipe
I've got these results:
teixeira: ~/stackoverflow $ time cat 100k.txt | redis-cli --pipe
All data transferred. Waiting for the last reply...
Last reply received from server.
errors: 0, replies: 100000
real 0m0.168s
user 0m0.025s
sys 0m0.015s
(5 arquivo(s), 6,6Mb)
teixeira: ~/stackoverflow $ time cat 100k_no_prot.txt | redis-cli --pipe
All data transferred. Waiting for the last reply...
Last reply received from server.
errors: 0, replies: 100000
real 0m0.433s
user 0m0.026s
sys 0m0.012s

Related

CONFIG GET command is not allowed from script

I am working in a scenario with one Redis Master and several replicas, distributed in two cities, for geographic redundancy. My point is, for some reason, in a Lua script I need to know in which city the active instance is running. I thought, quite simple :
redis.call("config", "get", "bind"), and knowing the server IP I will determine the city. Not quite:
$ cat config.lua
redis.call("config", "get", "bind")
$ redis-cli --eval config.lua
(error) ERR Error running script (call to f_9bfbf1fd7e6331d0e3ec9e46ec28836b1d40ba35): #user_script:1: #user_script: 1: This Redis command is not allowed from scripts
Why is "config" not allowed from scripts? First, even though it's no deterministic, there are no write commands.
Second, I am using version 5.0.4, and from version 5 the replication is supposed to be done by effects and not by script propagation.
Any ideas?

Any specific problems running (linux) BCP on "too many" threads?

Are there any specific problems with running Microsoft's BCP utility (on CentOS 7, https://learn.microsoft.com/en-us/sql/linux/sql-server-linux-migrate-bcp?view=sql-server-2017) on multiple threads? Googling could not find much, but am looking at a problem that seems to be related to just that.
Copying a set of large TSV files from HDFS to a remote MSSQL Server with some code of the form
bcpexport() {
filename=$1
TO_SERVER_ODBCDSN=$2
DB=$3
TABLE=$4
USER=$5
PASSWORD=$6
RECOMMEDED_IMPORT_MODE=$7
DELIMITER=$8
echo -e "\nRemoving header from TSV file $filename"
echo -e "Current head:\n"
echo $(head -n 1 $filename)
echo "$(tail -n +2 $filename)" > $filename
echo "First line of file is now..."
echo $(head -n 1 $filename)
# temp. workaround safeguard for NFS latency
#sleep 5 #FIXME: appears to sometimes cause script to hang, workaround implemented below, throws error if timeout reached
timeout 30 sleep 5
echo -e "\nReplacing null literal values with empty chars"
NULL_WITH_TAB="null\t" # WARN: assumes the first field is prime-key so never null
TAB="\t"
sed -i -e "s/$NULL_WITH_TAB/$TAB/g" $filename
echo -e "Lines containing null (expect zero): $(grep -c "\tnull\t" $filename)"
# temp. workaround safeguard for NFS latency
#sleep 5 #FIXME: appears to sometimes cause script to hang, workaround implemented below
timeout 30 sleep 5
/opt/mssql-tools/bin/bcp "$TABLE" in "$filename" \
$TO_SERVER_ODBCDSN \
-U $USER -P $PASSWORD \
-d $DB \
$RECOMMEDED_IMPORT_MODE \
-t "\t" \
-e ${filename}.bcperror.log
}
export -f bcpexport
parallel -q -j 7 bcpexport {} "$TO_SERVER_ODBCDSN" $DB $TABLE $USER $PASSWORD $RECOMMEDED_IMPORT_MODE $DELIMITER \
::: $DATAFILES/$TARGET_GLOB
where $DATAFILES/$TARGET_GLOB constructs a glob that lists a set of files in a directory.
When running this code for a set of TSV files, finding that sometimes some (but not all) of the parallel BCP threads fail, ie. some files successfully copy to MSSQL Server
Starting copy...
5397376 rows copied.
Network packet size (bytes): 4096
Clock Time (ms.) Total : 154902 Average : (34843.8 rows per sec.)
while others output error message
Starting copy...
BCP copy in failed
Usually, see this pattern: a few successful BCP copy-in operations in the first few threads returned, then a bunch of failing threads return their output until run out of files (GNU Parallel returns output only when whole thread done to appear same as if sequential).
Notice in the code there is the -e option to produce an error file for each BCP copy-in operation (see https://learn.microsoft.com/en-us/sql/tools/bcp-utility?view=sql-server-2017#e). When examining the files after observing these failing behaviors, all are blank, no error messages.
Only have seen this with the number of threads >= 10 (and only for certain sets of data (assuming has something to do with total number of files are files sizes, and yet...)), no errors seen so far when using ~7 threads, which further makes me suspect this has something to do with multi-threading.
Monitoring system resources (via free -mh) shows that generally ~13GB or RAM is always available.
May be helpful to note that the data bcp is trying to copy-in may be ~500000-1000000 records long with an upper limit of ~100 columns per record.
Does anyone have any idea what could be going on here? Note, am pretty new to using BCP as well as GNU Parallel and multi-threading.
No, no issues specific to the BCP program being run in multiple threads. You seem to be on the track of what I would say your issue is, system resources. Have you monitored system resources while increasing the number of threads? If anything, there is likely an issue with BCP executing properly when memory/cpu/network resources are low. Regarding the "-e" option, it is meant to output data errors. login errors, bad table names... many other errros are not reported in the file created with the -e option. When you get output using the "-e" option, you'll see info like "value truncated" and such... will give you line numbers and sample data that was at issue.
TLDR: Adding more threads to run concurrently to have bcp copy-in files of data seems to have the affect of overwhelming the endpoint MSSQL Server with write instructions, causing the bcp threads to fail (maybe timeing out?). When the number of threads becomes too many seems to depend on the size of the files getting copy-in'ed by bcp (ie. both the number of records in the file as well as the width of each record (ie. number of columns)).
Long version (more reasons for my theory):
1.
When running a larger number of bcp threads and looking at the processes started on the machine (https://clustershell.readthedocs.io/en/latest/tools/clush.html)
ps -aux | grep bcp
seeing a bunch of sleeping processes (notice the S, see https://askubuntu.com/a/360253/760862) as shown below (added newlines for readability)
me 135296 14.5 0.0 77596 6940 ? S 00:32 0:01
/opt/mssql-tools/bin/bcp TABLENAME in /path/to/tsv/1_16_0.tsv -D -S MyMSSQLServer -U myusername -P -d myDB -c -t \t -e /path/to/logfile
These threads appear to sleep for very long time. Further debugging into why these threads are sleeping suggests that they may in fact be doing their intended job (which would further imply that the problem may be coming from BCP itself (see https://stackoverflow.com/a/52748660/8236733)). From https://unix.stackexchange.com/a/47259/260742 and https://unix.stackexchange.com/a/36200/260742)
A process in S state is usually in a blocking system call, such as reading or writing to a file or the network, or waiting for another called program to finish.
(eg. writing to the MSSQL Server endpoint destination given to bcp in the ODBCDSN)
Your process will be in S state when it is doing reads and possibly writes that are blocking. Can also happen while waiting on semaphores or other synchronization primitives... This is all normal and expected, and not usually a problem... you don't want it to waste CPU while it's waiting for user input.
2. When running different sets of files of varying record-amount-per-file (eg. ranges of 500000 - 1000000 rows/file) and record-width-per-file (~10 - 100 columns/row), found that in cases with either very large data width or amounts, running a fixed set of bcp threads would fail.
Eg. for a set of ~33 TSVs with ~500000 rows each, each row being ~100 columns wide, a set of 30 threads would write the first few OK, but then all the rest would start returning failure messages. Incorporating a bit from #jamie's answer, the fact the the failure messages returned are "BCP copy in failed" errors does not necessarily mean it has do do with the content of the data in question. Having no actual content being written into the -e errorlog files from my process, #jamie's post says this
Regarding the "-e" option, it is meant to output data errors. login errors, bad table names... many other errros are not reported in the file created with the -e option. When you get output using the "-e" option, you'll see info like "value truncated" and such... will give you line numbers and sample data that was at issue.
Meanwhile, a set of ~33 TSVs with ~500000 rows each, each row being ~100 wide, and still using 30 bcp threads would complete quickly and without error (also would be faster when reducing the number of threads or file set). The only difference here being the overall size of the data being bcp copy-in'ed to the MSSQL Server.
All this while
free -mh
still showed that the machine running the threads still had ~15GB of free RAM remaining in each case (which is again why I suspect that the problem has to do with the remote MSSQL Server endpoint rather than with the code or local machine itself).
3. When running some of the tests from (2), found that manually killing the parallel process (via CTL+C) and then trying to remotely truncate the testing table being written to with /opt/mssql-tools/bin/sqlcmd -Q "truncate table mytable" on the local machine would take a very long time (as opposed to manually logging into the MSSQL Server and executing a truncate mytable in the DB). Again this makes me think that this has something to do with the MSSQL Server having too many connections and just being overwhelmed.
** Anyone with any MSSQL Mgmt Studio experience reading this (I have basically none), if you see anything here that makes you think that my theory is incorrect please let me know your thoughts.

Redis benchmarking for HMSET, HGETALL with a data size

Can someone let me know how can I use redis-benchmark to do a benchmarking for HMSET, HGETALL with a fixed data size (-d option in redis-benchmark). I am using redis 3.2.5.
I have gone through this answer and tried the below command:-
root#cache-server1:~# redis-benchmark -h a.b.c.d -p XXXX hmset hgetall myhash rand_int rand_string -d 2048
====== hmset hgetall myhash rand_int rand_string -d 2048 ======
10000 requests completed in 0.11 seconds
50 parallel clients
3 bytes payload
keep alive: 1
99.64% <= 1 milliseconds
100.00% <= 1 milliseconds
89285.71 requests per second
But looking at the output it seems it is using only 3 bytes payload.
If it is not possible via redis-benchmark can someone suggest some other alternative?
The payload is only 3 bytes (the default) because the -d is taken as part of the command. The command must be the last argument, and all switches must precede it.
Besides that, you can't use redis-benchmark to run two custom commands. Also, the -d option is only applicable to predefined tests (the ones that run by default or with the -t option) and has no meaning if the user specifies the command used in the benchmark.
If you have a specific benchmarking flow that you want to test, the best thing you can do is mock it with any client that you're comfortable with.

In Lettuce(4.x) for Redis how to reduce round trips and use output of one command as input for another command, especially for Georadius

I have seen this pass results to another command in redis
and using via command line this command works well :
src/redis-cli keys '*' | xargs src/redis-cli mget
However how can we achieve the same effect via Lettuce (i started trying out 4.0.2.Final)
Also a solution to this is particularly important in the following scenario :
Say we are using geolocation capabilities, and we add a set of locations of "my-location-category"
using GEOADD
GEOADD "category-1" 8.6638775 49.5282537 "location-id:1" 8.3796281 48.9978127 "location-id:2" 8.665351 49.553302 "location-id:3"
Next, say we do a GeoRadius to get locations within 10 km radius of 8.6582361 49.5285495 for "category-1"
Now when we get "location-id:1" & "location-id:3"
Given that I already set values for above keys "location-id:1" & "location-id:3"
I want to pipe commands to do the GEORADIUS as well as do mget on all the matching results.
Does Redis provide feature to do that?
and / or how can we achieve this via the Lettuce client library without first manually iterating through results of GEORADIUS and then do manual mget.
That would be more efficient performance for the program that uses it.
Does anyone know how we can do this ?
Update
This is the piped command for the scenario I discussed above :
src/redis-cli GEORADIUS "category-1" 8.6582361 49.5285495 10 km | xargs src/redis-cli mget
Now we need to know how to do this via Lettuce
IMPORTANT: never use KEYS, always use SCAN instead if you must.
This isn't really a question about Lettuce nor Java so I can actually answer it :)
What you're trying to do is use the results from a read operation (GEORADIUS) as input (key names) for another read operation (MGET). This type of flow can't be pipelined, well, just because of that - pipelining means that you don't need the answers for operations right away but in you case you do.
However.
Since you're reading String keys with MGET, you might as well just denormalize everything (remember, we're NoSQL) and store the contents of these keys in the Sorted Set's members, e.g.:
GEOADD "category-1" 8.6638775 49.5282537 "location-id:1:moredata:evenmoredata:{maybe a JSON document here}:orperhapsmsgpack"
This will allow you to get the locations and their "data" with one GEORADIUS call. Of course, any updates to location:1's data will need to be done across all categories.
A note about Lua scripts: while a Lua script could definitely save on the back and forth in this case, any such script will be against best practices/not cluster safe.
After digging around and studying Lua script, my conclusion is that removing round-trips in such a way can only be done via Lua scripts as suggested by Itamar Haber.
I ended up creating a lua script file (myscript.lua) as below
local locationKeys = redis.call('GEORADIUS', 'category-1', '8.6582361', '49.5285495', '10', 'km' )
if unpack(locationKeys) == nil then
return nil
else
return redis.call('MGET', unpack(locationKeys))
end
** of course we should be sending in parameters to this... this is just a poc :)
now you can execute it via command
src/redis-cli EVAL "$(cat myscript.lua)" 0
Then to reduce the network-overhead of sending across the entire script to Redis for execution, we have the option of registering the script with Redis.
Redis will give us a sha1 digested code for future references for that script, which can be used for next calls to that script.
This can be done as below :
src/redis-cli SCRIPT LOAD "$(cat myscript.lua)"
this should give back a sha1 code something like this : 49730aa2ed3034ee48f818e486tpbdf1b500b19e
next calls can be done using this code
eg
src/redis-cli evalsha 49730aa2ed3034ee48f818e486b2bdf1b500b19e 0
The sad part however here is that the sha1 digest is remembered only so long as the instance of redis is running. If it is restarted, that the sha1 digest is lost. Then you do the SCRIPT LOAD once again. And if nothing changes in the script, then the sha1-digest code will be the same.
Ideally while using through client api, we should first attempt evalsha, if that returns a "No matching script" error, then as a fallback do script load, and procure the sha1 code once again, and create an internal map of that and use that sha1 code for further calls.
This can well be done via Lettuce. I could find the methods for those. Hope this gives a good insight into solution for the problem.

How to craft specific packets on the host of Mininet to generate massive Packet-In messages

I am wondering that how to generate massive packet-in messages to the controller to test the response time of SDN controller in the environment of Mininet.
Can you give me some advice on it?
You could use iperf to send packets, like this:
$ iperf -c -F
You could specify the amount of time:
$IPERF_TIME (-t, --time)
The time in seconds to transmit for. Iperf normally works by repeatedly sending an array of len bytes for time seconds. Default is 10 seconds. See also the -l and -n options.
Here is a nice reference for iperf: https://iperf.fr/.
If you would like to use Scapy, try this:
from scapy.all import IP, TCP, send
data = "University of Network blah blah"
a = IP(dst="129.132.2.21")/TCP()/data
send(a)