Redis mass insertions on remote server - redis

I have a remote server running Redis where I want to push a lot of a data from a Java application. Until now I used Webdis to push one command at the time which is not efficient, but I did not have any security issues because I could define the IPs that were accepted as connections and coomand authorizations while redis was not accepting requests from outside (protected mode).
I want to try to use jedis(Java API) and the implementation of pipeline for faster insertion but that means I have to open my Redis to accept requests from outside.
My question is this: Is it possible to use webdis in a similar way(pipilined mass insertion)? And if not what are the security configurations I need to make to use something like Jedis over the internet?
Thanks in advance for any answer

IMO it should be transparent for Redis driver how you set up the security. No driver or password protection will be so secure as specifically designed protocols or technologies.
In the most simple way I'd handle the security, is letting Redis listening on 127.0.0.1:<some port> and using an SSH tunnel to the machine. At least this way you can test the performance agains your current scenario.
You can also use IPSec or OpenVPN afterwards to organize private network which is able to communicate with Redis server.

This question is almost 4 years old so I hope its author has moved on by now, but in case someone else has the same issue I thought I might suggest a way to send data to Webdis more efficiently.
You can indeed make data ingest faster by batching your inserts, meaning you can use MSET to insert multiple keys in a single request (or HMSET for hashes, etc).
As an example, here's ApacheBench (ab) inserting one key 100,000 times using 100 clients:
$ ab -c 100 -n 100000 -k 'http://127.0.0.1:7379/SET/foo/bar'
[...]
Requests per second: 82235.15 [#/sec] (mean)
We're measuring 82,235 single-key inserts per second. Keep in mind that there's a lot more to HTTP benchmarking than just looking at averages (the latency distribution is still important, etc.) but this example is only about showing the difference that batching can make.
You can send commands to Webdis in one of three ways (documented here):
GET /COMMAND/arg0/.../argN
POST / with COMMAND/arg0/.../argN in the HTTP body (demonstrated below)
PUT /COMMAND/arg0.../argN-1 with argN in the HTTP body
If instead of inserting one key per request we create a file containing the MSET command to write 100 keys in a single request, we can significantly increase the write rate.
# first showing what the command looks like for 3 keys
$ echo -n 'MSET' ; for i in $(seq 1 3); do echo -n "/key-${i}/value-${i}"; done
MSET/key-1/value-1/key-2/value-2/key-3/value-3
# then saving the command to write 100 keys to a file:
$ (echo -n 'MSET' ; for i in $(seq 1 100); do echo -n "/key-${i}/value-${i}"; done) > batch-contents.txt
With this file, we can use ab to send this multi-insert file as a POST request (-p) to Webdis:
$ ab -c 100 -n 10000 -k -p ./batch-contents.txt -T 'application/x-www-form-urlencoded' 'http://127.0.0.1:7379/'
[...]
Requests per second: 18762.82 [#/sec] (mean)
This is showing 18,762 requests per second… with each request performing 100 inserts, for a total of 1,876,282 actual key inserts per second.
If you track the CPU usage of Redis while ab is running, you'll find that the MSET use case pegs it at 100% CPU while sending individual SET does not.
Once again keep in mind that this is a rough benchmark, just enough to show that there is a significant difference when you batch inserts. This is true regardless of whether Webdis is used, by the way: batching inserts from a client connecting directly to Redis should also be much faster than individual inserts.
Note: (I am the author of Webdis)

Related

JMeter - Execute SSH Commands in parallel

I need to simulate the below:
1. SSH (only once)
2. Execute a command on all the rows in a csv file at once.
Number of rows in the csv file is dynamic. If 10, the command needs to be executed over all the 10 rows in parallel.
Am not sure of using SSH Command Sampler here. SSH and Command are to be entered in the same sampler. How do I separate these? i.e. SSH only once and then executing the commands in parallel. Which JMeter components do I use here?
Note: Increasing the number of Threads is not an efficient option. While doing this many sessions get created. In turn hanging the terminal. This option works fine up to 10 users. Not sure if there's a limit on the number of sessions.
Thanks for your support.
Regards,
Ajith
Why do you think that Increasing the number of Threads is not an efficient option?
I would suggest moving the SSH (only once) to setUp Thread Group and put Execute a command on all the rows in a csv file at once. bit under the normal Thread Group
If the number of rows in the CSV file is dynamic - you can make the number of threads dynamic as well using __groovy() function like:
${__groovy(new File('/path/to/your/file.csv').readLines().size,)}
If you want to execute all the 10 requests (or whatever is the number of lines) at exactly the same moment you can add a Synchronizing Timer

Import huge SQL file into SQL Server

I use the sqlcmd utility to import a 7 GB large SQL dump file into a remote SQL Server. The command I use is this:
sqlcmd -S IP address -U user -P password -t 0 -d database -i file.sql
After about 20-30 min the server regularly responds with:
Sqlcmd: Error: Scripting error.
Any pointers or advice?
I assume file.sql is just a bunch of INSERT statements. For a large amount of rows, I suggest using the BCP command-line utility. This will perform orders of magnitude faster than individual INSERT statements.
You could also bulk insert data using the T-SQL BULK INSERT command. In that case, the file path needs to be accessible by the database server (i.e. UNC path or copied to a drive on the server) and along with needed permissions. See http://msdn.microsoft.com/en-us/library/ms188365.aspx.
Why not use SSIS. While I have a certificate as a DBA, I always try to use the right tool for the job.
Here are some reasons to use SSIS.
1 - Use can still use fast-load, bulk copy. Make sure you set the batch size.
2 - Error handling is much better.
However, if you are using fast-load, either the batch commits or it gets tossed.
If you are using single record, you can direct each error row to a separate destination.
3 - You can perform transformations on the source data before loading it into the destination.
In short, Extract Translate Load.
4 - SSIS loves memory and buffers. If you want to get really in depth, read some articles from Matt Mason or Brian Knight.
Last but not least, the LAN/WAN always plays a factor if the job is not running on the target server with the input file on a local disk.
If you are on the same backbone with a good pipe, things go fast.
In summary, yeah you can use BCP. It is great for little quick jobs. Anything complicated with robust error handling should be done with SSIS.
Good luck,

mysql optimized query execution

As part of an ongoing research work, I am checking if an URL exists or not using the cURL command. I have been executing a shell script for couple of days and it is doing some updates for each URL in my database. However, the script seems to update around only 100,000 rows in a day.
I was thinking if I could write the values in a file first and then do the updates, the execution might be faster.
I am connecting to the database using the command line.
mysql -h servername -u username -ppassword databasename "Update Query"
For example, instead of connecting to the database 2 million times like above from the command line and updating 2 million rows, I am planning to connect to the database only once from the command line and update 2 million rows from the file.
So is the second approach better than the first one or the time difference would be negligible?
Three approaches.
You could using load data infile
You could build up a .sql file with all of the updates you need.
You could use something other than a CLI to connect to the URLs and DB. In other words, not using "curl" and "mysql" commands, but using a real programming language and provided libraries for checking URLs and updating databases.
Any of those would probably be faster. Though you'll likely get more speed improvement by making the http calls in parallel. You can do that more easily with a real programming language.

Redis move all keys from one database to another

Is there a command to move redis keys from one database to another or is it possible only with lua scripting??
There has been this type of question asked perviously redis move all keys but the answers are not appropriate and convincing for a beginner like me.
u can use "MOVE" to move one key to another redis database;
the text below is from redis.io
MOVE key db
Move key from the currently selected database (see SELECT) to the specified destination database. When key already exists in the destination database, or it does not exist in the source database, it does nothing. It is possible to use MOVE as a locking primitive because of this.
Return value
Integer reply, specifically:
1 if key was moved.
0 if key was not moved.
I think this will do the job:
redis-cli keys '*' | xargs -I % redis-cli move % 1 > /dev/null
(1 is the new database number, and redirection to /dev/null is in order to avoid getting millions of '1' lines - since it will move the keys one by one and return 1 each time)
Beware that redis might run out of connections and then will display tons of such errors:
Could not connect to Redis at 127.0.0.1:6379: Cannot assign requested address
So it could be better (and much faster) to just dump the database and then import it into the new one.
If you have a big database with millions of keys, you can use the SCAN command to select all the keys (without blocking like the dangerous KEYS command that even Redis authors do not recommend).
SCAN gives you the keys by "pages" one by one and the idea is to start at page 0 (formally called CURSOR 0) and then continue with the next page/cursor until you hit the end (the stop signal is when you get the CURSOR 0 again).
You may use any popular language for this like Redis or Ruby or Scala. Here a draft using Bash Scripting:
#!/bin/bash -e
REDIS_HOST=localhost
PAGE_SIZE=10000
KEYS_TO_QUERY="*"
SOURCE_DB=0
TARGET_DB=1
TOTAL=0
while [[ "$CURSOR" != "0" ]]; do
CURSOR=${CURSOR:-0}
>&2 echo $TOTAL:$CURSOR
KEYS=$(redis-cli -h $REDIS_HOST -n $SOURCE_DB scan $CURSOR match "$KEYS_TO_QUERY" count $PAGE_SIZE)
unset CURSOR
for KEY in $KEYS; do
if [[ -z $CURSOR ]]; then
CURSOR=$KEY
else
TOTAL=$(($TOTAL + 1))
redis-cli -h $REDIS_HOST -n $SOURCE_DB move $KEY $TARGET_DB
fi
done
done
IMPORTANT: As usual, please do not copy and paste scripts without understanding what is doing, so here some details:
The while loop is selecting the keys page by page with the SCAN command and with every key then running the MOVE command.
The SCAN command will return the next cursor in the first line and then the rest of the lines will be the found keys. The while loop starts with the variable CURSOR not defined and then defined in the first loop (this is some magic to just stop in the next CURSOR 0 that will signal the end of the scanning)
PAGE_SIZE is the value of how long will be each scan query, lower values will impact very low on the server but will be slow, bigger values will make the server "sweat" but faster ... here the network is impacted, so try to find a sweet spot around 10000 or even 50000 (ironically values of 1 or 2 may stress also the server but due to the network wrapping part of each query)
KEYS_TO_QUERY: It's a pattern on the keys you want to query, like "*balance*" witll select the keys that include balance in the name of the key (don't forget to include the quotes to avoid syntax errors) ... additionally you can do the filtering at script side, just query all the key with "*" and add a bash scripting if conditional, this will be slower but if you cannot find a pattern for your keys selection this will help.
REDIS_HOST: using localhost by default, change it to any server you like (if you are using a custom port other than the default port 6379 you can also include it with something like myredisserver:4739)
SOURCE_DB: the database ID you want the keys move from (by default 0)
TARGET_DB: the database ID you want the keys move to (by default 1)
You can use this script to execute other commands or checks with the keys, just replace the MOVE command call for anything you may need.
NOTE: To move keys from one Redis server to another Redis server (this is not only moving between internal databases) you can use redis-utils-cli from the NPM packages here -> https://www.npmjs.com/package/redis-utils-cli

Database temporarily disconnected after a lots of transactions by pgbench

I am using (PostgreSQL) 9.2.1 and test the database with pgbench.
pgbench -h 192.168.39.38 -p 5433 -t 1000 -c 40 -j 8 -C -U admin testdb
When I use the -C parameter(Establish a new connection for each transaction), the transactions are always lost after the 16381th transaction.
Connection to database "testdb" failed
could not connect to server: Can't assign requested address
Is the server running on host "192.168.39.38" and accepting
TCP/IP connections on port 5433?
Client 19 aborted in establishing connection.
Connection to database "testdb" failed
could not connect to server: Can't assign requested address
Is the server running on host "192.168.39.38" and accepting
TCP/IP connections on port 5433?
Client 19 aborted in establishing connection.
....
transaction type: TPC-B (sort of)
scaling factor: 30
query mode: simple
number of clients: 40
number of threads: 8
number of transactions per client: 1000
number of transactions actually processed: 16381/40000
tps = 1665.221801 (including connections establishing)
tps = 9487.779510 (excluding connections establishing)
And the number of transactions actually processed is always 16381 in each test.
However, pgbench can success and all transactions are processed in the circumstances that
-C is not used
or
the total transactions are less than 16381
After dropping these transactions, the database can continue to accept connection in few seconds.
I wonder if I miss some configuration of PostgreSQL.
Thanks
Edit I found that the client is blocked to connect for few seconds, but the others still can access the database. Does that mean the same client cannot send too many transactions in a short time?
I found the reason why it losses the connections after about 16000 transactions. TCP wait_time takes the blame for this mistake. The following command will show the status of TCP connections:
$ netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
Nevertheless, it does NOT show the TIME_WAIT in MAC OS X. Therefore I missed it. After I adjust the TCP wait_time by the following command, pgbench works properly.
$ sudo sysctl -w net.inet.tcp.msl=1500
net.inet.tcp.msl: 15000 -> 1500
Thanks for helping.
There is indeed a limit of maximum connections imposed by the OS. Read up on max-connections in the documentation: (bolded relevant parts)
Determines the maximum number of concurrent connections to the database server. The default is typically 100 connections, but might be less if your kernel settings will not support it (as determined during initdb). This parameter can only be set at server start.
Increasing this parameter might cause PostgreSQL to request more System V shared memory or semaphores than your operating system's default configuration allows. See Section 17.4.1 for information on how to adjust those parameters, if necessary.
That you can open only 16381 connections, is explicable by there being 2^14 (=16384) possible maximum connections minus 3 connections reserved by default for super-user connections (see documentation).
It's interesting that 16381 is so close to a power of 2.
This is largely speculation:
I'm wondering whether it's an OS thing. Looking at the TPS figures, is a new connection being created for every transaction? [Edit yes, now that I read your question properly.]
Perhaps the OS has only so many connection resources it can use, and it cannot immediately create a new connection after having made 16381 (plus a few additional ones) in the recent past?
There may be an OS setting for specifying the number of connection resources to make available, which could allow more connections to be used. Can you add some OS details to the question?
In particular I would suspect that the port number you connect from is increasing all the time and you're hitting a limit. Try "lsof -i" and see if you can catch a connection as-it-happens and see if the number is going up.
I soleved by setting to /etc/sysctl.conf:
net.ipv4.ip_local_port_range = 32768 65000
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_fin_timeout = 10