Does Redis automatically "memoize" or "recognize duplicates"? - redis

If you push two k,v pairs into Redis with the same value, that is
set(k1, v)
set(k2, v)
does Redis smarlty store v once behind the scenes and do something like:
set(somereference, v)
set(k1, #somereference)
set(k2, #somereference)
But still return the perception of (k1, v), (k2, v)?
I ask because Right now, from Python, I am pushing values into redis of the form:
pickle({"some sequence number" : xxx, "image-bytes" : some long bytestring})
And am wondering if it's worth restructuring how I'm doing this if two of these dicts actually contain the same image bytestring and redis would be able to only store the underlying value once.

No, redis doesn't deduplicate on its own:
redis$ du -h dump.rdb
4.0K dump.rdb
redis$ ipython3
In [1]: %paste
import os
from redis import StrictRedis
data = os.urandom(1024)
redis = StrictRedis()
for i in range(1000000):
redis.set(f'key{i}', data)
## -- End pasted text --
In [2]:
Do you really want to exit ([y]/n)?
redis$ du -h dump.rdb
633M dump.rdb
The database dump is compressed with LZW so it's a little smaller than the expected size.
As an aside, I've found msgpack to be much faster than pickle and pretty much everything else for packing and unpacking literals.

Related

How can I get the memory footprint of a specific key in redis?

I'm new to Redis. How can I get the memory footprint of a specific key in redis?
db0
1) "unacked_mutex"
2) "_kombu.binding.celery"
3) "_kombu.binding.celery.pidbox"
4) "_kombu.binding.celeryev"
I just want to get the memory footprint of one specific key like "_kombu.binding.celery" , or one specific db like db0 , how can I get it?
redis_version: 2.8.17
Any commentary is very welcome. great thanks.
You are running a very old version of redis. The MEMORY command is not available in that version, so there is no precise way of getting at this information. However, you can approximate this information using the DUMP command.
Simply call DUMP _kombu.binding.celery and save the results to a file. The result is some characters and escape sequences. When you load this file into an environment like node, you can look at the length of the string and multiply by 2 to get the number of bytes. This is not precise, but it will give you a generally close approximation.
Here's what you could do:
in Redis:
$ redis-cli
127.0.0.1:6379> hset c 123 456
(integer) 0
127.0.0.1:6379> dump c
"\r\x12\x12\x00\x00\x00\r\x00\x00\x00\x02\x00\x00\xfe{\x03\xc0\xc8\x01\xff\t\x00\x10\xd4L \x908\x8b2"
in Node:
$ node
> a="\r\x12\x12\x00\x00\x00\r\x00\x00\x00\x02\x00\x00\xfe{\x03\xc0\xc8\x01\xff\t\x00\x10\xd4L \x908\x8b2"
'\r\u0012\u0012\u0000\u0000\u0000\r\u0000\u0000\u0000\u0002\u0000\u0000þ{\u0003ÀÈ\u0001ÿ\t\u0000\u0010ÔL 82'
> a.length
30
This is close to half of the actual amount that redis provides with MEMORY USAGE:
127.0.0.1:6379> MEMORY USAGE c
(integer) 63
MEMORY USAGE _kombu.binding.celery would give you the number of bytes that a key and value require to be stored in RAM.
Here is the doc for the command.

Redis mass insertion: protocol vs inline commands

For my task I need to load a bulk of data into Redis as soon as possible. It looks like this article is right about my case: https://redis.io/topics/mass-insert
The article starts from giving an example of using multiple inline SET commands with redis-cli. Then they proceed to generating Redis protocol and again use it with redis-cli. They don't explain the reasons or benefits of using Redis protocol.
Using of Redis protocol is a bit harder and it generates a bit more traffic. I wonder, what are the reasons to use Redis protocol rather than simple one-line commands? Probably despite the fact the data is larger, it is easier (and faster) for Redis to parse it?
Good point.
Only a small percentage of clients support non-blocking I/O, and not
all the clients are able to parse the replies in an efficient way in
order to maximize throughput. For all this reasons the preferred way
to mass import data into Redis is to generate a text file containing
the Redis protocol, in raw format, in order to call the commands
needed to insert the required data.
What I understood is that you emulate a client when you use Redis protocol directly, which would benefit from the highlighted points.
Based on the docs you provided, I tried these scripts:
test.rb
def gen_redis_proto(*cmd)
proto = ""
proto << "*"+cmd.length.to_s+"\r\n"
cmd.each{|arg|
proto << "$"+arg.to_s.bytesize.to_s+"\r\n"
proto << arg.to_s+"\r\n"
}
proto
end
(0...100000).each{|n|
STDOUT.write(gen_redis_proto("SET","Key#{n}","Value#{n}"))
}
test_no_protocol.rb
(0...100000).each{|n|
STDOUT.write("SET Key#{n} Value#{n}\r\n")
}
ruby test.rb > 100k_prot.txt
ruby test_no_protocol.rb > 100k_no_prot.txt
time cat 100k.txt | redis-cli --pipe
time cat 100k_no_prot.txt | redis-cli --pipe
I've got these results:
teixeira: ~/stackoverflow $ time cat 100k.txt | redis-cli --pipe
All data transferred. Waiting for the last reply...
Last reply received from server.
errors: 0, replies: 100000
real 0m0.168s
user 0m0.025s
sys 0m0.015s
(5 arquivo(s), 6,6Mb)
teixeira: ~/stackoverflow $ time cat 100k_no_prot.txt | redis-cli --pipe
All data transferred. Waiting for the last reply...
Last reply received from server.
errors: 0, replies: 100000
real 0m0.433s
user 0m0.026s
sys 0m0.012s

In Lettuce(4.x) for Redis how to reduce round trips and use output of one command as input for another command, especially for Georadius

I have seen this pass results to another command in redis
and using via command line this command works well :
src/redis-cli keys '*' | xargs src/redis-cli mget
However how can we achieve the same effect via Lettuce (i started trying out 4.0.2.Final)
Also a solution to this is particularly important in the following scenario :
Say we are using geolocation capabilities, and we add a set of locations of "my-location-category"
using GEOADD
GEOADD "category-1" 8.6638775 49.5282537 "location-id:1" 8.3796281 48.9978127 "location-id:2" 8.665351 49.553302 "location-id:3"
Next, say we do a GeoRadius to get locations within 10 km radius of 8.6582361 49.5285495 for "category-1"
Now when we get "location-id:1" & "location-id:3"
Given that I already set values for above keys "location-id:1" & "location-id:3"
I want to pipe commands to do the GEORADIUS as well as do mget on all the matching results.
Does Redis provide feature to do that?
and / or how can we achieve this via the Lettuce client library without first manually iterating through results of GEORADIUS and then do manual mget.
That would be more efficient performance for the program that uses it.
Does anyone know how we can do this ?
Update
This is the piped command for the scenario I discussed above :
src/redis-cli GEORADIUS "category-1" 8.6582361 49.5285495 10 km | xargs src/redis-cli mget
Now we need to know how to do this via Lettuce
IMPORTANT: never use KEYS, always use SCAN instead if you must.
This isn't really a question about Lettuce nor Java so I can actually answer it :)
What you're trying to do is use the results from a read operation (GEORADIUS) as input (key names) for another read operation (MGET). This type of flow can't be pipelined, well, just because of that - pipelining means that you don't need the answers for operations right away but in you case you do.
However.
Since you're reading String keys with MGET, you might as well just denormalize everything (remember, we're NoSQL) and store the contents of these keys in the Sorted Set's members, e.g.:
GEOADD "category-1" 8.6638775 49.5282537 "location-id:1:moredata:evenmoredata:{maybe a JSON document here}:orperhapsmsgpack"
This will allow you to get the locations and their "data" with one GEORADIUS call. Of course, any updates to location:1's data will need to be done across all categories.
A note about Lua scripts: while a Lua script could definitely save on the back and forth in this case, any such script will be against best practices/not cluster safe.
After digging around and studying Lua script, my conclusion is that removing round-trips in such a way can only be done via Lua scripts as suggested by Itamar Haber.
I ended up creating a lua script file (myscript.lua) as below
local locationKeys = redis.call('GEORADIUS', 'category-1', '8.6582361', '49.5285495', '10', 'km' )
if unpack(locationKeys) == nil then
return nil
else
return redis.call('MGET', unpack(locationKeys))
end
** of course we should be sending in parameters to this... this is just a poc :)
now you can execute it via command
src/redis-cli EVAL "$(cat myscript.lua)" 0
Then to reduce the network-overhead of sending across the entire script to Redis for execution, we have the option of registering the script with Redis.
Redis will give us a sha1 digested code for future references for that script, which can be used for next calls to that script.
This can be done as below :
src/redis-cli SCRIPT LOAD "$(cat myscript.lua)"
this should give back a sha1 code something like this : 49730aa2ed3034ee48f818e486tpbdf1b500b19e
next calls can be done using this code
eg
src/redis-cli evalsha 49730aa2ed3034ee48f818e486b2bdf1b500b19e 0
The sad part however here is that the sha1 digest is remembered only so long as the instance of redis is running. If it is restarted, that the sha1 digest is lost. Then you do the SCRIPT LOAD once again. And if nothing changes in the script, then the sha1-digest code will be the same.
Ideally while using through client api, we should first attempt evalsha, if that returns a "No matching script" error, then as a fallback do script load, and procure the sha1 code once again, and create an internal map of that and use that sha1 code for further calls.
This can well be done via Lettuce. I could find the methods for those. Hope this gives a good insight into solution for the problem.

Hadoop jobs getting poor locality

I have some fairly simple Hadoop streaming jobs that look like this:
yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar \
-files hdfs:///apps/local/count.pl \
-input /foo/data/bz2 \
-output /user/me/myoutput \
-mapper "cut -f4,8 -d," \
-reducer count.pl \
-combiner count.pl
The count.pl script is just a simple script that accumulates counts in a hash and prints them out at the end - the details are probably not relevant but I can post it if necessary.
The input is a directory containing 5 files encoded with bz2 compression, roughly the same size as each other, for a total of about 5GB (compressed).
When I look at the running job, it has 45 mappers, but they're all running on one node. The particular node changes from run to run, but always only one node. Therefore I'm achieving poor data locality as data is transferred over the network to this node, and probably achieving poor CPU usage too.
The entire cluster has 9 nodes, all the same basic configuration. The blocks of the data for all 5 files are spread out among the 9 nodes, as reported by the HDFS Name Node web UI.
I'm happy to share any requested info from my configuration, but this is a corporate cluster and I don't want to upload any full config files.
It looks like this previous thread [ why map task always running on a single node ] is relevant but not conclusive.
EDIT: at #jtravaglini's suggestion I tried the following variation and saw the same problem - all 45 map jobs running on a single node:
yarn jar \
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.2.0.2.0.6.0-101.jar \
wordcount /foo/data/bz2 /user/me/myoutput
At the end of the output of that task in my shell, I see:
Launched map tasks=45
Launched reduce tasks=1
Data-local map tasks=18
Rack-local map tasks=27
which is the number of data-local tasks you'd expect to see on one node just by chance alone.

Hadoop S3 No Space Left On Device

I am running a map reduce job that takes a small input (~3MB, list of integers of size z),
with a sparse matrix cache of size n x m, and basically outputs z sparse vectors of dimension (n x 1). The output here is pretty big (~2TB). I am running 20 m1.small nodes on Amazon EC2 with S3 storage as inputs and output.
However, I am getting a IOException: No space left on device.
It seems like there are s3 bytes written on Hadoop logs, but no files are created.
When I used a smaller input (smaller z), the output is correctly there after the job is done.
Thus, I believe that it runs out on a temporary storage.
Is there way to check where this temporary storage is?
Also, funny thing is that the log is saying that all the bytes are written to s3, but I see no files and don't know where these bytes are being written.
Thank you for your help.
Example code (Have also tried to split into map and reduce job with same error)
public void map(LongWritable key, Text value,
Mapper<LongWritable, Text, LongWritable, VectorWritable>.Context context)
throws IOException, InterruptedException
{
// Assume the input is id \t number
String[] input = value.toString().split("\t");
int idx = Integer.parseInt(input[0]) - 1;
// Some operations to do, but basically outputting a vector
// Collect the output
context.write(new LongWritable(idx), new VectorWritable(matrix.getColumn(idx)));
};
Amazon EMR supports a couple of versions. These are the default values 0.20.205
hadoop.tmp.dir - /tmp/hadoop-${user.name} - A base for other temporary directories.
mapred.local.dir - ${hadoop.tmp.dir}/mapred/local - The local directory where MapReduce stores intermediate data files. May be a comma-separated list of directories on different devices in order to spread disk i/o. Directories that do not exist are ignored.
mapred.temp.dir - ${hadoop.tmp.dir}/mapred/temp - A shared directory for temporary files.
Run the du --max-depth=7 /home/xyz | sort -n command on the hadoop.tmp.dir and check which directory is occupying the most space. Although hadoop.tmp.dir says temporary, it stores system and data files also.