Hash fields in Redis not ordered the same way as input - redis

I've got an associative array of the type date => data, f.e.:
[
'2015-11-18' => 'some_data',
'2015-11-17' => 'some_data',
'2015-11-16' => 'some_data'
]
and I push them into a hash, where the array key (the date) is the field of the hash and the value is the value... But in Redis they are not sorted in the same order as they were input (and I need them to be). Furthermore, when I get all keys (hkeys) they are ordered in a completely different way of the order they are stored in Redis.
Is there a way to sort them by the same way I input them, both when storing and getting the keys?

You'll need to use two structures to implement an associative array in redis. One way to do it would be to store the keys in-order in a list, and also store the key => value mapping in a hash.
keys list:
[
'2015-11-18',
'2015-11-17',
'2015-11-16'
]
hash:
{
'2015-11-18' => 'some data',
'2015-11-16' => 'some data',
'2015-11-17' => 'some data'
}
You can use scripts to atomically update the two structures. An add operation script could look like:
eval "
redis.call('rpush', KEYS[1], ARGV[1]);
local i = redis.call('llen', KEYS[1]);
return redis.call('hset', KEYS[2], ARGV[1], ARGV[2]);
" 2 'keys' 'values' '2015-11-15' 'some data'
And a remove operation script could look like:
eval "
redis.call('lrem', KEYS[1], ARGV[1]);
return redis.call('hdel', KEYS[2], ARGV[1]);
" 2 'keys' 'values' '2015-11-15'
A get-by-key operation script could look like a normal hash get:
hget 'values' '2015-11-15'
A get-by-index operation script could look like:
eval "
local k = redis.call('lindex', KEYS[1], ARGV[1]);
return redis.call('hget', KEYS[2], k);
" 2 'keys' 'values' 1
To get the keys in-order would be a simple lrange:
lrange 'keys' 0 -1
To get the values in-order, you could use:
eval "
local k = redis.call('lrange', KEYS[1], 0, -1);
return redis.call('hmget', KEYS[2], unpack(k));
" 2 'keys' 'values'

Redis Hashes do not maintain order, nor do they make any assurances with regards to the output's order (a Redis Hash may undergo rehashing during its lifecycle). Look into using Redis' Sorted Sets instead perhaps.

Related

Efficient Redis SCAN of multiple key patterns

I'm trying to power some multi-selection query & filter operations with SCAN operations on my data and I'm not sure if I'm heading in the right direction.
I am using AWS ElastiCache (Redis 5.0.6).
Key design: <recipe id>:<recipe name>:<recipe type>:<country of origin>
Example:
13434:Guacamole:Dip:Mexico
34244:Gazpacho:Soup:Spain
42344:Paella:Dish:Spain
23444:HotDog:StreetFood:USA
78687:CustardPie:Dessert:Portugal
75453:Churritos:Dessert:Spain
If I want to power queries with complex multi-selection filters (example to return all keys matching five recipe types from two different countries) which the SCAN glob-style match pattern can't handle, what is the common way to go about it for a production scenario?
Assuming the I will calculate all possible patterns by doing a cartesian product of all field alternating patterns and multi-field filters:
[[Guacamole, Gazpacho], [Soup, Dish, Dessert], [Portugal]]
*:Guacamole:Soup:Portugal
*:Guacamole:Dish:Portugal
*:Guacamole:Dessert:Portugal
*:Gazpacho:Soup:Portugal
*:Gazpacho:Dish:Portugal
*:Gazpacho:Dessert:Portugal
What mechanism should I use to implement this sort of pattern matching in Redis?
Do multiple SCAN for each scannable pattern sequentially and merge the results?
LUA script to use improved pattern matching for each pattern while scanning keys and get all matching keys in a single SCAN?
An index built on top of sorted sets supporting fast lookups of keys matching single fields and solve matching alternation in the same field with ZUNIONSTORE and solve intersection of different fields with ZINTERSTORE?
<recipe name>:: => key1, key2, keyN
:<recipe type>: => key1, key2, keyN
::<country of origin> => key1, key2, keyN
An index built on top of sorted sets supporting fast lookups of keys matching all dimensional combinations and therefore avoiding Unions and Intersecions but wasting more storage and extend my index keyspace footprint?
<recipe name>:: => key1, key2, keyN
<recipe name>:<recipe type>: => key1, key2, keyN
<recipe name>::<country of origin> => key1, key2, keyN
:<recipe type>: => key1, key2, keyN
:<recipe type>:<country of origin> => key1, key2, keyN
::<country of origin> => key1, key2, keyN
Leverage RedisSearch? (while impossible for my use case, see Tug Grall answer which appears to be very nice solution.)
Other?
I've implemented 1) and performance is awful.
private static HashSet<String> redisScan(Jedis jedis, String pattern, int scanLimitSize) {
ScanParams params = new ScanParams().count(scanLimitSize).match(pattern);
ScanResult<String> scanResult;
List<String> keys;
String nextCursor = "0";
HashSet<String> allMatchedKeys = new HashSet<>();
do {
scanResult = jedis.scan(nextCursor, params);
keys = scanResult.getResult();
allMatchedKeys.addAll(keys);
nextCursor = scanResult.getCursor();
} while (!nextCursor.equals("0"));
return allMatchedKeys;
}
private static HashSet<String> redisMultiScan(Jedis jedis, ArrayList<String> patternList, int scanLimitSize) {
HashSet<String> mergedHashSet = new HashSet<>();
for (String pattern : patternList)
mergedHashSet.addAll(redisScan(jedis, pattern, scanLimitSize));
return mergedHashSet;
}
For 2) I've created a Lua Script to help with the server-side SCAN and the performance is not brilliant but is much faster than 1) even taking into consideration that Lua doesn't support alternation matching patterns and I have to loop each key through a pattern list for validation:
local function MatchAny( str, pats )
for pat in string.gmatch(pats, '([^|]+)') do
local w = string.match( str, pat )
if w then return w end
end
end
-- ARGV[1]: Scan Count
-- ARGV[2]: Scan Match Glob-Pattern
-- ARGV[3]: Patterns
local cur = 0
local rep = {}
local tmp
repeat
tmp = redis.call("SCAN", cur, "MATCH", ARGV[2], "count", ARGV[1])
cur = tonumber(tmp[1])
if tmp[2] then
for k, v in pairs(tmp[2]) do
local fi = MatchAny(v, ARGV[3])
if (fi) then
rep[#rep+1] = v
end
end
end
until cur == 0
return rep
Called in such a fashion:
private static ArrayList<String> redisLuaMultiScan(Jedis jedis, String luaSha, List<String> KEYS, List<String> ARGV) {
Object response = jedis.evalsha(luaSha, KEYS, ARGV);
if(response instanceof List<?>)
return (ArrayList<String>) response;
else
return new ArrayList<>();
}
For 3) I've implemented and maintained a secondary Index updated for each of the 3 fields using Sorted Sets and implemented querying with alternating matching patterns on single fields and multi-field matching patterns like this:
private static Set<String> redisIndexedMultiPatternQuery(Jedis jedis, ArrayList<ArrayList<String>> patternList) {
ArrayList<String> unionedSets = new ArrayList<>();
String keyName;
Pipeline pipeline = jedis.pipelined();
for (ArrayList<String> subPatternList : patternList) {
if (subPatternList.isEmpty()) continue;
keyName = "un:" + RandomStringUtils.random(KEY_CHAR_COUNT, true, true);
pipeline.zunionstore(keyName, subPatternList.toArray(new String[0]));
unionedSets.add(keyName);
}
String[] unionArray = unionedSets.toArray(new String[0]);
keyName = "in:" + RandomStringUtils.random(KEY_CHAR_COUNT, true, true);
pipeline.zinterstore(keyName, unionArray);
Response<Set<String>> response = pipeline.zrange(keyName, 0, -1);
pipeline.del(unionArray);
pipeline.del(keyName);
pipeline.sync();
return response.get();
}
The results of my stress test cases clearly favor 3) in terms of request latency:
I would vote for option 3, but I will probably start to use RediSearch.
Also have you look at RediSearch? This module allows you to create secondary index and do complex queries and full text search.
This may simplify your development.
I invite you to look at the project and Getting Started.
Once installed you will be able to achieve it with the following commands:
HSET recipe:13434 name "Guacamole" type "Dip" country "Mexico"
HSET recipe:34244 name "Gazpacho" type "Soup" country "Spain"
HSET recipe:42344 name "Paella" type "Dish" country "Spain"
HSET recipe:23444 name "Hot Dog" type "StreetFood" country "USA"
HSET recipe:78687 name "Custard Pie" type "Dessert" country "Portugal"
HSET recipe:75453 name "Churritos" type "Dessert" country "Spain"
FT.CREATE idx:recipe ON HASH PREFIX 1 recipe: SCHEMA name TEXT SORTABLE type TAG SORTABLE country TAG SORTABLE
FT.SEARCH idx:recipe "#type:{Dessert}"
FT.SEARCH idx:recipe "#type:{Dessert} #country:{Spain}" RETURN 1 name
FT.AGGREGATE idx:recipe "*" GROUPBY 1 #type REDUCE COUNT 0 as nb_of_recipe
I am not explaining all the commands in details here since you can find the explanation in the tutorial but here are the basics:
use a hash to store the recipes
create a RediSearch index and index the fields you want to query
Run queries, for example:
To get all Spanish Desert: FT.SEARCH idx:recipe "#type:{Dessert} #country:{Spain}" RETURN 1 name
To count the number of recipe by type: FT.AGGREGATE idx:recipe "*" GROUPBY 1 #type REDUCE COUNT 0 as nb_of_recipe
I ended up using a simple strategy to update each secondary index for each field when the key is created:
protected static void setKeyAndUpdateIndexes(Jedis jedis, String key, String value, int idxDimSize) {
String[] key_arr = key.split(":");
Pipeline pipeline = jedis.pipelined();
pipeline.set(key, value);
for (int y = 0; y < key_arr.length; y++)
pipeline.zadd(
"idx:" +
StringUtils.repeat(":", y) +
key_arr[y] +
StringUtils.repeat(":", idxDimSize-y),
java.time.Instant.now().getEpochSecond(),
key);
pipeline.sync();
}
The search strategy to find multiple keys that match a pattern including alternating patterns and multi-field filters was achieved like:
private static Set<String> redisIndexedMultiPatternQuery(Jedis jedis, ArrayList<ArrayList<String>> patternList) {
ArrayList<String> unionedSets = new ArrayList<>();
String keyName;
Pipeline pipeline = jedis.pipelined();
for (ArrayList<String> subPatternList : patternList) {
if (subPatternList.isEmpty()) continue;
keyName = "un:" + RandomStringUtils.random(KEY_CHAR_COUNT, true, true);
pipeline.zunionstore(keyName, subPatternList.toArray(new String[0]));
unionedSets.add(keyName);
}
String[] unionArray = unionedSets.toArray(new String[0]);
keyName = "in:" + RandomStringUtils.random(KEY_CHAR_COUNT, true, true);
pipeline.zinterstore(keyName, unionArray);
Response<Set<String>> response = pipeline.zrange(keyName, 0, -1);
pipeline.del(unionArray);
pipeline.del(keyName);
pipeline.sync();
return response.get();
}

SCAN command performance with phpredis

I'm replacing KEYS with SCAN using phpredis.
$redis = new Redis();
$redis->connect('127.0.0.1', 6379);
$redis->setOption(Redis::OPT_SCAN, Redis::SCAN_RETRY);
$it = NULL;
while($arr_keys = $redis->scan($it, "mykey:*", 10000)) {
foreach($arr_keys as $str_key) {
echo "Here is a key: $str_key\n";
}
}
According to redis documentation, I use SCAN to paginate searches to avoid disadvantage of using KEYS.
But in practice, using above code costs me 3 times lower than using just a single $redis->keys()
So I'm wondering if I've done something wrong, or I have to pay speed to avoid KEYS's threat?
Note that I totally have 400K+ keys in my db, and 4 mykey:* keys
A word of caution of using the example:
$it = NULL;
while($arr_keys = $redis->scan($it, "mykey:*", 10000)) {
foreach($arr_keys as $str_key) {
echo "Here is a key: $str_key\n";
}
}
That can return empty array's if none of the 10000 keys scanned matches and then it will give up, and you didn't get all the keys you wanted! I would recommend doing more like this:
$it = null;
do
{
$arr_keys = $redis->scan($it, $key, 10000);
if (is_array($arr_keys) && !empty($arr_keys))
{
foreach ($arr_keys as $str_key)
{
echo "Here is a key: $str_key\n";
}
}
} while ($arr_keys !== false);
And why it takes so long, 400k+ keys, 10000, that's 40 scan requests to redis, if it's not on the local machine, add latency for every 40 redis query to your speed.
Since using keys in production environments is just forbidden because it blocks the entire server while iterating global space keys, then, there's no discussion here about use or not to use keys.
In the other hand, if you want to speed up things, you should go further with Redis: you should index your data.
I doubt that these 400K keys couldn't be categorized in sets or sorted sets, or even hashes, so when you need a particular subset of your 400K-keys-database you could run any scan-equivalent command against a set of 1K items, instead of 400K.
Redis is about indexing data. If not, you're using it like just a simple key-value store.

redis hmget with wildcard fields

I have a hashset in redis like below.
"abcd" : {
"rec.number.984567": "value1",
"rec.number.973956": "value2",
"rec.number.990024": "value3",
"rec.number.910842": "value4",
"rec.number.910856": "...",
"other.abcd.efgh": "some value",
"other.xyza.blah": "some other value"
"..." : "...",
"..." : "...",
"..." : "...",
"..." : "..."
}
if I call hgetall abcd, it will give me all fields in the hash. My objective is to get only those fields of the hashset that begin with "rec.number". When I call like
redis-cli hmget "abcd" "rec.number*",
it gives me a result like
1)
Is there a way to retrieve data for only those keys which start with my expected pattern? I want to retrieve only those keys because my dataset contains many other irrelevant fields.
HMGET do not supports wildcard in field name. You can use HSCAN for that:
HSCAN abcd 0 MATCH rec.number*
More about SCAN function in official docs.
LUA way
This script does it in LUA scripting:
local rawData = redis.call('HGETALL', KEYS[1]);
local ret = {};
for idx = 1, #rawData, 2 do
if string.match(rawData[idx], ARGV[1]) then
hashData[rawData[idx]] = rawData[idx + 1];
end
end
Nice intro about using redis-cliand LUA in Redis may be found in A Guide for Redis Users.

How to convert a list of attribute-value pairs into a flat table whose columns are attributes

I'm trying to convert a csv file containing 3 columns (ATTRIBUTE_NAME,ATTRIBUTE_VALUE,ID) into a flat table whose each row is (ID,Attribute1,Attribute2,Attribute3,....). The samples of such tables are provided at the end.
Either Python, Perl or SQL is fine. Thank you very much and I really appreciate your time and efforts!
In fact, my question is very similar to this post, except that in my case the number of attributes is pretty big (~300) and not consistent across each ID, so hard coding each attribute might not be a practical solution.
For me, the challenging/difficult parts are:
There are approximately 270 millions lines of input, the total size of the input table is about 60 GB.
Some single values (string) contain comma (,) within, and the whole string will be enclosed with double-quote (") to make the reader aware of that. For example "JPMORGAN CHASE BANK, NA, TX" in ID=53.
The set of attributes is not the same across ID's. For example, the number of overall attributes is 8, but ID=53, 17 and 23 has only 7, 6 and 5 respectively. ID=17 does not have attributes string_country and string_address, so output blank/nothing after the comma.
The input attribute-value table looks like this. In this sample input and output, we have 3 ID's, whose number of attributes can be different depending on we can obtain such attributes from the server or not.
ATTRIBUTE_NAME,ATTRIBUTE_VALUE,ID
num_integer,100,53
string_country,US (United States),53
string_address,FORT WORTH,53
num_double2,546.0,53
string_acc,My BankAcc,53
string_award,SILVER,53
string_bankname,"JPMORGAN CHASE BANK, NA, TX",53
num_integer,61,17
num_double,34.32,17
num_double2,200.541,17
string_acc,Your BankAcc,17
string_award,GOLD,17
string_bankname,CHASE BANK,17
num_integer,36,23
num_double,78.0,23
string_country,CA (Canada),23
string_address,VAN COUVER,23
string_acc,Her BankAcc,23
The output table should look like this. (The order of attributes in the columns is not fixed. It can be sorted alphabetically or by order-of-appearance.)
ID,num_integer,num_double,string_country,string_address,num_double2,string_acc,string_award,string_bankname
53,100,,US (United States),FORT WORTH,546.0,My BankAcc,SILVER,"JPMORGAN CHASE BANK, NA, TX"
17,61,34.32,,,200.541,Your BankAcc,GOLD,CHASE BANK
23,36,78.0,CA (Canada),VAN COUVER,,Her BankAcc,,
This program will do as you ask. It expects the name of the input file as a parameter on the command line.
Update Looking more carefully at the data I see that not all of the data fields are available for every ID. That makes things more complex if the fields are to be kept in the same order as they appear in the file.
This program works by scanning the file and accumulating all the data for output into hash %data. At the same time it builds a hash %headers, that keeps the position each header appears in the data for each ID value.
Once the file has been scanned, the collected headers are sorted by finding the first ID for each pair that includes information for both headers. The sort order for that pair within the complete set must be the same as the order they appeared in the data for that ID, so it's just a matter of comparing the two position values using <=>.
Once a sorted set of headers has been created, the %data hash is dumped, accessing the complete list of values for each ID using a hash slice.
Update 2 Now that I realise the sheer size of your data I can see that my second attempt was also flawed, as it tried to read all of the information into memory before outputting it. That isn't going to work unless you have a monster machine with about 1TB of memory!
You may get some mileage from this version. It scans twice through the file, the first time to read the data so that the full set of header names can be created and ordered, then again to read the data for each ID and output it.
Let me know if it's not working for you, as there's still things I can do to make it more memory-efficient.
use strict;
use warnings;
use 5.010;
use Text::CSV;
use Fcntl 'SEEK_SET';
my $csv = Text::CSV->new;
open my $fh, '<', $ARGV[0] or die qq{Unable to open "$ARGV[0]" for input: $!};
my %headers = ();
my $last_id;
my $header_num;
my $num_ids;
while (my $row = $csv->getline($fh)) {
next if $. == 1;
my ($key, $val, $id) = #$row;
unless (defined $last_id and $id eq $last_id) {
++$num_ids;
$header_num = 0;
$last_id = $id;
print STDERR "Processing ID $id\n";
}
$headers{$key}[$num_ids-1] = ++$header_num;
}
sub by_position {
for my $id (0 .. $num_ids-1) {
my ($posa, $posb) = map $headers{$_}[$id], our $a, our $b;
return $posa <=> $posb if $posa and $posb;
}
0;
}
my #headers = sort by_position keys %headers;
%headers = ();
print STDERR "List of headers complete\n";
seek $fh, 0, SEEK_SET;
$. = 0;
$csv->combine('ID', #headers);
print $csv->string, "\n";
my %data = ();
$last_id = undef;
while () {
my $row = $csv->getline($fh);
next if $. == 1;
if (not defined $row or defined $last_id and $last_id ne $row->[2]) {
$csv->combine($last_id, #data{#headers});
print $csv->string, "\n";
%data = ();
}
last unless defined $row;
my ($key, $val, $id) = #$row;
$data{$key} = $val;
$last_id = $id;
}
output
ID,num_integer,num_double,string_country,string_address,num_double2,string_acc,string_award,string_bankname
53,100,,"US (United States)","FORT WORTH",546.0,"My BankAcc",SILVER,"JPMORGAN CHASE BANK, NA, TX"
17,61,34.32,,,200.541,"Your BankAcc",GOLD,"CHASE BANK"
23,36,78.0,"CA (Canada)","VAN COUVER",,"Her BankAcc",,
Use Text::CSV from CPAN:
#!/usr/bin/env perl
use strict;
use warnings;
# --------------------------------------
use charnames qw( :full :short );
use English qw( -no_match_vars ); # Avoids regex performance penalty
use Text::CSV;
my $col_csv = Text::CSV->new();
my $id_attr_csv = Text::CSV->new({ eol=>"\n", });
$col_csv->column_names( $col_csv->getline( *DATA ));
while( my $row = $col_csv->getline_hr( *DATA )){
# do all the keys but skip if ID
for my $attribute ( keys %$row ){
next if $attribute eq 'ID';
$id_attr_csv->print( *STDOUT, [ $attribute, $row->{$attribute}, $row->{ID}, ]);
}
}
__DATA__
ID,num_integer,num_double,string_country,string_address,num_double2,string_acc,string_award,string_bankname
53,100,,US (United States),FORT WORTH,546.0,My BankAcc,SILVER,"JPMORGAN CHASE BANK, NA, TX"
17,61,34.32,,,200.541,Your BankAcc,GOLD,CHASE BANK
23,36,78.0,CA (Canada),VAN COUVER,,Her BankAcc,,

Return an array of uniq keys from a query of hstore data in rails

I would like to use hstore keys as table column headers. My approach is to simply map a rails query that will return all keys from multiple records and then print the uniq ones to the array.
I'll be building the table in Prawn, using both static and dynamic column headers...like this..but, this doesnt work of course.
[["DATE", "LOCATION", "DAY OFF", "START", "END" + #users_options.select("properties").map { |k,v| ",#{k}" }]]
How can I iterate over the users logs, and output only uniq keys?
I just tried this...seems close...but not working yet
a = []
user.useroptions.select(:properties).collect{ |k,v| a << k }
I created a helper method:
def keys(user)
keys = []
user.useroptions.select(:properties).each do |opt|
a = opt.properties.keys
keys << a
end
keys.flatten.uniq
end
This iterates through all hstore records, grabs the keys, then flattens the hash, and puts only the uniq values
To finish this off, I moved the static array items into the helper and used the 'unshift' method...so that I spit out one array to the Prawn table builder.
def keys(user)
keys = []
user.useroptions.select(:properties).each do |opt|
a = opt.properties.keys
keys << a
end
keys.flatten.uniq
end
keys.flatten.uniq.unshift("DATE", "LOCATION", "DAY OFF", "START", "END")
then I slip my helper in to the Prawn table
[keys(#users_logs)] +
....table rows