I'm trying to power some multi-selection query & filter operations with SCAN operations on my data and I'm not sure if I'm heading in the right direction.
I am using AWS ElastiCache (Redis 5.0.6).
Key design: <recipe id>:<recipe name>:<recipe type>:<country of origin>
Example:
13434:Guacamole:Dip:Mexico
34244:Gazpacho:Soup:Spain
42344:Paella:Dish:Spain
23444:HotDog:StreetFood:USA
78687:CustardPie:Dessert:Portugal
75453:Churritos:Dessert:Spain
If I want to power queries with complex multi-selection filters (example to return all keys matching five recipe types from two different countries) which the SCAN glob-style match pattern can't handle, what is the common way to go about it for a production scenario?
Assuming the I will calculate all possible patterns by doing a cartesian product of all field alternating patterns and multi-field filters:
[[Guacamole, Gazpacho], [Soup, Dish, Dessert], [Portugal]]
*:Guacamole:Soup:Portugal
*:Guacamole:Dish:Portugal
*:Guacamole:Dessert:Portugal
*:Gazpacho:Soup:Portugal
*:Gazpacho:Dish:Portugal
*:Gazpacho:Dessert:Portugal
What mechanism should I use to implement this sort of pattern matching in Redis?
Do multiple SCAN for each scannable pattern sequentially and merge the results?
LUA script to use improved pattern matching for each pattern while scanning keys and get all matching keys in a single SCAN?
An index built on top of sorted sets supporting fast lookups of keys matching single fields and solve matching alternation in the same field with ZUNIONSTORE and solve intersection of different fields with ZINTERSTORE?
<recipe name>:: => key1, key2, keyN
:<recipe type>: => key1, key2, keyN
::<country of origin> => key1, key2, keyN
An index built on top of sorted sets supporting fast lookups of keys matching all dimensional combinations and therefore avoiding Unions and Intersecions but wasting more storage and extend my index keyspace footprint?
<recipe name>:: => key1, key2, keyN
<recipe name>:<recipe type>: => key1, key2, keyN
<recipe name>::<country of origin> => key1, key2, keyN
:<recipe type>: => key1, key2, keyN
:<recipe type>:<country of origin> => key1, key2, keyN
::<country of origin> => key1, key2, keyN
Leverage RedisSearch? (while impossible for my use case, see Tug Grall answer which appears to be very nice solution.)
Other?
I've implemented 1) and performance is awful.
private static HashSet<String> redisScan(Jedis jedis, String pattern, int scanLimitSize) {
ScanParams params = new ScanParams().count(scanLimitSize).match(pattern);
ScanResult<String> scanResult;
List<String> keys;
String nextCursor = "0";
HashSet<String> allMatchedKeys = new HashSet<>();
do {
scanResult = jedis.scan(nextCursor, params);
keys = scanResult.getResult();
allMatchedKeys.addAll(keys);
nextCursor = scanResult.getCursor();
} while (!nextCursor.equals("0"));
return allMatchedKeys;
}
private static HashSet<String> redisMultiScan(Jedis jedis, ArrayList<String> patternList, int scanLimitSize) {
HashSet<String> mergedHashSet = new HashSet<>();
for (String pattern : patternList)
mergedHashSet.addAll(redisScan(jedis, pattern, scanLimitSize));
return mergedHashSet;
}
For 2) I've created a Lua Script to help with the server-side SCAN and the performance is not brilliant but is much faster than 1) even taking into consideration that Lua doesn't support alternation matching patterns and I have to loop each key through a pattern list for validation:
local function MatchAny( str, pats )
for pat in string.gmatch(pats, '([^|]+)') do
local w = string.match( str, pat )
if w then return w end
end
end
-- ARGV[1]: Scan Count
-- ARGV[2]: Scan Match Glob-Pattern
-- ARGV[3]: Patterns
local cur = 0
local rep = {}
local tmp
repeat
tmp = redis.call("SCAN", cur, "MATCH", ARGV[2], "count", ARGV[1])
cur = tonumber(tmp[1])
if tmp[2] then
for k, v in pairs(tmp[2]) do
local fi = MatchAny(v, ARGV[3])
if (fi) then
rep[#rep+1] = v
end
end
end
until cur == 0
return rep
Called in such a fashion:
private static ArrayList<String> redisLuaMultiScan(Jedis jedis, String luaSha, List<String> KEYS, List<String> ARGV) {
Object response = jedis.evalsha(luaSha, KEYS, ARGV);
if(response instanceof List<?>)
return (ArrayList<String>) response;
else
return new ArrayList<>();
}
For 3) I've implemented and maintained a secondary Index updated for each of the 3 fields using Sorted Sets and implemented querying with alternating matching patterns on single fields and multi-field matching patterns like this:
private static Set<String> redisIndexedMultiPatternQuery(Jedis jedis, ArrayList<ArrayList<String>> patternList) {
ArrayList<String> unionedSets = new ArrayList<>();
String keyName;
Pipeline pipeline = jedis.pipelined();
for (ArrayList<String> subPatternList : patternList) {
if (subPatternList.isEmpty()) continue;
keyName = "un:" + RandomStringUtils.random(KEY_CHAR_COUNT, true, true);
pipeline.zunionstore(keyName, subPatternList.toArray(new String[0]));
unionedSets.add(keyName);
}
String[] unionArray = unionedSets.toArray(new String[0]);
keyName = "in:" + RandomStringUtils.random(KEY_CHAR_COUNT, true, true);
pipeline.zinterstore(keyName, unionArray);
Response<Set<String>> response = pipeline.zrange(keyName, 0, -1);
pipeline.del(unionArray);
pipeline.del(keyName);
pipeline.sync();
return response.get();
}
The results of my stress test cases clearly favor 3) in terms of request latency:
I would vote for option 3, but I will probably start to use RediSearch.
Also have you look at RediSearch? This module allows you to create secondary index and do complex queries and full text search.
This may simplify your development.
I invite you to look at the project and Getting Started.
Once installed you will be able to achieve it with the following commands:
HSET recipe:13434 name "Guacamole" type "Dip" country "Mexico"
HSET recipe:34244 name "Gazpacho" type "Soup" country "Spain"
HSET recipe:42344 name "Paella" type "Dish" country "Spain"
HSET recipe:23444 name "Hot Dog" type "StreetFood" country "USA"
HSET recipe:78687 name "Custard Pie" type "Dessert" country "Portugal"
HSET recipe:75453 name "Churritos" type "Dessert" country "Spain"
FT.CREATE idx:recipe ON HASH PREFIX 1 recipe: SCHEMA name TEXT SORTABLE type TAG SORTABLE country TAG SORTABLE
FT.SEARCH idx:recipe "#type:{Dessert}"
FT.SEARCH idx:recipe "#type:{Dessert} #country:{Spain}" RETURN 1 name
FT.AGGREGATE idx:recipe "*" GROUPBY 1 #type REDUCE COUNT 0 as nb_of_recipe
I am not explaining all the commands in details here since you can find the explanation in the tutorial but here are the basics:
use a hash to store the recipes
create a RediSearch index and index the fields you want to query
Run queries, for example:
To get all Spanish Desert: FT.SEARCH idx:recipe "#type:{Dessert} #country:{Spain}" RETURN 1 name
To count the number of recipe by type: FT.AGGREGATE idx:recipe "*" GROUPBY 1 #type REDUCE COUNT 0 as nb_of_recipe
I ended up using a simple strategy to update each secondary index for each field when the key is created:
protected static void setKeyAndUpdateIndexes(Jedis jedis, String key, String value, int idxDimSize) {
String[] key_arr = key.split(":");
Pipeline pipeline = jedis.pipelined();
pipeline.set(key, value);
for (int y = 0; y < key_arr.length; y++)
pipeline.zadd(
"idx:" +
StringUtils.repeat(":", y) +
key_arr[y] +
StringUtils.repeat(":", idxDimSize-y),
java.time.Instant.now().getEpochSecond(),
key);
pipeline.sync();
}
The search strategy to find multiple keys that match a pattern including alternating patterns and multi-field filters was achieved like:
private static Set<String> redisIndexedMultiPatternQuery(Jedis jedis, ArrayList<ArrayList<String>> patternList) {
ArrayList<String> unionedSets = new ArrayList<>();
String keyName;
Pipeline pipeline = jedis.pipelined();
for (ArrayList<String> subPatternList : patternList) {
if (subPatternList.isEmpty()) continue;
keyName = "un:" + RandomStringUtils.random(KEY_CHAR_COUNT, true, true);
pipeline.zunionstore(keyName, subPatternList.toArray(new String[0]));
unionedSets.add(keyName);
}
String[] unionArray = unionedSets.toArray(new String[0]);
keyName = "in:" + RandomStringUtils.random(KEY_CHAR_COUNT, true, true);
pipeline.zinterstore(keyName, unionArray);
Response<Set<String>> response = pipeline.zrange(keyName, 0, -1);
pipeline.del(unionArray);
pipeline.del(keyName);
pipeline.sync();
return response.get();
}
Related
I am able to add and get a particular user object from Redis I am adding object like this:
private static final String USER_PREFIX = ":USER:";
public void addUserToRedis(String serverName,User user) {
redisTemplate.opsForHash().put(serverName + USER_PREFIX + user.getId(),
Integer.toString(user.getId()),user);
}
If a userId is 100 I am able to get by key: SERVER1:USER:100
Now I want to retrieve all Users as Map<String,List<User>> ,
For example, get all users by this key SERVER1:USER: Is it possible ? Or I need to modify my addUserToRedis method? Please suggest me.
I would recommend not using the "KEYS" command in production as this can severely impact REDIS latencies (can even bring down the cluster if you have a large number of keys stored)
Instead, you would want to use a different command than plain GET/SET.
It would be better if you use a Sets or Hashes
127.0.0.1:6379> sadd server1 user1 user2
(integer) 2
127.0.0.1:6379> smembers server1
1) "user2"
2) "user1"
127.0.0.1:6379>
Using sets you can simply add your users to server keys and get the entire list of users on a server.
If you really need a map of < server, list < users > > you can use hashes with stringified user data and then convert it to actual User POJO at application layer
127.0.0.1:6379> hset server2 user11 name
(integer) 1
127.0.0.1:6379> hset server2 user13 name
(integer) 1
127.0.0.1:6379> hgetall server2
1) "user11"
2) "name"
3) "user13"
4) "name"
127.0.0.1:6379>
Also do note that keeping this much big data into a single key is not an ideal thing to do.
i dont use java but here's how to use SCAN
const Redis = require('ioredis')
const redis = new Redis()
async function main() {
const stream = redis.scanStream({
match: "*:user:*",
count: 100,
})
stream.on("data", (resultKeys) => {
for (let i = 0; i < resultKeys.length; i++) {
// console.log(resultKeys[i])
// do your things here
}
});
stream.on("end", () => {
console.log("all keys have been visited");
});
}
main()
Finally I came up with this solution with wildcard search and avoiding KEYS, and here is my complete method:
public Map<String, User> getUserMapFromRedis(String serverName){
Map<String, User> users=new HashMap<>();
RedisConnection redisConnection = null;
try {
redisConnection = redisTemplate.getConnectionFactory().getConnection();
ScanOptions options = ScanOptions.scanOptions().match(serverName + USER_PREFIX+"*").build();
Cursor<byte[]> scan = redisConnection.scan(options);
while (scan.hasNext()) {
byte[] next = scan.next();
String key = new String(next, StandardCharsets.UTF_8);
String[] keyArray=key.split(":");
String userId=keyArray[2];
User user=//get User by userId From Redis
users.put(userId, user);
}
try {
scan.close();
} catch (IOException e) {
}
}finally {
redisConnection.close(); //Ensure closing this connection.
}
return users;
}
Somehow i got filename.log which looks like for example (tab separated)
Name:Peter Age:18
Name:Tom Age:25
Name:Jason Age:35
because the value of key column may differ i cannot define schema when i load text like
a = load 'filename.log' as (Name:chararray,Age:int);
Neither do i want to call column by position like
b = foreach a generate $0,$1;
What I want to do is, from only that filename.log, to make it possible to call each value by key, for example
a = load 'filename.log' using PigStorage('\t');
b = group b by Name;
c = foreach b generate group, COUNT(b);
dump c;
for that purpose, i wrote some Java UDF which seperate key:value and get value for every field in tuple as below
public class SPLITALLGETCOL2 extends EvalFunc<Tuple>{
#Override
public Tuple exec(Tuple input){
TupleFactory mTupleFactory = TupleFactory.getInstance();
ArrayList<String> mProtoTuple = new ArrayList<String>();
Tuple output;
String target=input.toString().substring(1, input.toString().length()-1);
String[] tokenized=target.split(",");
try{
for(int i=0;i<tokenized.length;i++){
mProtoTuple.add(tokenized[i].split(":")[1]);
}
output = mTupleFactory.newTupleNoCopy(mProtoTuple);
return output;
}catch(Exception e){
output = mTupleFactory.newTupleNoCopy(mProtoTuple);
return output;
}
}
}
How should I alter this method to get what I want? or How should I write other UDF to get there?
Whatever you do, don't use a tuple to store the output. Tuples are intended to store a fixed number of fields, where you know what every field contains. Since you don't know that the keys will be in Name,Age form (or even exist, or that there won't be more) you should use a bag. Bags are unordered sets of tuples. They can have any number of tuples in them as long as the tuples have the same schema. These are all valid bags for the schema B: {T:(key:chararray, value:chararray)}:
{(Name,Foo),(Age,Bar)}
{(Age,25),(Name,Jim)}
{(Name,Bob)}
{(Age,30),(Name,Roger),(Hair Color,Brown)}
{(Hair Color,),(Name,Victor)} -- Note the Null value for Hair Color
However, it sounds like you really want a map:
myudf.py
#outputSchema('M:map[]')
def mapize(the_input):
out = {}
for kv in the_input.split(' '):
k, v = kv.split(':')
out[k] = v
return out
myscript.pig
register '../myudf.py' using jython as myudf ;
A = LOAD 'filename.log' AS (total:chararray) ;
B = FOREACH A GENERATE myudf.mapize(total) ;
-- Sample usage, grouping by the name key.
C = GROUP B BY M#'Name' ;
Using the # operator you can pull out all values from the map using the key you give. You can read more about maps here.
I have a problem with RavenDB indexing.
Simple query looks like this:
var values =
myCollection.Query.Where(w =>
w.MyId == MyId &&
w.IsReady == false &&
w.IsDeleted &&
w.Rate > 0)
During execution Raven creates dynamic index:
from doc in docs.MyCollection
select new { Rate = doc.Rate, IsReady = doc.IsReady, IsDeleted = doc.IsDeleted, MyId = doc.MyId }
with extra options:
Field -> Rate;
Storage -> No;
Indexing -> Default;
Sort -> Double;
Field Rate has decimal type.
Problem:
I wanted to add static index, but when I specified index like this:
public class MyIndex : AbstractIndexCreationTask<MyCollection> {
public MyIndex () {
Map = d => d.Select(s => new { Rate = s.Rate, IsReady = s.IsReady, IsDeleted = s.IsDeleted, MyId = s.MyId });
Sort(x => x.Rate, SortOptions.Double);
}
}
Raven is creating index slightly different:
from doc in docs.MyCollection
select new { Rate = (decimal)doc.Rate, IsReady = doc.IsReady, IsDeleted = doc.IsDeleted, MyId = doc.MyId }
with extra options:
Field -> Rate;
Storage -> No;
Indexing -> Default;
Sort -> Double;
The only difference is that I have casting in static index, because my field type is decimal and I'm using Double sort option.
Because of that Raven is not using my static index but instead creates dynamic one every time query is being executed.
I tried to do some casting inside Sort() but then index has not been created at all. One way to overcome this issue is to manually modify static index from management console after it was created, but it's not good solution.
Any ideas how to deal with that?
Thanks.
Edit:
Another example:
Type of field DateTime and querying using DateTime values as predicates (greater than / less than). Raven in dynamic index creation picks String as a SortOption, and when I try to prepare static index I get casting issue.
You can use the IDocumentSession.Query(string indexName, [bool isMapReduce]) or the IDocumentSession.Query<TResult, TIndexCreator>() overloads to explicitly specify a static index. So in your specific case, either IDocumentSession.Query<MyCollection, MyIndex>() or IDocumentSession.Query("MyIndex").
Supposed that I have two arrays:
Dim RoomName() As String = {(RoomA), (RoomB), (RoomC), (RoomD), (RoomE)}
Dim RoomType() As Integer = {1, 2, 2, 2, 1}
I want to get a value from the "RoomName" array based on a criteria of "RoomType" array. For example, I want to get a "RoomName" with "RoomType = 2", so the algorithm should randomize the index of the array that the "RoomType" is "2", and get a single value range from index "1-3" only.
Is there any possible ways to solve the problem using array, or is there any better ways to do this? Thank you very much for your time :)
Note: Code examples below using C# but hopefully you can read the intent for vb.net
Well, a simpler way would be to have a structure/class that contained both name and type properties e.g.:
public class Room
{
public string Name { get; set; }
public int Type { get; set; }
public Room(string name, int type)
{
Name = name;
Type = type;
}
}
Then given a set of rooms you can find those of a given type using a simple linq expression:
var match = rooms.Where(r => r.Type == 2).Select(r => r.Name).ToList();
Then you can find a random entry from within the set of matching room names (see below)
However assuming you want to stick with the parallel arrays, one way is to find the matching index values from the type array, then find the matching names and then find one of the matching values using a random function.
var matchingTypeIndexes = new List<int>();
int matchingTypeIndex = -1;
do
{
matchingTypeIndex = Array.IndexOf(roomType, 2, matchingTypeIndex + 1);
if (matchingTypeIndex > -1)
{
matchingTypeIndexes.Add(matchingTypeIndex);
}
} while (matchingTypeIndex > -1);
List<string> matchingRoomNames = matchingTypeIndexes.Select(typeIndex => roomName[typeIndex]).ToList();
Then to find a random entry of those that match (from one of the lists generated above):
var posn = new Random().Next(matchingRoomNames.Count);
Console.WriteLine(matchingRoomNames[posn]);
If I have a field x, that can contain a value of y, or z etc, is there a way I can query so that I can return only the values that have been indexed?
Example
x available settable values = test1, test2, test3, test4
Item 1 : Field x = test1
Item 2 : Field x = test2
Item 3 : Field x = test4
Item 4 : Field x = test1
Performing required query would return a list of:
test1, test2, test4
I've implemented this before as an extension method:
public static class ReaderExtentions
{
public static IEnumerable<string> UniqueTermsFromField(
this IndexReader reader, string field)
{
var termEnum = reader.Terms(new Term(field));
do
{
var currentTerm = termEnum.Term();
if (currentTerm.Field() != field)
yield break;
yield return currentTerm.Text();
} while (termEnum.Next());
}
}
You can use it very easily like this:
var allPossibleTermsForField = reader.UniqueTermsFromField("FieldName");
That will return you what you want.
EDIT: I was skipping the first term above, due to some absent-mindedness. I've updated the code accordingly to work properly.
TermEnum te = indexReader.Terms(new Term("fieldx"));
do
{
Term t = te.Term();
if (t==null || t.Field() != "fieldx") break;
Console.WriteLine(t.Text());
} while (te.Next());
You can use facets to return the first N values of a field if the field is indexed as a string or is indexed using KeywordTokenizer and no filters. This means that the field is not tokenized but just saved as it is.
Just set the following properties on a query:
facet=true
facet.field=fieldname
facet.limit=N //the number of values you want to retrieve
I think a WildcardQuery searching on field 'x' and value of '*' would do the trick.
I once used Lucene 2.9.2 and there I used the approach with the FieldCache as described in the book "Lucene in Action" by Manning:
String[] fieldValues = FieldCache.DEFAULT.getStrings(indexReader, fieldname);
The array fieldValues contains all values in the index for field fieldname (Example: ["NY", "NY", "NY", "SF"]), so it is up to you now how to process the array. Usually you create a HashMap<String,Integer> that sums up the occurrences of each possible value, in this case NY=3, SF=1.
Maybe this helps. It is quite slow and memory consuming for very large indexes (1.000.000 documents in index) but it works.