I'm would like to check that all my keys in redis are correct.
I'm storing the keys in groups like so:
userid:fname
userid:lname
userid:age
...
I would like to iterate over the them by grouping them by userid and then check each group from fname, lname and age.
How can I do this?
ScanParams params = new ScanParams();
params.match("userid:fname*");
// Use "0" to do a full iteration of the collection.
ScanResult<String> scanResult = jedis.scan("0", params);
List<String> keys = scanResult.getResult();
Repeat above code for lname and age. Or, match user:id and then filter the groups using a regex and iterating through keys.
EDIT: For large collections (millions of keys), a scan result will return a few tens of elements. Adjust your code accordingly to continue scanning until the whole collection of keys has been scanned:
ScanParams params = new ScanParams();
params.match("userid:fname*");
// An iteration starts at "0": http://redis.io/commands/scan
ScanResult<String> scanResult = jedis.scan("0", params);
List<String> keys = scanResult.getResult();
String nextCursor = scanResult.getStringCursor();
int counter = 0;
while (true) {
for (String key : keys) {
addKeyToProperGroup(key);
}
// An iteration also ends at "0"
if (nextCursor.equals("0")) {
break;
}
scanResult = jedis.scan(nextCursor, params);
nextCursor = scanResult.getStringCursor();
keys = scanResult.getResult();
}
You can use Redis based java.util.Iterator and java.lang.Iterable interfaces offered by Redisson Redis client.
Here is an example:
RedissonClient redissonClient = RedissonClient.create(config);
RKeys keys = redissonClient.getKeys();
// default batch size on each SCAN invocation is 10
for (String key: keys.getKeys()) {
...
}
// default batch size on each SCAN invocation is 250
for (String key: keys.getKeys(250)) {
...
}
Related
I am using the highlighting feature of Lucene to isolate matching terms for my query, but some of the matched terms are excessive.
I have some simple test cases which are delivered in an Ant project (download details below).
Materials
You can download the test case here: mydemo_with_libs.zip
That archive includes the Lucene 8.6.3 libraries which my test uses; if you prefer a copy without the JAR files you can download that from here: mydemo_without_libs.zip
The necessary libraries are: core, analyzers, queries, queryparser, highlighter, and memory.
You can run the test case by unzipping the archive into an empty directory and running the Ant command ant synsearch
Input
I have provided a short synonym list which is used for indexing and analysing in the highlighting methods:
cope,manage
jobs,tasks
simultaneously,at once
and there is one document being indexed:
Queues are a useful way of grouping jobs together in order to manage a number of them at once. You can:
hold or release multiple jobs at the same time;
group multiple tasks (for the same event);
control the priority of jobs in the queue;
Eventually log all events that take place in a queue.
Use either job.queue or task.queue in specifications.
Process
When building the index I am storing the text field, and using a custom analyzer. This is because (in the real world) the content I am indexing is technical documentation, so stripping out punctuation is inappropriate because so much of it may be significant in technical expressions. My analyzer uses a TechTokenFilter which breaks the stream up into tokens consisting of strings of words or digits, or individual characters which don't match the previous pattern.
Here's the relevant code for the analyzer:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer(String synlist) {
if (synlist != "") {
this.synlist = synlist;
this.useSynonyms = true;
}
}
public MyAnalyzer() {
this.useSynonyms = false;
}
#Override
protected TokenStreamComponents createComponents(String fieldName) {
WhitespaceTokenizer src = new WhitespaceTokenizer();
TokenStream result = new TechTokenFilter(new LowerCaseFilter(src));
if (useSynonyms) {
result = new SynonymGraphFilter(result, getSynonyms(synlist), Boolean.TRUE);
result = new FlattenGraphFilter(result);
}
return new TokenStreamComponents(src, result);
}
and here's my filter:
public class TechTokenFilter extends TokenFilter {
private final CharTermAttribute termAttr;
private final PositionIncrementAttribute posIncAttr;
private final ArrayList<String> termStack;
private AttributeSource.State current;
private final TypeAttribute typeAttr;
public TechTokenFilter(TokenStream tokenStream) {
super(tokenStream);
termStack = new ArrayList<>();
termAttr = addAttribute(CharTermAttribute.class);
posIncAttr = addAttribute(PositionIncrementAttribute.class);
typeAttr = addAttribute(TypeAttribute.class);
}
#Override
public boolean incrementToken() throws IOException {
if (this.termStack.isEmpty() && input.incrementToken()) {
final String currentTerm = termAttr.toString();
final int bufferLen = termAttr.length();
if (bufferLen > 0) {
if (termStack.isEmpty()) {
termStack.addAll(Arrays.asList(techTokens(currentTerm)));
current = captureState();
}
}
}
if (!this.termStack.isEmpty()) {
String part = termStack.remove(0);
restoreState(current);
termAttr.setEmpty().append(part);
posIncAttr.setPositionIncrement(1);
return true;
} else {
return false;
}
}
public static String[] techTokens(String t) {
List<String> tokenlist = new ArrayList<String>();
String[] tokens;
StringBuilder next = new StringBuilder();
String token;
char minus = '-';
char underscore = '_';
char c, prec, subc;
// Boolean inWord = false;
for (int i = 0; i < t.length(); i++) {
prec = i > 0 ? t.charAt(i - 1) : 0;
c = t.charAt(i);
subc = i < (t.length() - 1) ? t.charAt(i + 1) : 0;
if (Character.isLetterOrDigit(c) || c == underscore) {
next.append(c);
// inWord = true;
}
else if (c == minus && Character.isLetterOrDigit(prec) && Character.isLetterOrDigit(subc)) {
next.append(c);
} else {
if (next.length() > 0) {
token = next.toString();
tokenlist.add(token);
next.setLength(0);
}
if (Character.isWhitespace(c)) {
// shouldn't be possible because the input stream has been tokenized on
// whitespace
} else {
tokenlist.add(String.valueOf(c));
}
// inWord = false;
}
}
if (next.length() > 0) {
token = next.toString();
tokenlist.add(token);
// next.setLength(0);
}
tokens = tokenlist.toArray(new String[0]);
return tokens;
}
}
Examining the index I can see that the index contains the separate terms I expect, including the synonym values. For example the text at the end of the first line has produced the terms
of
them
at , simultaneously
once
.
You
can
:
and the text at the end of the third line has produced the terms
same
event
)
;
When the application performs a search it analyzes the query without using the synonym list (because the synonyms are already in the index), but I have discovered that I need to include the synonym list when analyzing the stored text to identify the matching fragments.
Searches match the correct documents, but the code I have added to identify the matching terms over-performs. I won't show all the search method here, but will focus on the code which lists matched terms:
public static void doSearch(IndexReader reader, IndexSearcher searcher,
Query query, int max, String synList) throws IOException {
SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter("\001", "\002");
Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
Analyzer analyzer;
if (synList != null) {
analyzer = new MyAnalyzer(synList);
} else {
analyzer = new MyAnalyzer();
}
// Collect all the docs
TopDocs results = searcher.search(query, max);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = Math.toIntExact(results.totalHits.value);
System.out.println("\nQuery: " + query.toString());
System.out.println("Matches: " + numTotalHits);
// Collect matching terms
HashSet<String> matchedWords = new HashSet<String>();
int start = 0;
int end = Math.min(numTotalHits, max);
for (int i = start; i < end; i++) {
int id = hits[i].doc;
float score = hits[i].score;
Document doc = searcher.doc(id);
String docpath = doc.get("path");
String doctext = doc.get("text");
try {
TokenStream tokens = TokenSources.getTokenStream("text", null, doctext, analyzer, -1);
TextFragment[] frag = highlighter.getBestTextFragments(tokens, doctext, false, 100);
for (int j = 0; j < frag.length; j++) {
if ((frag[j] != null) && (frag[j].getScore() > 0)) {
String match = frag[j].toString();
addMatchedWord(matchedWords, match);
}
}
} catch (InvalidTokenOffsetsException e) {
System.err.println(e.getMessage());
}
System.out.println("matched file: " + docpath);
}
if (matchedWords.size() > 0) {
System.out.println("matched terms:");
for (String word : matchedWords) {
System.out.println(word);
}
}
}
Problem
While the correct documents are selected by these queries, and the fragments chosen for highlighting do contain the query terms, the highlighted pieces in some of the selected fragments extend over too much of the input.
For example, if the query is
+text:event +text:manage
(the first example in the test case) then I would expect to see 'event' and 'manage' in the highlighted list. But what I actually see is
event);
manage
Despite the highlighting process using an analyzer which breaks terms apart and treats punctuation characters as single terms, the highlight code is "hungry" and breaks on whitespace alone.
Similarly if the query is
+text:queeu~1
(my final test case) I would expect to only see 'queue' in the list. But I get
queue.
job.queue
task.queue
queue;
It is so nearly there... but I don't understand why the highlighted pieces are inconsistent with the index, and I don't think I should have to parse the list of matches through yet another filter to produce the correct list of matches.
I would really appreciate any pointers to what I am doing wrong or how I could improve my code to deliver exactly what I need.
Thanks for reading this far!
I managed to get this working by replacing the WhitespaceTokenizer and TechTokenFilter in my analyzer with a PatternTokenizer; the regular expression took a bit of work but once I had it all the matching terms were extracted with pinpoint accuracy.
The replacement analyzer:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer(String synlist) {
if (synlist != "") {
this.synlist = synlist;
this.useSynonyms = true;
}
}
public MyAnalyzer() {
this.useSynonyms = false;
}
private static final String tokenRegex = "(([\\w]+-)*[\\w]+)|[^\\w\\s]";
#Override
protected TokenStreamComponents createComponents(String fieldName) {
PatternTokenizer src = new PatternTokenizer(Pattern.compile(tokenRegex), 0);
TokenStream result = new LowerCaseFilter(src);
if (useSynonyms) {
result = new SynonymGraphFilter(result, getSynonyms(synlist), Boolean.TRUE);
result = new FlattenGraphFilter(result);
}
return new TokenStreamComponents(src, result);
}
I have multiple servers (for redundancy) sending data to clients. The clients need to process these messages in sequence and ignore duplicates.
We use external information to determine a special sequencing string that is deterministic across all our servers, as it would be too slow to keep the servers in sync.
The sequencing strings generated have remnants of top-secret information in them, and we can't reveal them to the clients.
Suppose the sequencing string just contains an integer. Is there a way of hashing this data such that the clients can order the messages without learning any additional information about its content?
Suppose a more complicated sequence string is used. The string is split into sub-sequences, and each sub-sequence is given a category, something like "a:12477/t:637" and "a:12477/e:456", where the comparison function between sequences is given below. Is it possible to hash the sequencing string in such a way that even a complicated function like this can operate on the data and nothing else?
JavaScript pseudo-code:
function compare(seq_a: string, seq_b: string) {
function decode(seq) {
seq_a.split("/").map(segment => {
let [category, sub_seq] = segment.split(":");
return { category, sub_seq: Number(sub_seq) }
});
}
let a = decode(seq_a);
let b = decode(seq_b);
for (let i = 0; i < Math.max(a.length, b.length); i++) {
let segment_a = a[i] || { category: "empty", sub_seq: 0 };
let segment_b = b[i] || { category: "empty", sub_seq: 0 };
if (segment_a.category != segment_b.category) {
return "UNKNOWN";
}
if (segment_a.sub_seq > segment_b.sub_seq) {
return "A";
} else if (segment_a.sub_seq < segment_b.sub_seq) {
return "B";
} else if (segment_a.sub_seq == segment_b.sub_seq) {
continue;
}
}
return "UNKNOWN";
}
I have very little knowledge in the cryptographic and zero-knowledge area so there's nothing I have yet tried, so the furthest I have gotten is just realizing the idea of what is needed.
I have some code that stores, in redis, a flag of whether a user is active, under a unique key per user.
class RedisProfileActiveRepo implements ProfileActiveRepo
{
/** #var Redis */
private $redis;
public function __construct(Redis $redis)
{
$this->redis = $redis;
}
public function markProfileIsActive(int $profile_id)
{
$keyname = ProfileIsActiveKey::getAbsoluteKeyName($profile_id);
// Set the user specific key for 10 minutes
$result = $this->redis->setex($keyname, 10 * 60, 'foobar');
}
public function getNumberOfActiveProfiles()
{
$count = 0;
$pattern = ProfileIsActiveKey::getWildcardKeyName();
$iterator = null;
while (($keys = $this->redis->scan($iterator, $pattern)) !== false) {
$count += count($keys);
}
return $count;
}
}
When I generate the keys from this code:
namespace ProjectName;
class ProfileIsActive
{
public static function getAbsoluteKeyName(int $profile_id) : string
{
return __CLASS__ . '_' . $profile_id;
}
public static function getWildcardKeyName() : string
{
return __CLASS__ . '_*';
}
}
Which results in the keys looking like ProjectName\ProfileIsActive_1234 the scan command in Redis fails to match any keys.
When I replace the slashes with underscores:
class ProfileIsActive
{
public static function getAbsoluteKeyName(int $profile_id) : string
{
return str_replace('\\', '', __CLASS__) . '_' . $profile_id;
}
public static function getWildcardKeyName() : string
{
return str_replace('\\', '', __CLASS__) . '_*';
}
}
The code works as expected.
My question is - why is doing a scan with a slash in the keyname failing to behave as expected, and are there any other characters that should be avoided in keynames to avoid similar problems?
Theoretically latest Redis autoescapes backslashes when you set keys at redis-cli:
127.0.0.1:6379> set this\test 1
OK
127.0.0.1:6379> keys this*
1) "this\\test"
Issue a MONITOR command in redis-cli before you run your php client code, and watch for SCAN commands. If your collection is big enough and your count parameter is absent or low enough you might not get the record:
127.0.0.1:6379> scan 0 match this*
1) "73728"
2) (empty list or set)
127.0.0.1:6379> scan 0 match this* count 10000
1) "87704"
2) 1) "this\\test"
I am trying to select and update multiple row from ravendb, but it recursively update same rows. Namely first 100 rows. There is no changes.
Here is my code. How can I select some rows, Update some fields of each rows and do it again and again until my job finished.
var currentEmailId = 100;
using (var session = store.OpenSession())
{
var goon = true;
while(goon){
var contacts = session.Query<Contacts>().Where(f => f.LastEmailId < currentEmailId).Take(100);
if(contacts.Any()){
foreach(var contact in contacts){
EmailOperation.Send(contact, currentEmailId);
contact.LastEmailId = currentEmailId;
}
session.SaveChanges();
}
else{
goon = false
}
}
}
It's probably because you're doing a query immediately after saving changes, without letting the indexes update after save changes. Thus, you're getting back the same items. To fix that, you can tell SaveChanges to wait until indexes are updated. Your code would look something like this:
Try this:
var goon = true;
var currentEmailId = 100;
while (goon)
{
using (var session = store.OpenSession())
{
var contacts = session.Query<Contacts>()
.Where(f => f.LastEmailId < currentEmailId)
.Take(100);
if(contacts.Any())
{
foreach(var contact in contacts)
{
EmailOperation.Send(contact, currentEmailId);
contact.LastEmailId = currentEmailId;
}
// Wait for the indexes to update when calling SaveChanges.
DbSession.Advanced.WaitForIndexesAfterSaveChanges(TimeSpan.FromSeconds(30), false);
session.SaveChanges();
}
else
{
goon = false
}
}
}
If you're updating many contacts at once, you may wish to consider using using Streaming query results combined with BulkInsert to update many Contacts en mass.
I have a Lucene SpellChecker indexing implementation like so:
def buildAutoSuggestIndex(path:Path):SpellChecker = {
val config = new IndexWriterConfig(new CustomAnalyzer())
val dictionary = new PlainTextDictionary(path)
val directory = FSDirectory.open(path.getParent)
val spellChecker = new SpellChecker(directory)
val jw = new JaroWinklerDistance()
jw.setThreshold(jaroWinklerThreshold)
spellChecker.setStringDistance(new JaroWinklerDistance())
spellChecker.indexDictionary(dictionary, config, true)
spellChecker
}
I need to update these Spellchecker dictionaries i.e. reindex new entries, without reindexing the whole index. Is there any way to update SpellChecker indexes?
SpellChecker.indexDictionary(...) already avoids reindexing terms right here:
terms: while ((currentTerm = iter.next()) != null) {
String word = currentTerm.utf8ToString();
int len = word.length();
if (len < 3) {
continue; // too short we bail but "too long" is fine...
}
if (!isEmpty) {
for (TermsEnum te : termsEnums) {
if (te.seekExact(currentTerm)) {
continue terms;
}
}
}
// ok index the word
Document doc = createDocument(word, getMin(len), getMax(len));
writer.addDocument(doc);
seelkExact will return false if the term is already contained, and the document with the n-grams for the term is not added (continue terms;).