How do I drop an index using Lucandra? - lucene

I am using Lucandra and want to drop an entire index. The IndexReader and IndexWriters don't have all methods implemented so even iterating through a call to deleteDocument(int docNum) isn't possible.
Has anyone run up against this and either figured out how to either
hack the Cassandra keyspace or
made additions to the Lucandra code, or
figured out how to construct an iterator to delete all docs?

The current version of lucandra doesn't store documents from 1-N so the deleteDocument(int) doesn't work.
What I've done is index a field with the same term in all documents so you can match all documents then delete them with deleteDocuments(Term) call.
Another option (if you only have 1 index per keyspace) is to truncate the cassandra CFs
The next version of lucandra(in development) does store documents 1-N fashion.

Related

Are docids constant if the index is not manipulated in Lucene 8.6.1?

Say I update my index once a day, everyday, at the same time. During the time between updates (for 21 hours or so), will the docids remain constant?
As #andrewjames mentioned, the docId's only change when a merge happens. The docsId is basically the array index position of the doc in a particular segment.
The side effect of that is also that if you have multiple segments, then a given docId might be assigned to multiple docs, one in one segment, one in another segment, etc. If that's a problem, you can do a force merge once you are done building your index so that there is only a single segment. Then no two docs will have the same docId at that point.
The docId for a given document will not change if a merge does not happen. And a merge won't happen unless you call force merge or add or delete documents, or upgrade your index.
So...if you build your index, and don't add docs, delete docs, or call force merge, or upgrade your index then the docIds will be stable. But the next time you build your index, a give doc may receive a totally different doc Id. And as #andrewjames said, the docId assignments and timing of assignments are an internal affair in Lucene, so you sould be cautious about relying on them even when you know when and how they are currently assigned.

PostgreSQL with TimescaleDB only uses a single core during index creation

we have a PostgreSQL hypertable with a few billion rows and we're trying to create a unique index on top of it like so:
CREATE UNIQUE INDEX device_data__device_id__value_type__timestamp__idx ON public.device_data(device_id, value_type, "timestamp" DESC);
We created the hypertable like this:
SELECT create_hypertable('device_data', 'timestamp');
Since we want to create the index as fast as possible, we'd like to parallelize the index creation, and followed this guide.
We tested various settings for work_mem, maintenance_work_mem, max_worker_processes, max_parallel_maintenance_workers, and max_parallel_workers. We also set the parallel_workers setting on our table: ALTER TABLE device_data SET (parallel_workers = 10);. But no matter what we do, the index creation always only uses a single core (we have 16 available), and therefore the creation takes very long.
Any idea what we might be missing here?
Our PostgreSQL version is 12.5 and the server runs Ubuntu 18.
Unfortunately, Timescale doesn't currently support parallel index creation. I would recommend filing a Github issue asking to support it. It is a bit of a heavy lift and might not get prioritized horribly quickly. I think another option that could be useful would be to take the https://docs.timescale.com/latest/api#create_index transaction_per_chunk option here and allow the user to control how the indexes are created, so a simple api that would create the index for all future chunks, but not on older chunks and then allow you to call create_index(chunk_name, ht_index_name) on all the chunks, then you could parallelize that operation in your own code. This ends up being a much simpler lift because the transactionality of the parallel index creation is the hardest part.

Redis write back cache still a manual task?

I am working on an assignment. The REST API (developed in Spring) has a method m() which simulates cleaning of windows by a person. Towards the end the cleaner has to write a unique phrase (a string) on the window. Phrases written by all cleaners are eventually saved in the MySQL DB. So each time m() is executed, a query is made to the DB to fetch all phrases written to the DB today so far. The cleaner method m() then generates a random string as a phrase, checks it in the queried phrases to make sure its unique and writes it to the DB. So there is one query per m() to fetch all phrases and one to write the phrase. Both happens on the same table.
This is a scenario that can take advantage of caching and I went to Redis. I also think write back cache is the best solution. So every write happens, it happens to the cache instead of the DB and every read happens from the cache as well. The cache can be copied to the DB in a new thread per hour (or something configurable). I was reading Can Redis write out to a database like PostgreSQL? and it seems some years back you had to do this manually.
My questions:
Is doing this manually still the way to go? If not, can someone
point me to a Redis resource I can make use of?
If manual is the way to go this is how I plan to implement it. Is it ideal?
Phrases written each hour will be appended to a list of objects (userid, phrase) in Redis, the list for midnight to 1 am will be called phrases_1, for 1 to 2 am as phrases_2 and so on. Each hour a background thread will write the entire hour's list to DB. Every time all phrases are required to be fetched for checking, I will load all lists for the day from the cache e.g. phrases_1, phrases_2 in a loop and consolidate them. (Later when number of users grow - I will have to shard but that is not my immediate concern).
Thanks.
Check https://github.com/RedisGears/rgsync (and https://redislabs.com/solutions/use-cases/caching/) which tries to address both the cases of write-back and write-through.
I'm yet to do a functionality test.
It is also interesting to note that a 2020 CMU paper (https://www.pdl.cmu.edu/PDL-FTP/Storage/2020.apocs.writeback.pdf) claims "writeback-aware caching is NPcomplete and Max-SNP hard"
Instead of going to redis for uniqueness of data,you should create a unique index on the field you want to be unique and MySQL will take care of the rest for you

Obtain all keys of a Neo4j index

I have a Neo4j database whose content is generated dynamically from a big dataset.
All “entry points” nodes are indexed on a named index (IndexManager.forNodes(…)). I can therefore look up a particular “entry point” node.
However, I would now like to enumerate all those specific nodes, but I can't know on which key they were indexed.
Is there any way to enumerate all keys of a Neo4j Index?
If not, what would be the best way to store those keys, a data type that is eminently non-graph-oriented?
UPDATE (thanks for asking details :) ): the list would be more than 2 million entries. The main use case would be to never update it after an initialization step, but other use cases might need it, so it has to be somewhat scalable.
Also, I would really prefer avoiding killing my current resilience abilities, so storing all keys at once, as opposed to adding them incrementally, would be a last-resort solution.
I would either use a different data store to supplement Neo4j- I like Redis- or try #MattiasPersson's suggestion and store the the list on a node.
Is it just one list of keys or is it a list per node? You could store such a list on a specific node, say the reference node.
Instead of using a different storage which increases complexety you could try again with
lucene indices. normally lucene is able to handle this easily, especially now that the MatchAllDocsQuery is better. but one problem is that the neo4j guys are using a very old lucene version.
a special "reference" field in every node especially for this key-traversal case linking to the next node where you easily get ALL properties :)
If you want to get all Nodes, which were indexed in a particular index, you can just do:
IndexHits<Node> hits = IndexManager.forNodes(<INDEX_NAME>).query("*:*");
try{
while(hits.hasNext()){
Node n = hits.next();
...process the node...
}
}finally{
hits.close();
}

How to deal with constantly changing data and SOLR indexes?

Afternoon guys,
I'm using a SOLR index for searching through items on my site. The search results contain an average rating of the item and an amount of comments the item has. The results can be sorted by both rating and num of comments.
But obviously with the solr index, these numbers aren't updated until the db (2million~ rows) is reindexed (done nightly, probably).
What would you guys think is the best way to approach this?
Well, i think you should change your db - index sync policy:
First approach: when commiting database changes also post changes (a batch of them) to indexes. You should write a mapper tier to map your domain objects to solr docs (remember, persists and if it goes ok then index -this works fine for us ;-)). If you want to achieve near real-time index updates you should see solutions like zoey (linkedin lucene-based searching framework)
Second approach: take a look around delta import (and program more frecuently index updates).