Put and Delete with CouchDB + Lucene

Put and Delete with CouchDB + Lucene - lucene

I'm running CouchDB (1.2.1) + Lucene on Linux (https://github.com/rnewson/couchdb-lucene/), and I have a few questions.
I index everything - one index for all documents. i've got around 20.000.000 documents.
How fast are puts/deletes done on the index -- I have about 10-50 Puts/Deletes etc. a second.
Is there a rule, like after 10,000 updates you have to optimize the index?
Are changes in documents immediately visible in the index? If not is there a delay or a temporary table for this updates/deletes?
Thanks in advance - Brandon

Use a profiler to measure the put/delete performance. That's the only way you'll get reasonably accurate numbers.
Optimization depends on how quickly the index is changing -- again, you will need to experiment and profile.
Changes are immediately visible in the index, but not to already-open IndexReaders.

Related

Neo4j import tool and querying

I have some very basic conceptual questions related to functioning of neo4j.
1. First questions is about import tool. I am importing around 150 million nodes and a similar amount of relationships. When I do an upload the output on command terminal prints the number of nodes uploaded and then prepare node index. What is this node index? Where is it actually used? I see that the created index information is present in the graph_db=>schema=>label. What is this index and where is it actually used? Running a cypher query with does not show that index is being used anywhere.
2. Second questions is about the heap memory size of neo4j. What I understood that while running cypher queries, results are stored in heap. Once the heap is full, a garbage collection happens. What if I run a cypher statement that produces results that can not be kept in heap i.e. the result of query is bigger than the heap size. Would neo4j switch to disk? or would it produce an error.
Thanks for clearing these questions in advance.
Best,

What is this node index? Where is it actually used?
The index is just that - a database index. A database index is what's used to help you look up nodes really quickly. Say you put 1 million :Person nodes into a database, then 1 million :Location nodes in a database. When you MATCH (p:Person { last_name: "Smith" } you want the database to search through only the :Person nodes, and not all 2 million. The index is what makes that happen.
Read up on indexes in neo4j
What is this index and where is it actually used?
The index by label is basically a searchable collection of nodes categorized by label (in this case :Person and :Location) that the database engine uses to speed lookups. This is a greatly simplified answer, but basically accurate. This is a very good thing, you definitely want it. Performance of getting data out of the database would be quite bad without it.
Indexes are all about trading computation time and storage for better performance. Basically, the database pre-orders all of the nodes in a certain way (which costs you up-front computation time, and also a small amount of storage on disk) in exchange for having a nice data structure in place that makes queries very fast. Generally in database terms, you'll find that if you do a lot of read-only queries (fetching data) you really, really want indexes. If your workload is mostly just adding stuff (not lookups), they're not as good.
Running a cypher query with does not show that index is being used anywhere.
Yes, it's invisible, but when you search for something in Cypher using a label, neo4j is exploiting that index. It may be invisible but it's being used to optimize your query.
What I understood that while running cypher queries, results are stored in heap
Well that's only partially true; in some senses everything in java is stored in the heap. But results stream back from the database. If you issue a query that results in 1 million results, it is not the case that all 1 million go into the heap immediately. They get pulled in blocks at a time (I don't know how many at a time, the db engine handles that). At any given time, what's in heap is the set you need right now, not everything.
What if I run a cypher statement that produces results that can not be kept in heap i.e. the result of query is bigger than the heap size
See earlier answer. You can do this without problem, because the entire set generally isn't in the heap. In database terms, we'd say you get a "cursor" back, that lets you iterate through results. You do not get a huge result set back. The gotcha here is that if you have 1million results, you can iterate through them once. Need to run through them a second time? Avoid doing that, or issue the query again.
Would neo4j switch to disk?
No - if/when any swapping to disk happened, in any case that would be an operating system decision dealing with your main memory. It's possible it would happen, but that wouldn't have much to do with neo4j.
or would it produce an error
Nope, neo4j doesn't care how big your result set it. With the "cursor" concept, you can get 1 result or 10 billion results, both will work.

My Postgres database wasn't using my index; I resolved it, but don't understand the fix, can anyone explain what happened?

My database schema in relevant part is there is a table called User, which had a boolean field Admin. There was an index on this field Admin.
The day before I restored my full production database onto my development machine, and then made only very minor changes to the database, so they should have been very similar.
When I ran the following command on my development machine, I got the expected result:
EXPLAIN SELECT * FROM user WHERE admin IS TRUE;
Index Scan using index_user_on_admin on user (cost=0.00..9.14 rows=165 width=3658)
Index Cond: (admin = true)
Filter: (admin IS TRUE)
However, when I ran the exact same command on my production machine, I got this:
Seq Scan on user (cost=0.00..620794.93 rows=4966489 width=3871)
Filter: (admin IS TRUE)
So instead of using the exact index that was a perfect match for the query, it was using a sequential scan of almost 5 million rows!
I then tried to run EXPLAIN ANALYZE SELECT * FROM user WHERE admin IS TRUE; with the hope that ANALYZE would make Postgres realize a sequential scan of 5 million rows wasn't as good as using the index, but that didn't change anything.
I also tried to run REINDEX INDEX index_user_on_admin in case the index was corrupted, without any benefit.
Finally, I called VACUUM ANALYZE user and that resolved the problem in short order.
My main understanding of vacuum is that it is used to reclaim wasted space. What could have been going on that would cause my index to misbehave so badly, and why did vacuum fix it?

It was most likely the ANALYZE that helped, by updating the data statistics used by the planner to determine what would be the best way to run a query. VACUUM ANALYZE just runs the two commands in order, VACUUM first, ANALYZE second, but ANALYZE itself would probably be enough to help.
The ANALYZE option to EXPLAIN has completely nothing to do with the ANALYZE command. It just causes Postgres to run the query and report the actual run times, so that they can be compared with the planner predictions (EXPLAIN without the ANALYZE only displays the query plan and what the planner thinks it will cost, but does not actually run the query). So EXPLAIN ANALYZE did not help because it did not update the statistics. ANALYZE and EXPLAIN ANALYZE are two completely different actions that just happen to use the same word.

PostgreSQL keeps a number of advanced statistics about the table condition, index condition, data, etc... This can get out of sync sometimes. Running VACUUM will correct the problem.
It is likely that when you reloaded the table from scratch on development, it had the same effect.
Take a look at this:
http://www.postgresql.org/docs/current/static/maintenance.html#VACUUM-FOR-STATISTICS

A partial index seems a good solution for your issue:
CREATE INDEX admin_users_ix ON users (admin)
WHERE admin IS TRUE;;
Has no sense to index a lot of tuples over a identical field.

Here is what I think is the most likely explanation.
Your index is useful only when a very small number of rows are returned (btw, I don't like to index bools for this reason-- you might consider using a partial index instead, or even adding a where admin is true since that will keep your index only to the cases where it is likely to be usable anyway).
If more than around, iirc, 10% of the pages in the table are to be retrieved, the planner is likely to choose a lot of sequential disk I/O over a smaller amount of random disk I/O because that way you don't have to wait for platters to turn. The seek speed is a big issue there and PostgreSQL will tend to try to balance that against the amount of actual data to be retrieved from the relation.
You had statistics gathered which indicated that the table was either smaller than it was or there were more admins as a portion of users than you had, and so the planner used bad information to make the decision.
VACUUM ANALYZE does three things. First it freezes tuples visible to all transactions so that transaction wraparound is not an issue. Then it allocates tuples visible to no transactions as free space. Neither of these affected your issue. However the third is that it analyzes the tables and gather statistics on the tables. Keep in mind this is a random sampling and therefore sometimes can be off. My guess is that the previous run, it grabbed the page with lots of admins and thus grossly overestimated the number of admins of the system.
This is probably a good time to double check your autovacuum settings because it is also possible that the statistics are very much out of date elsewhere but that is far from certain. In particular, cost-based vacuum settings have defaults that sometimes make it so that vacuum never fully catches up.

How to optimize large index on solr

our index is rising relatively fast, by adding 2000-3000 documents a day.
We are running an optimize every night.
The point is, that Solr needs double disc space while optimizing. Actually the index has an size of 44GB, which works on an 100GB partition - for the next few months.
The point is, that 50% of the disk space are unused for 90% of the day and only needed during optimize.
Next thing: we have to add more space on that partition periodical - which is always a painful discussion with the guys from the storage department (because we have more than one index...).
So the question is: is there a way to optimize an index without blocking additional 100% of the index size on disk?
I know, that multi-cores an distributed search is an option - but this is only an "fall back" solution, because for that we need to change the application basically.
Thank you!

There is continous merging going on under the hood in Lucene. Read up on the Merge Factor which can be set in the solrconfig.xml. If you tweak this setting you probably wont have to optimize at all.

You can try partial optimize by passing maxSegment parameter.
This will reduce the index to that specified number.
I suggest you do in batches (e.g if there are 50 segments first reduce to 30 then to 15 and so on).
Here's the url:
host:port/solr/CORE_NAME/update?optimize=true&maxSegments=(Enter the number of segments you want to reduce to. Ignore the parentheses)&waitFlush=false

How do I remove logically deleted documents from a Solr index?

I am implementing Solr for a free text search for a project where the records available to be searched will need to be added and deleted on a large scale every day.
Because of the scale I need to make sure that the size of the index is appropriate.
On my test installation of Solr, I index a set of 10 documents. Then I make a change in one of the document and want to replace the document with the same ID in the index. This works correctly and behaves as expected when I search.
I am using this code to update the document:
getSolrServer().deleteById(document.getIndexId());
getSolrServer().add(document.getSolrInputDocument());
getSolrServer().commit();
What I noticed though is that when I look at the stats page for the Solr server that the figures are not what I expect.
After the initial index, numDocs and maxDocs both equal 10 as expected. When I update the document however, numDocs is still equal to 10 (expected) but maxDocs equals 11 (unexpected).
When reading the documentation I see that
maxDoc may be larger as the maxDoc count includes logically deleted documents that have not yet been removed from the index.
So the question is, how do I remove logically deleted documents from the index?
If these documents still exist in the index do I run the risk of performance penalties when this is run with a very large volume of documents?
Thanks :)

You have to optimize your index.
Note that an optimize is expansive, you probably should not do it more than daily.
Here is some more info on optimize:
http://www.lucidimagination.com/search/document/CDRG_ch06_6.3.1.3
http://wiki.apache.org/solr/SolrPerformanceFactors#Optimization_Considerations

How to deal with constantly changing data and SOLR indexes?

Afternoon guys,
I'm using a SOLR index for searching through items on my site. The search results contain an average rating of the item and an amount of comments the item has. The results can be sorted by both rating and num of comments.
But obviously with the solr index, these numbers aren't updated until the db (2million~ rows) is reindexed (done nightly, probably).
What would you guys think is the best way to approach this?

Well, i think you should change your db - index sync policy:
First approach: when commiting database changes also post changes (a batch of them) to indexes. You should write a mapper tier to map your domain objects to solr docs (remember, persists and if it goes ok then index -this works fine for us ;-)). If you want to achieve near real-time index updates you should see solutions like zoey (linkedin lucene-based searching framework)
Second approach: take a look around delta import (and program more frecuently index updates).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas