In elastic search is a collection queryable while its being added to the index - nest

I'm using Nest to insert a list of 60k+ objects into elastic search, specifically calling client.IndexMany(list, indexName).
As the list is inserted, is it query-able? or is it only query-able
after the complete list is indexed?
If its the former, is there a way to force it to only be query-able after the list is fully indexed?

Ad1. The answer is no. Document isn't immediately available for search after indexing.
Definitive guide has really nice chapter why elasticsearch works this way. You should have a look on this answers for a quick explanation.
Ad2. To refresh your index call elasticClient.Refresh()
Hope this helps you.

Related

Managing the neo4j index's life cycle (CRUD)

I have limited (and disjointed) experience with databases, and nearly none with indexes. Based on web search, reading books, and working with ORMs my understanding can be summed up as follows:
An index in databases is similar to a book index in that it lists "stuff" that's in the book and tells you where to find it. This helps with lookup efficiency (this is most probably not the only benefit)
In (at least some) RDBMS's, primary key fields get automatically indexed so u never have to directly manipulate them.
I'm tinkering with neo4j and it seems you have to be deliberate about indexes so now I need to understand them but I cannot find clear answers to:
How are indexes managed in neo4j?
I know there's automatic indexing, how does it work?
If you choose to manually manage your own indexes, what can you control about them? Perhaps,index name, etc?
Would appreciate answers or pointers to answers, thanx.
Neo4j uses Apache Lucene under the covers if you want index engine like capabilities for your data. You can index nodes and/or relationships- the index helps you look up a particular instance/set of nodes or relationships.
Manual Indexing:
You can create as many node/relationship indexes as you want and you can specify a name for each index. The config can also be controlled i.e. whether you want exact matching (the default) or Lucenes full text indexing support. Once you have the index, you simply add nodes/relationships to it and the key/value you want indexed. You do however need to take care of "updating" data in the index yourself if you make changes to the node properties.
Auto-Indexing:
Here you get one index for nodes and one index for relations if you turn them on in the neo4j.properties file. You may specify what properties are to be indexed and from the point of turning them on, the index is automatically managed for you i.e. any nodes created after this point are added to the index and updated/removed automatically.
More reading:
http://docs.neo4j.org/chunked/stable/indexing.html
The above applies to versions < 2.0
2.0 adds more around the concept of indexing itself, you might want to go through
http://www.neo4j.org/develop/labels
http://blog.neo4j.org/2013/04/nodes-are-people-too.html
Hope that helps.

Can a ravendb collection forced to be in memory?

Can I force a ravendb collection to stay in memory so that the queries are fast. I read about aggressive caching but the documentation only talks about the request caching. If I have sharding enabled can I force all the shards to cache the collection in memory ?
Any help is appreciated,
Thanks
RavenDB doesn't really have "Collections" in the sense you are thinking. The only thing that collections are used for is to filter documents by their Raven-Entity-Name metadata. This serves a few purposes:
The Raven Studio UI can group things to make them easier to find.
Indexes can use a shortcut form of docs.EntityName instead of having a where clause against the metadata in every index.
But that's pretty much it. They aren't isolated on disk. For example, when Raven indexes documents, every index considers all documents. Docs get discarded quickly if they don't pass the collection filter, but they are still put through the pipeline.
You can read more about collections here.
Also - As long as you are still in a learning phase, you may want to post these style of questions on the RavenDB Google Group instead. You will get a much better response. You won't get much rating on StackOverflow when you are asking non-code "can X do Y?" questions. Come back here when you have written some code. See the ravendb tag for other questions that have been answered, and you'll get a feel for what StackOverflow is for. Thanks.
You don't need to do that.
RavenDB will automatically detect usage patterns and keep frequently requested documents in memory.

Lucene: Getting terms from an unstored field

Is there any way to retrieve all the terms in a particular field which is unfortunately not stored. I can not rebuild the index. Positional based information is not necessary. I just need the list of terms.
UPDATE
I've built a sample index with one stored, another unstored field and tested it with Luke. I was wondering whether I could get access to all terms just like Luke did. This may not be the brightest idea, but might work.
Lucene uses two different concepts: Indexing and Storing. If you want to extract the terms, you dont need to store anything. You can use luke, as well as iterate over the terms through the API. For the java API you can use [1]: How can I get the list of unique terms from a specific field in Lucene?.
Luke is Open Source, so just look at how Luke does it.

Relevant Search Results Across Multiple Databases

I have three databases that all have the contents of several web pages in them. What would be the best way to go about searching all three and having the most relevant web page at the top of the search results?
The only way I can think of is break down content by word count and/or creating a complex set of search rules to give one content priority over another. This might be more trouble than what it's worth, but I was wondering if anybody knows a way or product out there that would be able to help me.
To further support Ivans answer above Lucene is the way to go. You haven't mentioned what platform you're on so I'll point out that you can use a .NET port of this too.
If you do use Lucene there is a very good book from Manning on the subject which I recommend you look at.
When it comes to populating your index, you have a couple of choices. For starters you can just dump all of your text into the index and allow the engine to just search on it. However, I'd recommend adding fixed fields to your index which will allow you to support things such as partitioned searches or searches against those fields only.
To explain, lets say you have a field for the website. Then you can partition your index by restricting the index search to those documents that have that website in that field.
The other process is to extract points of interest from your document and allow searches on those without searching the entire index entry. Your mileage may vary with this as the lucene engine is very well written so it may simply allow you to collect your searches into more logical units which helps you with your solution.
I've done this myself and it helps when answering management questions about what exactly is searched and indexed.
HTH!
If you're using MS SQL Server then the full text search can return a ranking for you. I haven't used it, so you'll need to check the documentation or online for specifics.

Is there a set of best practices for building a Lucene index from a relational DB?

I'm looking into using Lucene and/or Solr to provide search in an RDBMS-powered web application. Unfortunately for me, all the documentation I've skimmed deals with how to get the data out of the index; I'm more concerned with how to build a useful index. Are there any "best practices" for doing this?
Will multiple applications be writing to the database? If so, it's a bit tricky; you have to have some mechanism to identify new records to feed to the Lucene indexer.
Another point to consider is do you want one index that covers all of your tables, or one index per table. In general, I recommend one index, with a field in that index to indicate which table the record came from.
Hibernate has support for full text search, if you want to search persistent objects rather than unstructured documents.
There's an OpenSymphony project called Compass of which you should be aware. I have stayed away from it myself, primarily because it seems to be way more complicated than search needs to be. Also, as I can tell from the documentation (I confess I haven't found the time necessary to read it all), it stores Lucene segments as blobs in the database. If you're familiar with the Lucene architecture, Compass implements a Lucene Directory on top of the database. I think this is the wrong approach. I would leverage the database's built-in support for indexing and implement a Lucene IndexReader instead. The same criticism applies to distributed cache implementations, etc.
I haven't explored this at all, but take a look at LuSql.
Using Solr would be straightforward as well but there'll be some DRY-violations with the Solr schema.xml and your actual database schema. (FYI, Solr does support wildcards, though.)
We are rolling out our first application that uses Solr tonight. With Solr 1.3, they've included the DataImportHandler that allows you to specify your database tables (they call them entities) along with their relationships. Once defined, a simple HTTP request will tirgger an import of your data.
Take a look at the Solr wiki page for DataImportHandler for details.
As introduction:
Brian McCallister wrote a nice blog post: Using Lucene with OJB.