How to delete data from Lucene's index exactly by document id? - lucene

I have a collection of documents in MongoDb (url: String, title: String, content: String). url is a unique field and contains something like server://aaa/bbb/1.html.
I would like to index data with Lucene, not Mongo (I can change storage). I'm going to store url in Lucene's index. When user searchs something by keywords, I'll perform query with Lucene, read url field and go to Mongo to extract doc by the url. It works well.
But I can't delete data from Lucene's index by url because it contains a lot of not allowed symbols. I use following settins for url field:
store = true
analyzed = false
indexed = true
(Should I index this field? What if I don't index this field? Will Lucene do a full scan? Collection can contain millions of documents)
If I want to have good performance should I create secondary index (Int or Long) and don't search by url?
I use latest versions of JVM, Lucene, Ubuntu and Mongo.

You need to properly encode your URL in a query, it should help.
E.g. in your case some.url/foo should be passed in a query as some.url%2Ffoo. You could try decoding/encoding online here - http://www.url-encode-decode.com/
For more info about escaping chars in Solr query take a look here - https://wiki.apache.org/solr/SolrQuerySyntax#NOTE:_URL_Escaping_Special_Characters

Related

Sitecore: Full text search using lucene

I'm using sitecore 8 and I'm looking for a way to run a full text search for all my sitecore content. I have a solution in place, but I feel there's got to be a better way to do this.
My approach:
i have a computed field that merges all text fields into a single computed field. Before I execute a search I tokenize my search text and build a ORed predicate to match on the field.
I do not like this approach because it gets really complicated if I need to boost items that match the title vs the body i.e. i loose the field level boosting.
FYI: my code is very similar to this so post.
Thanks
Sitecore already maintains a full text field, _content, that contains all the text fields. You can run your search against that. You can even create computed fields that add to _content (such as the datasource content example here).
So assuming you are building a LINQ query for your full text search, and have already filtered on templates, latest version, location, etc., adding your search terms to the query would look something like this:
var terms = SearchTerm.Split();
var currentExpression = PredicateBuilder.True<SiteSearchResultItem>();
foreach (var term in terms)
{
//Content is mapped to _content
currentExpression = PredicateBuilder.And(currentExpression, x => x.Content.Contains(term));
}
query = query.Where(currentExpression);
Typically you would want to AND search terms rather than ORing them.
You are right that field level boosting is lost in this. In the end, Lucene is not a great solution for creating a quality full-text site search. If this is an important requirement, you may want to look at Coveo or even something like a Google Site Search.

Elastic Search Engine Without Saving Data

Does Elastic/Lucene really need to store all indexed data in a document? Couldn't you just pass data through it so that Lucene may index the words into its hash table and have a single field for each document with the URL (or what ever pointer makes sense for you) that returns where each document came from?
A quick example may be indexing Wikipedia.org. If I pass each webpage to Elastic/Lucene to index - why do I need to save each webpages' main text in a field if Lucene indexes it and has a corresponding URL field to reply for searches?
We pay the cloud so much money to store so much redundant data -- Im just wondering why if Lucene is searching from its hash table and not the actual fields we save data into... why save that data if we dont want it?
Is there a way to index full text documents in Elastic without having to save all of the full text data from those documents?
There are a lot of options for the _source field. This is the field that actually stored the original document. You can disable it completely or decide which fields to keep. More information can be found in the docs:
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html

Neo4j index for full text search

I am working on neo4j database version 2.0.I have following requirements :
Case 1. I want to fetch all records where name contains some string,for example if i am searching for Neo4j then all records having name Neo4j Data,Neo4j Database,Neo4jDatabase etc. should be returned.
Case 2. When i want to fire field less query,if a set of properties is having matching value then those records should be returned or it may also be global level instead of label level.
Case Sensitivity is also a point.
I have read multiple thing about like,index,full text search,legacy index etc.,so what will be the best fit for my case,or i have to use elastic search etc.
I am using spring-data-neo4j in my application,so provide some configuration for SDN
Annotate your name with #Indexed annotation:
#Indexed(indexName = "whateverIndexName", indexType = IndexType.FULLTEXT)
private String name;
Then query for it following way (example for method in SDN repository, you can use similar anywhere else you use cypher):
#Query("START n=node:whateverIndexName({query}) return n"
Set<Topic> findByName(#Param("query") String query);
Neo4j uses lucene as backend for indexing so the query value must be a valid lucene query, e.g. "name:neo4j" or "name:neo4j*".
There is an article that explains the confusion around various Neo4j indexes http://nigelsmall.com/neo4j/index-confusion.
I don't think you need to be using elastic search-- you can use the legacy indexes or the lucene indexes to do full text searches.
Check out Michael Hunger's blog: jexp.de/blog
thix post specifically: http://jexp.de/blog/2014/03/full-text-indexing-fts-in-neo4j-2-0/

Neo4j - Querying with Lucene

I am using Neo4j embedded as database. I have to store thousands of articles daily and and I need to provide a search functionality where I should return the articles whose content match to the keywords entered by the users. I indexed the content of each and every article and queried on the index like below
val articles = article_content_index.query("article_content", search string)
This works fine. But, its taking lot of time when the search string contains common words like "the", "a" and etc which will be present in each and every article.
How do I solve this problem?
Probably a lucene issue.
You can configure your own analyzer which could leave off those frequent (stop-)words:
http://docs.neo4j.org/chunked/stable/indexing-create-advanced.html
http://lucene.apache.org/core/3_6_2/api/core/org/apache/lucene/analysis/Analyzer.html
http://lucene.apache.org/core/3_6_2/api/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html
You might configure article_content_index as fulltext index, see http://docs.neo4j.org/chunked/stable/indexing-create-advanced.html. To switch to using fulltext index, you first have to remove the index and the first usage of IndexManager.forNodes(String, Map) needs to configure the index on creation properly.

Will Lucene ALWAYS return ALL the documents that match my query same way as the SQL select query does?

I'm using Lucene to index the values that I'm storing in an object database. I'm storing a reference (UUID) to the object along with the field names and their corresponding values (Lucene Fields) in the Lucene Document.
My question is will Lucene ALWAYS return ALL the documents that match my query?
Thanks.
it depends on analyzer which you are using and also you can limit the no of result while searching.
for better searching you also can use Apache's open source search platform - Solr.