Does Elastic/Lucene really need to store all indexed data in a document? Couldn't you just pass data through it so that Lucene may index the words into its hash table and have a single field for each document with the URL (or what ever pointer makes sense for you) that returns where each document came from?
A quick example may be indexing Wikipedia.org. If I pass each webpage to Elastic/Lucene to index - why do I need to save each webpages' main text in a field if Lucene indexes it and has a corresponding URL field to reply for searches?
We pay the cloud so much money to store so much redundant data -- Im just wondering why if Lucene is searching from its hash table and not the actual fields we save data into... why save that data if we dont want it?
Is there a way to index full text documents in Elastic without having to save all of the full text data from those documents?
There are a lot of options for the _source field. This is the field that actually stored the original document. You can disable it completely or decide which fields to keep. More information can be found in the docs:
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html
Related
I have a Lucene index and the document text is 'indexed' but not 'stored'.
I am using Luke v7.6.0 and it's great for 'visualising' the index.
Obviously because my document text is indexed but not stored I cannot copy or query the 'stored' value (there isn't one), but can I somehow extract the indexed text values to the clipboard or text file to allow me to analyse exactly what is indexed from my file?
One of the available thing to you - is to check Lucene index files manually.
I suspect that the most important ones are the Term Dictionary files (*.tim)
I’ve indexed document with no stored values and terms - test#test.com in field email (TextField with Standard analyzer) and John in field name (StringField)
After this one, I opened tim file with hex editor and was able to see something like this:
You could clearly see the values of test, test, com which were tokenized by Standard one, also you could see John still stays the same, since I used StringField. In my other examples, I was able to see the work of lowercasing as well.
Just a reminder, if you would like to repeat it - by default for small indices Lucene will put everything into compound file, which I don’t prefer for this specific debug. You could disable this by setUseCompoundFile(false)
I have a solr index generated from a catalog of PDF files and correspoing metadata fields pertaining to the pdf files themselves. Still, I would like to provide my users an option to exclude in the query any text indexed from within a PDF. This is so the query results would be based on the metadata fields instead and not biased by the vast text within the pdf files.
I have thought of maybe having two indexes (cores) - one with the indexed pdf files and one without.
Is there another way?
Sounds like you are doing a general search against a default field. Which means you have a lot of copyField instructions (or just one copyField * -> text), which include the PDF content field.
You can create a second destination and copyField everything but the PDF content field into that as well. This way, users can search against or another combined field.
However, remember that this parses all content according to the analysis chain of the destination field. So, eDisMax with a list of source fields may be a better approach there. And, remember, you can use several request handlers (like 'select') and define different default parameters there. That usually makes the client code a bit easier.
You do not need to use 2 separate indexes. You can use the edismax parser and specify the qf parameter at query time. That will help determine what fields are searched.
You can look at field aliases
If you have 3 index fields
pdfmeta
pdftext
Then you can create two field aliases
quicksearch : pdfmeta
fullsearch : pdfmeta, pdftext
One advantage of using a field alias over qf is if your users have bookmarks like q=quicksearch:value, you can change the alias for quicksearch without affecting the user's bookmark.
Will someone please explain under what circumstance I may use Field.Store.NO instead of Field.Store.YES? I am extremely new to Lucene. And I am trying to create a document. Per my basic knowledge, I am doing
doc.add(new StringField(fieldNameA,fieldValueA,Field.Store.YES));
doc.add(new TextField(fieldNameB,fieldValueB,Field.Store.YES));
There are two basic ways a document can be written into Lucene.
Indexed - The field is analyzed and indexed, and can be searched.
Stored - The field's full text is stored and will be returned with search results.
If a document is indexed but not stored, you can search for it, but it won't be returned with search results.
One reasonably common pattern is to use lucene for search, but only have an ID field being stored which can be used to retrieve the full contents of the document/record from, for instance, a SQL database, a file system, or an web resource.
You might also opt not to store a field when that field is just a search tool, but you wouldn't display it to the user, such as a soundex/metaphone, or an alternate analysis of a content field.
Use Field.Store.YES when you need a document back from Lucene document. Use NO when you just need a search from document. Here is a link explained with a scenario.
https://handyopinion.com/java-lucene-saving-fields-or-not/
Here is my question (pardon the wordiness):
I have millions of documents and all of them are unique.
However, all documents contain a 'description' field and this field contains data that only has a few different variations in the text across all 10 million documents. This field is large-ish -400-800 words or so.
What is the most appropriate way to eliminate this repetition of data in the 'description' field?
Let me elaborate. Here is an example schema that been simplified:
Doc_id <-- this is unique
Title <-- always unique as well
Description <-- contains mostly dupe data
I search over both the title and description but only return the title itself.
I'm fairly new to Solr but have been unable to find any information on how to tackle a scenario like this. In case it matters, I'm running Solr 5 on Ubuntu.
Thanks for any help!
I will try to provide some strategies to tackle your problem.
You are saying that you search over title and description, this means you should set these fields to indexed=true in your schema.xml. Only title is returned, this means only title needs to be set to stored=true, description should be set to stored=false. See this posting for more information on stored vs. indexed: Solr index vs stored
Another useful option you could try is the field option compression. If you need to store a field, you can use gzip compression on certain fields, such as TextField and StrField, see: https://wiki.apache.org/solr/SchemaXml for more info.
Lastly, deduplication is supported in Solr, see: https://wiki.apache.org/solr/Deduplication. I did not try this feature, but from the sounds of it, you can prevent (nearly) duplicate documents to be indexed or tag duplicates. Maybe its goal "Allow for both duplicate collapsing in search results as well as deduplication on adding a document." is what you are looking for?
I'm using ColdFusion 9.0.1 and the integrated SOLR full text search engine.
I have dates stored in my SQL Server database as datetime fields for upcoming events. I took these records and inserted them into a SOLR collection with the custom3 and custom4 fields being the dateStart and dateEnd dates respectively. Users want to query the collection against a date range and sort by closest date to now.
First question: How do we set the datatype for the custom1-4 fields? Or, can we? Based on this post, Optimizing Solr for Sorting, the field should be set to either tdate or date rather than string for best performance. Or does SOLR automatically make the field have the correct datatype based on this post, Sort by date in Solr/Lucene performance problems?
Second question: How would the search criteria be structured to pull records? How about between May 1, 2011 and July 31, 2011, for example?
I don't tell too many people this, but for you, I believe it's time to ditch CFINDEX/CFSEARCH, and start using Solr directly.
CF's implementation is built for indexing a large block of text with some attributes, not a query. If you start using Solr directly, you can create your own schema, and have far more granular control of how your search works. Yes, it's going to take longer to implement, but you will love the results. Filtering by date is just the beginning.
Here's a quick overview of the steps:
Create a new index using the CFAdmin. This is the easy way to create all the files you need.
Modify the schema. The schema is in [cfroot]/solr/multicore/[your index name]/conf/
The top half of the schema is <types>. This defines all the datatypes you could use. The bottom half is the <fields>, and this is where you're going to be making most of your changes. It's pretty straightforward, just like a table. Create a field for each "column" you want to include. "indexed" means that you want to make that field searchable. "stored" means that you want the exact data stored, so that you can use it to display results. Because I'm using CF9's ORM, I don't store much beyond the primary key, and I use loadEntityByPK() on my results page.
After modifying the schema, you need to restart the solr service/daemon.
Use http://cfsolrlib.riaforge.org/ to index your data (the add method is a 'insert or modify' style method), and to perform the search.
To do a search, check out this example. It shows how to sort and filter by date. I didn't test it, so the format of the dates might be wrong, but you'll get the idea. http://pastebin.com/eBBYkvCW
Sorry this is answer is so general, I hope I can get you going down the right path here :)