I've set up a basic implementation of ElasticSearch, storing a couple of fields in the document and I'm able to execute queries.
var searchResult = client.Search<SearchTest>(s =>
s
.Size(1000)
.Fields(f => f.ID)
.Query(q => q.QueryString(d => d.Query(query)))
)
.Documents.Select(item =>
item.ID
)
.ToList();
var products = this.DbContext.Products
.Where(item =>
searchResult.Contains(item.ProductId)
&& ...
)
.Select(item => ...);
// subsequent queries here
Right now, I simply return the index, which I use in database queries to retrieve a whole lot of information. The information stored in the documents is also retrieved. Now I'm wondering, should I skip retrieving this from the database, and use the data in the document store? Or should I use it for nothing else but searching?
Some context: searching in a product database, some information is always the same, some information (like price calculation) depends on which customer is searching.
There isn't really a hard and fast answer to this question. I like to pull enough information from the index to populate a list of search results, but retrieve the full contents of the document from others, external sources (ex. a database). Entirely subjectively, this seems to be the more common use of Lucene, from what I've seen.
Storage strategy, as far as I know, should not have a direct impact on search performance, but keeping data stored for each document to a minimum will improve performance retrieving documents from the index (ie, for that list of results mentioned before).
I'm also sometimes hesitant to make Lucene the system of record. It seems to be much easier to find yourself with a broken/corrupt index than a database. I like having the option available to trash and rebuild it.
I see you already accepted an answer but i'd like to offer a second approach.
Elasticsearch excels at storing Documents (json) and so retrieving complete object graphs can be a very fast and powerful approach to overcome the impedance mismatch and N+1 sensitive database queries.
To me the best approach would be for searchResults to already be the list of definitive IEnumerable<Product> without having to do N database queries afterwards.
Elasticsearch (unlike raw lucene or even Solr) has a special field that stores the original json graph called _source so the overhead of loading your whole document is very minimal.
This comes at the cost of having to basically write your data twice, once to the database and once to elasticsearch on every mutation. Depending on your architecture this may or may not be achievable.
I agree with #femtoRgon that being able to reindex from an external datasource is a good idea, but the Elasticsearch developers are working very hard to get a proper backup and restore going for 1.0. This will greatly reduce the need for the second datastorage.
BTW not sure if you are aware but specifying .Fields() will already force Elasticsearch to only load up the specified fields instead of the whole graph from the special _source field.
Related
We are trying to implement a logging or statistics implementation with Aerospike. We have users logged in and Anonymus users making queries to our main database and we want to store every request that it's made.
Our best approach so far is to store Records with the UserID as a Key, and the query keywords as a List like this:
{
Key: 'alacret'
Bins:{
searches: [
"something to search 1",
"something to search 2",
"something to search 3",
...
]
}
}
As the application Architect, reviewing this, I come to several performance/design pitfalls :
1) Retrieving and storing are two operations, getting all the list, append, and then put again seems inefficient or suboptimal
2) By doing two operations means that I have to do both in a transaction, to prevent raise conditions, that I think would kill the Aerospike performance
3) The documentation states that the List are data structures to sized-bounded data, so if I understand correctly is not gonna scale pretty well, especially for anonymous users who would increase the size of the list exponentially.
As an alternative, I'm proposing to move the userID as Bin, and generate a Key that prevents raises conditions and keep the save operation as a single operation, and not several in a transaction.
So, what I'm looking for are opinions and validations.
Greetings
You can append to the list or prepend it. You can also limit it by trimming it, if beyond a certain limit, you don't care to store the search items ie you only want to store say 100 most recent items in your userID search list. You can do the append and trim, then read back the updated list, all in one lock. If you are storing on disk, the record size is limited to 1MB including all overhead etc. You can store much much larger record size if storing data only in RAM. (storage-engine memory). Does that suit your application need?
I have a Neo4j database whose content is generated dynamically from a big dataset.
All “entry points” nodes are indexed on a named index (IndexManager.forNodes(…)). I can therefore look up a particular “entry point” node.
However, I would now like to enumerate all those specific nodes, but I can't know on which key they were indexed.
Is there any way to enumerate all keys of a Neo4j Index?
If not, what would be the best way to store those keys, a data type that is eminently non-graph-oriented?
UPDATE (thanks for asking details :) ): the list would be more than 2 million entries. The main use case would be to never update it after an initialization step, but other use cases might need it, so it has to be somewhat scalable.
Also, I would really prefer avoiding killing my current resilience abilities, so storing all keys at once, as opposed to adding them incrementally, would be a last-resort solution.
I would either use a different data store to supplement Neo4j- I like Redis- or try #MattiasPersson's suggestion and store the the list on a node.
Is it just one list of keys or is it a list per node? You could store such a list on a specific node, say the reference node.
Instead of using a different storage which increases complexety you could try again with
lucene indices. normally lucene is able to handle this easily, especially now that the MatchAllDocsQuery is better. but one problem is that the neo4j guys are using a very old lucene version.
a special "reference" field in every node especially for this key-traversal case linking to the next node where you easily get ALL properties :)
If you want to get all Nodes, which were indexed in a particular index, you can just do:
IndexHits<Node> hits = IndexManager.forNodes(<INDEX_NAME>).query("*:*");
try{
while(hits.hasNext()){
Node n = hits.next();
...process the node...
}
}finally{
hits.close();
}
I am working on semantic search system, that stores huge amount of data. The data actually are documents and their indexes. The main problems are how to index document using ontologies and how to store them.
My question is about the second problem. At first, I implemented storing in RDBMS. It works veeery slowly. I consider to use some NoSQL database for this purpose, but have some doubts.
Please note, that simple text search using Lucene is not what i need in the current field.
Let me simplify the store structure. Note, that only inverted indexes are stored. In RDBMS we have tables:
1) Word - words from some dictionary
2) Document - document with metadata and it's content
3) Hit - word's hits in document (all hits separated by '|')
To get result system analyses words in request and calculate doc relevance basing on word's hit info. I have omitted some moments about semantic analyze, it's not important for now.
What do you think about this structure of the word storing?
{
"word": "some_word",
...
"some other metadata from the dictionary"
...
"hits": [
"doc1" : [ "hit_info1", "hit_info2"...]
"doc2" : [ "hit_info1", "hit_info2"...]
]
}
Thanks in advance!
First of all, RDBMS is a good choice for highly structured data. The major performance problem with RDBMS is the transaction processing. You try to manage a n:m relation between words and documents. This can't be done in file system. Use an SQL server and follow following hints, then it should be fast enough.
First of all, you should consider an ORM (object relational mapping) framework that supports "generalized batching". For C# and .NET I can recommend "DataObjects.NET". It saves you a lot of work optimizing client/server round trips.
Make your transaction as large as possible. If you have a document with 1000 words, process it in one transaction. Maybe you can process multiple documents in one transaction.
Form your inserts in two batches:
(A batch is a brunch of SQL commands send in one peace to the server)
Query all missing words for your document
Insert the document, the missing words, and the relations in one round.
It is absolutely important to do this in a batch. If you perform single statements you will mess up in client/sever round trips.
I have similar data to process and for a large batch (100000 words) this is done in about 0.2-0.5 seconds.
P.S.
And consider to disable flushing to disk on transaction end on your SQL server.
I would like to use Lucene for indexing a table in an existing database. I have been thinking the process is like:
Create a 'Field' for every column in the table
Store all the Fields
'ANALYZE' all the Fields except for the Field with the primary key
Store each row in the table as a Lucene Document.
While most of the columns in this table are small in size, one is huge. This column is also the one containing the bulk of the data on which searches will be performed.
I know Lucene provides an option to not store a Field. I was thinking of two solutions:
Store the field regardless of the size and if a hit is found for a search, fetch the appropriate Field from Document
Don't store the Field and if a hit is found for a search, query the data base to get the relevant information out
I realize there may not be a one size fits all answer ...
For sure, your system will be more responsive if you store everything on Lucene. Stored field does not affect the query time, it will only make the size of your index bigger. And probably not that bigger if it is only a small portion of the rows that have a lot of data. So if the index size is not an issue for your system, I would go with that.
I strongly disagree with a Pascal's answer. Index size can have major impact on search performance. The main reasons are:
stored fields increase index size. It could be problem with relatively slow I/O system;
stored fields are all loaded when you load Document in memory. This could be good stress for the GC
stored fields are likely to impact reader reopen time.
The final answer, of course, it depends. If the original data is already stored somewhere else, it's good practice to retrieve it from original data store.
When adding a row from the database to Lucene, you can judge if it actually needed to be write to the inverted-index. If not, you can use Index.NOT to avoid writing too much data to the inverted-index.
Meanwhile, you can judge where a column will be queried by key-value. If not, you needn't use Store.YES to store the data.
What is storing and what is indexing a field when it comes to searching?
Specifically I am talking about MySQL or SOLR.
Is there any thorough article about this, I have made some searches without luck!
Thanks
Storing information in a database just means writing the information to a file.
Indexing a database involves looking at the data in a table and creating an 'index' which is then used to perform a more efficient lookup in the table when you want to retreive the stored data.
From Wikipedia:
A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random look ups and efficient access of ordered records. The disk space required to store the index is typically less than that required by the table (since indexes usually contain only the key-fields according to which the table is to be arranged, and excludes all the other details in the table), yielding the possibility to store indexes in memory for a table whose data is too large to store in memory.
Storing is just putting data in the tables.
Storing vs. indexing is a SOLR's concept.
In SOLR, a stored field cannot be searched for or sorted on. It can be retrieved as a result of the query that includes a search on an indexed field.
In MySQL, on contrary, you can search and sort on unindexed fields too: this will be just slower, but still possible (unlike SOLR)
Storing data is just storing data somewhere so you can retrieve it later. Where indexing comes in is retrieving parts of the data efficiently. Wikipedia explains the idea quite well.
storing is just that saving the data to the disk (or whatever) so that the database can retrieve it later on demand.
indexing means creating some separate data structure to optimize the location and retrieval of that data in a faster way than simply reading the entire database (or the entire table) and looking at each and everyt record until the database searching algorithm finds what you asked it for... Generally databases use what is called a Balanced-Tree indices, which is an extension of the concept of a Binary-Tree. Look up Binary Tree on google/wikipedia to get a more indepth understanding of how this works...
Data
L1. This
L2. Is
L3. My Data
And the index is
This -> L1
Is -> L2
My -> L3
Data -> L3
The data/index analogy holds for books as well.