How to efficiently store document indexes - indexing

I am working on semantic search system, that stores huge amount of data. The data actually are documents and their indexes. The main problems are how to index document using ontologies and how to store them.
My question is about the second problem. At first, I implemented storing in RDBMS. It works veeery slowly. I consider to use some NoSQL database for this purpose, but have some doubts.
Please note, that simple text search using Lucene is not what i need in the current field.
Let me simplify the store structure. Note, that only inverted indexes are stored. In RDBMS we have tables:
1) Word - words from some dictionary
2) Document - document with metadata and it's content
3) Hit - word's hits in document (all hits separated by '|')
To get result system analyses words in request and calculate doc relevance basing on word's hit info. I have omitted some moments about semantic analyze, it's not important for now.
What do you think about this structure of the word storing?
{
"word": "some_word",
...
"some other metadata from the dictionary"
...
"hits": [
"doc1" : [ "hit_info1", "hit_info2"...]
"doc2" : [ "hit_info1", "hit_info2"...]
]
}
Thanks in advance!

First of all, RDBMS is a good choice for highly structured data. The major performance problem with RDBMS is the transaction processing. You try to manage a n:m relation between words and documents. This can't be done in file system. Use an SQL server and follow following hints, then it should be fast enough.
First of all, you should consider an ORM (object relational mapping) framework that supports "generalized batching". For C# and .NET I can recommend "DataObjects.NET". It saves you a lot of work optimizing client/server round trips.
Make your transaction as large as possible. If you have a document with 1000 words, process it in one transaction. Maybe you can process multiple documents in one transaction.
Form your inserts in two batches:
(A batch is a brunch of SQL commands send in one peace to the server)
Query all missing words for your document
Insert the document, the missing words, and the relations in one round.
It is absolutely important to do this in a batch. If you perform single statements you will mess up in client/sever round trips.
I have similar data to process and for a large batch (100000 words) this is done in about 0.2-0.5 seconds.
P.S.
And consider to disable flushing to disk on transaction end on your SQL server.

Related

Is there a NO KEY or auto increment strategy on AEROSPIKE that i can Use for unique keys?

We are trying to implement a logging or statistics implementation with Aerospike. We have users logged in and Anonymus users making queries to our main database and we want to store every request that it's made.
Our best approach so far is to store Records with the UserID as a Key, and the query keywords as a List like this:
{
Key: 'alacret'
Bins:{
searches: [
"something to search 1",
"something to search 2",
"something to search 3",
...
]
}
}
As the application Architect, reviewing this, I come to several performance/design pitfalls :
1) Retrieving and storing are two operations, getting all the list, append, and then put again seems inefficient or suboptimal
2) By doing two operations means that I have to do both in a transaction, to prevent raise conditions, that I think would kill the Aerospike performance
3) The documentation states that the List are data structures to sized-bounded data, so if I understand correctly is not gonna scale pretty well, especially for anonymous users who would increase the size of the list exponentially.
As an alternative, I'm proposing to move the userID as Bin, and generate a Key that prevents raises conditions and keep the save operation as a single operation, and not several in a transaction.
So, what I'm looking for are opinions and validations.
Greetings
You can append to the list or prepend it. You can also limit it by trimming it, if beyond a certain limit, you don't care to store the search items ie you only want to store say 100 most recent items in your userID search list. You can do the append and trim, then read back the updated list, all in one lock. If you are storing on disk, the record size is limited to 1MB including all overhead etc. You can store much much larger record size if storing data only in RAM. (storage-engine memory). Does that suit your application need?

Using the document store as a cache

I've set up a basic implementation of ElasticSearch, storing a couple of fields in the document and I'm able to execute queries.
var searchResult = client.Search<SearchTest>(s =>
s
.Size(1000)
.Fields(f => f.ID)
.Query(q => q.QueryString(d => d.Query(query)))
)
.Documents.Select(item =>
item.ID
)
.ToList();
var products = this.DbContext.Products
.Where(item =>
searchResult.Contains(item.ProductId)
&& ...
)
.Select(item => ...);
// subsequent queries here
Right now, I simply return the index, which I use in database queries to retrieve a whole lot of information. The information stored in the documents is also retrieved. Now I'm wondering, should I skip retrieving this from the database, and use the data in the document store? Or should I use it for nothing else but searching?
Some context: searching in a product database, some information is always the same, some information (like price calculation) depends on which customer is searching.
There isn't really a hard and fast answer to this question. I like to pull enough information from the index to populate a list of search results, but retrieve the full contents of the document from others, external sources (ex. a database). Entirely subjectively, this seems to be the more common use of Lucene, from what I've seen.
Storage strategy, as far as I know, should not have a direct impact on search performance, but keeping data stored for each document to a minimum will improve performance retrieving documents from the index (ie, for that list of results mentioned before).
I'm also sometimes hesitant to make Lucene the system of record. It seems to be much easier to find yourself with a broken/corrupt index than a database. I like having the option available to trash and rebuild it.
I see you already accepted an answer but i'd like to offer a second approach.
Elasticsearch excels at storing Documents (json) and so retrieving complete object graphs can be a very fast and powerful approach to overcome the impedance mismatch and N+1 sensitive database queries.
To me the best approach would be for searchResults to already be the list of definitive IEnumerable<Product> without having to do N database queries afterwards.
Elasticsearch (unlike raw lucene or even Solr) has a special field that stores the original json graph called _source so the overhead of loading your whole document is very minimal.
This comes at the cost of having to basically write your data twice, once to the database and once to elasticsearch on every mutation. Depending on your architecture this may or may not be achievable.
I agree with #femtoRgon that being able to reindex from an external datasource is a good idea, but the Elasticsearch developers are working very hard to get a proper backup and restore going for 1.0. This will greatly reduce the need for the second datastorage.
BTW not sure if you are aware but specifying .Fields() will already force Elasticsearch to only load up the specified fields instead of the whole graph from the special _source field.

What's the fastest way to copy data from one table to another in Django?

I have two models -
ChatCurrent - (which stores the messages for the current active chats)
ChatArchive - (which archives the messages for the chats that have ended)
The reason I'm doing this is so that the ChatCurrent table always has minimum number of entries, making querying the table fast (I don't know if this works, please let me know if I've got this wrong)
So I basically want to copy (cut) data from the ChatCurrent to the ChatArchive model. What would be the fastest way to do this. From what I've read online, it seems that I might have to execute a raw SQL query, if you would be kind enough to even state the Query I'll be grateful.
Additional details -
Both the models have the same schema.
My opinion is that today they are not reason to denormalize database in this way to improve performance. Indexes or partitioning + indexes should be enought.
Also, in case that, for semantic reasons, you prefer have two tables (models) like: Chat and ChatHistory (or ChatCurrent and ChatActive) as you say and manage it with django, I thing that the right way to keep consistence is to create ToArchive() method in ChatCurrent. This method will move chat entries to historical chat model. You can perform this operation in background mode, then you can thread the swap in a celery process, in this way online users avoid wait for request. Into celery process the fastest method to copy data is a raw sql. Remember that you can encapsulate sql into a stored procedure.
Edited to include reply to your comment
You can perform ChatCurrent.ToArchive() in ChatCurrent.save() method:
class ChatCurrent(model.Model):
closed=models.BooleanField()
def save(self, *args, **kwargs):
super(Model, self).save(*args, **kwargs)
if self.closed:
self.ToArchive()
def ToArchive(self):
from django.db import connection, transaction
cursor = connection.cursor()
cursor.execute("insert into blah blah")
transaction.commit_unless_managed()
#self.delete() #if needed (perhaps deleted on raw sql)
Try something like this:
INSERT INTO "ChatArchive" ("column1", "column2", ...)
SELECT "column1", "column2", ...
FROM "ChatCurrent" WHERE yourCondition;
and than just
DELETE FROM "ChatCurrent" WHERE yourCondition;
The thing you are trying to do is table partitioning.
Most databases support this feature without the need for manual book keeping.
Partitioning will also yield much better results than manually moving parts of the data to a different table. By using partitioning you avoid:
- Data inconsistency. Which is easy to introduce because you will move records in bulk and then remove a lot of them from the source table. It's easy to make a mistake and copy only a portion of the data.
- Performance drop - moving the data around and the associated overhead from transactions will generally neglect any benefit you got from reducing the size of the ChatCurrent table.
For a really quick rundown. Table partitioning allows you to tell the database that parts of the data are stored and retrieved together, this significantly speeds up queries as the database knows that it only has to look into a specific part of the data set. Example: chat's from the current day, last hour, last month etc. You can additionally store each partition on a different drive, that way you can keep your current chatter on a fast SSD drive and your history on regular slower disks.
Please refer to your database manual to know the details about how it handles partitioning.
Example for PostgreSQL: http://www.postgresql.org/docs/current/static/ddl-partitioning.html
Partitioning refers to splitting what is logically one large table into smaller physical pieces. Partitioning can provide several benefits:
Query performance can be improved dramatically in certain situations, particularly when most of the heavily accessed rows of the table are in a single partition or a small number of partitions. The partitioning substitutes for leading columns of indexes, reducing index size and making it more likely that the heavily-used parts of the indexes fit in memory.
When queries or updates access a large percentage of a single partition, performance can be improved by taking advantage of sequential scan of that partition instead of using an index and random access reads scattered across the whole table.
Bulk loads and deletes can be accomplished by adding or removing partitions, if that requirement is planned into the partitioning design. ALTER TABLE NO INHERIT and DROP TABLE are both far faster than a bulk operation. These commands also entirely avoid the VACUUM overhead caused by a bulk DELETE.
Seldom-used data can be migrated to cheaper and slower storage media.
def copyRecord(self,recordId):
emailDetail=EmailDetail.objects.get(id=recordId)
copyEmailDetail= CopyEmailDetail()
for field in emailDetail.__dict__.keys():
copyEmailDetail.__dict__[field] = emailDetail.__dict__[field]
copyEmailDetail.save()
logger.info("Record Copied %d"%copyEmailDetail.id)
As per the above solutions, don't copy over.
If you really want to have two separate tables to query, store your chats in a single table (and for preference, use all the database techniques here mentioned), and then have a Current and Archive table, whose objects simply point to Chat objects/

Lucene Indexing

I would like to use Lucene for indexing a table in an existing database. I have been thinking the process is like:
Create a 'Field' for every column in the table
Store all the Fields
'ANALYZE' all the Fields except for the Field with the primary key
Store each row in the table as a Lucene Document.
While most of the columns in this table are small in size, one is huge. This column is also the one containing the bulk of the data on which searches will be performed.
I know Lucene provides an option to not store a Field. I was thinking of two solutions:
Store the field regardless of the size and if a hit is found for a search, fetch the appropriate Field from Document
Don't store the Field and if a hit is found for a search, query the data base to get the relevant information out
I realize there may not be a one size fits all answer ...
For sure, your system will be more responsive if you store everything on Lucene. Stored field does not affect the query time, it will only make the size of your index bigger. And probably not that bigger if it is only a small portion of the rows that have a lot of data. So if the index size is not an issue for your system, I would go with that.
I strongly disagree with a Pascal's answer. Index size can have major impact on search performance. The main reasons are:
stored fields increase index size. It could be problem with relatively slow I/O system;
stored fields are all loaded when you load Document in memory. This could be good stress for the GC
stored fields are likely to impact reader reopen time.
The final answer, of course, it depends. If the original data is already stored somewhere else, it's good practice to retrieve it from original data store.
When adding a row from the database to Lucene, you can judge if it actually needed to be write to the inverted-index. If not, you can use Index.NOT to avoid writing too much data to the inverted-index.
Meanwhile, you can judge where a column will be queried by key-value. If not, you needn't use Store.YES to store the data.

Can anyone please explain "storing" vs "indexing" in databases?

What is storing and what is indexing a field when it comes to searching?
Specifically I am talking about MySQL or SOLR.
Is there any thorough article about this, I have made some searches without luck!
Thanks
Storing information in a database just means writing the information to a file.
Indexing a database involves looking at the data in a table and creating an 'index' which is then used to perform a more efficient lookup in the table when you want to retreive the stored data.
From Wikipedia:
A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random look ups and efficient access of ordered records. The disk space required to store the index is typically less than that required by the table (since indexes usually contain only the key-fields according to which the table is to be arranged, and excludes all the other details in the table), yielding the possibility to store indexes in memory for a table whose data is too large to store in memory.
Storing is just putting data in the tables.
Storing vs. indexing is a SOLR's concept.
In SOLR, a stored field cannot be searched for or sorted on. It can be retrieved as a result of the query that includes a search on an indexed field.
In MySQL, on contrary, you can search and sort on unindexed fields too: this will be just slower, but still possible (unlike SOLR)
Storing data is just storing data somewhere so you can retrieve it later. Where indexing comes in is retrieving parts of the data efficiently. Wikipedia explains the idea quite well.
storing is just that saving the data to the disk (or whatever) so that the database can retrieve it later on demand.
indexing means creating some separate data structure to optimize the location and retrieval of that data in a faster way than simply reading the entire database (or the entire table) and looking at each and everyt record until the database searching algorithm finds what you asked it for... Generally databases use what is called a Balanced-Tree indices, which is an extension of the concept of a Binary-Tree. Look up Binary Tree on google/wikipedia to get a more indepth understanding of how this works...
Data
L1. This
L2. Is
L3. My Data
And the index is
This -> L1
Is -> L2
My -> L3
Data -> L3
The data/index analogy holds for books as well.