As an answer to a question I asked yesterday (New Core Data entity identical to existing one: separate entity or other solution?), someone recommended I index an attribute.
After much searching on Google for what an 'index' is in SQLite/Core Data, I'm afraid I'm not closer to knowing exactly what it is or how it speeds up fetching based on an attribute. Keep in mind I know nothing about SQLite/databases in general other than a vague idea based on reading way, way, way, too much about Core Data the past few months.
Simplistically, indexing is a kind of presorting. If you have a numerical attribute index, the store maintains linked list in numerical order. If you have a text attribute, it maintains a linked list in alphabetical order. Depending on the algorithm, it can maintain other kinds of information about the attributes as well. It stores the data in the index attached to the persistent store file.
It makes fetches based on the indexed attribute go faster with the tradeoff of larger file size and slightly slower inserts.
All these answers are good, but overly technical.
An index is pretty much identical to the index you'd find in the back of a book. Thus if you wanted to find which page a certain word occurred at, you'd go through it alphabetically and thus quickly find the all the pages where that word occurred.
If you didn't have an index, then the user would have to resort to going thru EVERY single page word by word, which could take quite a while. Thus, the index is created pretty much in this way ONLY once, and not every time the user wants to search.
Wikipedia has a great explanation of a database index:
"A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space."
Related
I would like to understand the performance impact of indexing documents of multiple types to a single index where there is an imbalance in the number of items of each type (one type has millions, where another type has just thousands of documents). I have spotted issues on some of my indexes, and ruling out whether types are indexed separately within a single index (or not) would help me. Can I assume that types are indexed separately along the lines of a relational database where each table is effectively separate?
If the answer to the above is no and that types are effectively all lumped together, then I'll lay out the rest of what I'm doing to try and get some more detailed input.
The use case for this example is capturing tweets for Twitter users (call it owner for clarity). I have multi-tenant environment with one index per twitter owner. That said, focusing on a single owner:
I capture the tweets from each timeline (mentions, direct messages, my tweets, and the full 'home' timeline) into a single index, with each timeline type having a different mapping in ElasticSearch
Each tweet refers to a parent type, the user who authored the tweet (which may or may not be the owner), with a parent mapping. There is only a single 'user' type for all the timeline types
I search and facet only ever on one owner in a single query, so I don't have to concern myself searching across multiple indexes
The home timeline may capture millions of tweets, where the owner's own tweets may result in hundreds or thousands
The user documents are routinely updated with information outside of the Twitter timelines, therefore I would like to avoid (if possible) the situation where I have to keep multiple copies of the same user object in sync across multiple indexes
I have noticed a much slower response querying on the indexes with millions of documents, even when excluding the 'home timeline' type with millions of documents indexed, leaving just the types with a few thousand entries. I don't want to have to split the types into separate indexes (unless I have to), due to the parent-child relationship between a tweet and a user.
Is there a way I can understand if the issue is with the total number of documents in a specific index, something to do with the operation of 'has_child' filtered queries, some other poor design of queries or facets, or something else?
Any input would be appreciated.
EDIT
To clarify the statement that tweets are stored per timeline. This means that there is an ElasticSearch type defined for home_timeline, my_tweets_timeline, mentions_timeline, direct_messages_timeline, etc, which correspond to what you see in the standard twitter.com UI. So there is a natural split between the sets of tweets, although with some overlap too.
I have gone back in to check out the has_child queries, and this is a definite red-herring at this point. Basic queries on the larger indexes are much slower, even when querying a type with just a few thousand rows (my_tweets_timeline).
Can I assume that types are indexed separately along the lines of a relational database where each table is effectively separate?
No, types are all lumped together into one index as you guessed.
Is there a way I can understand if the issue is with the total number of documents in a specific index, something to do with the operation of 'has_child' filtered queries, some other poor design of queries or facets, or something else?
The total number of documents in the index is obviously a factor. Whether has_child queries are slow in particular is another question - try comparing the performance of has_child queries with trivial term queries for example. The has_child documentation offers one clue under "memory considerations":
With the current implementation, all _id values are loaded to memory (heap) in order to support fast lookups, so make sure there is enough memory for it.
This would imply a large amount of memory is required for any has_child query where there are millions of potential children. Make sure enough memory is available for such operations, or consider a redesign that removes the need for has_child.
I have quite large database with ~2.6 million documents where I have two collections each 1.2 million and rest are small collections (<1000 documents). When I create new index for small collection, it takes lot of time indexing to complete (so temp indexes are useless). It seems that RavenDB indexing process reads each document in DB and checks if it should be added to index. I think it would perform better to index only collections used by index.
Also when using Smuggler to export data and I want to export only one small collection, it reads all documents and exporting might take quite a lot of time. Same time custom app which uses RavenDB Linq API and indexes can export data in seconds.
Why RavenDB behaves like this? And maybe there is some configuration setting which might change this behavior?
RavenDB doesn't actually have any real concept of a "collection". All documents are pretty much the same. It simply looks at the Raven-Entity-Name metadata in each document to determine how to group things together for purposes of querying by type and displaying the "Collections" page in the management studio.
I am not sure of the specific rationale for this. I think it has something to do with the underlying ESENT tables used by the document store. Perhaps Ayende can answer better. Your particular use cases are good examples for why it might be done differently.
One thing you could try is to use multiple databases. You could put the your large-quantity documents in one database, and put everything else in another. Of course, you may have problems with indexing related documents, multi-map/reduce, or other scenarios where documents of different types need to be together on the same database.
Seems that answer to my question is coming in RavenDB 3.0. Ayende says:
In RavenDB 2.x, you still have to pay the full price for indexing
everything, but that isn’t the case in RavenDB 3.0. What we have done
is to effectively optimize the process so that in this case, we will
preload all of the documents taking part in the relevant collection,
and send them directly to be indexed.
We do this by utilizing the Raven/DocumentsByEntityName index. Which
has already indexed everything in the database anyway. This is a nice
little feature, because it allows us to really take advantage of the
work we already did long ago. Using one index to pre-populate another
is a neat trick, and one that I am very happy about.
And here is full blog post: http://ayende.com/blog/165923/shiny-features-in-the-depth-new-index-optimization
For example, in order to provide an effective way to query repsondents answers to a dynamic questionnaire, where responses are stored in a keyword/response pair.
I am aware that there may be some latency in updating the catalogue/text index as new entries are added, but this may not matter if reporting/querying is not a real time concern. (i.e. performed at some later date)
So in answer to my own question, the transactional aspect of this doesnt actually matter, does it?
I would distinguish between data consistency in selected storage and gap between data arrival and appearing in search results for the user as you might use external or even remote search solutions for your application as the index update might take some significant time depends on the case.
I have a question. I am considering using a data store for some type of objects (e.g. products data). Criteria for using document store is if object has a detail page, so fast read of the entire object is necessary (example - product with all attributes, images, comments etc). Criteria for using SQL is displaying lists (e.g. N newest, most popular etc).
Some objects meet both criteria. Products is an example. So is it a normal practice to store info that will be used in rendering lists on index pages in SQL database, and other data in document store?
If denormalization is suitable for getting performance, go ahead with denormalization. But you have to ensure that you have a way to deal with updates of denormalized data. Your options in MongoDB are:
multiple queries to avoid denormalization
embedded docs
database references
make your choice..
The main idea is mongoDB was created for denormalization and embedding. At one of my past projects i've done sql denormalization to get better performance, but i don't like sql denormalization because very many dublicated data( if you have one to many relation for example). Second step was rewriting data access layer to mongoDB. And in mongoDB for some difficult pages where i need to load multiple documents i've created denormalized document(with embeded collections and plain data from different documents) to fit page content. No all my problem pages work fast, like facebook ;).
But here possible problems, becase you should support denormalized document every time. Also all my denormalized data updates work async, and some data can be stale in some moment, but it's normal practice. Even stackoverlow use denormalization because sometime when open question i see an answer, but when i return back to questions list and refresh page sometimes question doesn't have answer.
If i need denormalization i choose mongodb.
Basically, I'm still working on a puzzle-related website (micro-site really), and I'm making a tool that lets you input a word pattern (e.g. "r??n") and get all the matching words (in this case: rain, rein, ruin, etc.). Should I store the words in local text files (such as words5.txt, which would have a return-delimited list of 5-letter words), or in a database (such as the table Words5, which would again store 5-letter words)?
I'm looking at the problem in terms of data retrieval speeds and CPU server load. I could definitely try it both ways and record the times taken for several runs with both methods, but I'd rather hear it from people who might have had experience with this.
Which method is generally better overall?
The database will give you the best performance with the least amount of work. The built in index support and query analyzers will give you good performance for free while a textfile might give you excellent performance for a ton of work.
In the short term, I'd recommend creating a generic interface which would hide the difference between a database and a flat-file. Later on, you can benchmark which one will provide the best performance but I think the database will give you the best bang per hour of development.
For fast retrieval you certainly want some kind of index. If you don't want to write index code yourself, it's certainly easiest to use a database.
If you are using Java or .NET for your app, consider looking into db4o. It just stores any object as is with a single line of code and there are no setup costs for creating tables.
Storing data in a local text file (when you add new records to end of the file) always faster then storing in database. So, if you create high load application, you can save the data in a text file and copy data to a database later. However in most application you should use a database instead of text file, because database approach has many benefits.