ElasticSearch types and indexing performance - indexing

I would like to understand the performance impact of indexing documents of multiple types to a single index where there is an imbalance in the number of items of each type (one type has millions, where another type has just thousands of documents). I have spotted issues on some of my indexes, and ruling out whether types are indexed separately within a single index (or not) would help me. Can I assume that types are indexed separately along the lines of a relational database where each table is effectively separate?
If the answer to the above is no and that types are effectively all lumped together, then I'll lay out the rest of what I'm doing to try and get some more detailed input.
The use case for this example is capturing tweets for Twitter users (call it owner for clarity). I have multi-tenant environment with one index per twitter owner. That said, focusing on a single owner:
I capture the tweets from each timeline (mentions, direct messages, my tweets, and the full 'home' timeline) into a single index, with each timeline type having a different mapping in ElasticSearch
Each tweet refers to a parent type, the user who authored the tweet (which may or may not be the owner), with a parent mapping. There is only a single 'user' type for all the timeline types
I search and facet only ever on one owner in a single query, so I don't have to concern myself searching across multiple indexes
The home timeline may capture millions of tweets, where the owner's own tweets may result in hundreds or thousands
The user documents are routinely updated with information outside of the Twitter timelines, therefore I would like to avoid (if possible) the situation where I have to keep multiple copies of the same user object in sync across multiple indexes
I have noticed a much slower response querying on the indexes with millions of documents, even when excluding the 'home timeline' type with millions of documents indexed, leaving just the types with a few thousand entries. I don't want to have to split the types into separate indexes (unless I have to), due to the parent-child relationship between a tweet and a user.
Is there a way I can understand if the issue is with the total number of documents in a specific index, something to do with the operation of 'has_child' filtered queries, some other poor design of queries or facets, or something else?
Any input would be appreciated.
EDIT
To clarify the statement that tweets are stored per timeline. This means that there is an ElasticSearch type defined for home_timeline, my_tweets_timeline, mentions_timeline, direct_messages_timeline, etc, which correspond to what you see in the standard twitter.com UI. So there is a natural split between the sets of tweets, although with some overlap too.
I have gone back in to check out the has_child queries, and this is a definite red-herring at this point. Basic queries on the larger indexes are much slower, even when querying a type with just a few thousand rows (my_tweets_timeline).

Can I assume that types are indexed separately along the lines of a relational database where each table is effectively separate?
No, types are all lumped together into one index as you guessed.
Is there a way I can understand if the issue is with the total number of documents in a specific index, something to do with the operation of 'has_child' filtered queries, some other poor design of queries or facets, or something else?
The total number of documents in the index is obviously a factor. Whether has_child queries are slow in particular is another question - try comparing the performance of has_child queries with trivial term queries for example. The has_child documentation offers one clue under "memory considerations":
With the current implementation, all _id values are loaded to memory (heap) in order to support fast lookups, so make sure there is enough memory for it.
This would imply a large amount of memory is required for any has_child query where there are millions of potential children. Make sure enough memory is available for such operations, or consider a redesign that removes the need for has_child.

Related

Storing data values in multiple tables vs accessing the same value via query

I have designed my database tables where multiple tables store a value, all of which could be achieved via a query to one table.
My question is would it be considered better practice to never store duplicate data and always query, or to store small values multiple times to reduce the number of queries required?
For context, I am building a Python app that quizzes Korean language questions using SQLAlchemy and SQLite.
I have User , Quiz and Question classes.
The values in question are num_correct, num_wrong with regard to quiz questions.
Basically I have a question table that stores all questions related to quiz by quiz_id. Each question has a column "correct" that stores a boolean telling whether or not that question was answered correctly.
In my "quiz" table, I have columns for num_correct / num_wrong regarding questions answered for that quiz.
In my "user" table, I also have columns for num_correct / num_wrong regarding their total answers correct and wrong for all time.
I realize that to get the values in "quiz" I could query the "questions" table and to get the values in "user" I could do that same.
In this case (and in general) which would be the preferred strategy considering best practices?
I've tried googling quite a bit, but wording the question is a bit tricky.
The issue of duplicated data is a complicated one in relational databases. If your application is doing data modifications, then duplicated data incurs synchronization issues -- the data needs to be updated in multiple places.
That is bad for a variety of reasons:
Updating a single item of information requires multiple changes.
The multiple changes can get out-of-sync, meaning that queries will not see consistent data.
Changes to the database structure (such as adding new tables) can be rather cumbersome.
Databases do support this capability, via ACID properties, transactions, and triggers. However, they add overhead. In general, such duplication is added out of necessity (i.e. performance) rather than up-front. Hence, there is a strong preference for normalized data models where information is stored only once when updates frequently occur.
On the other hand, some databases are used primarily for querying purposes. These databases are often denormalized -- and quite so. For instance, a customer table might contain summaries along many different dimensions, gathering information from dozens of underlying tables.
This not only simplifies queries but it encodes business logic. One major issue with using data is that different people have slightly different definitions of things -- is a one-year customer someone who started 365 days ago? Someone who started on the same day of the year last year? Someone who has been around for 12 months? Standardized analysis tables provide the answer.
Your case seems to fall more into the first situation. You are doing updates and thinking about storing summaries up front. I would discourage you from doing this. Just write the queries you need to summarize the data. In all likelihood, indexes and partitioning will provide all the performance you need.
If you know up front that you will have millions of users taking hundreds of quizzes with dozens of questions, then you might want to think about performance optimizations up front. But for thousands of users taking a handful of quizzes with a few dozen questions, start with a simple data model and make it more complicated after you have demonstrated that it works.
My question is would it be considered better practice to never store duplicate data and always query, or to store small values multiple times to reduce the number of queries required?
I don't see how this reduces the number of queries.
It may affect the complexity of a query, i.e. you'll need to join a few tables together instead of a simple query on one table, but these operations are very fast. I would not worry about speed.
If you duplicate your data it will eventually get out of sync, and then you're in big trouble.
In short, don't duplicate.
Also, this question doesn't really have anything to do with Python.

Is it ever a good idea to store an array as a field value, or store array values as records?

In my application I've got "articles" (similar to posts/tweets/articles) that are tagged with descriptive predefined tags: i.e "difficult", "easy", "red", "blue", "business" etc
These available tags are stored in a table, call it "tags" that contains all available tags.
Each article can be tagged with multiple tags, editable through a custom admin interface.
It could be tempting to simply bundle the tags for each entity into a stringified array of the IDs of each tag and store it alongside the article record in my "articles" table:
id | title | author | tags
---+-------+--------+-------------
1 | title | TG | "[1,4,7,12]"
though I'm sure this is a bad idea for a number of reasons, is there ever a reasonable reason to do the above?
I think you should read about Database normalization and decide for yourself. In short though, there are a number of issues with your proposal, but you may decide you can live with them.
The most obvious are:
What if an additional tag is added to row(1)? Do you have to first parse, check if it's already present then update the row to be tags.append(newTag).
Worse still deleting a tag? Search tags, is present, re-create tags.
What if a tag is to change name - some moderation process, perhaps?
Worse again, what about dfferent people specifying a tag-name differently - it'd be hard to rationalise.
What if you want to query data based on tags? Your query becomes far more complex than it would need to be.
Presentation: The client has to parse the tag in order to use it. What about the separator field? Change that and all clients have to change.
In short, all of these operations become harder and more cumbersome. Normalization is designed to overcome such issues. Probably the only reason for doing what you say, IMO, is that you're capturing the data as a one-off and it's informational only - that is, makes sense to a user but not to a system per-se. This is kind of like saying it's probably best avoided (again, IMO).
It seems to me like you want to have a separate table that stores tags and holds a foreign key which relates the tag records back to their parent record in the articles table (this is referred to as "normalizing" the database structure).
Doing it like you have suggested by cramming the tags into one field may seem to make sense now, but it will prove to be difficult to maintain and difficult/time consuming to pull the values out efficiently as your application grows in size or the amount of data grows a lot larger.
I would say that there are very few reasons to do what you have suggested, given how straightforward it is to create another table and setup a relationship to link keys between the two tables to maintain referential integrity.
I totally agree that it CAN be a good idea. I am a strong advocate of storing tags in the database as a single delimited list of strings.
BUT: The reason that I agree is that I like to use Azure Search API to index these types of data, so the query to do a lookup based on tags is not done via SQL. (using the Azure search API service is not necessary, but In my experience you will get much better performance and scalability by using a search index that is outside of the database.)
If you primary query language will be SQL (relational based queries)
then you are better off creating a child table that has a row for each
tag, otherwise you will wear a performance hit when your query has to
perform a logic on each value to split it for analysis.
Tagging is a concept that we use to get around relational data or hierarchical mapping, so to get the best performance do no try to use these relational concepts to query the tags. It is often best implemented in NoSQL data storage because they don't try to use the database to process the search queries.
I encourage you to store the data as a delimited string, and use an external indexing service to provide search and insights into your data. This is a good trade off between CRUD data access performance attempts to manage the data and indexes to optimise for searching. Sure you can optimise the DB and the search queries to make it work in SQL but it can take effort to get it right.
Once your user base hits large volumes and you need to support multiple concurrent searches without affecting update performance you will find that external indexing is an awesome investment in your time now, to save you time and resources later.

Buffer table in a database, Good or not?

I have a question !
I need to make a university project, and in this project i will have one database table like this :
This table will have a LOT of records !!!!!!
And for manage this i need to create a validation system.
What is the best (and why) between create a buffer table like this :
Or add a column in my table like this :
Thank you !
Your question does not have enough information to provide a real answer. Here is some guidance on how to think about the situation. Which approach depends on the nature of your application and especially on what "validation" means.
One reasonable interpretation is that "validation" is part of a work-flow process, so it happens only once (or 99% of the time only once). And, you never want to see unvalidated advertisements when you look look at advertisements. If this is the case, then there would typically be additional information about the validation process.
This scenario suggests two reasonable approaches:
Do the validation inside a transaction. This would be reasonable if the validation process were entirely in the database and was measured in seconds.
Have a separate table for advertisements being validated. Perhaps even a separate table per "user" or "entity" responsible for them. Depending on the nature of the validation process, this could be a queue that feeds them to people doing the validation.
Putting them in the "advertisements" table doesn't make sense, because there is likely to be additional information involved with the validation process -- who, what, where, when, how.
If an advertisement can be validated and invalidated multiple times, then the best approach may be to put them in the same table. Once again, there are questions about the nature of the process.
Getting access to the two groups without a full table scan is tricky. If 10% of the rows are invalidated and 90% are validated, then a normal index would require a full table scan for reading either group. To get faster access to the smaller group, here are two options:
clustered index on the validation flag.
separate partitions for validated and invalidated rows.
In both cases, changing the validation flag for a record is relatively expensive, because it involves reading and writing the record on different data pages. Unless dozens of changes are made per second, this is probably not a big deal.
Here, there is no need to have a separate "buffer table". You can just properly index the valid field. So the following index would essentially automatically create a buffer table:
create unique index x on y (id)
include (all columns)
where (valid = 0)
This index creates a copy of the yet invalid data. You can do lots of variations such as
create unique index x on y (valid, id)
There's really no need for a separate table. Indexes are very easy compared to partitioning or even manually partitioning. Much less work, more general, more flexible and less potential for human error.
Either approach is valid, and which will perform better will depend more on the type of database you are using rather than the theoretical question of whether it is more correct to use a boolean or partition this into two tables.
I actually prefer the partitioning approach (your buffer table idea), but it will be more complex to code around. This may be a significant point to consider. Most modern databases will handle the boolean criteria very well with an index, but sometimes you can be surprised.
The most important thing from a development perspective right now is to pick one and run with it instead of paralyzing your project while you decide the "right" one.

Elasticsearch querying multiple types and grouped by types?

Suppose I am to search against two types [cars] and [buildings], and I would want the results to be separated. Is there a way one can group results by types?
I understand one simple way will be to query each types separately, but for other use cases one may actually need to query tens or hundreds of types together. Is there a native way or hacky way(like using sort) to achieve this?
This type of grouping behavior is (currently) not available in elasticsearch. It has been a long standing request:
https://github.com/elasticsearch/elasticsearch/issues/256
There are two approaches that can help, both of which are far from perfect, but may be good enough for some use cases.
Client side aggregation. Request a lot more results than you plan on displaying and the then bucket those.
Using multi-query. This allows you to easily pass down some number of queries in a single batch, but will have potential scaling problems if the number of queries gets to large.
This is one feature that Solr has that elasticsearch doesn't, but I have never tried it. I used a similar feature with Autonomy IDOL years back, but the performance was abysmal.
If you want the results separated in groups of documents, you're going to have to restructure your documents, since, elasticsearch is focused on finding matching documents. You might get around this by designing a document that has child documents then you can query for matches on the parent document that represents your type.
I guess there might be some common field (let's say it's [price]) if you want to search against different types. Then it would be reasonable to add some different type like [price_aggregator] and put into it fields [type] and [price]. And then you could easily build your query against just one type. This requires some additional work while indexing and more memory to store index but it's much performant when you search.

Too many fields bad for elasticsearch index?

Let say I have a thousand keys, and I would want to store the associated values. The intuitive approach seems to be something like
{
"key1":"someval",
"key2":"someotherval",
...
}
Is this a bad design pattern for elasticsearch index to have thousands of keys? Would each keys introduced this way create overhead for every documents under the index?
If you know there is an upper limit to the number of keys you'll have, a few thousand fields is not a problem.
The problem is when you have an unbounded set of keys, e.g. when the key is derived from a value, as you'll have a continuously growing mapping and thus also cluster state. It can also lead to quirky searches.
This is a common enough question/issue that I dedicated a section to it in my article on Troubleshooting Elasticsearch searches, for Beginners.
In short, thousands of fields is no problem - not having control of the mapping is.
Elasticsearch is not ideal for 1000s of key-value pattern in a document. and if you want to update them in real-time or something, then try redis or riak for that.
If you have thousands of keys in a document/record, essentially they become fields and the value become the text and indexed.
From information-retrieval perspective with large data, it is advised to use fewer big fields than numerous small fields, for faster search performance.