Too many fields bad for elasticsearch index? - lucene

Let say I have a thousand keys, and I would want to store the associated values. The intuitive approach seems to be something like
{
"key1":"someval",
"key2":"someotherval",
...
}
Is this a bad design pattern for elasticsearch index to have thousands of keys? Would each keys introduced this way create overhead for every documents under the index?

If you know there is an upper limit to the number of keys you'll have, a few thousand fields is not a problem.
The problem is when you have an unbounded set of keys, e.g. when the key is derived from a value, as you'll have a continuously growing mapping and thus also cluster state. It can also lead to quirky searches.
This is a common enough question/issue that I dedicated a section to it in my article on Troubleshooting Elasticsearch searches, for Beginners.
In short, thousands of fields is no problem - not having control of the mapping is.

Elasticsearch is not ideal for 1000s of key-value pattern in a document. and if you want to update them in real-time or something, then try redis or riak for that.
If you have thousands of keys in a document/record, essentially they become fields and the value become the text and indexed.
From information-retrieval perspective with large data, it is advised to use fewer big fields than numerous small fields, for faster search performance.

Related

Is it ever a good idea to store an array as a field value, or store array values as records?

In my application I've got "articles" (similar to posts/tweets/articles) that are tagged with descriptive predefined tags: i.e "difficult", "easy", "red", "blue", "business" etc
These available tags are stored in a table, call it "tags" that contains all available tags.
Each article can be tagged with multiple tags, editable through a custom admin interface.
It could be tempting to simply bundle the tags for each entity into a stringified array of the IDs of each tag and store it alongside the article record in my "articles" table:
id | title | author | tags
---+-------+--------+-------------
1 | title | TG | "[1,4,7,12]"
though I'm sure this is a bad idea for a number of reasons, is there ever a reasonable reason to do the above?
I think you should read about Database normalization and decide for yourself. In short though, there are a number of issues with your proposal, but you may decide you can live with them.
The most obvious are:
What if an additional tag is added to row(1)? Do you have to first parse, check if it's already present then update the row to be tags.append(newTag).
Worse still deleting a tag? Search tags, is present, re-create tags.
What if a tag is to change name - some moderation process, perhaps?
Worse again, what about dfferent people specifying a tag-name differently - it'd be hard to rationalise.
What if you want to query data based on tags? Your query becomes far more complex than it would need to be.
Presentation: The client has to parse the tag in order to use it. What about the separator field? Change that and all clients have to change.
In short, all of these operations become harder and more cumbersome. Normalization is designed to overcome such issues. Probably the only reason for doing what you say, IMO, is that you're capturing the data as a one-off and it's informational only - that is, makes sense to a user but not to a system per-se. This is kind of like saying it's probably best avoided (again, IMO).
It seems to me like you want to have a separate table that stores tags and holds a foreign key which relates the tag records back to their parent record in the articles table (this is referred to as "normalizing" the database structure).
Doing it like you have suggested by cramming the tags into one field may seem to make sense now, but it will prove to be difficult to maintain and difficult/time consuming to pull the values out efficiently as your application grows in size or the amount of data grows a lot larger.
I would say that there are very few reasons to do what you have suggested, given how straightforward it is to create another table and setup a relationship to link keys between the two tables to maintain referential integrity.
I totally agree that it CAN be a good idea. I am a strong advocate of storing tags in the database as a single delimited list of strings.
BUT: The reason that I agree is that I like to use Azure Search API to index these types of data, so the query to do a lookup based on tags is not done via SQL. (using the Azure search API service is not necessary, but In my experience you will get much better performance and scalability by using a search index that is outside of the database.)
If you primary query language will be SQL (relational based queries)
then you are better off creating a child table that has a row for each
tag, otherwise you will wear a performance hit when your query has to
perform a logic on each value to split it for analysis.
Tagging is a concept that we use to get around relational data or hierarchical mapping, so to get the best performance do no try to use these relational concepts to query the tags. It is often best implemented in NoSQL data storage because they don't try to use the database to process the search queries.
I encourage you to store the data as a delimited string, and use an external indexing service to provide search and insights into your data. This is a good trade off between CRUD data access performance attempts to manage the data and indexes to optimise for searching. Sure you can optimise the DB and the search queries to make it work in SQL but it can take effort to get it right.
Once your user base hits large volumes and you need to support multiple concurrent searches without affecting update performance you will find that external indexing is an awesome investment in your time now, to save you time and resources later.

Key-value design in redis

I just begun work with redis. I have problem when work with it. I have a list of users.
I have a page that displays a list of users, in that page I have pagination, sorting, filter by name address... How can I design key-value redis for easy use?
Redis is not exactly suited for an SQL-alike usage. What I mean is that usually you get data the way you put data in Redis.
Having a list of users with pagination, if you don't need too much filtering, or just limited filtering, can be a good use case using a sorted set data type where you have your user IDs as values, and the unix time as score. If you need another listing sorted by a different field, you'll likely need an additional sorted set, and so forth.
As far as filtering is concerned, you may do it server-side getting ranges from the sorted set and removing the non-matching items if they are sparse. However you can see how this will not scale if your filter selects 10 elements out of millions.
So the applicability of Redis in your use case depends on the exact details, and in general it looks like you may want a database more suitable for complex queries, even if you are likely going to pay the performance price.

Managing the neo4j index's life cycle (CRUD)

I have limited (and disjointed) experience with databases, and nearly none with indexes. Based on web search, reading books, and working with ORMs my understanding can be summed up as follows:
An index in databases is similar to a book index in that it lists "stuff" that's in the book and tells you where to find it. This helps with lookup efficiency (this is most probably not the only benefit)
In (at least some) RDBMS's, primary key fields get automatically indexed so u never have to directly manipulate them.
I'm tinkering with neo4j and it seems you have to be deliberate about indexes so now I need to understand them but I cannot find clear answers to:
How are indexes managed in neo4j?
I know there's automatic indexing, how does it work?
If you choose to manually manage your own indexes, what can you control about them? Perhaps,index name, etc?
Would appreciate answers or pointers to answers, thanx.
Neo4j uses Apache Lucene under the covers if you want index engine like capabilities for your data. You can index nodes and/or relationships- the index helps you look up a particular instance/set of nodes or relationships.
Manual Indexing:
You can create as many node/relationship indexes as you want and you can specify a name for each index. The config can also be controlled i.e. whether you want exact matching (the default) or Lucenes full text indexing support. Once you have the index, you simply add nodes/relationships to it and the key/value you want indexed. You do however need to take care of "updating" data in the index yourself if you make changes to the node properties.
Auto-Indexing:
Here you get one index for nodes and one index for relations if you turn them on in the neo4j.properties file. You may specify what properties are to be indexed and from the point of turning them on, the index is automatically managed for you i.e. any nodes created after this point are added to the index and updated/removed automatically.
More reading:
http://docs.neo4j.org/chunked/stable/indexing.html
The above applies to versions < 2.0
2.0 adds more around the concept of indexing itself, you might want to go through
http://www.neo4j.org/develop/labels
http://blog.neo4j.org/2013/04/nodes-are-people-too.html
Hope that helps.

ElasticSearch types and indexing performance

I would like to understand the performance impact of indexing documents of multiple types to a single index where there is an imbalance in the number of items of each type (one type has millions, where another type has just thousands of documents). I have spotted issues on some of my indexes, and ruling out whether types are indexed separately within a single index (or not) would help me. Can I assume that types are indexed separately along the lines of a relational database where each table is effectively separate?
If the answer to the above is no and that types are effectively all lumped together, then I'll lay out the rest of what I'm doing to try and get some more detailed input.
The use case for this example is capturing tweets for Twitter users (call it owner for clarity). I have multi-tenant environment with one index per twitter owner. That said, focusing on a single owner:
I capture the tweets from each timeline (mentions, direct messages, my tweets, and the full 'home' timeline) into a single index, with each timeline type having a different mapping in ElasticSearch
Each tweet refers to a parent type, the user who authored the tweet (which may or may not be the owner), with a parent mapping. There is only a single 'user' type for all the timeline types
I search and facet only ever on one owner in a single query, so I don't have to concern myself searching across multiple indexes
The home timeline may capture millions of tweets, where the owner's own tweets may result in hundreds or thousands
The user documents are routinely updated with information outside of the Twitter timelines, therefore I would like to avoid (if possible) the situation where I have to keep multiple copies of the same user object in sync across multiple indexes
I have noticed a much slower response querying on the indexes with millions of documents, even when excluding the 'home timeline' type with millions of documents indexed, leaving just the types with a few thousand entries. I don't want to have to split the types into separate indexes (unless I have to), due to the parent-child relationship between a tweet and a user.
Is there a way I can understand if the issue is with the total number of documents in a specific index, something to do with the operation of 'has_child' filtered queries, some other poor design of queries or facets, or something else?
Any input would be appreciated.
EDIT
To clarify the statement that tweets are stored per timeline. This means that there is an ElasticSearch type defined for home_timeline, my_tweets_timeline, mentions_timeline, direct_messages_timeline, etc, which correspond to what you see in the standard twitter.com UI. So there is a natural split between the sets of tweets, although with some overlap too.
I have gone back in to check out the has_child queries, and this is a definite red-herring at this point. Basic queries on the larger indexes are much slower, even when querying a type with just a few thousand rows (my_tweets_timeline).
Can I assume that types are indexed separately along the lines of a relational database where each table is effectively separate?
No, types are all lumped together into one index as you guessed.
Is there a way I can understand if the issue is with the total number of documents in a specific index, something to do with the operation of 'has_child' filtered queries, some other poor design of queries or facets, or something else?
The total number of documents in the index is obviously a factor. Whether has_child queries are slow in particular is another question - try comparing the performance of has_child queries with trivial term queries for example. The has_child documentation offers one clue under "memory considerations":
With the current implementation, all _id values are loaded to memory (heap) in order to support fast lookups, so make sure there is enough memory for it.
This would imply a large amount of memory is required for any has_child query where there are millions of potential children. Make sure enough memory is available for such operations, or consider a redesign that removes the need for has_child.

Using solr for indexing different types of data

I'm considering the use of Apache solr for indexing data in a new project. The data is made of different, independent types, which means there are for example
botanicals
animals
cars
computers
to index. Should I be using different indexes for each of the types or does it make more sense to use only one index? How does using many indexes affect performance?
Or is there any other possibility to achieve this?
Thanks.
Both are legitimate approaches, but there are tradeoffs. First, how big is your dataset? If it is large enough that you may want to partition it across multiple servers, it probably makes sense to have different indexes.
Second, how important is performance - indexing it all together will likely result in worse performance, but the degree depends on how much data there is and how complex the queries can get.
Third, do you have the need to query for multiple data types in the same search? If so, indexing everything together can be a convenient way to allow this. Technically this could be achieved with separate indexes, but getting the most relevant results for the query could be a challenge (not that it isn't already)
Fourth, a single index with a single schema and configuration can simplify the life of whoever will be deploying and maintaining the system.
One other thing to consider is IDs - do the all of the different objects have a unique identifier across all types? If not, you probably will need to generate this if you want to index them together.