Elasticsearch querying multiple types and grouped by types? - lucene

Suppose I am to search against two types [cars] and [buildings], and I would want the results to be separated. Is there a way one can group results by types?
I understand one simple way will be to query each types separately, but for other use cases one may actually need to query tens or hundreds of types together. Is there a native way or hacky way(like using sort) to achieve this?

This type of grouping behavior is (currently) not available in elasticsearch. It has been a long standing request:
https://github.com/elasticsearch/elasticsearch/issues/256
There are two approaches that can help, both of which are far from perfect, but may be good enough for some use cases.
Client side aggregation. Request a lot more results than you plan on displaying and the then bucket those.
Using multi-query. This allows you to easily pass down some number of queries in a single batch, but will have potential scaling problems if the number of queries gets to large.
This is one feature that Solr has that elasticsearch doesn't, but I have never tried it. I used a similar feature with Autonomy IDOL years back, but the performance was abysmal.

If you want the results separated in groups of documents, you're going to have to restructure your documents, since, elasticsearch is focused on finding matching documents. You might get around this by designing a document that has child documents then you can query for matches on the parent document that represents your type.

I guess there might be some common field (let's say it's [price]) if you want to search against different types. Then it would be reasonable to add some different type like [price_aggregator] and put into it fields [type] and [price]. And then you could easily build your query against just one type. This requires some additional work while indexing and more memory to store index but it's much performant when you search.

Related

Apply join like SQL to get data from multiple collections - Firestore

I have a project database setup on Firestore. For some case I just want to pull out data from multiple collections and combine them like SQL. Also we have provision of pagination with filters applied on data. So its like applying where condition and joining multiple collections to get relevant data. Can anybody help me to get this result?
Firestore read operations get documents from a single collection, or a group of collections of the same name. It has no support for server-side join operations.
The two common workaround for this are:
Load the additional data from your application code, also referred to as a client-side join. This affect performance a bit, but not nearly as much as you may expect - so definitely try and measure performance before ruling it out.
Duplicate the data that you need from the secondary collection. This way you store more data, and your write operations become more complex, but reading the data is fast and simple.
The second solution is also the only way to have conditions on the data from both collections, as there's also no way to query across collections.
While this may all be very unexpected if you come from a background in relational databases, it is actually very common amongst NoSQL solutions and is one of the reasons they can scale so well to massive data sizes.
To learn more, I highly recommend reading NoSQL data modeling and watching Getting to know Cloud Firestore.

Is sorting the database via a custom function inefficient?

I have a table with Id and Text fields. The Text field holds sentences, averaging 50 words. There are >1,000,000 rows.
This is part of a web app where users need to be able to search through these sentences. Here's the twist though - I need to be able to run a custom search function written in C# that uses Machine Learning instead.
From what I understand, this means I'll have to download the entire database of >1,000,000 rows every time a user makes a search! This seems really inefficient to me.
How would you implement this in the most efficient/fast way possible?
If this is relevant, I'm using EF Core with LINQ .Where(my_custom_search_function), with a PostgreSQL database
I think I've found the solution. Postgresql full-text search currently provides two ranking functions. In this case "sorting" in the question and "ranking" here refer to the same thing.
Postgresql docs state:
However, the concept of relevancy is vague and very application-specific. Different applications might require additional information for ranking, e.g., document modification time. The built-in ranking functions are only examples. You can write your own ranking functions and/or combine their results with additional factors to fit your specific needs.
These functions can any of the four kinds of supported postgresql functions.
Then they answer this exact question:
Ranking can be expensive since it requires consulting the tsvector of each matching document, which can be I/O bound and therefore slow. Unfortunately, it is almost impossible to avoid since practical queries often result in large numbers of matches.
Credits to #Used_By_Already for pointing me to Postgresql full-text search.

ElasticSearch types and indexing performance

I would like to understand the performance impact of indexing documents of multiple types to a single index where there is an imbalance in the number of items of each type (one type has millions, where another type has just thousands of documents). I have spotted issues on some of my indexes, and ruling out whether types are indexed separately within a single index (or not) would help me. Can I assume that types are indexed separately along the lines of a relational database where each table is effectively separate?
If the answer to the above is no and that types are effectively all lumped together, then I'll lay out the rest of what I'm doing to try and get some more detailed input.
The use case for this example is capturing tweets for Twitter users (call it owner for clarity). I have multi-tenant environment with one index per twitter owner. That said, focusing on a single owner:
I capture the tweets from each timeline (mentions, direct messages, my tweets, and the full 'home' timeline) into a single index, with each timeline type having a different mapping in ElasticSearch
Each tweet refers to a parent type, the user who authored the tweet (which may or may not be the owner), with a parent mapping. There is only a single 'user' type for all the timeline types
I search and facet only ever on one owner in a single query, so I don't have to concern myself searching across multiple indexes
The home timeline may capture millions of tweets, where the owner's own tweets may result in hundreds or thousands
The user documents are routinely updated with information outside of the Twitter timelines, therefore I would like to avoid (if possible) the situation where I have to keep multiple copies of the same user object in sync across multiple indexes
I have noticed a much slower response querying on the indexes with millions of documents, even when excluding the 'home timeline' type with millions of documents indexed, leaving just the types with a few thousand entries. I don't want to have to split the types into separate indexes (unless I have to), due to the parent-child relationship between a tweet and a user.
Is there a way I can understand if the issue is with the total number of documents in a specific index, something to do with the operation of 'has_child' filtered queries, some other poor design of queries or facets, or something else?
Any input would be appreciated.
EDIT
To clarify the statement that tweets are stored per timeline. This means that there is an ElasticSearch type defined for home_timeline, my_tweets_timeline, mentions_timeline, direct_messages_timeline, etc, which correspond to what you see in the standard twitter.com UI. So there is a natural split between the sets of tweets, although with some overlap too.
I have gone back in to check out the has_child queries, and this is a definite red-herring at this point. Basic queries on the larger indexes are much slower, even when querying a type with just a few thousand rows (my_tweets_timeline).
Can I assume that types are indexed separately along the lines of a relational database where each table is effectively separate?
No, types are all lumped together into one index as you guessed.
Is there a way I can understand if the issue is with the total number of documents in a specific index, something to do with the operation of 'has_child' filtered queries, some other poor design of queries or facets, or something else?
The total number of documents in the index is obviously a factor. Whether has_child queries are slow in particular is another question - try comparing the performance of has_child queries with trivial term queries for example. The has_child documentation offers one clue under "memory considerations":
With the current implementation, all _id values are loaded to memory (heap) in order to support fast lookups, so make sure there is enough memory for it.
This would imply a large amount of memory is required for any has_child query where there are millions of potential children. Make sure enough memory is available for such operations, or consider a redesign that removes the need for has_child.

Should I be concerned that ORMs, by default, return all columns?

In my limited experience in working with ORMs (so far LLBL Gen Pro and Entity Framework 4), I've noticed that inherently, queries return data for all columns. I know NHibernate is another popular ORM, and I'm not sure that this applies with it or not, but I would assume it does.
Of course, I know there are workarounds:
Create a SQL view and create models and mappings on the view
Use a stored procedure and create models and mappings on the result set returned
I know that adhering to certain practices can help mitigate this:
Ensuring your row counts are reasonably limited when selecting data
Ensuring your tables aren't excessively wide (large number of columns and/or large data types)
So here are my questions:
Are the above practices sufficient, or should I still consider finding ways to limit the number of columns returned?
Are there other ways to limit returned columns other than the ones I listed above?
How do you typically approach this in your projects?
Thanks in advance.
UPDATE: This sort of stems from the notion that SELECT * is thought of as a bad practice. See this discussion.
One of the reasons to use an ORM of nearly any kind is to delay a lot of those lower-level concerns and focus on the business logic. As long as you keep your joins reasonable and your table widths sane, ORMs are designed to make it easy to get data in and out, and that requires having the entire row available.
Personally, I consider issues like this premature optimization until encountering a specific case that bogs down because of table width.
First of : great question, and about time someone asked this! :-)
Yes, the fact an ORM typically returns all columns for a database table is something you need to take into consideration when designing your systems. But as you've mentioned - there are ways around this.
The main fact for me is to be aware that this is what happens - either a SELECT * FROM dbo.YourTable, or (better) a SELECT (list of all columns) FROM dbo.YourTable.
This is not a problem when you really want the whole object and all its properties, and as long as you load a few rows, that's fine, too - the convenience beats the raw performance.
You might need to think about changing your database structures a little bit - things like:
maybe put large columns like BLOBs into separate tables with a 1:1 link to your base table - that way, a select on the parent tables doesn't grab all those large blobs of data
maybe put groups of columns that are optional, that might only show up in certain situations, into separate tables and link them - again, just to keep the base tables lean'n'mean
Also: avoid trying to "arm-wrestle" your ORM into doing bulk operations - that's just not their strong point.
And: keep an eye on performance, and try to pick an ORM that allows you to change certain operations into e.g. stored procedures - Entity Framework 4 allows this. So if the deletes are killing you - maybe you just write a Delete stored proc for that table and handle that operation differently.
The question here covers your options fairly well. Basically you're limited to hand-crafting the HQL/SQL. It's something you want to do if you run into scalability problems, but if you do in my experience it can have a very large positive impact. In particular, it saves a lot of disk and network IO, so your scalability can take a big jump. Not something to do right away though: analyse then optimise.
Are there other ways to limit returned columns other than the ones I listed above?
NHibernate lets you add projections to your queries so you wouldn't need to use views or procs just to limit your columns.
For me this has only been an issue if the tables has LOTS of columns > 30 or if the column had alot of data for example a over 5000 character in a field.
The approach I have used is to just map another object to the existing table but with only the fields I need. So for a search that populates a table with 100 rows I would have a
MyObjectLite, but when I click to view the Details of that Row I would call a GetById and return a MyObject that has all the columns.
Another approach is to use custom SQL, Stroed procs but I only think you should go down this path if you REALLY need the performance gain and have users complaining. SO unless there is a performance problem do not waste your time trying to fix a problem that does not exist.
You can limit number of returned columns by using Projection and Transformers.AliasToBean and DTO here how it looks in Criteria API:
.SetProjection(Projections.ProjectionList()
.Add(Projections.Property("Id"), "Id")
.Add(Projections.Property("PackageName"), "Caption"))
.SetResultTransformer(Transformers.AliasToBean(typeof(PackageNameDTO)));
In LLBLGen Pro, you can return Typed Lists which not only allow you to define which fields are returned but also allow you to join data so you can pull a custom list of fields from multiple tables.
Overall, I agree that for most situations, this is premature optimization.
One of the big advantages of using LLBLGen and other ORMs as well (I just feel confident speaking about LLBLGen because I have used it since its inception) is that the performance of the data access has been optimized by folks who understand the issues better than your average bear.
Whenever they figure out a way to further speed up their code, you get those changes "for free" just by re-generating your data layer or by installing a new dll.
Unless you consider yourself an expert at writing data access code, ORMs probably improve most developers efficacy and accuracy.

Using solr for indexing different types of data

I'm considering the use of Apache solr for indexing data in a new project. The data is made of different, independent types, which means there are for example
botanicals
animals
cars
computers
to index. Should I be using different indexes for each of the types or does it make more sense to use only one index? How does using many indexes affect performance?
Or is there any other possibility to achieve this?
Thanks.
Both are legitimate approaches, but there are tradeoffs. First, how big is your dataset? If it is large enough that you may want to partition it across multiple servers, it probably makes sense to have different indexes.
Second, how important is performance - indexing it all together will likely result in worse performance, but the degree depends on how much data there is and how complex the queries can get.
Third, do you have the need to query for multiple data types in the same search? If so, indexing everything together can be a convenient way to allow this. Technically this could be achieved with separate indexes, but getting the most relevant results for the query could be a challenge (not that it isn't already)
Fourth, a single index with a single schema and configuration can simplify the life of whoever will be deploying and maintaining the system.
One other thing to consider is IDs - do the all of the different objects have a unique identifier across all types? If not, you probably will need to generate this if you want to index them together.