Lucene index structure concerns - lucene

I'm trying to implement a search for an online store, the requirements are the following:
If the user only searches a category name, return the category's page
If the user searches both a category and brand, return a search page with the category and brand filter applied
If the user searches for a value that matches a product exactly, return the product's page
If we matched multiple products across multiple categories and brands, return the results.
My question is, it is possible to accomplish this using a single Lucene index or should I use multiple indexes and search in all of them?
As far as I understood, Lucene has no relationships so I can't represent something like category -> brand -> model.
Thank you!

My question is, it is possible to accomplish this using a single Lucene index or should I use multiple indexes and search in all of them?
You can definitely accomplish this in a single LuceneNet index. Be aware that what is typically referred to as a "Lucene Index" is really a collection of indexes given that multiple fields can be indexed.
Another thing to know is that Lucene indexes "documents" and it imposes no common structure on those documents. One document may have 2 fields (lets say categoryId, categoryName) and another document may have 4 fields, (let's say productId, productName, productCategory, allProductFields). Lucene is totally fine with that. And if categoryName is an indexed field then you can search by that field and will only get back documents that contain that field and match the query. Ditto if querying allProductFields.
Developers may think of these documents as being two types of documents, a category document and product document. To Lucene they are all just documents. But it's sometimes useful to add to all documents a field that indicates its "document type" as you see it. So for example you could choose to add a docType field to every document and when creating a document from a product you might set that fields value to "product" and when creating a document from a category you might set its value to "category".
Having such a field then makes it possible to query only product documents or to query only category document. If there are otherwise no field names shared between the documents then having such a field is not strictly necessary. But let's say both category documents and product documents had a field named name then a search on the name field could pull up either type of document and having a docType field could prove useful to distinguish the types of results returned, or it could be used as part of the search criteria to search only one type of document.
Hopefully this will give you some ideas of how a single "Lucene Index" can be used to accomplish the various tasks you desire.
As far as I understood, Lucene has no relationships so I can't represent something like category -> brand -> model.
Well, it's true that Lucene documents do not inherently have relationships with other documents. But you can certainly chose to put unique keys on your documents and a docType then you can create your own relationships by, for example, putting the categoryId on in the product document and later using that to pull back the product's related category document for each product returned in a search. So it's kind of a roll your own sort of thing.
There is also a thing called BlockJoinQuery which is a bit more complicated and has some limitations. You an learn about it a bit from this SO Question and Answers and google around on the internet about it.
And finally, Lucene has faceting support. Actually it has two implementations of faceting support. One of them uses a "side-car" index, (ie a sister index) and this implementation supports hierarchical facets. Hierarchical facets would be a much more advanced way to represent your category -> brand -> model hierarchy. If that's of interest to you the conference session Faceted Search with Lucene is something you are probably going to want to watch.

Related

Is a two table solution a performant and scalable solution to implement tagging in Postgres 9.5?

Background
I work for a real estate technology company. An upcoming project involves building out functionality to allow users to affix tags/labels (plural) to a MLS listing (real estate property). The second requirement is to allow a user to search by one or more tags. We won't be dealing with keeping track of counts or building word clouds or anything like that.
Solutions Researched
I found this SO Q&A and think the solution is pretty straightforward and have attempted to adapt some ideas from it below. Also, I understand that JSONB support is much better in 9.5 and it may be a possibility. If you have any insight here I'd love to hear your thoughts as well in an answer.
Attempted Solution
Table: Tags
Columns: ID, OwnerID, TagName, CreatedDate
Table: TaggedItems
Columns: ID, TagID (references above), PropertyID, CreatedDate, (Possibly some denormalized data to assist with presenting search results; property name, original listor, etc.)
Inserting new tags should be straightforward. Searching tags should also be straightforward since the user will select one or multiple tags from a searchable dropdown, thus affording me access to the actual TagID which I can use to query the TaggedItems table. When showing the full profile view for a listing, I can use it's PropertyID and the UserID to query my tables for the existence of one or more Tags to display in the view.
Edit: It's probably worth noting that we don't keep an entire database of properties, we access them via an API partner; hence the two table solution and not 3.
If you want to Nth normalize you would actually use 3 tables.
1 Property/Listing
2 Tags
3 CrossReferenceBetween the Two
The 3rd table creates a many to many relationship between the other 2 tables.
In this case only the 3 rd table would carry both the tagid and the property.
Going with 2 tables if fine too depending on how large of use you have as a small string won't bloat your databse too much.
I would say that it is strongly preferable to separate the tags to a separate table when you need to do lookups and more on it. Otherwise you have to have a delimited list which then what happens if a user injects a delimiter into their tag value? Also how do you plan on searching the delimited list? You will constantly expand that to a table or use regex and the regex might give you false positives as "some" will match "some" and "something" depending on how you write your code.......

SQL full text index on linked and parent child tables

I would like broad guidelines before hitting the details, and thus as brief as I can on two issues (this might be far too little info):
Supplier has more than one address. Address is made up of fields. Address line 1 and 2 is free text. The rest are keys to master data tables that has FK and name. Id to country. Id to province. ID to municipality. ID to city. ID to suburb. I would like to employ FTS on address line 1 and 2 and also all the master data table name columns so that user can find suppliers whose address match what they capture. This is thus across various master data tables. Note that a province or suburb etc is not only single word, e.g. meadow park.
Supplier provides many goods and services. These goods and services are a 4 level hierarchy (UNSPSC) of parent and child items. The goods or service is at the lowest level of the 4 level hierarchy, but hits on the higher levels should be returned as well. Supplier linked to lowest items of hierarchy. Would like to employ FTS to find supplier who provides goods and services across the 4 level hierarchy.
The idea is this to find matches, return suppliers, show rank, show where it hit. If I'm unable to show the hit, the result makes less sense, e.g. search for the word "car" will hit on care, cars, cardiovascular, cards etc. User can type in more than one word, e.g. "car service".
Should the FTS indexes be on only the text fields on the master data tables, and my select thus inner join on FTS fields? Should I create views and indexes on those? How do I show the hit words?
Should the FTS indexes be on only the text fields on the master data tables...?
When you have fields across multiple tables that need to be searched in a single query and then ranked, the best practice is to combine those fields into a single field through an ETL process. I recently posted an answer where I explained the benefits of this approach:
Why combine them into 1 table? This approach results in better ranking than if you were to apply full text indexes to each existing
table. The former solution produces a single rank whereas the latter
will produce a different rank for each table and there is no accurate
way to resolve multiple ranks (which are based on completely different
scales) into 1 rank. ...
How can you combine them into 1 table? You will need some sort of ETL process which either runs on a schedule (which may be easier to
implement but will result in lag time where your full text index is
out of sync with the master tables) or gets run on demand whenever
your master tables are modified (either via triggers or by hooking
into an event in your data layer).
How do I show the hit words?
Unfortunately SQL Server Full Text does not have a feature that extracts or highlights the words/phrases that were matched during the search. The answers to this question have tips on how to roll your own solution. There's also a 3rd party product called ThinkHighlight which is a CLR assembly that helps with highlighting (I've never used it so I can't vouch for it).
...search for the word "car" will hit on care, cars, cardiovascular, cards etc...
You didn't explicitly ask about this but you should be aware that by default "car" will not match "care", etc. What you're looking to do is a wildcard search. Your full text query will need to use an asterisk and should look something like this: SELECT * FROM CONTAINSTABLE(MyTable, *, '"car*"') Be aware that wildcards are only available when using CONTAINS/CONTAINSTABLE (boolean searches), not FREETEXT/FREETEXTTABLE (natural language searches). Based on how you describe your use case, it sounds like you will need to modify your user's search string to add the wildcards. You'll need to do this anyway if you use CONTAINS/CONTAINSTABLE in order to add the boolean operators and quotes (ex: User types car service. You change it to "car*" AND "service*".)

Indexed View with its own rowversion for Azure Search indexer

I'm trying to design the best way to index my data into Azure Search. Let's say my Azure SQL Database contains two tables:
products
orders
In my Azure Search index I want to have not only products (name, category, description etc.), but also count of orders for this product (to use this in the scoring profiles, to boost popular products in search results).
I think that the best way to do this is to create a view (indexed view?) which will contain columns from products and count of orders for each product, but I'm not sure if my view (indexed view?) can have its own rowversion column, which will change every time the count changes (orders may be withdrawn - DELETED - and placed - INSERTED).
Maybe there is some easier solution to my problem? Any hints are appreciated.
Regards,
MJ
Yes, I believe the way you are looking to do this is a good approach. Some other things that I have seen people do is to also includes types For example, you could have a Collection field (which is an Array of strings), perhaps called OrderTypes that you would load with all of the associated order types for that product. That way you can use the Azure Search $facets features to show you the total count of specific order types. Also, you can use this to drill into the specifics of those order. For example, you could then filter based on the selected order type they selected. Certainly if there are too many types of Orders, perhaps that might not be viable.
In any case, yes, I think this would work well and also don't forget, if you want to periodically update this count you could simply pass on just that value (rather than sending the whole product fields) to make it more efficient.
A view cannot have its "own" rowversion column - that column should come from either products or orders table. If you make that column indexed, a high water mark change tracking policy will be able to capture new or updated (but not deleted) rows efficiently. If products are deleted, you should look into using a soft-delete approach as described in http://azure.microsoft.com/en-us/documentation/articles/search-howto-connecting-azure-sql-database-to-azure-search-using-indexers-2015-02-28/
HTH,
Eugene

Implement search in multiple tables in database with ASP .NET MVC

I have a question about implementing search functionality. I have a table which contains 2 user id's and details of transaction between them (title, date, description, etc.). I want to allow user to search transactions by any of these criteria (so typing "Mike salary 2013" would result in transactions from 2013 with Mike, which title or description contained word "salary").
This can be accomplished by joining required tables, creating a search string and filtering every input word by that string, but what concerned me, was that Transaction table is designed to have ultimately millions of rows - so joining multiple tables + string operations from database's side could be slow.
My another idea was to create separate column for search string - that string would be created with creation of transaction and would contain all necessary information. The problem is when user decides to change his/her name (users can do that form their "Profile" page). The search strings in all transactions assigned to that user would be outdated.
So here's my question: is it better to search all entries and update search strings after user changes their name (it would be costly, but users don't change their names often) or give up on this whole "search string column" idea and do it with old-fashioned joins? Or maybe there is another option?
Thanks for your help :)
You should use Full Text Search. It actually combines both of your ideas. You can run FTS queries on multiple columns and multiple tables. Behind the scenes, FTS uses an index, which is similar to your "search string column" idea.

What's the best way in Postgres to store a bunch of arbitrary boolean values for a row?

I have a database full of recipes, one recipe per row. I need to store a bunch of arbitrary "flags" for each recipe to mark various properties such as Gluton-Free, No meat, No Red Meat, No Pork, No Animals, Quick, Easy, Low Fat, Low Sugar, Low Calorie, Low Sodium and Low Carb. Users need to be able to search for recipes that contain one or more of those flags by checking checkboxes in the UI.
I'm searching for the best way to store these properties in the Recipes table. My ideas so far:
Have a separate column for each property and create an index on each of those columns. I may have upwards of about 20 of these properties, so I'm wondering if there's any drawbacks with creating a whole bunch of BOOL columns on a single table.
Use a bitmask for all properties and store the whole thing in one numeric column that contains the appropriate number of bits. Create a separate index on each bit so searches will be fast.
Create an ENUM with a value for each tag, then create a column that has an ARRAY of that ENUM type. I believe an ANY clause on an array column can use an INDEX, but have never done this.
Create a separate table that has a one-to-many mapping of recipes to tags. Each tag would be a row in this table. The table would contain a link to the recipe, and an ENUM value for which tag is "on" for that recipe. When querying, I'd have to do a nested SELECT to filter out recipes that didn't contain at least one of these tags. I think this is the more "normal" way of doing this, but it does make certain queries more complicated - If I want to query for 100 recipes and also display all their tags, I'd have to use an INNER JOIN and consolidate the rows, or use a nested SELECT and aggregate on the fly.
Write performance is not too big of an issue here since recipes are added by a backend process, and search speed is critical (there might be a few hundred thousand recipes eventually). I doubt I will add new tags all that often, but I want it to be at least possible to do without major headaches.
Thanks!
I would advise you to use a normalized setup. Setting this up from the get go as a de-normalized structure is not what I would advise.
Without knowing all the details of what he have going on I think the best setup would be to have your recipe table and new property table and a new recipe_property table. That allows a recipe to have 0 or many properties and normalizes your data making it fast and easy to maintain and query your data.
High level structure would be:
CREATE TABLE recipe(recipe_id);
CREATE TABLE property(property_id);
CREATE TABLE recipe_property(recipe_property_id,recipe_id,property_id);