I have a requirement that need to support autocomplete from UI and search across multiple tables for possible match. For example, I have 2 tables: Recipe and Ingredient. When a user types in 'Chicken', it should return all recipe that contains 'Chicken' and all ingredients that contain 'Chicken' (e.g. minced chicken, diced chicken, etc).
I think I need to use full text search but I am quite new to this. I can find many examples online that show how to do it for a single table, but not across multiple tables. Do I need to extract contents from multiple tables into a separate table just for search? Or does postgres support full text search across tables?
Related
I'm trying to implement a search for an online store, the requirements are the following:
If the user only searches a category name, return the category's page
If the user searches both a category and brand, return a search page with the category and brand filter applied
If the user searches for a value that matches a product exactly, return the product's page
If we matched multiple products across multiple categories and brands, return the results.
My question is, it is possible to accomplish this using a single Lucene index or should I use multiple indexes and search in all of them?
As far as I understood, Lucene has no relationships so I can't represent something like category -> brand -> model.
Thank you!
My question is, it is possible to accomplish this using a single Lucene index or should I use multiple indexes and search in all of them?
You can definitely accomplish this in a single LuceneNet index. Be aware that what is typically referred to as a "Lucene Index" is really a collection of indexes given that multiple fields can be indexed.
Another thing to know is that Lucene indexes "documents" and it imposes no common structure on those documents. One document may have 2 fields (lets say categoryId, categoryName) and another document may have 4 fields, (let's say productId, productName, productCategory, allProductFields). Lucene is totally fine with that. And if categoryName is an indexed field then you can search by that field and will only get back documents that contain that field and match the query. Ditto if querying allProductFields.
Developers may think of these documents as being two types of documents, a category document and product document. To Lucene they are all just documents. But it's sometimes useful to add to all documents a field that indicates its "document type" as you see it. So for example you could choose to add a docType field to every document and when creating a document from a product you might set that fields value to "product" and when creating a document from a category you might set its value to "category".
Having such a field then makes it possible to query only product documents or to query only category document. If there are otherwise no field names shared between the documents then having such a field is not strictly necessary. But let's say both category documents and product documents had a field named name then a search on the name field could pull up either type of document and having a docType field could prove useful to distinguish the types of results returned, or it could be used as part of the search criteria to search only one type of document.
Hopefully this will give you some ideas of how a single "Lucene Index" can be used to accomplish the various tasks you desire.
As far as I understood, Lucene has no relationships so I can't represent something like category -> brand -> model.
Well, it's true that Lucene documents do not inherently have relationships with other documents. But you can certainly chose to put unique keys on your documents and a docType then you can create your own relationships by, for example, putting the categoryId on in the product document and later using that to pull back the product's related category document for each product returned in a search. So it's kind of a roll your own sort of thing.
There is also a thing called BlockJoinQuery which is a bit more complicated and has some limitations. You an learn about it a bit from this SO Question and Answers and google around on the internet about it.
And finally, Lucene has faceting support. Actually it has two implementations of faceting support. One of them uses a "side-car" index, (ie a sister index) and this implementation supports hierarchical facets. Hierarchical facets would be a much more advanced way to represent your category -> brand -> model hierarchy. If that's of interest to you the conference session Faceted Search with Lucene is something you are probably going to want to watch.
Background
I work for a real estate technology company. An upcoming project involves building out functionality to allow users to affix tags/labels (plural) to a MLS listing (real estate property). The second requirement is to allow a user to search by one or more tags. We won't be dealing with keeping track of counts or building word clouds or anything like that.
Solutions Researched
I found this SO Q&A and think the solution is pretty straightforward and have attempted to adapt some ideas from it below. Also, I understand that JSONB support is much better in 9.5 and it may be a possibility. If you have any insight here I'd love to hear your thoughts as well in an answer.
Attempted Solution
Table: Tags
Columns: ID, OwnerID, TagName, CreatedDate
Table: TaggedItems
Columns: ID, TagID (references above), PropertyID, CreatedDate, (Possibly some denormalized data to assist with presenting search results; property name, original listor, etc.)
Inserting new tags should be straightforward. Searching tags should also be straightforward since the user will select one or multiple tags from a searchable dropdown, thus affording me access to the actual TagID which I can use to query the TaggedItems table. When showing the full profile view for a listing, I can use it's PropertyID and the UserID to query my tables for the existence of one or more Tags to display in the view.
Edit: It's probably worth noting that we don't keep an entire database of properties, we access them via an API partner; hence the two table solution and not 3.
If you want to Nth normalize you would actually use 3 tables.
1 Property/Listing
2 Tags
3 CrossReferenceBetween the Two
The 3rd table creates a many to many relationship between the other 2 tables.
In this case only the 3 rd table would carry both the tagid and the property.
Going with 2 tables if fine too depending on how large of use you have as a small string won't bloat your databse too much.
I would say that it is strongly preferable to separate the tags to a separate table when you need to do lookups and more on it. Otherwise you have to have a delimited list which then what happens if a user injects a delimiter into their tag value? Also how do you plan on searching the delimited list? You will constantly expand that to a table or use regex and the regex might give you false positives as "some" will match "some" and "something" depending on how you write your code.......
I would like broad guidelines before hitting the details, and thus as brief as I can on two issues (this might be far too little info):
Supplier has more than one address. Address is made up of fields. Address line 1 and 2 is free text. The rest are keys to master data tables that has FK and name. Id to country. Id to province. ID to municipality. ID to city. ID to suburb. I would like to employ FTS on address line 1 and 2 and also all the master data table name columns so that user can find suppliers whose address match what they capture. This is thus across various master data tables. Note that a province or suburb etc is not only single word, e.g. meadow park.
Supplier provides many goods and services. These goods and services are a 4 level hierarchy (UNSPSC) of parent and child items. The goods or service is at the lowest level of the 4 level hierarchy, but hits on the higher levels should be returned as well. Supplier linked to lowest items of hierarchy. Would like to employ FTS to find supplier who provides goods and services across the 4 level hierarchy.
The idea is this to find matches, return suppliers, show rank, show where it hit. If I'm unable to show the hit, the result makes less sense, e.g. search for the word "car" will hit on care, cars, cardiovascular, cards etc. User can type in more than one word, e.g. "car service".
Should the FTS indexes be on only the text fields on the master data tables, and my select thus inner join on FTS fields? Should I create views and indexes on those? How do I show the hit words?
Should the FTS indexes be on only the text fields on the master data tables...?
When you have fields across multiple tables that need to be searched in a single query and then ranked, the best practice is to combine those fields into a single field through an ETL process. I recently posted an answer where I explained the benefits of this approach:
Why combine them into 1 table? This approach results in better ranking than if you were to apply full text indexes to each existing
table. The former solution produces a single rank whereas the latter
will produce a different rank for each table and there is no accurate
way to resolve multiple ranks (which are based on completely different
scales) into 1 rank. ...
How can you combine them into 1 table? You will need some sort of ETL process which either runs on a schedule (which may be easier to
implement but will result in lag time where your full text index is
out of sync with the master tables) or gets run on demand whenever
your master tables are modified (either via triggers or by hooking
into an event in your data layer).
How do I show the hit words?
Unfortunately SQL Server Full Text does not have a feature that extracts or highlights the words/phrases that were matched during the search. The answers to this question have tips on how to roll your own solution. There's also a 3rd party product called ThinkHighlight which is a CLR assembly that helps with highlighting (I've never used it so I can't vouch for it).
...search for the word "car" will hit on care, cars, cardiovascular, cards etc...
You didn't explicitly ask about this but you should be aware that by default "car" will not match "care", etc. What you're looking to do is a wildcard search. Your full text query will need to use an asterisk and should look something like this: SELECT * FROM CONTAINSTABLE(MyTable, *, '"car*"') Be aware that wildcards are only available when using CONTAINS/CONTAINSTABLE (boolean searches), not FREETEXT/FREETEXTTABLE (natural language searches). Based on how you describe your use case, it sounds like you will need to modify your user's search string to add the wildcards. You'll need to do this anyway if you use CONTAINS/CONTAINSTABLE in order to add the boolean operators and quotes (ex: User types car service. You change it to "car*" AND "service*".)
I have a database full of recipes, one recipe per row. I need to store a bunch of arbitrary "flags" for each recipe to mark various properties such as Gluton-Free, No meat, No Red Meat, No Pork, No Animals, Quick, Easy, Low Fat, Low Sugar, Low Calorie, Low Sodium and Low Carb. Users need to be able to search for recipes that contain one or more of those flags by checking checkboxes in the UI.
I'm searching for the best way to store these properties in the Recipes table. My ideas so far:
Have a separate column for each property and create an index on each of those columns. I may have upwards of about 20 of these properties, so I'm wondering if there's any drawbacks with creating a whole bunch of BOOL columns on a single table.
Use a bitmask for all properties and store the whole thing in one numeric column that contains the appropriate number of bits. Create a separate index on each bit so searches will be fast.
Create an ENUM with a value for each tag, then create a column that has an ARRAY of that ENUM type. I believe an ANY clause on an array column can use an INDEX, but have never done this.
Create a separate table that has a one-to-many mapping of recipes to tags. Each tag would be a row in this table. The table would contain a link to the recipe, and an ENUM value for which tag is "on" for that recipe. When querying, I'd have to do a nested SELECT to filter out recipes that didn't contain at least one of these tags. I think this is the more "normal" way of doing this, but it does make certain queries more complicated - If I want to query for 100 recipes and also display all their tags, I'd have to use an INNER JOIN and consolidate the rows, or use a nested SELECT and aggregate on the fly.
Write performance is not too big of an issue here since recipes are added by a backend process, and search speed is critical (there might be a few hundred thousand recipes eventually). I doubt I will add new tags all that often, but I want it to be at least possible to do without major headaches.
Thanks!
I would advise you to use a normalized setup. Setting this up from the get go as a de-normalized structure is not what I would advise.
Without knowing all the details of what he have going on I think the best setup would be to have your recipe table and new property table and a new recipe_property table. That allows a recipe to have 0 or many properties and normalizes your data making it fast and easy to maintain and query your data.
High level structure would be:
CREATE TABLE recipe(recipe_id);
CREATE TABLE property(property_id);
CREATE TABLE recipe_property(recipe_property_id,recipe_id,property_id);
I have one question which can be best described by the following scenario.
Suppose I have three tables BaseCategory,Category and products. If i am thinking in terms of RDBMS then the relationship amoung these tables are
1- One BaseCategory has Many categories
2- One Category has Many Products.
Now i am thinking to convert it into HBase. can anybody help me how to map these relations into HBase?
You'd probably have each row represent a supercategory/category pair (encoded with a separator, e.g. MySuperCategory:MyCategory, and a column family named "products" with a column for each product in that category.
This would allow you to very quickly retrieve all of the items in a given supercategory/category pair, and with some de-duplication all of the items in a supercategory.