Lucene Facets - How to handle StoreId - lucene

We are storing data for multiple stores in the same index.
We want to create facets for several fields , like category (which is hierarchical), price, color, size, price , but we want to calculate these facets per store id.
We will never have a use case - where we want to count across stores.
How do we handle this usecase , shall we add storeid as part of all the values we give to facets , or shall we declare all facets as hierarchical , and have storeid as the first level

There may be multiple ways to handle this but based on my experience I'd suggest that when you create your drill down query for your facets (to specify the level in the category hierarchy that you are interest in) and you pass that query a baseQuery, the base query should include your criteria that storeid equals a specific store.
In a sense the storeid needing to be for a specific store is just another query criteria (that you happen to be adding on behind the scenes) for indicating which products the the customer is interested in. This is not much different than if you we also specifying that only products with a specific color are of interest.

Related

Comparing 2nd largest item per group to the largest item in the group using SQL

I have a relational database for a Burger Building application that a restaurant uses. Two of the tables contained in the DB are Category and Item. These are used to display the categories and then the customer can select a category (E.G. Buns) and view all of the children contained in that category and choose which ones to add to their order. The two tables are linked using a field called CategoryID.
The Item database contains amongst many, the following fields: ItemID, ItemName, TimesOrdered, CategoryID.
One of the required functions is to view the item that has been ordered the most (most popular) per category. This can be retrieved from the TimesOrdered field. However, if two items have been ordered the same amount of times, then there is technically not any item in that category that has been ordered the most.
Therefore, the largest TimesOrdered field will have to be compared to the second largest TimesOrdered field to determine if any items have been ordered the most for that category.
Is there any way to achieve this using SQL? For example, showing the ItemID for each category (using Grouping on CategoryID) that has been ordered the most as long as the item that has been ordered the second most has been ordered less times than the item that has been ordered the most.
I know that this can obviously be done by simply viewing the first two items and comparing the second record's TimesOrdered field with the first record's TimesOrdered field, but as a challenge and a way to improve my SQL, is their any way to get the desired results by using a single SQL statement?
Thanks in advance for any responses :)
Would it be possible to share some sample data? For example, what types of records are in your Item table?
How specifically is your Item table related to your Category table? Do you have multiple items per category?
I'd also want to know how the TimesOrdered field gets updated. Is this something that is updated manually by a user whenever that item is ordered, or handled by code?
Regarding the output: It sounds like you want to display, by category, the item with the most orders. Is this correct? If so, would it be displayed via a query the user runs? It sounds like you want to display something different for categories with multiple items having the "max ItemCount" for that category. If a given category has multiple items with max ItemCount, what should display for that category? Could you provide some sample output of what you're expecting to see?
I'm thinking the best way to handle this would be to use multiple sub-queries, which can get rather hairy in Access. It might be best to break this into separate queries in Access, which you can progressively select from
Create a query Q1 that shows the max TimesOrdered for each category.
Create a query Q2 that uses Q1 to figure out how many items for each category have the max TimesOrdered value.
Depending on how you want to display the final results, you could create a new query, Q3, that either shows NULL for the item in that category (if there's a tie), or the appropriate item. Basically, you'd display the item from each category where the TimesOrdered matches the max TimesOrdered for that category (having to possibly do special handling for categories with ties).
Another thing you might want to think about: What about having a separate Orders table that stores details of each order, rather than having a TimesOrdered field? Of course, that would complicate your queries further, but give you more data to report on.

How does one structure queries to amalgamate master detail records over a given period

Consider the following scenario (if it helps think Northwind Orders / OrderDetails).
I have two tables LandingHeaders and LandingDetails, that record details about commercial fishing trips. Typically, over the course of a week, a fishing vessel can make several trips to sea, and so will end up with several LandingHeader/LandingDetail records.
At the end of each week the company that purchases the results of these fishing trips need to work out the value of each landing made by each vessel and then pay the owner of that vessel whatever money is due. To add to fun there are some vessels owned by the same person, so the company purchasing the fish would prefer if the value of all the landings from all of the vessels owned by a given individual were amalgamated into a single payment.
Until now the information required to perform this task was spread across more that a simple master-detail table structure and as such it has required several stored procedures (along with the judicious use of dictionaries in the main application doing the work) to achieve the desired end result. External circumstances beyond my control have forced some major database changes and I have taken the opportunity to restructure the LandingHeader table such that it contains all the necessary information that might be needed.
From the landing Header table I need to record the following fields;
LandingHeaderId of sql type int
VesselOwnerId of sql type int
LandingDate (Just used as part of query in reality) of sql type datetime
From the LandingDetails Table I need to record the following fields;
ProductId of sql type int
Quantity of sql type decimal (10,2)
UnitPrice of sql type money
I have been thinking about creating a query that takes as Parameters VesselOwnerID , SartDate and EndDate.
As output I need to know which LandingId's are associated with the owner and the total Quantity for each Distinct ProductId (along with the UnitPrice which will be the same for each ProductId over the selected period) spread over the various landingDetails associated with the LandingHeaders over the given period.
I have been thinking along the lines of output rows that might look a little like this;
Can this sort of thing be done from a standard master - detail type table relationship or will I still need to resort to multiple stored procedures.
A longer term goal is to have a query that could be used to produce xml that could be adapted for use with a web api.

Modeling N-to-N with DynamoDB

I'm working in a project that uses DynamoDB for most persistent data. I'm now trying to model a data structure that more resembles what one would model in a traditional SQL database, but I'd like to explore the possibilities for a good NoSQL design also for this kind of data.
As an example, consider a simple N-to-N relation such as items grouped into categories. In SQL, this might be modeled with a connection table such as
items
-----
item_id (PK)
name
categories
----------
category_id (PK)
name
item_categories
---------------
item_id (PK)
category_id (PK)
To list all items in a category, one could perform a join such as
SELECT items.name from items
JOIN item_categories ON items.item_id = item_categories.item_id
WHERE item_categories.category_id = ?
And to list all categories to which an item belongs, the corresponding query could be made:
SELECT categories.name from categories
JOIN item_categories ON categories.category_id = item_categories.category_id
WHERE item_categories.item_id = ?
Is there any hope in modeling a relation like this with a NoSQL database in general, and DynamoDB in particular, in a fairly efficient way (not requiring a lot of (N, even?) separate operations) for a simple use-case like the ones above - when there is no equivalent of JOINs?
Or should I just go for RDS instead?
Things I have considered:
Inline categories as an array within item. This makes it easy to find the categories of an item, but does not solve getting all items within a category. And I would need to duplicate the needed attributes such as category name etc within each item. Category updates would be awkward.
Duplicate each item for each category and use category_id as range key, and add a GSI with the reverse (category_id as hash, item_id as range). De-normalizing being common for NoSQL, but I still have doubts. Possibly split items into items and item_details and only duplicate the most common attributes that are needed in listings etc.
Go for a connection table mapping items to categories and vice versa. Use [item_id, category_id] as key and [category_id, item_id] as GSI, to support both queries. Duplicate the most common attributes (name etc) here. To get all full items for a category I would still need to perform one query followed by N get operations though, which consumes a lot of CU:s. Updates of item or category names would require multipe update operations, but not too difficult.
The dilemma I have is that the format of the data itself suits a document database perfectly, while the relations I need fit an SQL database. If possible I'd like to stay with DynamoDB, but obviously not at any cost...
You are already in looking in the right direction!
In order to make an informed decision you will need to also consider the cardinality of your data:
Will you be expecting to have just a few (less then ten?) categories? Or quite a lot (ie hundreds, thousands, tens of thousands etc.)
How about items per category: Do you expect to have many cagories with a few items in each or lots of items in a few categories?
Then, you need to consider the cardinality of the total data set and the frequency of various types of queries. Will you most often need to retrieve only items in a single category? Or will you be mostly querying to retrieve items individually and you just need stayistics for number of items per category etc.
Finally, consider the expected growth of your dataset over time. DynamoDB will generally outperform an RDBMS at scale as long as your queries partition well.
Also consider the acceptable latency for each type of query you expect to perform, especially at scale. For instance, if you expect to have hundreds of categories with hundreds of thousands of items each, what does it mean to retrieve all items in a category? Surely you wouldn't be displaying them all to the user at once.
I encourage you to also consider another type of data store to accompany DynamoDB if you need statistics for your data, such as ElasticSearch or a Redis cluster.
In the end, if aggregate queries or joins are essential to your use case, or if the dataset at scale can generally be processed comfortably on a single RDBMS instance, don't try to fit a square peg in a round hole. A managed RDBMS solution like Aurora might be a better fit.

SQL full text index on linked and parent child tables

I would like broad guidelines before hitting the details, and thus as brief as I can on two issues (this might be far too little info):
Supplier has more than one address. Address is made up of fields. Address line 1 and 2 is free text. The rest are keys to master data tables that has FK and name. Id to country. Id to province. ID to municipality. ID to city. ID to suburb. I would like to employ FTS on address line 1 and 2 and also all the master data table name columns so that user can find suppliers whose address match what they capture. This is thus across various master data tables. Note that a province or suburb etc is not only single word, e.g. meadow park.
Supplier provides many goods and services. These goods and services are a 4 level hierarchy (UNSPSC) of parent and child items. The goods or service is at the lowest level of the 4 level hierarchy, but hits on the higher levels should be returned as well. Supplier linked to lowest items of hierarchy. Would like to employ FTS to find supplier who provides goods and services across the 4 level hierarchy.
The idea is this to find matches, return suppliers, show rank, show where it hit. If I'm unable to show the hit, the result makes less sense, e.g. search for the word "car" will hit on care, cars, cardiovascular, cards etc. User can type in more than one word, e.g. "car service".
Should the FTS indexes be on only the text fields on the master data tables, and my select thus inner join on FTS fields? Should I create views and indexes on those? How do I show the hit words?
Should the FTS indexes be on only the text fields on the master data tables...?
When you have fields across multiple tables that need to be searched in a single query and then ranked, the best practice is to combine those fields into a single field through an ETL process. I recently posted an answer where I explained the benefits of this approach:
Why combine them into 1 table? This approach results in better ranking than if you were to apply full text indexes to each existing
table. The former solution produces a single rank whereas the latter
will produce a different rank for each table and there is no accurate
way to resolve multiple ranks (which are based on completely different
scales) into 1 rank. ...
How can you combine them into 1 table? You will need some sort of ETL process which either runs on a schedule (which may be easier to
implement but will result in lag time where your full text index is
out of sync with the master tables) or gets run on demand whenever
your master tables are modified (either via triggers or by hooking
into an event in your data layer).
How do I show the hit words?
Unfortunately SQL Server Full Text does not have a feature that extracts or highlights the words/phrases that were matched during the search. The answers to this question have tips on how to roll your own solution. There's also a 3rd party product called ThinkHighlight which is a CLR assembly that helps with highlighting (I've never used it so I can't vouch for it).
...search for the word "car" will hit on care, cars, cardiovascular, cards etc...
You didn't explicitly ask about this but you should be aware that by default "car" will not match "care", etc. What you're looking to do is a wildcard search. Your full text query will need to use an asterisk and should look something like this: SELECT * FROM CONTAINSTABLE(MyTable, *, '"car*"') Be aware that wildcards are only available when using CONTAINS/CONTAINSTABLE (boolean searches), not FREETEXT/FREETEXTTABLE (natural language searches). Based on how you describe your use case, it sounds like you will need to modify your user's search string to add the wildcards. You'll need to do this anyway if you use CONTAINS/CONTAINSTABLE in order to add the boolean operators and quotes (ex: User types car service. You change it to "car*" AND "service*".)

Indexed View with its own rowversion for Azure Search indexer

I'm trying to design the best way to index my data into Azure Search. Let's say my Azure SQL Database contains two tables:
products
orders
In my Azure Search index I want to have not only products (name, category, description etc.), but also count of orders for this product (to use this in the scoring profiles, to boost popular products in search results).
I think that the best way to do this is to create a view (indexed view?) which will contain columns from products and count of orders for each product, but I'm not sure if my view (indexed view?) can have its own rowversion column, which will change every time the count changes (orders may be withdrawn - DELETED - and placed - INSERTED).
Maybe there is some easier solution to my problem? Any hints are appreciated.
Regards,
MJ
Yes, I believe the way you are looking to do this is a good approach. Some other things that I have seen people do is to also includes types For example, you could have a Collection field (which is an Array of strings), perhaps called OrderTypes that you would load with all of the associated order types for that product. That way you can use the Azure Search $facets features to show you the total count of specific order types. Also, you can use this to drill into the specifics of those order. For example, you could then filter based on the selected order type they selected. Certainly if there are too many types of Orders, perhaps that might not be viable.
In any case, yes, I think this would work well and also don't forget, if you want to periodically update this count you could simply pass on just that value (rather than sending the whole product fields) to make it more efficient.
A view cannot have its "own" rowversion column - that column should come from either products or orders table. If you make that column indexed, a high water mark change tracking policy will be able to capture new or updated (but not deleted) rows efficiently. If products are deleted, you should look into using a soft-delete approach as described in http://azure.microsoft.com/en-us/documentation/articles/search-howto-connecting-azure-sql-database-to-azure-search-using-indexers-2015-02-28/
HTH,
Eugene