How to create a faceted search with SQL Server - sql

I have an application which I will be accessing SQL server to return data which has been filtered by selections from the application as any common faceted search. I did see some out the box solutions, but these are expensive and I prefer building out something custom, but just don't know where to start.
The database structure is like this:
The data from the PRODUCT table would be searched by tags from the TAG table. Values which would be found in the TAG table would be something like this:
ID NAME
----------------------
1 Blue
2 Green
3 Small
4 Large
5 Red
They would be related to products through the ProductTag table.
I would need to return two groups of data from this setup:
The Products that are only related to the Tags selected, whether single or multiple
The Remaining tags that are also available to select for the products which have already been refined by single or multiple selected tags.
I would like this to be all with-in SQL server if possible, 2 seperate as stored procedures.
Most websites have this feature built into it these days, ie: http://www.gnc.com/family/index.jsp?categoryId=2108294&cp=3593186.3593187 (They've called it 'Narrow By')
I have been searching for a while how to do this, and I'm taking a wild guess that if a stored procedure has to be created in this nature, that there would need to be 1 param that accepts CSV values, like this:
[dbo].[GetFacetedProducts] #Tags_Selected = '1,3,5'
[dbo].[GetFacetedTags] #Tags_Selected = '1,3,5'
So with this architecture, does anyone know what types of queries need to be written for these stored procedures, or is the architecture flawed in any way? Has anyone created a faceted search before that was like this? If so, what types of queries would be needed to make something like this? I guess I'm just having trouble wrap my head around it, and there isn't much out there that shows someone how to make something like this.

A RDBMS for being used for faceted searching is the wrong tool for the job at hand. Faceted searching is a multidimensional search, which is difficult to express in the set-based SQL language. Using a data-cube or the like might give you some of the desired functionality, but would be quite a bit of work to build.
When we were faced with similar requirements we ultimately decided to utilize the Apache Solr search engine, which supports faceting as well as many other search-oriented functions and features.

It is possible to do faceted search in SQL Server. However don't try to use your live product data tables. Instead create a de-normalised "fact" table which holds every product (rows) and every tag (columns) so that the intersection is your product-tag values. You can re-populate this periodically from your main product table.
It is then straightforward and relatively efficient to get the facet counts for the matching records for each tag the user checks.
The approach I have described will be perfectly good for small cases, e.g. 1,000 product rows and 50-100 tags (attributes). Also there is an interesting opportunity with the forthcoming SQL Server 2014, which can place tables in memory - that should allow much larger fact tables.
I have also used Solr, and as STW points out this is the "correct" tool for facet searches. It is orders of magnitude faster than a SQL Server solution.
However there are some major disadvantages to using Solr. The main issue is that you have to setup not only another platform (Solr) but also all the paraphernalia that goes with it - Java and some kind of Java servlet (of which there are several). And whilst Solr runs on Windows quite nicely, you will still soon find yourself immersed in a world of command lines and editing of configuration files and environment variables that will remind you of all that was great about the 1980s ... or possibly not. And when that is all working you then need to export your product data to it, using various methods - there is a SQL Server connector which works fairly well but many prefer to post data to it as XML. And then you have to create a webservice-type process on your application to send it the user's query and parse the resulting list of matches and counts back into your application (again, XML is probably the best method).
So if your dataset is relatively small, I would stick with SQL Server. You can still get a sub-second response, and SQL 2014 will hopefully allow much bigger datasets. If your dataset is big then Solr will give remarkably fast results (it really is very fast) but be prepared to make a major investment in learning and supporting a whole new platform.

There's other places where you can get examples of turning a CSV parameter into a table variable. Assuming you have done that part your query boils down to the following:
GetFacetedProducts:
Find Product records where all tags passed in are assigned to each product.
If you wrote it by hand you could end up with:
SELECT P.*
FROM Product P
INNER JOIN ProductTag PT1 ON PT1.ProductID = P.ID AND PT1.TagID = 1
INNER JOIN ProductTag PT2 ON PT1.ProductID = P.ID AND PT1.TagID = 3
INNER JOIN ProductTag PT3 ON PT1.ProductID = P.ID AND PT1.TagID = 5
While this does select only the products that have those tags, it is not going to work with a dynamic list. In the past some people have built up the SQL and executed it dynamically, don't do that.
Instead, lets assume that the same tag can't be applied to a product twice, so we could change our question to:
Find me products where the number of tags matching (dynamic list) is equal to the number of tags in (dynamic list)
DECLARE #selectedTags TABLE (ID int)
DECLARE #tagCount int
INSERT INTO #selectedTags VALUES (1)
INSERT INTO #selectedTags VALUES (3)
INSERT INTO #selectedTags VALUES (5)
SELECT #tagCount = COUNT(*) FROM #selectedTags
SELECT
P.ID
FROM Product P
JOIN ProductTag PT
ON PT.ProductID = P.ID
JOIN #selectedTags T
ON T.ID = PT.TagID
GROUP BY
P.ID,
P.Name
HAVING COUNT(PT.TagID) = #tagCount
This returns just the ID of products that match all your tags, you could then join this back to the products table if you want more than just an ID, otherwise you're done.
As for your second query, once you have the product IDs that match, you want a list of all tags for those product IDs that aren't in your list:
SELECT DISTINCT
PT2.TagID
FROM aProductTag PT2
WHERE PT2.ProductID IN (
SELECT
P.ID
FROM aProduct P
JOIN aProductTag PT
ON PT.ProductID = P.ID
JOIN #selectedTags T
ON T.ID = PT.TagID
GROUP BY
P.ID,
P.Name
HAVING COUNT(PT.TagID) = #tagCount
)
AND PT2.TagID NOT IN (SELECT ID FROM #selectedTags)

Related

What am I / Was I missing on this SQL request?

I apologize for any vagueness, this is about a question I had on a entry level SQL position test last week. I couldn't figure out how to do this at the time, and can't seem to figure anything out now.
Basically I was provided with 3 tables. One was Recipes (had recipe name, ID, instructions and notes), one was RecipeIngredients (had recipe ID, ingredient ID, ingredients, and quantity of ingredients), and the third was Ingredients (ingredient ID and ingredient). Something like that.
I had a few queries with JOIN statements that showed how to make certain recipes and so on. But I couldn't figure out quite how to manage the final question. The final question was something like:
"Provide 2 sets of queries at once. 1st query - Return Ingredients, quantities, and notes for a specific ID. 2nd Query - Return the instructions for the same Recipe ID. Write the queries so that the user can easily alter the recipe ID in one place only for both queries in order to query for different recipe IDs."
I know we can't alias a WHERE clause, but that is the only thing I can remotely think of for doing 2 queries at once with only specifying the WHERE once. I tried to see if I could do it with a subquery but had no luck. I considered UNION... but there were different columns and different values in each query so that's a no go.
Is there something I'm missing? Or did I just completely fail when trying to set this up as a subquery? I apologize for vagueness, it's been a few days and I've been too busy to remember to post this on here. I've found a lot of help anonymously browsing this site in the past so I figured it was worth posting this as I've not seen anything similar so far.
Using SQL Server, you could have declared a variable, like #recipeID, then used it in two queries. That would allow you to change the value in one place, and have it used in 2 queries.
DECLARE #recipeID INT = 123
SELECT *
FROM recipes
WHERE recipeID = #recipeID
SELECT *
FROM recipe_ingredients
WHERE recipeID = #recipeID

Sql to Hql query in grails for a very large assosiation

In my app there's a following relation: Page hasMany Paragraphs and I need to create a query that returns all pages where the number of paragraphs is less then limit. The problem is that the pages are created in another app approximately 2 per second and the paragraphs table contains more then 2 million rows. All standard grails approaches, like dynamic finders and criteria queries just hang as they create very not optimal sql. In the database console the following query does the job:
select * from (
select a.id, count(b.page_id) count from page a
left join paragraph b ON a.id = b.page_id
group by 1) sub
WHERE sub.count <= 10 LIMIT 1000
And I couln't translate this query into HQL. I know there's groovy sql available, but it's rows method returns a List of GroovyResult, not list of domain classes. Is there a better approach to the issue?
If a query gets too complicated I tend to do something like this:
def results = new Sql(dataSource).rows(SQL)*.id*.asType(Integer).colect(DomainClass.&get)
I know it doesn't look too great and you'd probable get no kudos for it but it gets the job done.
However it if you'd like to use something more expressive you could give a try to JOOQ (http://www.jooq.org/)

Learning ExecuteSQL in FMP12, a few questions

I have joined a new job where I am required to use FileMaker (and gradually transition systems to other databases). I have been a DB Admin of a MS SQL Server database for ~2 years, and I am very well versed in PL/SQL and T-SQL. I am trying to pan my SQL knowledge to FMP using the ExecuteSQL functionaloty, and I'm kinda running into a lot of small pains :)
I have 2 tables: Movies and Genres. The relevant columns are:
Movies(MovieId, MovieName, GenreId, Rating)
Genres(GenreId, GenreName)
I'm trying to find the movie with the highest rating in each genre. The SQL query for this would be:
SELECT M.MovieName
FROM Movies M INNER JOIN Genres G ON M.GenreId=G.GenreId
WHERE M.Rating=
(
SELECT MAX(Rating) FROM Movies WHERE GenreId = M.GenreId
)
I translated this as best as I could to an ExecuteSQL query:
ExecuteSQL ("
SELECT M::MovieName FROM Movies M INNER JOIN Genres G ON M::GenreId=G::GenreId
WHERE M::Rating =
(SELECT MAX(M2::Rating) FROM Movies M2 WHERE M2::GenreId = M::GenreId)
"; "" ; "")
I set the field type to Text and also ensured values are not stored. But all I see are '?' marks.
What am I doing incorrectly here? I'm sorry if it's something really stupid, but I'm new to FMP and any suggestions would be appreciated.
Thank you!
--
Ram
UPDATE: Solution and the thought process it took to get there:
Thanks to everyone that helped me solve the problem. You guys made me realize that traditional SQL thought process does not exactly pan to FMP, and when I probed around, what I realized is that to best use SQL knowledge in FMP, I should be considering each column independently and not think of the entire result set when I write a query. This would mean that for my current functionality, the JOIN is no longer necessary. The JOIN was to bring in the GenreName, which is a different column that FMP automatically maps. I just needed to remove the JOIN, and it works perfectly.
TL;DR: The thought process context should be the current column, not the entire expected result set.
Once again, thank you #MissJack, #Chuck (how did you even get that username?), #pft221 and #michael.hor257k
I've found that FileMaker is very particular in its formatting of queries using the ExecuteSQL function. In many cases, standard SQL syntax will work fine, but in some cases you have to make some slight (but important) tweaks.
I can see two things here that might be causing the problem...
ExecuteSQL ("
SELECT M::MovieName FROM Movies M INNER JOIN Genres G ON
M::GenreId=G::GenreId
WHERE M::Rating =
(SELECT MAX(M2::Rating) FROM Movies M2 WHERE M2::GenreId = M::GenreId)
"; "" ; "")
You can't use the standard FMP table::field format inside the query.
Within the quotes inside the ExecuteSQL function, you should follow the SQL format of table.column. So M::MovieName should be M.MovieName.
I don't see an AS anywhere in your code.
In order to create an alias, you must state it explicitly. For example, in your FROM, it should be Movies AS M.
I think if you fix those two things, it should probably work. However, I've had some trouble with JOINs myself, as my primary experience is with FMP, and I'm only just now becoming more familiar with SQL syntax.
Because it's incredibly hard to debug SQL in FMP, the best advice I can give you here is to start small. Begin with a very basic query, and once you're sure that's working, gradually add more complicated elements one at a time until you encounter the dreaded ?.
There's a number of great posts on FileMaker Hacks all about ExecuteSQL:
Since you're already familiar with SQL, I'd start with this one: The Missing FM 12 ExecuteSQL Reference. There's a link to a PDF of the entire article if you scroll down to the bottom of the post.
I was going to recommend a few more specific articles (like the series on Robust Coding, or Dynamic Parameters), but since I'm new here and I can't include more than 2 links, just go to FileMaker Hacks and search for "ExecuteSQL". You'll find a number of useful posts.
NB If you're using FMP Advanced, the Data Viewer is a great tool for testing SQL. But beware: complex queries on large databases can sometimes send it into fits and freeze the program.
The first thing to keep in mind when working with FileMaker and ExecuteSQL() is the difference between tables and table occurrences. This is a concept that's somewhat unique to FileMaker. Succinctly, tables store the data, but table occurrences define the context of that data. Table occurrences are what you're seeing in FileMaker's relationship graph, and the ExecuteSQL() function needs to reference the table occurrences in its query.
I agree with MissJack regarding the need to start small in building the SQL statement and use the Data Viewer in FileMaker Pro Advanced, but there's one more recommendation I can offer, which is to use SeedCode's SQL Explorer. It does require the adding of table occurrences and fields to duplicate the naming in your existing solution, but this is pretty easy to do and the file they offer includes a wizard for building the SQL query.

SQL, to loop or not to loop?

the problem story goes like:
consider a program to manage bank accounts with balance limits for each customer
{table Customers, table Limits} where for each Customer.id there's one Limit record
then the client said to store a history for the limits' changes, it's not a problem since I've already had date column for Limit but the active/latest limits's view-query needs to be changed
before: Customer-Limit was 1 to 1 so a simple select did the job
now: it would show all the Limits' records which means multiple records for each Customers and I need the latest Limits only so I thought of something like this pseudo code
foreach( id in Customers)
{
select top 1 *
from Limits
where Limits.customer_id = id
order by Limits.date
}
but while looking through SO for similar issues, I came across stuff like
"95% of the time when you need a looping structure in tSQL you are probably doing it wrong"-JohnFx
and
"SQL is primarily a set-orientated language - it's generally a bad idea to use a loop in it."-Mark Bannister
can anyone confirm/explain why is it wrong to loop? and in the explained problem above, what am I getting wrong that I need to loop?
thanks in advance
update : my solution
in light of TomTom's answer & suggested link here and before Dean kindly answered with code I came up with this
SELECT *
FROM Customers c
LEFT JOIN Limits a ON a.customer_id = c.id
AND a.date =
(
SELECT MAX(date)
FROM Limits z
WHERE z.customer_id = a.customer_id
)
thought I'd share :>
thanks for your response,
happy coding
Will this do?
;with l as (
select *, row_number() over(partition by customer_id order by date desc) as rn
from limits
)
select *
from customers c
left join l on c.customer_id = l.customer_id and l.rn = 1
I am assuming that earlier (i.e. before implementing the history functionality) you must be updating the Limits table. Now, for implementing the history functionality you have started inserting new records. Doesnt this trigger a lot of changes in your databases and code?
Instead of inserting new records, how about keeping the original functionality as is and creating a new table say Limits_History which will store all the old values from Limits table before updating it? Then all you need to do is fetch records from this table if you want to show history. This will not cause any changes in your existing SPs and code hence will be less error prone.
To insert record in the Limits_History table, you can simply create an AFTER TRIGGER and use the deleted magic table. Hence you need not worry about calling an SP or something to maintain history. The trigger will do this for you. Good examples of trigger are here
Hope this helps
It is wrong. You can do the same by quyting customers and limits with a subquery limiting to the most recent record on limit.
This is similar in concept to the query presented in Most recent record in a left join
You may have to do so in 2 joins - get most recent date, then get limit for the date. While this may look complex - it is a beginner issue, talk complex when you have sql statements reaching 2 printed pages and more ;)
Now, for an operational system the table design is broken - limits should contain the mos trecent limit, and a LimitHistory table the historical (or: all) entries, allowing fast retrieval of the CURRENT limit (which will be the one to apply to all transaction) without the overhead of the history. The table design you have assumes all limits are identical - that may be the truth (is the truth) for a reporting data warehouse, but is wrong for a transactional system as the history is not transacted.
Confirmation for why loop is wrong is exactly in the quoted parts in your question - SQL is a set-orientated language.
This means when you work on sets there's no reason to loop through the single rows, because you already have the 'result' (set) of data you want to work on.
Then the work you are doing should be done on the set of rows, because otherwise your selection is wrong.
That being said there are of course situations where looping is done in SQL and it will generally be done via cursors if on data, or done via a while loop if calculating stuff. (generally, exceptions always change).
However, as also mentioned in the quotes, often when you feel like using a loop you either shouldn't (it's poor performance) or you're doing logic in the wrong part of your application.
Basically - it is similar to how object orientated languages works on objects and references to said objects. Set based language works on - well, sets of data.
SQL is basically made to function in that manner - query relational data into result sets - so when working with the language, you should let it do what it can do and work on that. Just as if it was Java or any other language.

How to get specific records and it's all related records in SQL?

I have more than 3 tables. But for simplicity, let take 3 Products, ProductBrands and ProductAttributes. Every poduct have zero or more brands and zero or more attributes. Now I have,
SELECT P.Name,P.ID, P.Desc FROM Products
But I want to select all product attributes and brands in the same SQL. I am thinking this,
SELECT P.Name,P.ID, P.Desc, GetProductAttributesInJSONOrXML(P.ID), GetProductBrandsInJSONOrXML(P.ID) FROM Products
How to create GetProductAttributesInJSONOrXML and GetProductBrandsInJSONOrXML funstions? So that in my app I can easily deserilize the xml or json. Please let me know if there is a better way.
You can select data in SQL Server as XML by use of the FOR XML clause. Such a query will give you back a single row with a single column containing the generated XML. Here's an example.
You could use something like this:
SELECT Product.Name, Product.ID, Product.Desc, Attribs.Attribute, Brands.Brand
FROM Products Product
LEFT JOIN ProductBrands Brands ON Product.ID = Brands.ProductID
LEFT JOIN ProductAttributes Attribs ON Product.ID = Attribs.ProductID
FOR XML AUTO, ELEMENTS
To get XML schema something like this, with one Product group for each row:
<Product>
<Name></Name>
<ID></ID>
<Desc></Desc>
<Attribs>
<Attribute></Attribute>
</Attribs>
<Brands>
<Brand></Brand>
</Brands>
</Product>
...
There are a lot of different options with the clause to get the schema formatted exactly the way you want, though it might take a bit of work for more complicated designs.
There's no way to generate JSON on SQL Server, short of using code to explicitly generate it with text functions. This would be complicated and probably not perform very well, since SQL Server is not optimized for text processing. Generating JSON is best done at the application level.
If you need to emit both JSON and XML, I suggest generating both at the application level. This allows needing only one SQL query to get your data, and keeping the data formatting code in a single place.
If you want to do it in SQL server, here are the options:
Generate JSON String in Queries
Store JSON string in a column as string for faster rendering
You can also do it in java by using the JSON libarary mentioned in previous answer.