Find a series of data using non-exact measurements (fuzzy logic) - sql

This is a more complex follow-up question to: Efficient way to look up sequential values
Each Product can have many Segment rows (thousands). Each segment has position column that starts at 1 for each product (1, 2, 3, 4, 5, etc.) and a value column that can contain any values such as (323.113, 5423.231, 873.42, 422.64, 763.1, etc.). The data is read-only.
It may help to think of the product as a song and the segments as a set of musical notes in the song.
Given a subset of contiguous segments, like a snippet of a song, I would like to identify potential matches for products. However, due to potential errors in measurements, the segments in the subset may not match the segments in the database exactly.
How can I identify product candidates by finding the segments of products which most closely match the subset of segments I have measured? Also, is a database the best medium for this type of data?
-
Here are just some thoughts for how I was about to approach this problem. Please don't take these as exact requirements. I am open to any kind of algorithms to make this work as best as possible. I was thinking there needs to be multiple threshold variables for determining closeness. One possibility might be to implement a proximity threshold and a match threshold.
For example, given these values:
Product A contains these segments: 11,21,13,13,15.
Measurement 1 has captured: 20,14,14,15.
Measurement 2 has captured: 11,21,78,13.
Measurement 3 has captured: 15,13,21,13,11.
If a proximity threshold allowed the measured segment to be 1 above or below the actual segment, then Measurement 1 may match Product A because, although many segments do not match exactly, they are within the proximity threshold relative to the actual values.
If a match threshold allowed for measurements with matches of 3 or more, Measurement 2 may return Product A because, although one of the segments (78) far exceeds the proximity threshold, it still matches 3 segments in the correct order and so is within the match threshold.
Measurement 3 would not match Product A because, although all measured segments exist in the actual segments, they are not within the proximity or match thresholds.
Update: One of the answers asked me to define what I mean by most closely match. I'm not exactly sure how to answer that, but I'll try to explain by continuing with the song analogy. Let's say the segments represent maximum frequencies of a recorded song. If I record that same song again it will be similar, but due to background noise and other limitations of recording equipment, some of the frequencies will match, some will be close, and a few will be way off. In this scenario, how would you define when one recording "matches" another? That's the same kind of matching logic I'm looking for to use in this problem.

From the information you posted this can be solved with the edmond's blossom v perfect match algorithm. Either you can minimize or maximize the function and it will always find the best match. Maybe you can use a brute force solution with 2 loops. The wikipedia about edmond's matching algorithm: http://en.wikipedia.org/wiki/Edmonds%27s_matching_algorithm

You need to come up with a definition for "most closely match". I don't know how anyone here can help you with that since no one here is going to know the business requirements or intricacies of the data. Your two methods both sound reasonable, but I have no idea if they actually are or not.
As for whether or not a database is the correct medium for this kind of data, I'd say that a database is probably the perfect medium for the data, but it is very like not the correct medium for processing the data. Whether it's possible or not will depend on your final solution on what constitutes "most closely match".
As a quick note, SSIS has some fuzzy match capabilities built into it for processing data. I've only played around with it though and that was a couple of years ago, so I don't know if it would work for what you're doing or not.

If you take literally your song example, one approach is to boil down your input to a bit-vector fingerprint, and then look up that fingerprint in a database as an exact match. You can increase the chances of finding a good match by extracting several fingerprints from your input and/or trying e.g. all bit-vectors that are only 1 or bit-errors away from your fingerprint.
If you have access to the ACM digital library, you can read a description of this sort of approach in "The Shazam Music Recognition service" at acm=1321038137_73cd62cf2b16cd73ca9070e7d5ea0744">http://delivery.acm.org/10.1145/1150000/1145312/p44-wang.pdf?ip=94.195.253.182&acc=ACTIVE%20SERVICE&CFID=53180383&CFTOKEN=41480065&acm=1321038137_73cd62cf2b16cd73ca9070e7d5ea0744. There is also some information at http://www.music.mcgill.ca/~alastair/621/porter11fingerprint-summary.pdf.
The input format you describe suggests that you might be able to do something with the random projection method described in http://en.wikipedia.org/wiki/Locality_sensitive_hashing.
To answer your second question, depending on exactly what a position corresponds to, you might consider boiling down the numbers to hash fingerprints made up of bits or characters, and storing these in a text search database, such as Apache Lucene.

Could you take the approach of matching the measurements against each segment position by position and calculating the difference for each position. Then slide the measurements along one position and calculate the difference. Then find which slide position scored the lowest difference. Do this for every product and then you know which product the measurements match to closest.
Test tables and data:
CREATE TABLE [dbo].[Segment]
(
[ProductId] INT,
[Position] INT,
[Value] INT
)
INSERT [dbo].[Segment]
VALUES (1, 1, 300),
(1, 2, 5000),
(1, 3, 900),
(1, 4, 400),
(1, 5, 800),
(2, 1, 400),
(2, 2, 6000),
(2, 3, 1000),
(2, 4, 500),
(2, 5, 900),
(3, 1, 400),
(3, 2, 5400),
(3, 3, 900),
(3, 4, 400),
(3, 5, 900)
CREATE TABLE #Measurement
(
[Position] INT,
[Value] INT
)
INSERT #Measurement
VALUES (1, 5400),
(2, 900),
(3, 400)
As you can see, the measurements match (a subset of) the third product exactly.
Some helpers:
CREATE TABLE #ProductSegmentCount
(
[ProductId] INT,
[SegmentCount] INT
)
INSERT #ProductSegmentCount
SELECT [ProductId], MAX([Position])
FROM [dbo].[Segment]
GROUP BY [ProductId]
DECLARE #MeasurementSegmentCount INT = (SELECT MAX([Position]) FROM #Measurement)
A recursive common table expression to show the products ordered by closest match:
;WITH [cteRecursive] AS
(
SELECT s.[ProductId],
0 AS [RecursionId],
m.[Position] AS [MeasurementPosition],
s.[Position] AS [SegmentPosition],
ABS(m.[Value] - s.[Value]) AS [Difference]
FROM #Measurement m
INNER JOIN [dbo].[Segment] s
ON m.[Position] = s.[Position]
UNION ALL
SELECT s.[ProductId],
[RecursionId] + 1 AS [RecursionId],
m.[Position],
s.[Position],
ABS(m.[Value] - s.[Value]) AS [Difference]
FROM [cteRecursive] r
INNER JOIN #Measurement m
ON m.[Position] = r.[MeasurementPosition]
INNER JOIN [dbo].[Segment] s
ON r.[ProductId] = s.[ProductId]
AND m.[Position] + (r.[RecursionId]) = s.[Position]
INNER JOIN #ProductSegmentCount psc
ON s.[ProductId] = psc.[ProductId]
WHERE [RecursionId] <= ABS(#MeasurementSegmentCount - psc.[SegmentCount])
)-- select * from [cteRecursive] where [ProductId] = 3 order by RecursionId, SegmentPosition
, [cteDifferences] AS
(
SELECT [ProductId], [RecursionId], SUM([Difference]) AS [Difference]
FROM [cteRecursive]
GROUP BY [ProductId], [RecursionId]
)-- select * from [cteDifferences]
SELECT [ProductId], MIN([Difference]) AS [Difference]
FROM [cteDifferences]
GROUP BY [ProductId]
ORDER BY MIN([Difference])
OPTION (MAXRECURSION 0)

Related

Parsing Multiple Snowflake Objects with consistent keys to rows

First post, hope I don't do anything too crazy
I want to go from JSON/object to long in terms of formatting.
I have a table set up as follows (note: there will be a large but finite number of 50+ activity columns, 2 is a minimal working example). I'm not concerned about the formatting of the date column - different problem.
customer_id(varcahr), activity_count(object, int), activity_duration(object, numeric)
sample starting point
In this case I'd like to explode this into this:
customer_id(varcahr), time_period, activity_count(int), activity_duration(numeric)
sample end point - long
minimum data set
WITH smpl AS (
SELECT
'12a' AS id,
OBJECT_CONSTRUCT(
'd1910', 0,
'd1911', 26,
'd1912', 6,
'd2001', 73) as activity_count,
OBJECT_CONSTRUCT(
'd1910', 0,
'd1911', 260.1,
'd1912', 30,
'd2001', 712.3) AS activity_duration
UNION ALL
SELECT
'13b' AS id,
OBJECT_CONSTRUCT(
'd1910', 1,
'd1911', 2,
'd1912', 3,
'd2001', 4) as activity_count,
OBJECT_CONSTRUCT(
'd1910', 1,
'd1911', 2.2,
'd1912', 3.3,
'd2001', 4.3) AS activity_duration
)
select * from smpl
Extra credit for also taking this from JSON/object to wide (in Google Big Query it's SELECT id, activity_count.* FROM tbl
Thanks in advance.
I've tried tons of random FLATTEN() based joins. In this instance I probably just need one working example.
This needs to scale to a moderate but finite number of objects (e.g. 50)
I'll also see if I can combine with THIS - I'll see if I can combine it - Lateral flatten two columns without repetition in snowflake
Using FLATTEN:
WITH (...)
SELECT s1.ID, s1.KEY, s1.value AS activity_count, s2.value AS activity_duration
FROM (select ID, Key, VALUE from smpl,table(flatten(input=>activity_count))) AS s1
JOIN (select ID, Key, VALUE from smpl,table(flatten(input=>activity_duration))) AS s2
ON S1.ID = S2.ID AND S1.KEY = S2.KEY;
Output:
#Lukasz Szozda gets close but the answer doesn't scale as well with multiple variables (it's essentially a bunch of cartesian products and I'd need to do a lot of ON conditions). I have a known constraint (each field is in a strict format) so it's easy to recycle the key.
After WAY WAY WAY too much messing with this (off and on searches for weeks) it finally snapped and it's pretty easy.
SELECT
id, key, activity_count[key], activity_duration[key], activity_duration2[key]
FROM smpl, LATERAL flatten(input => activity_count);
You can also use things OTHER than key such as index
It's inspired by THIS link but I just didn't quite follow it.
https://stackoverflow.com/a/36804637/20994650

Create a "products you may be interested in" algorithm in SQL?

i have a problem which im not sure how to approach.
I have a simple database where i store products , users , and purchases of products by users.
Each product has a name , a category and a price.
My goal is the following :
I want to display a list of 5 items that are suggested as "You might be interested in" to the user.The main problem is that i don't just want to search LIKE %..% for the name , but i also want to take into account the types of products the user usually buys , the price range he usually buys at , and giving priority to products being bought more often.
Is such an algorithm realistic? I can think of some metrics , like grouping all categories into semantically "similar" buckets and calculating distance from that , but im not sure how i should rank them when there is multiple criteria.
Maybe i should give each criteria an importance factor and have the result be a multiplication of the distance * the factor?
What you could do is create 2 additional fields for each product in your database. In the first field called Type for example you could say "RC" and in the second field called similar you could say, "RC, Radio, Electronics, Remote, Model" Then in your query in SQL later on you can tell it to select products that match up between type and similar. This provides a system that doesn't just rely on the product name, as these can be deceiving. It would be still using the LIKE command, but it would be far more accurate as it's pre-defined by you as to what other products are similar to this one.
Depending on the size of your database already, I believe this to be the simplest option.
I was using this on MySql for some weighted search :
SELECT *,
IF(
`libelle` LIKE :startSearch, 30,
IF(`libelle` LIKE :fullSearch, 20, 0)
)
+ IF(
`description` LIKE :startSearch, 10,
IF(`description` LIKE :fullSearch, 5, 0)
)
+ IF(
`keyword` LIKE :fullSearch, 1, 0
)
AS `weight`
FROM `t`
WHERE (
-- at least 1 match
`libelle` LIKE :fullSearch
OR `description` LIKE :fullSearch
OR `keyword` LIKE :fullSearch
)
ORDER BY
`weight` DESC
/*
'fullSearch'=>'%'.str_replace(' ', '_', trim($search)).'%',
'startSearch'=>str_replace(' ', '_', trim($search)).'%',
*/

Expand-collapse report with data set based on GROUPING SETS

I've used the Expand/Collapse feature in SSRS reports before, but in all those cases it was Reporting Services that was doing the grouping and totalling. This time around I utilize GROUPING SETS in my dataset query to let SQL Server handle aggregating the data. I want to create a report that has Expand/Collapse features for the groups, but can't seem to get it to work.
Repro
First up, here's a way to get a small repro simulating my actual situation. Use the following query for a dataset:
-- Simulating with already denormalized data for sake of simplicity
DECLARE #Order TABLE (Category VARCHAR(20), Product VARCHAR(20), PersonId INT);
INSERT INTO #Order
(Category, Product, PersonId)
VALUES ('Fruit', 'Banana', 1)
,('Fruit', 'Banana', 1)
,('Cakes', 'Chocolate', 1)
,('Fruit', 'Apple', 2)
,('Cakes', 'Chocolate', 2)
,('Cakes', 'Berry Jam', 3)
,('Cakes', 'Chocolate', 3)
,('Cakes', 'Chocolate', 3)
,('Fruit', 'Banana', 4)
,('Cakes', 'Berry Jam', 5)
SELECT Category,
Product,
COUNT(DISTINCT PersonId) AS NrOfBuyers
FROM #Order AS o
GROUP BY GROUPING SETS ((), (Category), (Category, Product))
This will provide this output (I've manually ordered the output to illustrate my intentions):
Category Product NrOfBuyers
-------- ------- ----------
Fruit Apple 1
Fruit Banana 2
Fruit NULL 3
Cakes Berry Jam 2
Cakes Chocolate 3
Cakes NULL 4
NULL NULL 5
To foreshadow what I'm aiming for, here's what I want to get in Excel.
Expanded version of intended result:
Collapsed version of intended result:
What I've tried so far:
While writing this question and creating the repro I did realize that my first approach of just dumping my dataset in a tablix was wrong.
So what I tried to fix this was recreating the tablix with proper Row Groups like so:
In addition to that I need a column on the left hand side outside the main group to hold the toggle "+" for the grand total row.
However, this gives incorrect numbers for the collapsed version:
These should be different: Cakes and Fruit have a "Subtotal" of 3 and 4, respectively.
This seems like a problem with ordering the rows, so I've checked the sorting for the Tablix and that should order rows as the appear in the "intended result" screenshots. It doesn't, and after a bit I understood why: the groups do sorting as well. So I've added sorting for the groups as well, e.g. this is the one for the Product Row Group:
This seems to improve things (it does the sorting bit I needed anyways) but it doesn't fix having the wrong numbers in collapsed state.
What do I need to do to finish this last stretch and complete the report?
The approach can work, but one last step is needed to get the correct numbers for collapsed state. Know that with the example from the question this design:
Shows the following expression for this cell:
=Fields!NrOfBuyers.Value
But this sneakily seems to come down to this:
=First(Fields!NrOfBuyers.Value)
When it is evaluated in the context of a collapsed row.
So, one way to "fix" this and get the correct sub totals is to change that expression to:
=Last(Fields!NrOfBuyers.Value)
Which will give the desired output in collapsed state:
Or semi-collapsed:
And finally, expanded:

What's the most efficient way to store sets in a database?

I want to store sets in a such a way that I can query for sets that are a superset of, subset of, or intersect with another set.
For example, if my database has the sets { 1, 2, 3 }, { 2, 3, 5 }, { 5, 10, 12} and I query it for:
Sets which are supersets of { 2, 3 } it should give me { 1, 2, 3 }, { 2, 3, 5 }
Sets which are subsets of { 1, 2, 3, 4 } it should give me { 1, 2, 3 }
Sets which intersect with { 1, 10, 20 } it should give me { 1, 2, 3 }, { 5, 10, 12}
Since some sets are unknown in advance (your comment suggests they come from the client as a search criteria), you cannot "precook" the set relationships into the database. Even if you could, that would represent a redundancy and therefore an opportunity for inconsistencies.
Instead, I'd do something like this:
CREATE TABLE "SET" (
ELEMENT INT, -- Or whatever the element type is.
SET_ID INT,
PRIMARY KEY (ELEMENT, SET_ID)
)
Additional suggestions:
Note how ELEMENT field is at the primary key's leading edge. This should aid the queries below better than PRIMARY KEY (SET_ID, ELEMENT). You can still add the latter if desired, but if you don't, then you should also...
Cluster the table (if your DBMS supports it), which means that the whole table is just a single B-Tree (and no table heap). That way, you maximize the performance of queries below, and minimize storage requirements (and cache effectiveness).
You can then find IDs of sets that are equal to or supersets of (for example) set {2, 3} like this:
SELECT SET_ID
FROM "SET"
WHERE ELEMENT IN (2, 3)
GROUP BY SET_ID
HAVING COUNT(*) = 2;
And sets that intersect {2, 3} like this:
SELECT SET_ID
FROM "SET"
WHERE ELEMENT IN (2, 3)
GROUP BY SET_ID;
And sets that are equal to or are subsets of {2, 3} like this:
SELECT SET_ID
FROM "SET"
WHERE SET_ID NOT IN (
SELECT SET_ID
FROM "SET" S2
WHERE S2.ELEMENT NOT IN (2, 3)
)
GROUP BY SET_ID;
"Efficient" can mean a lot of things, but the normalized way would be to have an Items table with all the possible elements and a Sets table with all the sets, and an ItemsSets lookup table. If you have sets A and B in your Sets table, queries like (doing this for clarity rather than optimization... also "Set" is a bad name for a table or field, given it is a keyword)
SELECT itemname FROM Items i
WHERE i.itemname IN
(SELECT itemname FROM ItemsSets isets WHERE isets.setname = 'A')
AND i.name IN
(SELECT itemname FROM ItemsSets isets WHERE isets.setname = 'B')
That, for instance, is the intersection of A and B (you can almost certainly speed this up as a JOIN; again, "efficient" can mean a lot of things, and you'll want an architecture that allows a query like that). Similar queries can be made to find out the difference, complement, test for equality, etc.
Now, I know you asked about efficiency, and this is a horribly slow way to query, but this is the only reliably scalable architecture for the tables to do this, and the query was just an easy one to show how the tables are built. You can do all sorts of crazy things to, say, cache intersections, or store multiple items that are in a set in one field and process that, or what have you. But don't. Cached info will eventually get stale; static limits on the number of items in the field size will be surpassed; ad-hoc members of new tuples will be misinterpreted.
Again, "efficient" can mean a lot of different things, but ultimately an information architecture you as a programmer can understand and reason about is going to be the most efficient.

How to capture data values to the power of ^ (...) in SQL Server?

Perhaps someone with more experience in SQL Server can be of assistance. I am in the middle of putting together the LookUp tables for a new project. For 2 different tests that a user can perform (Bacteria/Fungi) the results are currently recorded on paper as the following:
BACTERIA:
Bacteria cfu / ml
<100
10^2
10^3
10^4
10^5
10^6
10^7
FUNGI:
Fungi (yeast & mold) cfu /ml
<100
10^2
10^3
10^4
10^5
What would be the best way to capture these values in SQL Server 2008 R2? In particular, Data Type and Size?
Something like this would probably be good enough:
CREATE TABLE AmountLookup (
UnitsLimitExp int NULL,
Name nvarchar(10) NULL
)
INSERT INTO AmountLookup
SELECT 2, '<100'
UNION ALL SELECT 3, '10^3'
UNION ALL SELECT 4, '10^4'
UNION ALL SELECT 5, '10^5'
UNION ALL SELECT 6, '10^6'
UNION ALL SELECT 7, '10^7'
This way you store the exponent, not the amount. Real value is just a GUI representation. Another thing is your lookup name, which is ugly here (10^3). However, you can store HTML code and treat it as raw HTML on your user interface, e.g. 104 is 10<sup>4</sup>.
If these are value and not ranges, INT will work. If you start to deal with values greater than 2 billion, BIGINT should be used. If you need decimal digits, you can use the DECIMAL (NUMERIC) type.
Your other option, since these are discrete values, is to use a lookup table for the values, whose surrogate key id you can use in the tables that hold references to the data. That way you can represent a concept such as "<100" .
I'd propose a varchar(50):
<100 "Clean enough"
10^2 "Food over due date"
10^3 "Mr Bean's armpits"
10^4 "Three day old carcass"
10^5 "Biochemical experiment"
10^6 "Maggot invasion"
10^7 "Bacteria overflow"