Pair Objective in SQL (Access) - sql

I have a question concerning returning a pair "objective" such that some demand has been met. The example can be simply depicted here in a (n*3) Matrix as below.
The goal is to find a pair of product (group by ID) that has the least cost. Single row (ID) would be just neglected from the analysis. Of course, if the pair appears multiply, that would be taken account in form of the sum in costs.
ID Product Cost
1 a 2
1 b 3
2 c 4
3 d 5
3 b 6
4 a 6
4 b 5
4 d 4
5 c 3
6 a 2
That would mean that (a,b) is a pair to be considered in ID = 1, (d,b) is a pair considered in ID = 3, (a,b) appears once again in ID = 4, so does (b,d). However, ID = 4 also accounts for another pair which is (a,d) appearing only once in the whole table. The sequence of the pair does not matter. Thus I sum the cost of the two (a,b) pairs and (b,d) pairs to compare the cost value and whether (a,b),(b,d) or (a,d) is cheaper. Of course, the cost of the pairs within, has to be summed as well.
The goal is to return the pair products that has the least cost. At our examples the results would be:
(a,b) = 5(ID = 1) + 11 (ID = 4) = 16
(b,d) = 11 (ID = 3) + 9 (ID = 4) = 20
(a,d) = 10 (ID = 4)
Solution : (a,d) as the optimal pair. Note, that there are cheaper solutions when I only consider single variables instead of pairs variables, but this is not the objective. I am seeking for a pair within a column, that reflects the least cost.
I hope that my question is clear for everyone, and I hope that it is possible to help me out of my query. Many thanks in advance!
Best,
David

If I understand correctly, you want a self-join and aggregation. The self-join generates all pairs of products for each id. The aggregation calculates the sum of costs for them:
select top 1 t1.product, t2.product, sum(t1.cost + t2.cost)
from t as t1 inner join
t as t2
on t1.id = t2.id
where t1.product < t2.product
group by t1.product, t2.product
order by sum(t1.cost + t2.cost);

Related

Find 'Most Similar' Items in Table by Foreign Key

I have a child table with a number of charact/value pairs for a given 'material' (MaterialID). Any material can have a number of charact values and may have several of the same name (see id's 2,3).
The table has a large number of records (8+ million). What I'm trying to do is find the materials that are the most similar to a supplied material. That is, when I supply a MaterialID, I would like an ordered list of the most similar other materials (those with the most matching charact/value pairs).
I've done some research but, I may be missing some key terms or just not conceptualizing the problem correctly.
Any hints as to how to go about this would be very much appreciated.
ID MaterialID Charact Value
1 1 ROT_DIR CCW
2 1 SPECIAL_FEATURE CATALOG_CP
3 1 SPECIAL_FEATURE CHROME
4 1 SCHEDULE 80
5 2 BEARING_TYPE SB
6 2 SCHEDULE 80
7 3 ROT_DIR CCW
8 3 SPECIAL_FEATURE CATALOG_HSB
9 3 BEARING_TYPE SP
10 4 NDE_STYLE W_FAN
11 4 BEARING_TYPE SB
12 4 ROT_DIR CW*
You can do this with a self join:
select t.materialid, count(*) as nummatches
from t join
t tmat
on t.Charact = tmat.Charact and t.value = tmat.value
where tmat.materialid = #MaterialId
group by t.materialid
order by nummatches desc;
Notes:
You might want to remove the specified material, by adding where t.MaterialId <> tmat.MaterialId to the where clause.
If you want all materials, then make the join a left join and move the where condition to the on clause.
If you want only one material with the most matches, use select top 1.
If you want all materials with the most matches when there are ties, use `select top (1) with ties.

Find the pairs that appear the most times in StoreID and each STOREID has only this pair SQL

I want to find the pairs of MOVIEID that appear more times in the STOREID.
Additionally, each STOREID should have only this pair as MOVIEIDs. My table has 2 columns: STOREID and MOVIEID.
For example:
STOREID | MOVIEID
--------|---------
1 | a
1 | b
1 | c
2 | a
2 | b
3 | a
3 | b
5 | a
5 | b
In this case the answer would be pair: (a,b) 3 times.
Thanks in advance!
As far as i understand you want to consider only stores that sell movie pairs. This makes it a lot simpler. First you group by stores and take only those results with two movies. Now to generate pairs of those would be tricky if there were more than two movies. You would need windowing functions. However, for two you get both movies with aggregation functions. One with min and the other with max. Further those functions ensure, the same pair always has the same order. For example the pair (a,b) will always be (a,b) and never (b,a).
SELECT COUNT(*), MOVIE_1, MOVIE_2
FROM (
SELECT MIN(MOVIEID) MOVIE_1
,MAX(MOVIEID) MOVIE_2
,STOREID
FROM STORE_MOVIES -- your table
GROUP BY STOREID
HAVING COUNT(*) = 2
) MOVIE_PAIRS
GROUP BY MOVIE_1, MOVIE_2
ORDER BY COUNT(*) DESC
FETCH FIRST ROW ONLY;
For HAVING COUNT(*) = 2 I assume MOVIEID together with STOREID is unique.
Although the request does not really make sense, that's not our design/implementation to concern. I have done with a 3-part self-join to your movies table. The first (m1) joins to second (m2) on the same store, but for second movie being greater than (m1) movie. This will prevent conditions of comparing (a,b) vs (b,a). Then, I am joining (m2) to (m3) by same store, but movie 3 greater than 2. This is an intentional LEFT-JOIN as not all stores will have more than 2. In this case, the value at (m3) will be NULL (non-existent). So, I am looking for where m3.storeID IS NULL. The JOIN between (m1) and (m2) requires the first and second to exist. Finally, tacking on the HAVING will show only those pairs that appear at multiple stores.
select
m1.movieID as Movie1,
m2.movieID as Movie2,
count(*) TimesPaired
from
Movies m1
JOIN movies m2
on m1.storeId = m2.storeId
AND m1.movieId < m2.movieId
LEFT JOIN movies m3
on m2.storeId = m3.storeId
AND m2.movieId < m3.movieId
where
m3.storeId IS NULL
group by
m1.movieID,
m2.movieID
having
count(*) > 1

SQL: Most efficient way to select sequences of rows from a table

I have a tagged textual corpus stored in an SQL table like the following:
id tag1 tag2 token sentence_id
0 a e five 1
1 b f score 1
2 c g years 1
3 d h ago 1
My task is to search the table for sequences of tokens that meet certain criteria, sometimes with gaps between each token.
For example:
I want to be able to search for a sequence similar to the following:
the token has the value a in the tag1 column, and
the second token is one to two rows away from the first, and has the value g in tag2 or b in tag1, and
the third token should be at least three rows away, and has ago in the token column.
In SQL, this would be something like the following:
SELECT * FROM my_table t1
JOIN my_table t2 ON t1.sentence_id = t2.sentence_id
JOIN my_table t3 ON t3.sentence_id = t1.sentence_id
WHERE t1.tag1 = 'a' AND (t2.id = t1.id + 1 OR t2.id = t1.id + 2)
AND (t2.tag2 = 'g' OR t2.tag1 = 'b')
AND t3.id >= t1.id + 3 AND t3.token = 'ago'
So far I have only been able to achieve this by joining the table by itself each time I specify a new token in the sequence (e.g. JOIN my_table t4), but with millions of rows this gets quite slow. Is there a more efficient way to do this?
You could try this staged approach:
apply each condition (other than the various distance conditions) as a subquery
Calculate the distances between the tokens which meet the conditions
Apply all the distance conditions separately.
This might improve things, if you have indexes on the tag1, tag2 and token columns:
SELECT DISTINCT sentence_id FROM
(
-- 2. Here we calculate the distances
SELECT cond1.sentence_id,
(cond2.id - cond1.id) as cond2_distance,
(cond3.id - cond1.id) as cond3_distance
FROM
-- 1. These are all the non-distance conditions
(
SELECT * FROM my_table WHERE tag1 = 'a'
) cond1
INNER JOIN
(
SELECT * FROM my_table WHERE
(tag1 = 'b' OR tag2 = 'g')
) cond2
ON cond1.sentence_id = cond2.sentence_id
INNER JOIN
(
SELECT * FROM my_table WHERE token = 'ago'
) cond3
ON cond1.sentence_id = cond3.sentence_id
) conditions
-- 3. Now apply the distance conditions
WHERE cond2_distance BETWEEN 0 AND 2
AND cond3_distance >= 3
ORDER BY sentence_id;
If you apply this query to this SQL fiddle you get:
| sentence_id |
|-------------|
| 1 |
| 4 |
Which is what you want. Now whether it's any faster or not, only you (with your million-row database) can really tell, but from the perspective of having to actually write these queries, you'll find they're much easier to read, understand and maintain.
You need to edit your question and give more details on how these sequences of tokens work (for instance, what does "each time I specify a new token in the sequence" mean in practice?).
In postgresql you can solve this class of queries with a window function. Following your exact specification above:
SELECT *,
CASE
WHEN lead(tag2, 2) OVER w = 'g' THEN lead(token, 2) OVER w
WHEN lead(tag1) OVER w = 'b' THEN lead(token) OVER w
ELSE NULL::text
END AS next_token
FROM my_table
WHERE tag1 = 'a'
AND next_token IS NOT NULL
WINDOW w AS (PARTITION BY sentence_id ORDER BY id);
The lead() function looks ahead a number of rows (default is 1, when not specified) from the current row in the window frame, in this case all rows with the same sentence_id as specified in the partition of the window definition. So, lead(tag1, 2) looks at the value of tag1 two rows ahead to compare against your condition, and lead(token, 2) returns the token from two rows ahead as column next_token in the current row and having the same sentence_id. If the first CASE condition fails, the second is evaluated; if that fails NULL is returned. Note that the order of the conditions in the CASE clause is significant: different ordering gives different results.
Obviously, if you keep on adding conditions for subsequent tokens the query becomes very complex and you may have to put individual search conditions in separate stored procedures and then call these depending on your requirements.

Calculating relative frequencies in SQL

I am working on a tag recommendation system that takes metadata strings (e.g. text descriptions) of an object, and splits it into 1-, 2- and 3-grams.
The data for this system is kept in 3 tables:
The "object" table (e.g. what is being described),
The "token" table, filled with all 1-, 2- and 3-grams found (examples below), and
The "mapping" table, which maintains associations between (1) and (2), as well as a frequency count for these occurrences.
I am therefore able to construct a table via a LEFT JOIN, that looks somewhat like this:
SELECT mapping.object_id, mapping.token_id, mapping.freq, token.token_size, token.token
FROM mapping LEFT JOIN
token
ON (mapping.token_id = token.id)
WHERE mapping.object_id = 1;
object_id token_id freq token_size token
+-----------+----------+------+------------+--------------
1 1 1 2 'a big'
1 2 1 1 'a'
1 3 1 1 'big'
1 4 2 3 'a big slice'
1 5 1 1 'slice'
1 6 3 2 'big slice'
Now I'd like to be able to get the relative probability of each term within the context of a single object ID, so that I can sort them by probability, and see which terms are most probably (e.g. ORDER BY rel_prob DESC LIMIT 25)
For each row, I'm envisioning the addition of a column which gives the result of freq/sum of all freqs for that given token_size. In the case of 'a big', for instance, that would be 1/(1+3) = 0.25. For 'a', that's 1/3 = 0.333, etc.
I can't, for the life of me, figure out how to do this. Any help is greatly appreciated!
If I understood your problem, here's the query you need
select
m.object_id, m.token_id, m.freq,
t.token_size, t.token,
cast(m.freq as decimal(29, 10)) / sum(m.freq) over (partition by t.token_size, m.object_id)
from mapping as m
left outer join token on m.token_id = t.id
where m.object_id = 1;
sql fiddle example
hope that helps

Finding contiguous regions in a sorted MS Access query

I am a long time fan of Stack Overflow but I've come across a problem that I haven't found addressed yet and need some expert help.
I have a query that is sorted chronologically with a date-time compound key (unique, never deleted) and several pieces of data. What I want to know is if there is a way to find the start (or end) of a region where a value changes? I.E.
DateTime someVal1 someVal2 someVal3 target
1 3 4 A
1 2 4 A
1 3 4 A
1 2 4 B
1 2 5 B
1 2 5 A
and my query returns rows 1, 4 and 6. It finds the change in col 5 from A to B and then from B back to A? I have tried the find duplicates method and using min and max in the totals property however it gives me the first and last overall instead of the local max and min? Any similar problems?
I didn't see any purpose for the someVal1, someVal2, and someVal3 fields, so I left them out. I used an autonumber as the primary key instead of your date/time field; but this approach should also work with your date/time primary key. This is the data in my version of your table.
pkey_field target
1 A
2 A
3 A
4 B
5 B
6 A
I used a correlated subquery to find the previous pkey_field value for each row.
SELECT
m.pkey_field,
m.target,
(SELECT Max(pkey_field)
FROM YourTable
WHERE pkey_field < m.pkey_field)
AS prev_pkey_field
FROM YourTable AS m;
Then put that in a subquery which I joined to another copy of the base table.
SELECT
sub.pkey_field,
sub.target,
sub.prev_pkey_field,
prev.target AS prev_target
FROM
(SELECT
m.pkey_field,
m.target,
(SELECT Max(pkey_field)
FROM YourTable
WHERE pkey_field < m.pkey_field)
AS prev_pkey_field
FROM YourTable AS m) AS sub
LEFT JOIN YourTable AS prev
ON sub.prev_pkey_field = prev.pkey_field
WHERE
sub.prev_pkey_field Is Null
OR prev.target <> sub.target;
This is the output from that final query.
pkey_field target prev_pkey_field prev_target
1 A
4 B 3 A
6 A 5 B
Here is a first attempt,
SELECT t1.Row, t1.target
FROM t1 WHERE (((t1.target)<>NZ((SELECT TOP 1 t2.target FROM t1 AS t2 WHERE t2.DateTimeId<t1.DateTimeId ORDER BY t2.DateTimeId DESC),"X")));