How to map combinations of things to a relational database? - sql

I have a table whose records represent certain objects. For the sake of simplicity I am going to assume that the table only has one column, and that is the unique ObjectId. Now I need a way to store combinations of objects from that table. The combinations have to be unique, but can be of arbitrary length. For example, if I have the ObjectIds
1,2,3,4
I want to store the following combinations:
{1,2}, {1,3,4}, {2,4}, {1,2,3,4}
The ordering is not necessary. My current implementation is to have a table Combinations that maps ObjectIds to CombinationIds. So every combination receives a unique Id:
ObjectId | CombinationId
------------------------
1 | 1
2 | 1
1 | 2
3 | 2
4 | 2
This is the mapping for the first two combinations of the example above. The problem is, that the query for finding the CombinationId of a specific Combination seems to be very complex. The two main usage scenarios for this table will be to iterate over all combinations, and the retrieve a specific combination. The table will be created once and never be updated. I am using SQLite through JDBC. Is there any simpler way or a best practice to implement such a mapping?

The problem is, that the query for finding the CombinationId of a specific Combination seems to be very complex.
Shouldn't be too bad. If you want all combinations containing the selected items (with additional items allowed), it's just something like:
SELECT combinationID
FROM Combination
WHERE objectId IN (1, 3, 4)
GROUP BY combinationID
HAVING COUNT(*) = 3 -- The number of items in the combination
If you need only the specific combination (no extra items allowed), it can be more like:
SELECT combinationID FROM (
-- ... query from above goes here, this gives us all with those 3
) AS candidates
-- This bit gives us a row for each item in the candidates, including
-- the items we know about but also any 'extras'
INNER JOIN combination ON (candidates.combinationID = combination.combinationID)
GROUP BY candidates.combinationID
HAVING COUNT(*) = 3 -- Because we joined back on ALL, ones with extras will have > 3
You can also use a NOT EXISTS here (or in the original query), this seemed easier to explain.
Finally you could also be fancy and have a single, simple query
SELECT combinationID
FROM Combination AS candidates
INNER JOIN Combination AS allItems ON
(candidates.combinationID = allItems.combinationID)
WHERE candidates.objectId IN (1, 3, 4)
GROUP BY combinationID
HAVING COUNT(*) = 9 -- The number of items in the combination, squared
So in other words, if we're looking for {1, 2}, and there's a combination with {1, 2, 3}, we'll have a {candidates, allItems} JOIN result of:
{1, 1}, {1, 2}, {1, 3}, {2, 1}, {2, 2}, {2, 3}
The extra 3 results in COUNT(*) being 6 rows after GROUPing, not 4, so we know that's not the combination we're after.

This may be heresy, but for your usage scenarios it might work better to use a denormalized structure where you store the combinations themselves as some kind of composite (text) value:
CombinationId | Combination
---------------------------
1 | |1|2|
2 | |1|3|4|
If you make the rule that you always sort the ObjectIds when generating the composite value, it's easy to retrieve the Combination for a given set of Objects.

Another option would be to use relation-valued attributes, which in SQL DBMSs are called multisets or nested tables.
Relation-valued attributes may make sense if there is no identifier for the set of objects other than the set itself. However, I don't think any SQL DBMS permits keys to be declared on columns of that type so that could be a problem if you don't have some alternative key you can use.
http://download.oracle.com/docs/cd/B10500_01/appdev.920/a96594/adobjbas.htm#458790

Related

Unlimited join to the same table until exists

I don't even know how to properly name this question. I know how to join the same table, but this situation is slightly different. I'll try to make everything as simple as possible. Here are 2 tables:
ingredients (contains ingredient IDs and names)
ingredient_id|ingredient_name
1 |Water
2 |Salt
3 |Fancy Sauce
4 |Spices
5 |Pepper
6 |Chili
ingredients_to_ingredients (contains optional sub-ingredients for each ingredient)
ingredient_id|mapped_ingredient_id
3 |1
3 |2
3 |4
4 |5
4 |6
I need to get all the sub-ingredients (if any) for specified ingredient. So if I want to get sub-ingredients of #3 (Fancy Sauce), I use:
SELECT * FROM ingredients AS main
LEFT JOIN ingredients_to_ingredients AS sub
ON main.ingredient_id=sub.ingredient_id
LEFT JOIN ingredients
ON sub.mapped_ingredient_id=ingredients.ingredient_id
WHERE main.ingredient_id=3
And get the list:
Water
Salt
Spices
Easy. But as you see, Spices contains other sub-ingredients (Pepper and Chili), and I need to list them as well.
Some will say I can add 2nd sub-query, but there's a catch: there's no fixed number. In other words, one ingredient might not have sub-ingredients at all, while another might have lots of sub-sub-sub-ingredients (which would require dozens of sub-queries).
How do I write a query which keeps selecting mapped sub-ingredients as long as any ingredient has sub-ingredients? Now, I just use lots of repeated sub-queries (like dozens of them, just to be sure every possible sub-sub-sub ingredient is included), but query looks ugly and I believe it's not the best way to do it.
Any suggestions how to modify the query above?
P.S. I'm sure someone will say using sub-sub-sub ingredients mapped to themselves is a bad design. Well, I can't say to restaurant's chief - "don't use sub-ingredients in your ingredients because it's bad design from database point of view". Hope you got the idea why the design is as it is and can't be changed.
WITH RECURSIVE
cte AS ( SELECT *
FROM ingredients_to_ingredients
WHERE ingredient_id = 3
UNION ALL
SELECT ingredients_to_ingredients.*
FROM ingredients_to_ingredients
JOIN cte ON cte.mapped_ingredient_id = ingredients_to_ingredients.ingredient_id )
SELECT ingredients.ingredient_name
FROM cte
JOIN ingredients ON cte.mapped_ingredient_id = ingredients.ingredient_id;
https://dbfiddle.uk/?rdbms=mariadb_10.4&fiddle=ccfbd2838775fc0f69775ef86ab29093

can you put a tablename in a column? and should you?

I would like to create a highly scalable system for storing "candidates" the problem is each candidate has different "features" and sometimes these have different data types. One idea I'd like to try would involve something like this:
candidates:
| id | cType |
1 'fabric'
2 'belt'
candidateFeatures:
| candidateId | featureTable | featureId
1 'city' 1
1 'colour' 1
1 'colour' 2
2 'city' 2
2 'size' 1
city:
|id | lat | lng | name |
1 x x 'London'
1 x x 'Paris'
colour:
|id | name |
1 'Red'
2 'Green'
size:
|id | value |
1 10
2 12
Here you can see that there is one fabric candidate in London with Red and Green features and a belt candidate in Paris with size 10.
we do this because we get feedback in a universal way and I'm trying to write a scalable machine learning solution that will allow new types of candidates to be added seamlessly, as well as new candidate feature types - as they are discovered and added to the db. A candidate is assumed to be able to have more than one of each feature type.
Ultimately I need to be able to extract the data (probably through a materialised view) so that if I want all 'fabric' candidates I would end up with something like:
'id' | colourIds | cityIds |
1 [1, 2] [1]
4 [3] [4, 5]
but then if one day I find a fabric that doesn't have a colour but instead has a pattern I can easily get a new table for patterns and just add the features to my "candidateFeatures" table:
'id' | colourIds | cityIds | patternIds
1 [1, 2] [1] null
4 [3] [4, 5] null
14 null [6] [1]
This format is suitable for the front end, and the format of "candidateFeatures" is very useful for the backend. we can use it to easily scale without modifying existing tables and for scalable data analysis. Specifically when looking for correlations between user responses to candidates and presence of categorical features - or values of continuous features.
To me this seems like a really clever idea that hasn't got proper support in sql… which makes me think it's probably a really dumb idea in disguise. I think it's possible to do this using EXEC, but that does have some risks. Does anyone know of a smarter way to achieve the same result? or actually how to achieve this?
Since execution time isn't such a big concern I can always run it through a third party program e.g. in python and put the results into new tables. But ideally I'd use a bunch of materialised views and have them update periodically because that feels like it would scale better with more data.
This is too long for a comment.
It is neither a good idea nor an awful idea. It is simply not how SQL works. The problem is that queries have a well-defined set of tables and column references. This is quite important for optimizing the query -- a step that generally happens before the query is run.
Queries are not merely strings that permit dynamic substitution when they are processing data.
There are ways to address the data modeling:
Have separate tables for the features and association tables to match them back to the original data.
Use an entity-attribute-value model, which basically stored key-value pairs.
Use a flexible storage mechanism, such as JSON or arrays.
In addition, Postgres supports something called inheritance, which might be useful for representing this type data.

SQL Server Multiple Likes

I have an unusual question that seems simple but has me stumped in a SQL Server stored procedure.
I have 2 tables as described below.
tblMaster
ID, CommitDate, SubUser, OrigFileName
Sample data
ID CommitDate SubUser OrigFileName
----------------------------------------
1 2014-10-07 Test1 Test1.pdf
2 2014-10-08 Test2 Test2.pdf
3 2014-10-09 Test3 Test3.pdf
The above table is basically the header table that tracks the committed files. In addition to this, we have a details table with the following structure.
tblIndex
ID, FileID (Linking column to the header row described above), Word
Sample data:
1. 1, 1, Oil
2. 2, 1, oil
3. 3, 2, oil
4. 4, 2, tank
5. 5, 3, tank
The above rows represent the words that we want to search on and if a certain criteria matches return the corresponding filename/header row ID. What I would love to figure out to do is if I do a search for
One word (i.e. "oil"), then the system should respond with all the files that meet the criteria (easiest case and figured out)
If more than one word is searched for (i.e. "oil" and "tank"), then we should only see the second file since it is the only one that has both oil and tank as its key words.
Tried using a LIKE "%oil%" AND LIKE "%tank%" and that resulted in no rows being created since one value can't be both oil and tank.
Tried doing a LIKE "%oil%" OR LIKE "%tank%" but I get files 1, 2, and 3 since the OR is inclusive of all the other rows.
One last thing, I recognize I could just do a search for the first term and then save the results into a temp table and then search for the second term in that second table and I will get what I am looking for. The problem with that is that I don't exactly know how many items will be searched for. I don't want to have to create a structure where I am constantly having to store data into another temp table if someone does a search for 6 "keywords".
Any help/ideas will be much appreciated.
try this ! slightly differing from the previous answer
SELECT distinct FileID,COUNT(distinct t.word) FROM tblIndex t
WHERE t.Word LIKE '%oil%' OR t.Word LIKE '%tank%'
GROUP BY FileID
HAVING COUNT(distinct t.word) > 1
One simple option would be to do something like this :
SELECT FileID
FROM tblIndex t
WHERE t.Word LIKE '%oil%' OR t.Word LIKE '%tank%'
GROUP BY FileID
HAVING COUNT(*) > 1
This assume you do not have duplicate in your tblIndex.
I'm also unsure whether you really need the like or not. According to your sample data you don't, a basic comparison would be way more efficient and would avoid possible collisions.

Database scheme for searching my age groups

I've struggled with this for a while now trying to figure out how to do this most efficiently.
The problem is as follows. I have items in a database to be marketed for specific age groups such as ages 10 to 20 or ages 16+ and I need to be able to make a query like, find item that is for 17 year old
Here are my two best ideas (but I don't like either, as I think they're both inefficient).
Have a csv column with values like 10-20 and 16+ , retrieve the entire list, and parse through it (Bad idea, I know, I'm fresh out of ideas here though)
Have a csv column with values like 10,11,12,13...20 for ranges, so I can look for it using WHERE ages LIKE "%17%", and for cases like 16+ I'd have to retrieve those special cases using something like WHERE ages LIKE "%+%" and parse through those.
I'm of course leaning towards the second option, but in the very best scenario, I'm running two queries one for regular items, and one for things like 16+
Is there a better way? If not, do you think you could make either of my models more efficient? Thanks.
You can do it like this:
Add lower_age and upper_age columns to your table, both integers that allow NULLs.
If lower_age is NULL then there is no lower bound.
If upper_age is NULL then there is no upper bound.
Combine COALESCE and BETWEEN for your queries.
To clarify (4), you want to say things like this:
select *
from your_table
where $n between coalesce(lower_age, $n) and coalesce(upper_age, $n)
where $n is the age you're looking for. BETWEEN uses inclusive bounds so coalesce(lower_age, $n) ignores $n if lower_age is not NULL and gives you $n >= $n (i.e. an automatic true on that bound) if lower_age is NULL; similarly for the upper_age.
If something is suitable for only 11 year olds, then your [lower_age,upper_age] closed interval would be [11, 11], 16+ would be [16, NULL], six and lower would be [NULL, 6], everyone would be [NULL, NULL], and no one would be [23, 11] or anything else with lower_age > upper_age (or, more likely, invalid data that a CHECK constraint would throw a hissy fit over).
You can do this a number of ways. If you store the age of the user(whatever) in the row. Then you can query the age and with > 16 or < 30 or between 10-20 whatever. The other option is to store this as a bitwise. Have a reference table and store your different ranges if they can have multiples then you just add the two row values together.
1 = 10
2 = 16+
4 = 10-20
8 = 20-30
16 = 20+
32 = 30+
.
.
.
.
then in the table that stores the persons info you can set the column to an int or bigint take your preference and then for whatever groups they belong to you can determine this by the number for example:
Table of Users
ID Name BitWise
1 test 2
2 something 6 (2+4)
3 blah 24 (8+16)
However I think that it may be a bit overkill with the bitwise you might be best just storing the age as a number an running queries against that. More than likely this will be the most efficient.
You have a range of options (no pun intended). For age recommendations, the easiest way is to store a min_age and max_age and query like this:
select * from item where :age between min_age and max_age
where you have to decide whether you allow nulls for these columns (then you need to use coalesce() or nvl() or whatever function your database provides for dealing with comparisons with nulls), or set boundary values for these columns where you can be sure :age will always fall in between.
Alternatively, you can use a m:n table
create table item_ages (item_id int not null, age int not null, constraint item_ages_pk primary key (item_id, age)
and fill it with explicit values:
item_id | age
-------------
1 | 16
1 | 17
1 | 18
and so on. This is more cumbersome tha using a range, but also more flexible, and since your database can index the table and probably store that index in memory, queries should be fast. You only have to touch this table when a new item is entered or the age range for a particular item changes.
Note that CBRRacer's answer has similar properties: both share the idea that you prepare a datastructure that can easily be indexed, and answer the filter question from that index. This is a popular method for storing marketing data in ecommerce applications. The extreme end of that range would be to use a dedicated package for storing inverted indexes for that purpose. But for a simple age recommendation that's of course overkill.
Someting like this:
SELECT *
FROM tablename
WHERE 17 BETWEEN start_age AND end_age

Pulling items out of a DB with weighted chance

Let's say I had a table full of records that I wanted to pull random records from. However, I want certain rows in that table to appear more often than others (and which ones vary by user). What's the best way to go about this, using SQL?
The only way I can think of is to create a temporary table, fill it with the rows I want to be more common, and then pad it with other randomly selected rows from the table. Is there a better way?
One way I can think of is to create another column in the table which is a rolling sum of your weights, then pull your records by generating a random number between 0 and the total of all your weights, and pull the row with the highest rolling sum value less than the random number.
For example, if you had four rows with the following weights:
+---+--------+------------+
|row| weight | rollingsum |
+---+--------+------------+
| a | 3 | 3 |
| b | 3 | 6 |
| c | 4 | 10 |
| d | 1 | 11 |
+---+--------+------------+
Then, choose a random number n between 0 and 11, inclusive, and return row a if 0<=n<3, b if 3<=n<6, and so on.
Here are some links on generating rolling sums:
http://dev.mysql.com/tech-resources/articles/rolling_sums_in_mysql.html
http://dev.mysql.com/tech-resources/articles/rolling_sums_in_mysql_followup.html
I don't know that it can be done very easily with SQL alone. With T-SQL or similar, you could write a loop to duplicate rows, or you can use the SQL to generate the instructions for doing the row duplication instead.
I don't know your probability model, but you could use an approach like this to achieve the latter. Given these table definitions:
RowSource
---------
RowID
UserRowProbability
------------------
UserId
RowId
FrequencyMultiplier
You could write a query like this (SQL Server specific):
SELECT TOP 100 rs.RowId, urp.FrequencyMultiplier
FROM RowSource rs
LEFT JOIN UserRowProbability urp ON rs.RowId = urp.RowId
ORDER BY ISNULL(urp.FrequencyMultiplier, 1) DESC, NEWID()
This would take care of selecting a random set of rows as well as how many should be repeated. Then, in your application logic, you could do the row duplication and shuffle the results.
Start with 3 tables users, data and user-data. User-data contains which rows should be prefered for each user.
Then create one view based on the data rows that are prefered by the the user.
Create a second view that has the none prefered data.
Create a third view which is a union of the first 2. The union should select more rows from the prefered data.
Then finally select random rows from the third view.