What's the most efficient way to store sets in a database? - sql

I want to store sets in a such a way that I can query for sets that are a superset of, subset of, or intersect with another set.
For example, if my database has the sets { 1, 2, 3 }, { 2, 3, 5 }, { 5, 10, 12} and I query it for:
Sets which are supersets of { 2, 3 } it should give me { 1, 2, 3 }, { 2, 3, 5 }
Sets which are subsets of { 1, 2, 3, 4 } it should give me { 1, 2, 3 }
Sets which intersect with { 1, 10, 20 } it should give me { 1, 2, 3 }, { 5, 10, 12}

Since some sets are unknown in advance (your comment suggests they come from the client as a search criteria), you cannot "precook" the set relationships into the database. Even if you could, that would represent a redundancy and therefore an opportunity for inconsistencies.
Instead, I'd do something like this:
CREATE TABLE "SET" (
ELEMENT INT, -- Or whatever the element type is.
SET_ID INT,
PRIMARY KEY (ELEMENT, SET_ID)
)
Additional suggestions:
Note how ELEMENT field is at the primary key's leading edge. This should aid the queries below better than PRIMARY KEY (SET_ID, ELEMENT). You can still add the latter if desired, but if you don't, then you should also...
Cluster the table (if your DBMS supports it), which means that the whole table is just a single B-Tree (and no table heap). That way, you maximize the performance of queries below, and minimize storage requirements (and cache effectiveness).
You can then find IDs of sets that are equal to or supersets of (for example) set {2, 3} like this:
SELECT SET_ID
FROM "SET"
WHERE ELEMENT IN (2, 3)
GROUP BY SET_ID
HAVING COUNT(*) = 2;
And sets that intersect {2, 3} like this:
SELECT SET_ID
FROM "SET"
WHERE ELEMENT IN (2, 3)
GROUP BY SET_ID;
And sets that are equal to or are subsets of {2, 3} like this:
SELECT SET_ID
FROM "SET"
WHERE SET_ID NOT IN (
SELECT SET_ID
FROM "SET" S2
WHERE S2.ELEMENT NOT IN (2, 3)
)
GROUP BY SET_ID;

"Efficient" can mean a lot of things, but the normalized way would be to have an Items table with all the possible elements and a Sets table with all the sets, and an ItemsSets lookup table. If you have sets A and B in your Sets table, queries like (doing this for clarity rather than optimization... also "Set" is a bad name for a table or field, given it is a keyword)
SELECT itemname FROM Items i
WHERE i.itemname IN
(SELECT itemname FROM ItemsSets isets WHERE isets.setname = 'A')
AND i.name IN
(SELECT itemname FROM ItemsSets isets WHERE isets.setname = 'B')
That, for instance, is the intersection of A and B (you can almost certainly speed this up as a JOIN; again, "efficient" can mean a lot of things, and you'll want an architecture that allows a query like that). Similar queries can be made to find out the difference, complement, test for equality, etc.
Now, I know you asked about efficiency, and this is a horribly slow way to query, but this is the only reliably scalable architecture for the tables to do this, and the query was just an easy one to show how the tables are built. You can do all sorts of crazy things to, say, cache intersections, or store multiple items that are in a set in one field and process that, or what have you. But don't. Cached info will eventually get stale; static limits on the number of items in the field size will be surpassed; ad-hoc members of new tuples will be misinterpreted.
Again, "efficient" can mean a lot of different things, but ultimately an information architecture you as a programmer can understand and reason about is going to be the most efficient.

Related

Calculate aggregates based on fields in BigQuery's repeated records

{
"outer_1": "1",
"outer_2": 2,
"inner": [
{
"inner_1": 0,
"inner_2": null,
},
{
"inner_1": 3,
"inner_2": true,
},
]
}
Above you can see roughly what my data looks like. In terms of BigQuery, inner is a repeated record. I would like to calculate aggregates (like the sum of non-null inner_1 fields) for each row.
One idea is to use unnest(inner), calculate aggregates and then group by all other columns. This doesn't seem, however, to be the optimal approach. I already have about a million rows. Unnesting would create tremendously many more rows and doing aggregation to group everything right away doesn't sound right.
I bet there might be a more efficient way to iterate over nested records and calculate aggregates for them.
In the end, I would like to have something like:
{
"outer_1": "1",
"outer_2": 2,
"inner_1_sum: 3,
"inner_2_non_null_count": 1
}
Consider below approach
select * except(inner_col),
( select as struct
sum(inner_1) as inner_1_sum,
countif(not inner_2 is null) as inner_2_non_null_count
from unnest(inner_col)
).*
from `project.dataset.table`
If applied to sample data in your question - output is

SQL Schema - car model with modifiers as unique

I need to build a DB for the following scenario:
I will have an input stream of auctions, and I want to make a price histogram for items on said auction (ie. what they usually go for etc).
The input stream looks something like:
{['item_id': 1, ... 'price': 123, ...],
['item_id': 1, ... 'price': 124, ... modifiers: [1, 2, 3],
['item_id': 1, ... 'price': 125, ... modifiers: [100, 150, 500...],
['item_id': 2, ... 'price': 200, ...],
...}
As you might have noticed, item doesn't only consist of some id, but also of modifiers. Think of it as a car that can be modified with extra stuff (e.g. AC, electronic windows etc).
What would be the most efficient way to store this information? Basically what I want to have is a unique id for each combination that can occur. It's not necessary to store it at all times, but if there is an auction for such an combination, and the combination doesn't exist yet, create it then.
I thought of something like:
base_item:
id
modifier:
id
item:
id (autonumber)
base_item_id
item_modifications:
item_id (FK item.id)
modification_id (FK modifier.id)
item_price_history:
item_id (FK item.id)
price
time
This setup might work. The problem is, imagine I have hundreds of millions of such auctions every day (ie. the auction's information is updated every 20 minutes and it cosists of 2 million auctions in average).
I want to be able to quickly do something like: INSERT INTO item_price_history VALUES (some_item_id, some_price, now()) but in order to do that, I need to find some_item_id. I know base_item_id and modifiers (from auction itself), but doing such call hundreds of millions of times is quite costly I think?
Ie, pseudo code:
for a in auctions:
base_item_id = a['item_id']
modifiers = a['modifiers']
price = a['price']
actual_item_id = some_query(base_item_id, modifiers) #expensive. Can be avoided?
insert_into_histogram(auctual_item_id, price) #expensive but necessary I think
Is there some obvious mistake I'm making in this design?
The schema you describe is the textbook solution.
But wow, that would be a beast to work with. As I understand it, every time you added a price record, you would have to find the item record with that exact set of parameters: no more, no less. And if no such item record existed, you would then have to create the item record. Only then could you add the price record.
While I think one should be very careful about denormalizing, I'd be sorely tempted to denormalize in this case. Namely, it seems to me that in practice, the key to an item record is the combination of the base item id plus the modifiers. I'd be tempted to to create a "modifier string" formed by stringing together codes or IDs for all the modifiers. Of course to be workable they'd have to be strung together in a defined sequence, like you can't have both "1,2" and "2,1". But then you could easily find the desired item record: just have a function that builds the concatenated modifier string, and select item where base_item_id=#base and modifiers=#modifiers. If not found, create the record and all the associated modifier records.
I'd be strongly inclined to make this modifier string be redundant with individual modifier records, but data that is strung together like this is very difficult to process. I mean, if you have a textbook schema like you describe, and someone wants to know prices for cars with air conditioning, it's very easy to select * from price where price.item in (select id from item join modifier on modifier.item_id=item.id where modifier.name='AC'). But try and do that on the concatenated string, say the ID for AC is "17". select blah blah where modifier_string like '%17%' doesn't work: it will find 117 and 171 and so on. like '%,17,%' doesn't work because it won't find it if it's the first or the last. Etc. That's why I routinely tell people NOT to string data together like this in general: create separate records. But if the most common use case is that you want the record with a specific combination of modifiers, creating a redundant modifier string is a plausible denormalization. (And the first time I typed that I accidentally typed 'demoralization', which may have been a Freudian slip.)

Search efficiently for records matching given set of properties/attributes and their values (exact match, less than, greater than)

It is fairly simple problem to describe. However I could not come up with any reasonable solution so solution may or may not be so easy to cook up. Here is the problem:
Let there be many records describing some objects. For example:
{
id : 1,
kind : cat,
weight : 25 lb,
color : red
age : 10,
fluffiness : 98
attitude : grumpy
}
{
id : 2,
kind : robot,
chassis : aluminum,
year : 2015,
hardware : intel curie,
battery : 5000,
bat-life : 168,
weight : 0.5 lb,
}
{
id : 3,
kind : lightsaber,
color : red,
type : single blade,
power : 1000,
weight : 25 lb,
creator : Darth Vader
}
Attributes are not pre-specified so an object could be described using any attribute-value pairs.
If there are 1 000 000 records/objects there could easily be 100 000 different attributes.
My goal is to efficiently search through the data structure/s that will contain all records and if possible to come up with answer (quickly) which records match the given conditions.
For example a search query could be: Find all cats that weigh more than 20 and are older than 9 and are more fluffy than 98 and are red and whose attitude is "grumpy".
We can assume that there could be infinite number of records and infinite number of attributes but any search query contains no more than 20 numerical (lt,gt) clauses.
One possible implementation using SQL/MySQL I could think of was using fulltext indexes.
For example I could store non numeric attributes as "kind_cat color_red attitude_grumpy", search through them to narrow the resultset and then scan table containing numeric attributes for matches. It seems however (I am not sure at this point) that gt, lt searches might be costly in general using this strategy (I would have to do at least N joins for N numerical clauses).
I thought of MongoDB thinking of the problem, but although MongoDB naturally allows me to store key-value pairs, searching by some fields (not all) means that I must create indexes that contain all keys in all possible orders/permutations (and this is impossible).
Can this be done efficiently (maybe in logarithmic time??) using MySQL or any other dbms? - If not, is there data structure (maybe some muti-dimensional tree?) and algorithm that allows efficiently executing this kind of searches on a large scale (considering both time and space complexity)?
If it isn't possible to solve the problem defined this way are there any heuristic approaches that solve it without sacrificing too much.
If I get it right your thinking something like:
create table t
( id int not null
, kind varchar(...) not null
, key varchar(...) not null
, val varchar(...) not null
, primary key (id, kind, key) );
There are several problems with this approach, you can google for EAV to find out more. One example is that you will have to cast val to the appropriate type when doing comparisons ( '2' > '10' )
That said, an index like:
create unique index ix1 on t (kind, key, val, id)
will reduce the pain you will be suffering slightly, but the design wont scale well and with 1E6 of rows and 1E5 attributes the performance will be far from good. Your example query would look something like:
select a.id
from ( select id
from ( select id, val
from t
where kind = 'cat'
and key = 'weight'
)
where cast(val as int) > 20
) as a
join ( select id
from ( select id, val
from t
where kind = 'cat'
and key = 'age'
)
where cast(val as int) > 9
) as b
on a.id = b.id
join ( ...
and key = 'fluffy'
)
where cast(val as int) > 98
) as c
on a.id = c.id
join ...

What is LINQ operator to perform division operation on Tables?

To select elements belonging to a particular group in Table, if elements and their group type are contained in one table and all group types are listed in another table we perform division on tables.
I am trying LINQ query to perform the same operation. Please tell me how can I do perform it?
Apparently from the definition of that blog post you'd want to intersect and except.
Table1.Except(Table1.Intersect(Table2));
or rather in your case I'd guess
Table1.Where(d => !Table2.Any(t => t.Type == d.Type));
not so hard.
I don't think performance can be made much better, actually. Maybe with a groupby.
Table1.GroupBy(t => t.Type).Where(g => !Table2.Any(t => t.Type == g.Key)).SelectMany(g => g);
this should be better for performance. Only searches the second table for every kind of type once, not for every row in Table1.
It's a bit difficult to determine exactly what you're asking. But, it sounds like you are looking to determine the elements that are common in two tables or streams. If so, I think you want Intersect.
Take a look here
It works something like this:
int[] array1 = { 1, 2, 3 };
int[] array2 = { 2, 3, 4 };
var intersect = array1.Intersect(array2);
Returns 2 and 3.
The opposite of this would be Except().

Find a series of data using non-exact measurements (fuzzy logic)

This is a more complex follow-up question to: Efficient way to look up sequential values
Each Product can have many Segment rows (thousands). Each segment has position column that starts at 1 for each product (1, 2, 3, 4, 5, etc.) and a value column that can contain any values such as (323.113, 5423.231, 873.42, 422.64, 763.1, etc.). The data is read-only.
It may help to think of the product as a song and the segments as a set of musical notes in the song.
Given a subset of contiguous segments, like a snippet of a song, I would like to identify potential matches for products. However, due to potential errors in measurements, the segments in the subset may not match the segments in the database exactly.
How can I identify product candidates by finding the segments of products which most closely match the subset of segments I have measured? Also, is a database the best medium for this type of data?
-
Here are just some thoughts for how I was about to approach this problem. Please don't take these as exact requirements. I am open to any kind of algorithms to make this work as best as possible. I was thinking there needs to be multiple threshold variables for determining closeness. One possibility might be to implement a proximity threshold and a match threshold.
For example, given these values:
Product A contains these segments: 11,21,13,13,15.
Measurement 1 has captured: 20,14,14,15.
Measurement 2 has captured: 11,21,78,13.
Measurement 3 has captured: 15,13,21,13,11.
If a proximity threshold allowed the measured segment to be 1 above or below the actual segment, then Measurement 1 may match Product A because, although many segments do not match exactly, they are within the proximity threshold relative to the actual values.
If a match threshold allowed for measurements with matches of 3 or more, Measurement 2 may return Product A because, although one of the segments (78) far exceeds the proximity threshold, it still matches 3 segments in the correct order and so is within the match threshold.
Measurement 3 would not match Product A because, although all measured segments exist in the actual segments, they are not within the proximity or match thresholds.
Update: One of the answers asked me to define what I mean by most closely match. I'm not exactly sure how to answer that, but I'll try to explain by continuing with the song analogy. Let's say the segments represent maximum frequencies of a recorded song. If I record that same song again it will be similar, but due to background noise and other limitations of recording equipment, some of the frequencies will match, some will be close, and a few will be way off. In this scenario, how would you define when one recording "matches" another? That's the same kind of matching logic I'm looking for to use in this problem.
From the information you posted this can be solved with the edmond's blossom v perfect match algorithm. Either you can minimize or maximize the function and it will always find the best match. Maybe you can use a brute force solution with 2 loops. The wikipedia about edmond's matching algorithm: http://en.wikipedia.org/wiki/Edmonds%27s_matching_algorithm
You need to come up with a definition for "most closely match". I don't know how anyone here can help you with that since no one here is going to know the business requirements or intricacies of the data. Your two methods both sound reasonable, but I have no idea if they actually are or not.
As for whether or not a database is the correct medium for this kind of data, I'd say that a database is probably the perfect medium for the data, but it is very like not the correct medium for processing the data. Whether it's possible or not will depend on your final solution on what constitutes "most closely match".
As a quick note, SSIS has some fuzzy match capabilities built into it for processing data. I've only played around with it though and that was a couple of years ago, so I don't know if it would work for what you're doing or not.
If you take literally your song example, one approach is to boil down your input to a bit-vector fingerprint, and then look up that fingerprint in a database as an exact match. You can increase the chances of finding a good match by extracting several fingerprints from your input and/or trying e.g. all bit-vectors that are only 1 or bit-errors away from your fingerprint.
If you have access to the ACM digital library, you can read a description of this sort of approach in "The Shazam Music Recognition service" at acm=1321038137_73cd62cf2b16cd73ca9070e7d5ea0744">http://delivery.acm.org/10.1145/1150000/1145312/p44-wang.pdf?ip=94.195.253.182&acc=ACTIVE%20SERVICE&CFID=53180383&CFTOKEN=41480065&acm=1321038137_73cd62cf2b16cd73ca9070e7d5ea0744. There is also some information at http://www.music.mcgill.ca/~alastair/621/porter11fingerprint-summary.pdf.
The input format you describe suggests that you might be able to do something with the random projection method described in http://en.wikipedia.org/wiki/Locality_sensitive_hashing.
To answer your second question, depending on exactly what a position corresponds to, you might consider boiling down the numbers to hash fingerprints made up of bits or characters, and storing these in a text search database, such as Apache Lucene.
Could you take the approach of matching the measurements against each segment position by position and calculating the difference for each position. Then slide the measurements along one position and calculate the difference. Then find which slide position scored the lowest difference. Do this for every product and then you know which product the measurements match to closest.
Test tables and data:
CREATE TABLE [dbo].[Segment]
(
[ProductId] INT,
[Position] INT,
[Value] INT
)
INSERT [dbo].[Segment]
VALUES (1, 1, 300),
(1, 2, 5000),
(1, 3, 900),
(1, 4, 400),
(1, 5, 800),
(2, 1, 400),
(2, 2, 6000),
(2, 3, 1000),
(2, 4, 500),
(2, 5, 900),
(3, 1, 400),
(3, 2, 5400),
(3, 3, 900),
(3, 4, 400),
(3, 5, 900)
CREATE TABLE #Measurement
(
[Position] INT,
[Value] INT
)
INSERT #Measurement
VALUES (1, 5400),
(2, 900),
(3, 400)
As you can see, the measurements match (a subset of) the third product exactly.
Some helpers:
CREATE TABLE #ProductSegmentCount
(
[ProductId] INT,
[SegmentCount] INT
)
INSERT #ProductSegmentCount
SELECT [ProductId], MAX([Position])
FROM [dbo].[Segment]
GROUP BY [ProductId]
DECLARE #MeasurementSegmentCount INT = (SELECT MAX([Position]) FROM #Measurement)
A recursive common table expression to show the products ordered by closest match:
;WITH [cteRecursive] AS
(
SELECT s.[ProductId],
0 AS [RecursionId],
m.[Position] AS [MeasurementPosition],
s.[Position] AS [SegmentPosition],
ABS(m.[Value] - s.[Value]) AS [Difference]
FROM #Measurement m
INNER JOIN [dbo].[Segment] s
ON m.[Position] = s.[Position]
UNION ALL
SELECT s.[ProductId],
[RecursionId] + 1 AS [RecursionId],
m.[Position],
s.[Position],
ABS(m.[Value] - s.[Value]) AS [Difference]
FROM [cteRecursive] r
INNER JOIN #Measurement m
ON m.[Position] = r.[MeasurementPosition]
INNER JOIN [dbo].[Segment] s
ON r.[ProductId] = s.[ProductId]
AND m.[Position] + (r.[RecursionId]) = s.[Position]
INNER JOIN #ProductSegmentCount psc
ON s.[ProductId] = psc.[ProductId]
WHERE [RecursionId] <= ABS(#MeasurementSegmentCount - psc.[SegmentCount])
)-- select * from [cteRecursive] where [ProductId] = 3 order by RecursionId, SegmentPosition
, [cteDifferences] AS
(
SELECT [ProductId], [RecursionId], SUM([Difference]) AS [Difference]
FROM [cteRecursive]
GROUP BY [ProductId], [RecursionId]
)-- select * from [cteDifferences]
SELECT [ProductId], MIN([Difference]) AS [Difference]
FROM [cteDifferences]
GROUP BY [ProductId]
ORDER BY MIN([Difference])
OPTION (MAXRECURSION 0)