Fuzzy grouping in Postgres

Fuzzy grouping in Postgres - sql

I have a table with contents that look similar to this:
id | title
------------
1 | 5. foo
2 | 5.foo
3 | 5. foo*
4 | bar
5 | bar*
6 | baz
6 | BAZ
…and so on. I would like to group by the titles and ignore the extra bits. I know Postgres can do this:
SELECT * FROM (
SELECT regexp_replace(title, '[*.]+$', '') AS title
FROM table
) AS a
GROUP BY title
However, that's quite simple and would get very unwieldy if I tried to anticipate all the possible variations. So, the question is, is there a more general way to do fuzzy grouping than using regexp? Is it even possible, at least without breaking one's back doing it?
Edit: To clarify, there is no preference for any of the variations, and this is what the table should look like after grouping:
title
------
5. foo
bar
baz
I.e., the variations would be items that are different just by a few characters or capitalization, and it doesn't matter which ones are left as long as they're grouped.

For any grouping you should have transitive equality, that is a ~= b, b ~= c => a ~= c.
Formulate it strictly using words and we'll try to formulate it using SQL.
For instance, which group should foo*bar go to?
Update:
This query replaces all non-alphanumerical characters with spaces and returns first title from each group:
SELECT DISTINCT ON (REGEXP_REPLACE(UPPER(title), '[^[:alnum:]]', '', 'g')) title
FROM (
VALUES
(1, '5. foo'),
(2, '5.foo'),
(3, '5. foo*'),
(4, 'bar'),
(5, 'bar*'),
(6, 'baz'),
(7, 'BAZ')
) rows (id, title)

At some time, you are going to have to define what makes a set of values belong together in a group. If that's too hard, maybe you should prohibit and inhibit the entry of fuzzy data, or if you must permit it, add a column that contains a sanitized version of the title for use by the grouping operations.

Related

Take the difference between two lists in PostgreSQL

I have two columns in my table with multiple values, and I want to get the values that are in one of the columns and not in the other column.
Its best described by an example:
including_ids | excluding_ids
123, 456 | 456, 789
I want to create a new column of all the including_ids that are not in the excluding_ids, so in the above example:
including_ids | excluding_ids | remaining_ids
123, 456 | 456, 789 | 123
If easier, I could also represent the values as lists or arrays or something like that.

You can use arrays for that:
CREATE TABLE mytable (including_ids integer[], excluding_ids integer[]);
INSERT INTO mytable VALUES ('{123,456}', '{456,789}');
INSERT INTO mytable VALUES ('{1,2,3}', '{3,4,5}');
Then you can get the result you want like this:
SELECT (SELECT array_agg(i)
FROM unnest(m.including_ids) AS arr(i)
WHERE NOT ARRAY[i] <# m.excluding_ids)
FROM mytable AS m;
array_agg
-----------
{123}
{1,2}
(2 rows)
But, as jarlh commented, using arrays or other composite data types is often a bad idea if you want to manipulate the values inside the database a lot. A more normalized data model is often a better idea: queries will become simpler, and the performance will be better.

You could also use the intarray extension:
create extension intarray;
SELECT including_ids - excluding_ids as remaining_ids
FROM mytable;
remaining_ids
───────────────
{123}
{1,2}
(2 rows)

SQL: Check if at least one element of an array is in subquery

I have a table like this:
id | name | artists
-------------------
1 | XYZ | {Some Dude, Whatever}
2 | ABC | {Blah Blah Blah, Whatever}
3 | EFG | {Running, Out, Of, Made, Up, Names}
I have a subquery that returns one column called name with a bunch of artist's names. I need a way to check if at least one of the elements of artists (for each of the rows) is contained in the results of that subquery. That is, if the subquery returns this:
name
----
Some Dude
Whatever
Blah Blah Blah
then, I want to select only the rows with id 1 and 2 in my example, because none of the artists in id 3 are returned by the subquery.
I do know I can do single_element = ANY(subquery) but that only tests a single element. I've tried doing:
SELECT * FROM table WHERE ANY(artists) = ANY(subquery)
but that fails immediately with "ERROR: syntax error at or near 'any'".
Thanks in advance!

You can use the && operator to test for set-element overlap. Its documented in the postgresql documentation section on array functions
WITH artists (name) AS (
VALUES
('Blah Blah Blah'::text),
('Whatever'),
('Some Dude')
),
my_table (id, name, artists) AS (
VALUES
(1,'XYZ',ARRAY['Some Dude'::TEXT, 'Whatever'::TEXT]),
(2,'ABC',ARRAY['Blah Blah Blah', 'Whatever']),
(3,'EFG',ARRAY['Running', 'Out', 'Of', 'Made', 'Up', 'Names'])
)
SELECT *
FROM my_table
WHERE artists && (SELECT ARRAY_AGG(name) FROM artists)
Also, from my example above you can see how to convert a subquery into an array to be able to use the overlaps operator

Store multidimensional array in database: relational or multidimensional?

I have read numerous posts along the lines of multidimensional to single dimension, multidimensional database, and so on, but none of the answers helped. I did find a lot of documentation on Google but that only provided background information and didn't answer the question at hand.
I have a lot of strings that are related to one another. They are needed in a PHP script. The structure is hierarchical. Here is an example.
A:
AA:
AAA
AAC
AB
AE:
AEA
AEE:
AEEB
B:
BA:
BAA
BD:
BDC:
BDCB
BDCE
BDD:
BDDA
BE:
BED:
BEDA
C:
CC:
CCB:
CCBC
CCBE
CCC:
CCCA
CCCE
CE
Each indent supposes a new level in the multidimensional array.
The goal is to retrieve an element with PHP by name and all its descendants. If for instance I query for A, I want to receive an array of string containing array('A', 'AA', 'AAA', 'AAC', 'AB', 'AE', 'AEA', 'AEE', 'AEEB'). The 'issue' is that queries can also be made to lower-level elements. If I query AEE, I want to get array('AEE', 'AEEB').
As I understand the concept of relational databases, this means that I cannot use a relational database because there is no common 'key' between elements. The solution that I thought is possible, is assigning PARENT elements to each cell. So, in a table:
CELL | PARENT
A NULL
AA A
AAA AA
AAC AA
AB A
AE A
AEA AE
AEE AE
AEEB AEE
By doing so, I think you should be able to query the given string, and all items that share this parent, and then recursively go down this path until no more items are found. However, this seems rather slow to me because the whole search space would need to be looked through on each level - which is exactly what you don't want in a multidimensional array.
So I am a bit at loss. Note that there are actually around 100,000 strings structured in this way, so speed is important. Luckily the database is static and would not change. How can I store such a data structure in a database without having to deal with long loops and search times? And which kind of database software and data type is best suited for this? It has come to my attention that PostgreSQL is already present on our servers so I'd rather stick with that.
As I said I am new to databases but I am very eager to learn. Therefore, I am looking for an extensive answer that goes into detail and provides advantages and disadvantages of a certain approach. Performance is key. An expected answer would contain the best database type and language for this use case, and also script in that language to build such a structure.

The goal is to retrieve an element with PHP by name and all its descendants.
If that is all you need, you can use a LIKE search
SELECT *
FROM Table1
WHERE CELL LIKE 'AEE%';
With an index beginning with CELL this is a range check, which is fast.
If your data doesn't look like that, you can create a path column which looks like a directory path and contains all nodes "on the way/path" from root to the element.
| id | CELL | parent_id | path |
|====|======|===========|==========|
| 1 | A | NULL | 1/ |
| 2 | AA | 1 | 1/2/ |
| 3 | AAA | 2 | 1/2/3/ |
| 4 | AAC | 2 | 1/2/4/ |
| 5 | AB | 1 | 1/5/ |
| 6 | AE | 1 | 1/6/ |
| 7 | AEA | 6 | 1/6/7/ |
| 8 | AEE | 6 | 1/6/8/ |
| 9 | AEEB | 8 | 1/6/8/9/ |
To retrieve all descendants of 'AE' (including itself) your query would be
SELECT *
FROM tree t
WHERE path LIKE '1/6/%';
or (MySQL specific concatenation)
SELECT t.*
FROM tree t
CROSS JOIN tree r -- root
WHERE r.CELL = 'AE'
AND t.path LIKE CONCAT(r.path, '%');
Result:
| id | CELL | parent_id | path |
|====|======|===========|==========|
| 6 | AE | 1 | 1/6/ |
| 7 | AEA | 6 | 1/6/7/ |
| 8 | AEE | 6 | 1/6/8/ |
| 9 | AEEB | 8 | 1/6/8/9/ |
Demo
Performance
I have created 100K rows of fake data on MariaDB with the sequence plugin using the following script:
drop table if exists tree;
CREATE TABLE tree (
`id` int primary key,
`CELL` varchar(50),
`parent_id` int,
`path` varchar(255),
unique index (`CELL`),
unique index (`path`)
);
DROP TRIGGER IF EXISTS `tree_after_insert`;
DELIMITER //
CREATE TRIGGER `tree_after_insert` BEFORE INSERT ON `tree` FOR EACH ROW BEGIN
if new.id = 1 then
set new.path := '1/';
else
set new.path := concat((
select path from tree where id = new.parent_id
), new.id, '/');
end if;
END//
DELIMITER ;
insert into tree
select seq as id
, conv(seq, 10, 36) as CELL
, case
when seq = 1 then null
else floor(rand(1) * (seq-1)) + 1
end as parent_id
, null as path
from seq_1_to_100000
;
DROP TRIGGER IF EXISTS `tree_after_insert`;
-- runtime ~ 4 sec.
Tests
Count all elements under the root:
SELECT count(*)
FROM tree t
CROSS JOIN tree r -- root
WHERE r.CELL = '1'
AND t.path LIKE CONCAT(r.path, '%');
-- result: 100000
-- runtime: ~ 30 ms
Get subtree elements under a specific node:
SELECT t.*
FROM tree t
CROSS JOIN tree r -- root
WHERE r.CELL = '3B0'
AND t.path LIKE CONCAT(r.path, '%');
-- runtime: ~ 30 ms
Result:
| id | CELL | parent_id | path |
|=======|======|===========|=====================================|
| 4284 | 3B0 | 614 | 1/4/11/14/614/4284/ |
| 6560 | 528 | 4284 | 1/4/11/14/614/4284/6560/ |
| 8054 | 67Q | 6560 | 1/4/11/14/614/4284/6560/8054/ |
| 14358 | B2U | 6560 | 1/4/11/14/614/4284/6560/14358/ |
| 51911 | 141Z | 4284 | 1/4/11/14/614/4284/51911/ |
| 55695 | 16Z3 | 4284 | 1/4/11/14/614/4284/55695/ |
| 80172 | 1PV0 | 8054 | 1/4/11/14/614/4284/6560/8054/80172/ |
| 87101 | 1V7H | 51911 | 1/4/11/14/614/4284/51911/87101/ |
PostgreSQL
This also works for PostgreSQL. Only the string concatenation syntax has to be changed:
SELECT t.*
FROM tree t
CROSS JOIN tree r -- root
WHERE r.CELL = 'AE'
AND t.path LIKE r.path || '%';
Demo: sqlfiddle - rextester
How does the search work
If you look at the test example, you'll see that all paths in the result begin with '1/4/11/14/614/4284/'. That is the path of the subtree root with CELL='3B0'. If the path column is indexed, the engine will find them all efficiently, because the index is sorted by path. It's like you would want to find all the words that begin with 'pol' in a dictionary with 100K words. You wouldn't need to read the entire dictionary.

Performance
As others have already mentioned, performance shouldn't be an issue as long as you use a suitable indexed primary key and ensure that relations use foreign keys. In general, an RDBMS is highly optimised to efficiently perform joins on indexed columns and referential integrity can also provide the advantage of preventing orphans. 100,000 may sound a lot of rows but this isn't going to stretch an RDBMS as long as the table structure and queries are well designed.
Choice of RDBMS
One factor in answering this question lies in choosing a database with the ability to perform a recursive query via a Common Table Expression (CTE), which can be very useful to keep the queries compact or essential if there are queries that do not limit the number of descendants being traversed.
Since you've indicated that you are free to choose the RDBMS but it must run under Linux, I'm going to throw PostgreSQL out there as a suggestion since it has this feature and is freely available. (This choice is of course very subjective and there are advantages and disadvantages of each but a few other contenders I'd be tempted to rule out are MySQL since it doesn't currently support CTEs, MariaDB since it doesn't currently support *recursive* CTEs, SQL Server since it doesn't currently support Linux. Other possibilities such as Oracle may be dependent on budget / existing resources.)
SQL
Here's an example of the SQL you'd write to perform your first example of finding all the descendants of 'A':
WITH RECURSIVE rcte AS (
SELECT id, letters
FROM cell
WHERE letters = 'A'
UNION ALL
SELECT c.id, c.letters
FROM cell c
INNER JOIN rcte r
ON c.parent_cell_id = r.id
)
SELECT letters
FROM rcte
ORDER BY letters;
Explanation
The above SQL sets up a "Common Table Expression", i.e. a SELECT to run whenever its alias (in this case rcte) is referenced. The recursion happens because this is referenced within itself. The first part of the UNION picks the cell at the top of the hierarchy. Its descendants are all found by carrying on joining on children in the second part of the UNION until no further records are found.
Demo
The above query can be seen in action on the sample data here: http://rextester.com/HVY63888

You absolutely can do that (if I've read your question correctly).
Depending on your RDBMS you might have to choose a different way.
Your basic structure of having a parent is correct.
SQL Server use recursive common table expression (CTE) to anchor the start and work down
https://technet.microsoft.com/en-us/library/ms186243(v=sql.105).aspx
Edit: For Linux use the same in PostgreSQL https://www.postgresql.org/docs/current/static/queries-with.html
Oracle has a different approach, though I think you might be able to use the CTE as well.
https://oracle-base.com/articles/misc/hierarchical-queries
For 100k rows I don't imagine performance will be an issue, though I'd still index PK & FK because that's the right thing to do. If you're really concerned about speed then reading it into memory and building a hash table of linked lists might work.
Pros & cons - it pretty much comes down to readability and suitability for your RDBMS.
It's an already solved problem (again, assuming I've not missed anything) so you'll be fine.

I have two words for you... "RANGE KEYS"
You may find this technique to be incredibly powerful and flexible. You'll be able to navigate your hierarchies with ease, and support variable depth aggregation without the need for recursion.
In the demonstration below, we'll build the hierarchy via a recursive CTE. For larger hierarchies 150K+, I'm willing to share a much faster build in needed.
Since your hierarchies are slow moving (like mine), I tend to store them in a normalized structure and rebuild as necessary.
How about some actual code?
Declare #YourTable table (ID varchar(25),Pt varchar(25))
Insert into #YourTable values
('A' ,NULL),
('AA' ,'A'),
('AAA' ,'AA'),
('AAC' ,'AA'),
('AB' ,'A'),
('AE' ,'A'),
('AEA' ,'AE'),
('AEE' ,'AE'),
('AEEB','AEE')
Declare #Top varchar(25) = null --<< Sets top of Hier Try 'AEE'
Declare #Nest varchar(25) ='|-----' --<< Optional: Added for readability
IF OBJECT_ID('TestHier') IS NOT NULL
Begin
Drop Table TestHier
End
;with cteHB as (
Select Seq = cast(1000+Row_Number() over (Order by ID) as varchar(500))
,ID
,Pt
,Lvl=1
,Title = ID
From #YourTable
Where IsNull(#Top,'TOP') = case when #Top is null then isnull(Pt,'TOP') else ID end
Union All
Select cast(concat(cteHB.Seq,'.',1000+Row_Number() over (Order by cteCD.ID)) as varchar(500))
,cteCD.ID
,cteCD.Pt
,cteHB.Lvl+1
,cteCD.ID
From #YourTable cteCD
Join cteHB on cteCD.Pt = cteHB.ID)
,cteR1 as (Select Seq,ID,R1=Row_Number() over (Order By Seq) From cteHB)
,cteR2 as (Select A.Seq,A.ID,R2=Max(B.R1) From cteR1 A Join cteR1 B on (B.Seq like A.Seq+'%') Group By A.Seq,A.ID )
Select B.R1
,C.R2
,A.ID
,A.Pt
,A.Lvl
,Title = Replicate(#Nest,A.Lvl-1) + A.Title
Into dbo.TestHier
From cteHB A
Join cteR1 B on A.ID=B.ID
Join cteR2 C on A.ID=C.ID
Order By B.R1
Show The Entire Hier I added the Title and Nesting for readability
Select * from TestHier Order By R1
Just to state the obvious, the Range Keys are R1 and R2. You may also notice that R1 maintains the presentation sequence. Leaf nodes are where R1=R2 and Parents or rollups define the span of ownership.
To Show All Descendants
Declare #GetChildrenOf varchar(25) = 'AE'
Select A.*
From TestHier A
Join TestHier B on B.ID=#GetChildrenOf and A.R1 Between B.R1 and B.R2
Order By R1
To Show Path
Declare #GetParentsOf varchar(25) = 'AEEB'
Select A.*
From TestHier A
Join TestHier B on B.ID=#GetParentsOf and B.R1 Between A.R1 and A.R2
Order By R1
Clearly these are rather simple illustrations. Over time, I have created a series of helper functions, both Scalar and Table Value Functions. I should also state that you should NEVER hard code range key in your work because they will change.
In Summary
If you have a point (or even a series of points), you'll have its range and therefore you'll immediately know where it resides and what rolls into it.

This approach does not depend on the existence of a path or parent column. It is relational not recursive.
Since the table is static create a materialized view containing just the leaves to make searching faster:
create materialized view leave as
select cell
from (
select cell,
lag(cell,1,cell) over (order by cell desc) not like cell || '%' as leave
from t
) s
where leave;
table leave;
cell
------
CCCE
CCCA
CCBE
CCBC
BEDA
BDDA
BDCE
BDCB
BAA
AEEB
AEA
AB
AAC
AAA
A materialized view is computed once at creation not at each query like a plain view. Create an index to speed it up:
create index cell_index on leave(cell);
If eventually the source table is altered just refresh the view:
refresh materialized view leave;
The search function receives text and returns a text array:
create or replace function get_descendants(c text)
returns text[] as $$
select array_agg(distinct l order by l)
from (
select left(cell, generate_series(length(c), length(cell))) as l
from leave
where cell like c || '%'
) s;
$$ language sql immutable strict;
Pass the desired match to the function:
select get_descendants('A');
get_descendants
-----------------------------------
{A,AA,AAA,AAC,AB,AE,AEA,AEE,AEEB}
select get_descendants('AEE');
get_descendants
-----------------
{AEE,AEEB}
Test data:
create table t (cell text);
insert into t (cell) values
('A'),
('AA'),
('AAA'),
('AAC'),
('AB'),
('AE'),
('AEA'),
('AEE'),
('AEEB'),
('B'),
('BA'),
('BAA'),
('BD'),
('BDC'),
('BDCB'),
('BDCE'),
('BDD'),
('BDDA'),
('BE'),
('BED'),
('BEDA'),
('C'),
('CC'),
('CCB'),
('CCBC'),
('CCBE'),
('CCC'),
('CCCA'),
('CCCE'),
('CE');

For your scenario, I would suggest you to use Nested Sets Approach in PostgreSQL. It is XML tags based querying using Relational database.
Performance
If you index on lft and rgt columns, then you don't require recursive queries to get the data. Even though, the data seems huge, the retrieval will be very fast.
Sample
/*1A:
2 AA:
3 AAA
4 AAC
5 AB
6 AE:
7 AEA
8 AEE:
9 AEEB
10B:
*/
CREATE TABLE tree(id int, CELL varchar(4), lft int, rgt int);
INSERT INTO tree
("id", CELL, "lft", "rgt")
VALUES
(1, 'A', 1, 9),
(2, 'AA', 2, 4),
(3, 'AAA', 3, 3),
(4, 'AAC', 4, 4),
(5, 'AB', 5, 5),
(6, 'AE', 6, 9),
(7, 'AEA', 7, 7),
(8, 'AEE', 8, 8),
(9, 'AEEB', 9, 9)
;
SELECT hc.*
FROM tree hp
JOIN tree hc
ON hc.lft BETWEEN hp.lft AND hp.rgt
WHERE hp.id = 2
Demo
Querying using Nested Sets approach

SQL: Select distinct based on regular expression

Basically, I'm dealing with a horribly set up table that I'd love to rebuild, but am not sure I can at this point.
So, the table is of addresses, and it has a ton of similar entries for the same address. But there are sometimes slight variations in the address (i.e., a room # is tacked on IN THE SAME COLUMN, ugh).
Like this:
id | place_name | place_street
1 | Place Name One | 1001 Mercury Blvd
2 | Place Name Two | 2388 Jupiter Street
3 | Place Name One | 1001 Mercury Blvd, Suite A
4 | Place Name, One | 1001 Mercury Boulevard
5 | Place Nam Two | 2388 Jupiter Street, Rm 101
What I would like to do is in SQL (this is mssql), if possible, is do a query that is like:
SELECT DISTINCT place_name, place_street where [the first 4 letters of the place_name are the same] && [the first 4 characters of the place_street are the same].
to, I guess at this point, get:
Plac | 1001
Plac | 2388
Basically, then I can figure out what are the main addresses I have to break out into another table to normalize this, because the rest are just slight derivations.
I hope that makes sense.
I've done some research and I see people using regular expressions in SQL, but a lot of them seem to be using C scripts or something. Do I have to write regex functions and save them into the SQL Server before executing any regular expressions?
Any direction on whether I can just write them in SQL or if I have another step to go through would be great.
Or on how to approach this problem.
Thanks in advance!

Use the SQL function LEFT:
SELECT DISTINCT LEFT(place_name, 4)

I don't think you need regular expressions to get the results you describe. You just want to trim the columns and group by the results, which will effectively give you distinct values.
SELECT left(place_name, 4), left(place_street, 4), count(*)
FROM AddressTable
GROUP BY left(place_name, 4), left(place_street, 4)
The count(*) column isn't necessary, but it gives you some idea of which values might have the most (possibly) duplicate address rows in common.

I would recommend you look into Fuzzy Search Operations in SQL Server. You can match the results much better than what you are trying to do. Just google sql server fuzzy search.

Assuming at least SQL Server 2005 for the CTE:
;with cteCommonAddresses as (
select left(place_name, 4) as LeftName, left(place_street,4) as LeftStreet
from Address
group by left(place_name, 4), left(place_street,4)
having count(*) > 1
)
select a.id, a.place_name, a.place_street
from cteCommonAddresses c
inner join Address a
on c.LeftName = left(a.place_name,4)
and c.LeftStreet = left(a.place_street,4)
order by a.place_name, a.place_street, a.id

How to map combinations of things to a relational database?

I have a table whose records represent certain objects. For the sake of simplicity I am going to assume that the table only has one column, and that is the unique ObjectId. Now I need a way to store combinations of objects from that table. The combinations have to be unique, but can be of arbitrary length. For example, if I have the ObjectIds
1,2,3,4
I want to store the following combinations:
{1,2}, {1,3,4}, {2,4}, {1,2,3,4}
The ordering is not necessary. My current implementation is to have a table Combinations that maps ObjectIds to CombinationIds. So every combination receives a unique Id:
ObjectId | CombinationId
------------------------
1 | 1
2 | 1
1 | 2
3 | 2
4 | 2
This is the mapping for the first two combinations of the example above. The problem is, that the query for finding the CombinationId of a specific Combination seems to be very complex. The two main usage scenarios for this table will be to iterate over all combinations, and the retrieve a specific combination. The table will be created once and never be updated. I am using SQLite through JDBC. Is there any simpler way or a best practice to implement such a mapping?

The problem is, that the query for finding the CombinationId of a specific Combination seems to be very complex.
Shouldn't be too bad. If you want all combinations containing the selected items (with additional items allowed), it's just something like:
SELECT combinationID
FROM Combination
WHERE objectId IN (1, 3, 4)
GROUP BY combinationID
HAVING COUNT(*) = 3 -- The number of items in the combination
If you need only the specific combination (no extra items allowed), it can be more like:
SELECT combinationID FROM (
-- ... query from above goes here, this gives us all with those 3
) AS candidates
-- This bit gives us a row for each item in the candidates, including
-- the items we know about but also any 'extras'
INNER JOIN combination ON (candidates.combinationID = combination.combinationID)
GROUP BY candidates.combinationID
HAVING COUNT(*) = 3 -- Because we joined back on ALL, ones with extras will have > 3
You can also use a NOT EXISTS here (or in the original query), this seemed easier to explain.
Finally you could also be fancy and have a single, simple query
SELECT combinationID
FROM Combination AS candidates
INNER JOIN Combination AS allItems ON
(candidates.combinationID = allItems.combinationID)
WHERE candidates.objectId IN (1, 3, 4)
GROUP BY combinationID
HAVING COUNT(*) = 9 -- The number of items in the combination, squared
So in other words, if we're looking for {1, 2}, and there's a combination with {1, 2, 3}, we'll have a {candidates, allItems} JOIN result of:
{1, 1}, {1, 2}, {1, 3}, {2, 1}, {2, 2}, {2, 3}
The extra 3 results in COUNT(*) being 6 rows after GROUPing, not 4, so we know that's not the combination we're after.

This may be heresy, but for your usage scenarios it might work better to use a denormalized structure where you store the combinations themselves as some kind of composite (text) value:
CombinationId | Combination
---------------------------
1 | |1|2|
2 | |1|3|4|
If you make the rule that you always sort the ObjectIds when generating the composite value, it's easy to retrieve the Combination for a given set of Objects.

Another option would be to use relation-valued attributes, which in SQL DBMSs are called multisets or nested tables.
Relation-valued attributes may make sense if there is no identifier for the set of objects other than the set itself. However, I don't think any SQL DBMS permits keys to be declared on columns of that type so that could be a problem if you don't have some alternative key you can use.
http://download.oracle.com/docs/cd/B10500_01/appdev.920/a96594/adobjbas.htm#458790

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas