SQL Server - matching attributes query - sql

SQL Server Gurus ...
Currently using MS SQL Server 2016
I know Joe Celko and all SQL purists are squirming at the thought of using bitmasks, but I have a use case in which I need to query for all widgets that contain a set of given attributes.
Each widget may contain several hundred attributes.
The attributes of a widget are either present or not (1 = present, 0 = not
present)
One way I thought to do this is via bitmasks – the attributes to be found (a bitmask) could be ANDed with the attributes of each widget to find matches in a single operation. For example, the widgets table might be:
widets table:
widget_uid Uniqueidentifier
attributes BigInt
SELECT widget_uid
FROM widgets
WHERE ( attributes & bitmask ) = bitmask;
Problem is, using a BigInt for the attributes limits the number of attributes to 64 (a widget can have several hundred attributes), I could group the attributes in chunks of 64 bits, ie:
widets table:
widget_uid Uniqueidentifier
attributes0 BigInt -- Attributes 0-63
attributes1 BigInt -- Attributes 64-127
attributes2 BigInt -- Attributes 128-191
SELECT widget_uid
FROM widgets
WHERE ( attributes0 & bitmask0 ) = bitmask0
AND ( attributes1 & bitmask1 ) = bitmask1
AND ( attributes2 & bitmask2 ) = bitmask2
... but was wondering if anyone has come up with a solution for bit operations using bitmasks with greater than 64 bits – or if other (more efficient?) solutions would exist?
In the use case, the widgets table does contain other columns, but I am only concerned with the attributes matching portion of the query at the moment.
Any and all ideas are welcome - would be interested in knowing how others tackle this particular problem.
Thanks in advance.

We had a similar use case, on a significantly large data set. This was for an e-commerce site with products and attributes. Our case was a bit more complex than here, where we had any possible number of attributes and then values assigned to those attributes. e.g. Color - Red/Green/Blue, Size - S/M/L etc.
We found that associated tables with good indexing was the key in our case. While this may not be an option for you we found this to be the optimal solution for a dynamic data set.
I can code you up an example if you feel it will be helpful.
Edited to add example:
DROP TABLE IF EXISTS #Widgets
DROP TABLE IF EXISTS #Attributes
DROP TABLE IF EXISTS #WidgetAttributes
CREATE TABLE #Widgets (widget_UID UNIQUEIDENTIFIER PRIMARY KEY CLUSTERED, Name NVARCHAR(255))
CREATE TABLE #Attributes (Attribute_UID UNIQUEIDENTIFIER PRIMARY KEY CLUSTERED, Name NVARCHAR(255))
CREATE TABLE #WidgetAttributes (widget_UID UNIQUEIDENTIFIER,Attribute_UID UNIQUEIDENTIFIER)
CREATE NONCLUSTERED INDEX ix_WidgetAttribute ON #WidgetAttributes (Attribute_UID) INCLUDE (widget_UID)
INSERT INTO #Widgets (widget_UID, Name) values
( '{c63bea73-2331-4698-82c9-f71845ab8601}', N'Widget 1' ),
( '{a0865b8f-606b-4273-9207-39a8a26016c4}', N'Widget 2' ),
( '{211fe27e-ab98-4b61-83a3-3d006d66db5a}', N'Widget 3' )
INSERT INTO #Attributes (Attribute_UID, Name)
VALUES
( '{99354dc0-d0b2-4919-a887-edf115eeb1bd}', N'Height' ),
( '{136bbe4c-497d-472f-a905-670e4a7805d0}', N'Width' ),
( '{f006f950-30d1-453e-8e09-4f7d140fa3cb}', N'Depth' ),
( '{0d190639-677f-4b75-8d36-1bdac00de132}', N'Colour' )
-- Set links
-- Widget 1 All attributes
-- Widget 2 Height Width
-- Widget 3 Colour
INSERT INTO #WidgetAttributes (widget_UID, Attribute_UID)
SELECT '{c63bea73-2331-4698-82c9-f71845ab8601}',Attribute_UID FROM #Attributes
UNION ALL
SELECT TOP (2) '{a0865b8f-606b-4273-9207-39a8a26016c4}',Attribute_UID FROM #Attributes WHERE Name<> 'Colour'
UNION ALL
SELECT '{211fe27e-ab98-4b61-83a3-3d006d66db5a}',Attribute_UID FROM #Attributes WHERE Name = 'Colour'
-- #SearchAttributes to hold list of attributes you are trying to find
DECLARE #SearchAttributes TABLE (Attribute_UID UNIQUEIDENTIFIER)
INSERT INTO #SearchAttributes
SELECT Attribute_UID FROM #Attributes WHERE Name<> 'Colour'
;WITH cte AS (
SELECT WA.widget_UID, COUNT(1) AttributesPresent FROM #WidgetAttributes WA
JOIN #SearchAttributes SA ON SA.Attribute_UID = WA.Attribute_UID
GROUP BY WA.widget_UID
)
SELECT cte.AttributesPresent
, W.widget_UID
, W.Name
FROM cte
JOIN #Widgets W ON W.widget_UID = cte.widget_UID
ORDER BY cte.AttributesPresent DESC
Gives an output of:
AttributesPresent widget_UID Name
----------------- ------------------------------------ ----------
3 C63BEA73-2331-4698-82C9-F71845AB8601 Widget 1
2 A0865B8F-606B-4273-9207-39A8A26016C4 Widget 2
We used an approach of counting how many attributes were present for each so we not only had the option of "exact match" but also "closest fit".

Using bitmask in databases is wrong approach. Even if you somewhow manage it to work, you will not be able to use indexes to speed up execution.
Use standard solution, this is standard situation. There is standard M:N relationship between Widgets and Attributes (both should be tables, of course). You will add another table that will assign Attributes to Widgets - you can call it WidgetAttributes.
It will have 3 columns: Id, WidgetId, AttributeId
Then you can simply for example get list of Widgets that have Attribute:
select w.*
from Widgets w
inner join WidgetAttributes wa on wa.WidgetId = w.Id
inner join Attributes a on a.Id = wa.AttributeId
where a.AttributeName='xxx'

Related

Select records that match several tags

I implemented a standard tagging system on SQLite with two tables.
Table annotation:
CREATE TABLE IF NOT EXISTS annotation (
id INTEGER PRIMARY KEY,
comment TEXT
)
Table label:
CREATE TABLE IF NOT EXISTS label (
id INTEGER PRIMARY KEY,
annot_id INTEGER NOT NULL REFERENCES annotation(id),
tag TEXT NOT NULL
)
I can easily find the annotations that match tags 'tag1' OR 'tag2' :
SELECT * FROM annotation
JOIN label ON label.annot_id = annotation.id
WHERE label.tag IN ('tag1', 'tag2') GROUP BY annotation.id
But how do I select the annotations that match tags 'tag1' AND
'tag2'?
How do I select the annotations that match tags 'tag1'
AND 'tag2' but NOT 'tag3'?
Should I use INTERSECT? Is it efficient or is there a better way to express these?
I would definitely go with INTERSECT for question 1 and EXCEPT for question 2. After many years of experience with SQL I find it best to go with whatever the platform offers in cases where it directly addresses what you want to do.
The only exception would be if you had a really good reason not to. In this case, intersect and except are not ansi standard, so you are stuck with sqlite for as long as you use them.
If you want to go old school and use ONLY straight up SQL it is possible using subqueries, one for tag A, one for tag B, and one for tag C. Using an outer join with an "is null" condition is a common idiom to perform the exclusion.
Here is an sqlite example:
create table annotation (id integer, comment varchar);
create table label (id integer, annot_id integer, tag varchar);
insert into annotation values (1,'annot 1'),(2,'annot 2');
insert into label values (1,1,'tag1'),(2,1,'tag2'),(3,1,'tag2');
insert into label values (1,2,'tag1'),(2,2,'tag2'),(3,2,'tag3');
select distinct x.id,x.comment from annotation x
join label a on a.annot_id=x.id and a.tag='tag1'
join label b on b.annot_id=x.id and b.tag='tag2'
left join label c on c.annot_id=x.id and c.tag='tag3'
where
c.id is null;
This is set up so that both annotation 1 and 2 have tag1 and tag2 but label 2 has tag3 so should be excluded the output is only annotation 1:
id
comment
1
annot 1

DB: How to set up a many to many table(s) to handle multiple selectable conditions

I am working on a search filter for a website that will help users find a venue(for get-togethers and ceremonies) that meets their needs. Filters would include such things as: style, amenities, event type, etc. Multiple options in a category can apply to a venue, so a user can select multiple options from style, amenities and event type categories when searching.
My issue is in how I should approach the table design in the database. Currently I have a Venue table with a unique id and basic information, and a number of tables representing each category (style, amenities, etc) where they contain an id and name field.
I know that I need an intermediary table to hold foreign keys, so each option applicable to a category is associated to the venue.
Option 1: Create for each category table a many to many intermediary table with foreign keys to that category and the venue.
Option 2: Create one large intermediary table with foreign keys for every category, as well as the Venue
i.e.
fk_venue
fk_style
fk_amenities
...
I am trying to decide what is more efficient and less of a problem in coding for. Option 1 would require a query to each table which may become complicated to work with, where as option 2 seems easier to query but might have a much larger number of records to handle a venue with many amenities AND event types for example.
This doesn't seem like a new problem but I have had trouble finding resources that detail how best to approach this. We are currently using MSSQL for the DB and are building the site using .net core.
Go with option one. Create a join table to record the many-to-many relationships of each available feature of a venue. Option 2 is very wasteful in terms of storage. Consider a case where you have a venue with only one amenity, when 50 amenities types are available. Also, as I understand what you are suggesting for option 2, you would have to update your database design each time you add an amenity, event_type, or style. That would be a very difficult thing support wise.
In the case of Option 1, some of the tables would be:
Table Name: venue_amenities
Columns: venue_id, amenity_id
Table Name: venue_event_types
Columns: venue_id, event_type_id
Table Name: venue_styles
Columns: venue_id, style_id
When you query everything with a filter, you could query it like:
select distinct
v.venue_id
from venues v
inner join venue_amenities va on v.venue_id = va.venue_id
inner join venue_event_types vet on v.venue_id = vet.venue_id
inner join venue_styles vs on v.venue_id = vs.venue_id
where va.amenity_id in ([selected amenities])
and vet.event_type_id in ([selected event types])
and vs.venue_style in ([selected styles])
Option 3: You could start out with a meta data design. This would allow you to have multiple records per item or entity.
Often these things evolve with the development of tasks, or the evolution of the process and learning the data or the customer understanding some of the finer details that are drawn out as time goes on.
I've seen similar things where people design for hashtags or white lists, searching for that might get you closer to what you are looking for. Here is a working example to get you started.
declare #venue as table(
VenueID int identity(1,1) not null primary key clustered
, Name_ nvarchar(255) not null
, Address_ nvarchar(255) null
);
declare #venueType as table (
VenueTypeID int identity(1,1) not null primary key clustered
, VenueType nvarchar(255) not null
);
declare #venueStuff as table (
VenueStuffID int identity(1,1) not null primary key clustered
, VenueID int not null -- constraint back to venueid
, VenueTypeID int not null -- constraint to dim or lookup table for ... attribute types
, AttributeValue nvarchar(255) not null
);
insert into #venue (Name_)
select 'Bob''s Funhouse'
insert into #venueStuff (VenueID, VenueTypeID, AttributeValue)
select 1, 1, 'Scarrrrry' union all
select 1, 2, 'Food Avaliable' union all
select 1, 3, 'Game tables provided' union all
select 1, 4, 'Creepy';
insert into #venueType (VenueType)
select 'Haunted House Theme' union all
select 'Gaming' union all
select 'Concessions' union all
select 'post apocalyptic';
select a.Name_
, b.AttributeValue
, c.VenueType
from #venue a
join #venueStuff b
on a.VenueID = b.VenueID
join #venueType c
on c.VenueTypeID = b.VenueTypeID

SQL query with pivot tables?

I'm trying to wrap by brain around how to use pivot tables for a query I need. I have 3 database tables. Showing relevant columns:
TableA: Columns = pName
TableB: Columns = GroupName, GroupID
TableC: Columns = pName, GroupID
TableA contains a list of names (John, Joe, Jack, Jane)
TableB contains a list of groups with an ID#. (Soccer|1, Hockey|2, Basketball|3)
TableC contains a list of the names and the group they belong to (John|1, John|3, Joe|2, Jack|1, Jack|2, Jack|3, Jane|3)
I need to create a matrix like grid view using a SQL query that would return a list of all the names from TableA (Y-axis) and a list of all the possible groups (X-axis). The cell values would be either true or false if they belong to the group.
Any help would be appreciated. I couldn't quite find an existing answer that helped.
You might try it like this
Here I set up a MCVE, please try to create this in your next question yourself...
DECLARE #Name TABLE (pName VARCHAR(100));
INSERT INTO #Name VALUES('John'),('Joe'),('Jack'),('Jane');
DECLARE #Group TABLE(gName VARCHAR(100),gID INT);
INSERT INTO #Group VALUES ('Soccer',1),('Hockey',2),('Basketball',3);
DECLARE #map TABLE(pName VARCHAR(100),gID INT);
INSERT INTO #map VALUES
('John',1),('John',3)
,('Joe',2)
,('Jack',1),('Jack',2),('Jack',3)
,('Jane',3);
This quer will collect the values and perform PIVOT
SELECT p.*
FROM
(
SELECT n.pName
,g.gName
,'x' AS IsInGroup
FROM #map AS m
INNER JOIN #Name AS n ON m.pName=n.pName
INNER JOIN #Group AS g ON m.gID=g.gID
) AS x
PIVOT
(
MAX(IsInGroup) FOR gName IN(Soccer,Hockey,Basketball)
) as p
This is the result.
pName Soccer Hockey Basketball
Jack x x x
Jane NULL NULL x
Joe NULL x NULL
John x NULL x
Some hints:
You might use 1 and 0 instead of x as SQL Server does not know a real boolean
You should add a pID to your names. Never join tables on real data (unless it is something unique and unchangeable [which means never acutally!!!])
UPDATE dynamic SQL (thx to #djlauk)
If you want a query which deals with any amount of groups you have to to this dynamically. But please be aware, that you loose the chance to use this in ad-hoc-SQL like in VIEW or inline TVF, which is quite a big backdraw...
CREATE TABLE #Name(pName VARCHAR(100));
INSERT INTO #Name VALUES('John'),('Joe'),('Jack'),('Jane');
CREATE TABLE #Group(gName VARCHAR(100),gID INT);
INSERT INTO #Group VALUES ('Soccer',1),('Hockey',2),('Basketball',3);
CREATE TABLE #map(pName VARCHAR(100),gID INT);
INSERT INTO #map VALUES
('John',1),('John',3)
,('Joe',2)
,('Jack',1),('Jack',2),('Jack',3)
,('Jane',3);
DECLARE #ListOfGroups VARCHAR(MAX)=
(
STUFF
(
(
SELECT DISTINCT ',' + QUOTENAME(gName)
FROM #Group
FOR XML PATH('')
),1,1,''
)
);
DECLARE #sql VARCHAR(MAX)=
(
'SELECT p.*
FROM
(
SELECT n.pName
,g.gName
,''x'' AS IsInGroup
FROM #map AS m
INNER JOIN #Name AS n ON m.pName=n.pName
INNER JOIN #Group AS g ON m.gID=g.gID
) AS x
PIVOT
(
MAX(IsInGroup) FOR gName IN(' + #ListOfGroups + ')
) as p');
EXEC(#sql);
GO
DROP TABLE #map;
DROP TABLE #Group;
DROP TABLE #Name;
I suspect it may be laborious to keep the pivot up to date if categories are added. Or maybe I just prefer Excel (if you ignore one major advantage). The following approach could be helpful too, assuming you do have Office 365.
I added the three tables using 3 CREATE TABLE statements and 3 INSERT statements based on the code I saw above. (The solutions make use of temporary tables to insert specific values, but I believe you already have the data in your three tables, called TableA, TableB, TableC).
CREATE TABLE TestName (pName VARCHAR(100));
INSERT INTO TestName VALUES('John'),('Joe'),('Jack'),('Jane');
CREATE TABLE TestGroup (gName VARCHAR(100),gID INT);
INSERT INTO TestGroup VALUES ('Soccer',1),('Hockey',2),('Basketball',3);
CREATE TABLE Testmap (pName VARCHAR(100),gID INT);
INSERT INTO Testmap VALUES
('John',1),('John',3)
,('Joe',2)
,('Jack',1),('Jack',2),('Jack',3)
,('Jane',3);
Then, in MS Excel, I added (there may be a shorter sequence but I'm still exploring) the three tables as queries from database > sql server database. After adding them, I added all three to the Data Model (I can elaborate if you ask).
I then inserted PivotTable from the ribbon, chose External data source, but opened the Tables tab (instead of Connections tab), to find my data model (mine was top of the list) and I clicked Open. At some point Excel prompted me to create relationships between tables and it did a good job of auto generating them for me.
After minor tweaks my PivotTable came out like this (I could also ask Excel to show the data as a PivotChart).
Pivot showing groups as columns and names as rows.
The advantage is that you don't have to revisit the PIVOT code in SQL if the list (of groups) changes. As I think someone else mentioned, consider using ids for pName as well, or another way to ensure that you are not stuck the next day if you have two persons named John or Jack.
In Excel you can choose when to refresh the data (or the pivot) and, after refresh, any additional categories will be added and counted.

What is the equivalent PostgreSQL syntax to Oracle's CONNECT BY ... START WITH?

In Oracle, if I have a table defined as …
CREATE TABLE taxonomy
(
key NUMBER(11) NOT NULL CONSTRAINT taxPkey PRIMARY KEY,
value VARCHAR2(255),
taxHier NUMBER(11)
);
ALTER TABLE
taxonomy
ADD CONSTRAINT
taxTaxFkey
FOREIGN KEY
(taxHier)
REFERENCES
tax(key);
With these values …
key value taxHier
0 zero null
1 one 0
2 two 0
3 three 0
4 four 1
5 five 2
6 six 2
This query syntax …
SELECT
value
FROM
taxonomy
CONNECT BY
PRIOR key = taxHier
START WITH
key = 0;
Will yield …
zero
one
four
two
five
six
three
How is this done in PostgreSQL?
Use a RECURSIVE CTE in Postgres:
WITH RECURSIVE cte AS (
SELECT key, value, 1 AS level
FROM taxonomy
WHERE key = 0
UNION ALL
SELECT t.key, t.value, c.level + 1
FROM cte c
JOIN taxonomy t ON t.taxHier = c.key
)
SELECT value
FROM cte
ORDER BY level;
Details and links to documentation in my previous answer:
Does PostgreSQL have a pseudo-column like "LEVEL" in Oracle?
Or you can install the additional module tablefunc which provides the function connectby() doing almost the same. See Stradas' answer for details.
Postgres does have an equivalent to the connect by. You will need to enable the module. Its turned off by default.
It is called tablefunc. It supports some cool crosstab functionality as well as the familiar "connect by" and "Start With". I have found it works much more eloquently and logically than the recursive CTE. If you can't get this turned on by your DBA, you should go for the way Erwin is doing it.
It is robust enough to do the "bill of materials" type query as well.
Tablefunc can be turned on by running this command:
CREATE EXTENSION tablefunc;
Here is the list of connection fields freshly lifted from the official documentation.
Parameter: Description
relname: Name of the source relation (table)
keyid_fld: Name of the key field
parent_keyid_fld: Name of the parent-key field
orderby_fld: Name of the field to order siblings by (optional)
start_with: Key value of the row to start at
max_depth: Maximum depth to descend to, or zero for unlimited depth
branch_delim: String to separate keys with in branch output (optional)
You really should take a look at the docs page. It is well written and it will give you the options you are used to. (On the doc page scroll down, its near the bottom.)
Postgreql "Connect by" extension
Below is the description of what putting that structure together should be like. There is a ton of potential so I won't do it justice, but here is a snip of it to give you an idea.
connectby(text relname, text keyid_fld, text parent_keyid_fld
[, text orderby_fld ], text start_with, int max_depth
[, text branch_delim ])
A real query will look like this. Connectby_tree is the name of the table. The line that starting with "AS" is how you name the columns. It does look a little upside down.
SELECT * FROM connectby('connectby_tree', 'keyid', 'parent_keyid', 'pos', 'row2', 0, '~')
AS t(keyid text, parent_keyid text, level int, branch text, pos int);
As indicated by Stradas I report the query:
SELECT value
FROM connectby('taxonomy', 'key', 'taxHier', '0', 0, '~')
AS t(keyid numeric, parent_keyid numeric, level int, branch text)
inner join taxonomy t on t.key = keyid;
For example, we have a table in PostgreSQL, its name is product_types. Our table columns are (id, parent_id, name, sort_order).
Our first selection should give (parent) a root line.
id = 76 will be our sql's top 1 parent record.
with recursive product_types as (
select
pt0.id,
pt0.parant_id,
pt0.name,
pt0.sort_order,
0 AS level
from product_types pt0
where pt0.id = 76
UNION ALL
select
pt1.id,
pt1.parant_id,
pt1.name,
pt1.sort_order, (product_types.level + 1) as level
from product_types pt1
inner join product_types on (pt1.parant_id = product_types.id )
)
select
*
from product_types
order by level, sort_order

Join a table to itself

this is one on my database tables template.
Id int PK
Title nvarchar(10) unique
ParentId int
This is my question.Is there a problem if i create a relation between "Id" and "ParentId" columns?
(I mean create a relation between a table to itself)
I need some advices about problems that may occur during insert or updater or delete operations at developing step.thanks
You can perfectly join the table with it self.
You should be aware, however, that your design allows you to have multiple levels of hierarchy. Since you are using SQL Server (assuming 2005 or higher), you can have a recursive CTE get your tree structure.
Proof of concept preparation:
declare #YourTable table (id int, parentid int, title varchar(20))
insert into #YourTable values
(1,null, 'root'),
(2,1, 'something'),
(3,1, 'in the way'),
(4,1, 'she moves'),
(5,3, ''),
(6,null, 'I don''t know'),
(7,6, 'Stick around');
Query 1 - Node Levels:
with cte as (
select Id, ParentId, Title, 1 level
from #YourTable where ParentId is null
union all
select yt.Id, yt.ParentId, yt.Title, cte.level + 1
from #YourTable yt inner join cte on cte.Id = yt.ParentId
)
select cte.*
from cte
order by level, id, Title
No, you can do self join in your table, there will not be any problem. Are you talking which types of problems in insert, update, delete operation ? You can check some conditions like ParentId exists before adding new record, or you can check it any child exist while deleting parent.
You can do self join like :
select t1.Title, t2.Title as 'ParentName'
from table t1
left join table t2
on t1.ParentId = t2.Id
You've got plenty of good answers here. One other thing to consider is referential integrity. You can have a foreign key on a table that points to another column in the same table. Observe:
CREATE TABLE tempdb.dbo.t
(
Id INT NOT NULL ,
CONSTRAINT PK_t PRIMARY KEY CLUSTERED ( Id ) ,
ParentId INT NULL ,
CONSTRAINT FK_ParentId FOREIGN KEY ( ParentId ) REFERENCES tempdb.dbo.t ( Id )
)
By doing this, you ensure that you're not going to get garbage in the ParentId column.
Its called Self Join and it can be added to a table as in following example
select e1.emp_name 'manager',e2.emp_name 'employee'
from employees e1 join employees e2
on e1.emp_id=e2.emp_manager_id
I have seen this done without errors before on a table for menu hierarchy you shouldnt have any issues providing your insert / update / delete queries are well written.
For instance when you insert check a parent id exists, when you delete check you delete all children too if this action is appropriate or do not allow deletion of items that have children.
It is fine to do this (it's a not uncommon pattern). You must ensure that you are adding a child record to a parent record that actually exists etc., but there's noting different here from any other constraint.
You may want to look at recursive common table expressions:
http://msdn.microsoft.com/en-us/library/ms186243.aspx
As a way of querying an entire 'tree' of records.
This is not a problem, as this is a relationship that's common in real life. If you do not have a parent (which happens at the top level), you need to keep this field "null", only then do update and delete propagation work properly.