Currently I'm having the following table structure.
Master table Documents:
ID
Filename
1
document1.pdf
2
document2.pdf
3
document3.pdf
Detail table Keywords:
ID
DocumentID
Keyword
1
1
KeywordA
2
1
KeywordB
3
1
KeywordC
4
2
KeywordB
5
3
KeywordA
6
3
KeywordD
Code to create this:
CREATE TABLE Documents (
ID int IDENTITY(1,1) PRIMARY KEY,
Filename nvarchar(255) NOT NULL
);
CREATE TABLE Keywords (
ID int IDENTITY(1,1) PRIMARY KEY,
DocumentID int NOT NULL,
Keyword nvarchar(255) NOT NULL
);
INSERT INTO Documents(Filename) VALUES
('document1.pdf'), ('document2.pdf'), ('document3.pdf');
INSERT INTO Keywords(DocumentID, Keyword) VALUES
(1, 'KeywordA'),
(1, 'KeywordB'),
(1, 'KeywordC'),
(2, 'KeywordB'),
(3, 'KeywordA'),
(3, 'KeywordD');
SQL Fiddle for this.
Finding with one keyword
I'm looking for a way to get all documents matching a certain keyword.
This could be e.g. written with the following T-SQL query:
SELECT Documents.*
FROM Documents
WHERE Documents.ID IN
(
SELECT Keywords.DocumentID
FROM Keywords
WHERE Keywords.Keyword = 'KeywordA'
)
This works successfully.
Finding with multiple keywords
What I'm currently stuck with is when I want to find all documents that match multiple keyword, combined with logical AND.
E.g. find a document that has three detail records with keyword A, B and C.
I think the following might work, but I don't know whether this performant or elegant at all:
SELECT Documents.*
FROM Documents
WHERE Documents.ID IN
(
SELECT Keywords.DocumentID
FROM Keywords
WHERE
Keywords.Keyword = 'KeywordA' OR
Keywords.Keyword = 'KeywordB'
GROUP BY Keywords.DocumentID HAVING COUNT(*) = 2
)
SQL Fiddle for that.
My question
How to write a (performant) SQL query to find all documents that have multiple keywords associated.
If it is easier, a solution with a constant number of keywords (e.g. 3) would be sufficient.
I hope the following query can help you
SELECT D.ID
FROM Documents D
JOIN Keywords K ON K.DocumentID = D.ID
WHERE K.Keyword IN ('KeywordA', 'KeywordB', 'KeywordC')
GROUP BY D.ID
HAVING COUNT(DISTINCT K.Keyword) = 3
Demo
The technique you are trying to do is called Relational Division With Remainder, in other words: find all groups which contain a particular set of rows.
Your current query is one of the standard ways of doing this, there are others.
If you had the keywords in a table variable or TVP, ...
DECLARE #keywords AS TABLE (Keyword varchar(50));
INSERT #keywords VALUES
('KeywordA'), ('KeywordB'), ('KeywordC');
... you could make it much neater with the following:
SELECT d.*
FROM Documents d
WHERE d.ID IN
(
SELECT k.DocumentID
FROM Keywords k
JOIN #keywords kt ON kt.Keyword = k.Keyword
GROUP BY k.DocumentID
HAVING COUNT(*) = (SELECT COUNT(*) FROM #keywords)
);
Another option:
SELECT d.*
FROM Documents d
WHERE EXISTS (SELECT 1
FROM #keywords kt
LEFT JOIN Keywords k ON kt.Keyword = k.Keyword
AND k.DocumentID = d.ID
HAVING COUNT(*) = COUNT(k.keywords) -- there are no missing matches
);
And another, slightly confusing one:
SELECT d.*
FROM Documents d
WHERE NOT EXISTS (SELECT 1
FROM #keywords kt
WHERE NOT EXISTS (SELECT 1
FROM Keywords k
WHERE k.Keyword = kt.Keyword
AND K.DocumentID = d.ID
)
);
-- For each document, there are no keywords for which there is no match
A simple case: we have a main table (id, name, dictionary_id) and a dictionary table (id, name). The dictionary_id is the FK related with the dictionary table. NULL values are possible for dictionary_id.
I need to show:
id, name, dictionary_name
if dictionary_id is null (in this case dictionary_name is empty) or
dictionary_id comes from a list (e.g. from subquery).
Thanks,
Jacek
You are looking for an outer join:
select mt.id, mt.name, dt.name as dictionary_name
from main_table mt
left join dictionary_table dt on mt.dictionary_id = dt.id;
I have three tables:
group:
id - primary key
name - varchar
profile:
id - primary key
name - varchar
surname - varchar
[...etc...]
profile_group:
profile_id - integer, foreign key to table profile
group_id - integer, foreign key to table group
Profiles may be in many groups. I have group named "Users" with id=1 and I want to assign all users to this group but only if there was no such entry for the table profiles.
How to do it?
If I understood you correctly, you want to add entries like (profile_id, 1) into profile_group table for all profiles, that were not in this table before. If so, try this:
INSERT INTO profile_group(profile_id, group_id)
SELECT id, 1 FROM profile p
LEFT JOIN profile_group pg on (p.id=pg.profile_id)
WHERE pg.group_id IS NULL;
What you want to do is use a left join to the profile group table and then exclude any matching records (this is done in the where clause of the below SQL statement).
This is faster than using not in (select xxx) since the query profiler seems to handle it better (in my experience)
insert into profile_group (profile_id, group_id)
select p.id, 1
from profiles p
left join profile_group pg on p.id = pg.profile_id
and pg.group_id = 1
where pg.profile_id is null
For a class project, a few others and I have decided to make a (very ugly) limited clone of StackOverflow. For this purpose, we're working on one query:
Home Page: List all the questions, their scores (calculated from votes), and the user corresponding to their first revision, and the number of answers, sorted in date-descending order according to the last action on the question (where an action is an answer, an edit of an answer, or an edit of the question).
Now, we've gotten the entire thing figured out, except for how to represent tags on questions. We're currently using a M-N mapping of tags to questions like this:
CREATE TABLE QuestionRevisions (
id INT IDENTITY NOT NULL,
question INT NOT NULL,
postDate DATETIME NOT NULL,
contents NTEXT NOT NULL,
creatingUser INT NOT NULL,
title NVARCHAR(200) NOT NULL,
PRIMARY KEY (id),
CONSTRAINT questionrev_fk_users FOREIGN KEY (creatingUser) REFERENCES
Users (id) ON DELETE CASCADE,
CONSTRAINT questionref_fk_questions FOREIGN KEY (question) REFERENCES
Questions (id) ON DELETE CASCADE
);
CREATE TABLE Tags (
id INT IDENTITY NOT NULL,
name NVARCHAR(45) NOT NULL,
PRIMARY KEY (id)
);
CREATE TABLE QuestionTags (
tag INT NOT NULL,
question INT NOT NULL,
PRIMARY KEY (tag, question),
CONSTRAINT qtags_fk_tags FOREIGN KEY (tag) REFERENCES Tags(id) ON
DELETE CASCADE,
CONSTRAINT qtags_fk_q FOREIGN KEY (question) REFERENCES Questions(id) ON
DELETE CASCADE
);
Now, for this query, if we just join to QuestionTags, then we'll get the questions and titles over and over and over again. If we don't, then we have an N query scenario, which is just as bad. Ideally, we'd have something where the result row would be:
+-------------+------------------+
| Other Stuff | Tags |
+-------------+------------------+
| Blah Blah | TagA, TagB, TagC |
+-------------+------------------+
Basically -- for each row in the JOIN, do a string join on the resulting tags.
Is there a built in function or similar which can accomplish this in T-SQL?
Here's one possible solution using recursive CTE:
The methods used are explained here
TSQL to set up the test data (I'm using table variables):
DECLARE #QuestionRevisions TABLE (
id INT IDENTITY NOT NULL,
question INT NOT NULL,
postDate DATETIME NOT NULL,
contents NTEXT NOT NULL,
creatingUser INT NOT NULL,
title NVARCHAR(200) NOT NULL)
DECLARE #Tags TABLE (
id INT IDENTITY NOT NULL,
name NVARCHAR(45) NOT NULL
)
DECLARE #QuestionTags TABLE (
tag INT NOT NULL,
question INT NOT NULL
)
INSERT INTO #QuestionRevisions
(question,postDate,contents,creatingUser,title)
VALUES
(1,GETDATE(),'Contents 1',1,'TITLE 1')
INSERT INTO #QuestionRevisions
(question,postDate,contents,creatingUser,title)
VALUES
(2,GETDATE(),'Contents 2',2,'TITLE 2')
INSERT INTO #Tags (name) VALUES ('Tag 1')
INSERT INTO #Tags (name) VALUES ('Tag 2')
INSERT INTO #Tags (name) VALUES ('Tag 3')
INSERT INTO #Tags (name) VALUES ('Tag 4')
INSERT INTO #Tags (name) VALUES ('Tag 5')
INSERT INTO #Tags (name) VALUES ('Tag 6')
INSERT INTO #QuestionTags (tag,question) VALUES (1,1)
INSERT INTO #QuestionTags (tag,question) VALUES (3,1)
INSERT INTO #QuestionTags (tag,question) VALUES (5,1)
INSERT INTO #QuestionTags (tag,question) VALUES (4,2)
INSERT INTO #QuestionTags (tag,question) VALUES (2,2)
Here's the action part:
;WITH CTE ( id, taglist, tagid, [length] )
AS ( SELECT question, CAST( '' AS VARCHAR(8000) ), 0, 0
FROM #QuestionRevisions qr
GROUP BY question
UNION ALL
SELECT qr.id
, CAST(taglist + CASE WHEN [length] = 0 THEN '' ELSE ', ' END + t.name AS VARCHAR(8000) )
, t.id
, [length] + 1
FROM CTE c
INNER JOIN #QuestionRevisions qr ON c.id = qr.question
INNER JOIN #QuestionTags qt ON qr.question=qt.question
INNER JOIN #Tags t ON t.id=qt.tag
WHERE t.id > c.tagid )
SELECT id, taglist
FROM ( SELECT id, taglist, RANK() OVER ( PARTITION BY id ORDER BY length DESC )
FROM CTE ) D ( id, taglist, rank )
WHERE rank = 1;
This was the solution I ended up settling on. I checkmarked Mack's answer because it works with arbitrary numbers of tags, and because it matches what I asked for in my question. I ended up though going with this, however, simply because I understand what this is doing, while I have no idea how Mack's works :)
WITH tagScans (qRevId, tagName, tagRank)
AS (
SELECT DISTINCT
QuestionTags.question AS qRevId,
Tags.name AS tagName,
ROW_NUMBER() OVER (PARTITION BY QuestionTags.question ORDER BY Tags.name) AS tagRank
FROM QuestionTags
INNER JOIN Tags ON Tags.id = QuestionTags.tag
)
SELECT
Questions.id AS id,
Questions.currentScore AS currentScore,
answerCounts.number AS answerCount,
latestRevUser.id AS latestRevUserId,
latestRevUser.caseId AS lastRevUserCaseId,
latestRevUser.currentScore AS lastRevUserScore,
CreatingUsers.userId AS creationUserId,
CreatingUsers.caseId AS creationUserCaseId,
CreatingUsers.userScore AS creationUserScore,
t1.tagName AS tagOne,
t2.tagName AS tagTwo,
t3.tagName AS tagThree,
t4.tagName AS tagFour,
t5.tagName AS tagFive
FROM Questions
INNER JOIN QuestionRevisions ON QuestionRevisions.question = Questions.id
INNER JOIN
(
SELECT
Questions.id AS questionId,
MAX(QuestionRevisions.id) AS maxRevisionId
FROM Questions
INNER JOIN QuestionRevisions ON QuestionRevisions.question = Questions.id
GROUP BY Questions.id
) AS LatestQuestionRevisions ON QuestionRevisions.id = LatestQuestionRevisions.maxRevisionId
INNER JOIN Users AS latestRevUser ON latestRevUser.id = QuestionRevisions.creatingUser
INNER JOIN
(
SELECT
QuestionRevisions.question AS questionId,
Users.id AS userId,
Users.caseId AS caseId,
Users.currentScore AS userScore
FROM Users
INNER JOIN QuestionRevisions ON QuestionRevisions.creatingUser = Users.id
INNER JOIN
(
SELECT
MIN(QuestionRevisions.id) AS minQuestionRevisionId
FROM Questions
INNER JOIN QuestionRevisions ON QuestionRevisions.question = Questions.id
GROUP BY Questions.id
) AS QuestionGroups ON QuestionGroups.minQuestionRevisionId = QuestionRevisions.id
) AS CreatingUsers ON CreatingUsers.questionId = Questions.id
INNER JOIN
(
SELECT
COUNT(*) AS number,
Questions.id AS questionId
FROM Questions
INNER JOIN Answers ON Answers.question = Questions.id
GROUP BY Questions.id
) AS answerCounts ON answerCounts.questionId = Questions.id
LEFT JOIN tagScans AS t1 ON t1.qRevId = QuestionRevisions.id AND t1.tagRank = 1
LEFT JOIN tagScans AS t2 ON t2.qRevId = QuestionRevisions.id AND t2.tagRank = 2
LEFT JOIN tagScans AS t3 ON t3.qRevId = QuestionRevisions.id AND t3.tagRank = 3
LEFT JOIN tagScans AS t4 ON t4.qRevId = QuestionRevisions.id AND t4.tagRank = 4
LEFT JOIN tagScans AS t5 ON t5.qRevId = QuestionRevisions.id AND t5.tagRank = 5
ORDER BY QuestionRevisions.postDate DESC
This is a common question that comes up quite often phrased in a number of different ways (concatenate rows as string, merge rows as string, condense rows as string, combine rows as string, etc.). There are two generally accepted ways to handle combining an arbitrary number of rows into a single string in SQL Server.
The first, and usually the easiest, is to abuse XML Path combined with the STUFF function like so:
select rsQuestions.QuestionID,
stuff((select ', '+ rsTags.TagName
from #Tags rsTags
inner join #QuestionTags rsMap on rsMap.TagID = rsTags.TagID
where rsMap.QuestionID = rsQuestions.QuestionID
for xml path(''), type).value('.', 'nvarchar(max)'), 1, 1, '')
from #QuestionRevisions rsQuestions
Here is a working example (borrowing some slightly modified setup from Mack). For your purposes you could store the results of that query in a common table expression, or in a subquery (I'll leave that as an exercise).
The second method is to use a recursive common table expression. Here is an annotated example of how that would work:
--NumberedTags establishes a ranked list of tags for each question.
--The key here is using row_number() or rank() partitioned by the particular question
;with NumberedTags (QuestionID, TagString, TagNum) as
(
select QuestionID,
cast(TagName as nvarchar(max)) as TagString,
row_number() over (partition by QuestionID order by rsTags.TagID) as TagNum
from #QuestionTags rsMap
inner join #Tags rsTags on rsTags.TagID = rsMap.TagID
),
--TagsAsString is the recursive query
TagsAsString (QuestionID, TagString, TagNum) as
(
--The first query in the common table expression establishes the anchor for the
--recursive query, in this case selecting the first tag for each question
select QuestionID,
TagString,
TagNum
from NumberedTags
where TagNum = 1
union all
--The second query in the union performs the recursion by joining the
--anchor to the next tag, and so on...
select NumberedTags.QuestionID,
TagsAsString.TagString + ', ' + NumberedTags.TagString,
NumberedTags.TagNum
from NumberedTags
inner join TagsAsString on TagsAsString.QuestionID = NumberedTags.QuestionID
and NumberedTags.TagNum = TagsAsString.TagNum + 1
)
--The result of the recursive query is a list of tag strings building up to the final
--string, of which we only want the last, so here we select the longest one which
--gives us the final result
select QuestionID, max(TagString)
from TagsAsString
group by QuestionID
And here is a working version. Again, you could use the results in a common table expression or subquery to join against your other tables to get your ultimate result. Hopefully the annotations help you understand a little more how the recursive common table expression works (though the link in Macks answer also goes into some detail about the method).
There is, of course, another way to do it, which doesn't handle an arbitrary number of rows, which is to join against your table aliased multiple times, which is what you did in your answer.
I have a problem and I dont know what is better solution.
Okay, I have 2 tables: posts(id, title), posts_tags(post_id, tag_id).
I have next task: must select posts with tags ids for example 4, 10 and 11.
Not exactly, post could have any other tags at the same time.
So, how I could do it more optimized? Creating temporary table in each query? Or may be some kind of stored procedure?
In the future, user could ask script to select posts with any count of tags (it could be 1 tag only or 10 at the same time) and I must be sure that method that I will choose would be the best method for my problem.
Sorry for my english, thx for attention.
This solution assumes that (post_id, tag_id) in post_tags is enforced to be UNIQUE:
SELECT id, title FROM posts
INNER JOIN post_tag ON post_tag.post_id = posts.id
WHERE tag_id IN (4, 6, 10)
GROUP BY id, title
HAVING COUNT(*) = 3
Although it's not a solution for all possible tag combinations, it's easy to create as dynamic SQL. To change for other sets of tags, change the IN () list to have all the tags, and the COUNT(*) = to check for the number of tags specified. The advantage of this solution over cascading a bunch of JOINs together is that you don't have to add JOINs, or even extra WHERE terms, when you change the request.
select id, title
from posts p, tags t
where p.id = t.post_id
and tag_id in ( 4,10,11 ) ;
?
Does this work?
select *
from posts
where post.post_id in
(select post_id
from post_tags
where tag_id = 4
and post_id in (select post_id
from post_tags
where tag_id = 10
and post_id in (select post_id
from post_tags
where tag_id = 11)))
You can do a time-storage trade-off by storing a one-way hash of the post's tag names sorted alphabetically.
When a post is tagged, execute select t.name from tags t inner join post_tags pt where pt.post_id = [ID_of_tagged_post] order by t.name. Concatenate all of the tag names, create a hash using the MD5 algorithm and insert the value into a column alongside your post (or into another table joined by a foreign key, if you prefer).
When you want to search for a specific combination of tags, simply execute (remembering to sort the tag names) select from posts p where p.taghash = MD5([concatenated_tag_string]).
This selects all posts that have any of the tags (4, 10, 11):
select distinct id, title from posts
where exists (
select * from posts_tags
where
post_id = id and
tag_id in (4, 10, 11))
Or you can use this:
select distinct id, title from posts
join posts_tags on post_id = id
where tag_id in (4, 10, 11)
(Both will be optimized the same way).
This selects all posts that have all of the tags (4, 10, 11):
select distinct id, title from posts
where not exists (
select * from posts_tags t1
where
t1.tag_id in (4, 10, 11) and
not exists (
select * from posts_tags as t2
where
t1.tag_id = t2.tag_id and
id = t2.post_id))
The list of tags in the in clause is what dynamically changes (in all cases).
But, this last query is not really fast, so you could use something like this instead:
create temporary table target_tags (tag_id int);
insert into target_tags values(4),(10),(11);
select id, title from posts
join posts_tags on post_id = id
join target_tags on target_tags.tag_id = posts_tags.tag_id
group by id, title
having count(*) = (select count(*) from target_tags);
drop table target_tags;
The part that changes dynamically is now in the second statement (the insert).