SQL Taxonomy Help - sql

I have a database that relates content by taxonomy and I am trying to query that content by taxonomy. It looks like this:
Table 1
content_id, content_name
Table 2
content_id, content_taxonmoy
What I am trying in my query is to find content with two or more types of related taxonomy. My query looks like this:
SELECT content_id FROM table_1 JOIN table_2 ON table_1.content_id=table_2.content_id WHERE content_taxonomy='ABC' AND content_taxonomy='123'
Except it returns nothing. I later tried a group by with:
SELECT content_id FROM table_1 JOIN table_2 ON table_1.content_id=table_2.content_id WHERE content_taxonomy='ABC' AND content_taxonomy='123'GROUP BY content_id, content_taxonomy
But that didn't work either. Any suggestions please?

SELECT *
FROM content c
WHERE (
SELECT COUNT(*)
FROM taxonomy t
WHERE t.content_id = c.content_id
AND t.content_taxonomy IN ('ABC', '123')
) = 2
Create a UNIQUE INDEX or a PRIMARY KEY on taxonomy (content_id, content_taxonomy) for this to work fast.
SELECT c.*
FROM (
SELECT content_id
FROM taxonomy
WHERE content_taxonomy IN ('ABC', '123')
GROUP BY
content_id
HAVING COUNT(*) = 2
) t
JOIN content c
ON c.content_id = t.content_id
In this case, create a UNIQUE INDEX or a PRIMARY KEY on taxonomy (content_taxonomy, content_id) (note the order or the fields).
Either solution can be more or less effective than another one, depending on how many taxonomies per content do you have and what is the probability of matching.

Related

Query for master records that have matching detail records

Currently I'm having the following table structure.
Master table Documents:
ID
Filename
1
document1.pdf
2
document2.pdf
3
document3.pdf
Detail table Keywords:
ID
DocumentID
Keyword
1
1
KeywordA
2
1
KeywordB
3
1
KeywordC
4
2
KeywordB
5
3
KeywordA
6
3
KeywordD
Code to create this:
CREATE TABLE Documents (
ID int IDENTITY(1,1) PRIMARY KEY,
Filename nvarchar(255) NOT NULL
);
CREATE TABLE Keywords (
ID int IDENTITY(1,1) PRIMARY KEY,
DocumentID int NOT NULL,
Keyword nvarchar(255) NOT NULL
);
INSERT INTO Documents(Filename) VALUES
('document1.pdf'), ('document2.pdf'), ('document3.pdf');
INSERT INTO Keywords(DocumentID, Keyword) VALUES
(1, 'KeywordA'),
(1, 'KeywordB'),
(1, 'KeywordC'),
(2, 'KeywordB'),
(3, 'KeywordA'),
(3, 'KeywordD');
SQL Fiddle for this.
Finding with one keyword
I'm looking for a way to get all documents matching a certain keyword.
This could be e.g. written with the following T-SQL query:
SELECT Documents.*
FROM Documents
WHERE Documents.ID IN
(
SELECT Keywords.DocumentID
FROM Keywords
WHERE Keywords.Keyword = 'KeywordA'
)
This works successfully.
Finding with multiple keywords
What I'm currently stuck with is when I want to find all documents that match multiple keyword, combined with logical AND.
E.g. find a document that has three detail records with keyword A, B and C.
I think the following might work, but I don't know whether this performant or elegant at all:
SELECT Documents.*
FROM Documents
WHERE Documents.ID IN
(
SELECT Keywords.DocumentID
FROM Keywords
WHERE
Keywords.Keyword = 'KeywordA' OR
Keywords.Keyword = 'KeywordB'
GROUP BY Keywords.DocumentID HAVING COUNT(*) = 2
)
SQL Fiddle for that.
My question
How to write a (performant) SQL query to find all documents that have multiple keywords associated.
If it is easier, a solution with a constant number of keywords (e.g. 3) would be sufficient.
I hope the following query can help you
SELECT D.ID
FROM Documents D
JOIN Keywords K ON K.DocumentID = D.ID
WHERE K.Keyword IN ('KeywordA', 'KeywordB', 'KeywordC')
GROUP BY D.ID
HAVING COUNT(DISTINCT K.Keyword) = 3
Demo
The technique you are trying to do is called Relational Division With Remainder, in other words: find all groups which contain a particular set of rows.
Your current query is one of the standard ways of doing this, there are others.
If you had the keywords in a table variable or TVP, ...
DECLARE #keywords AS TABLE (Keyword varchar(50));
INSERT #keywords VALUES
('KeywordA'), ('KeywordB'), ('KeywordC');
... you could make it much neater with the following:
SELECT d.*
FROM Documents d
WHERE d.ID IN
(
SELECT k.DocumentID
FROM Keywords k
JOIN #keywords kt ON kt.Keyword = k.Keyword
GROUP BY k.DocumentID
HAVING COUNT(*) = (SELECT COUNT(*) FROM #keywords)
);
Another option:
SELECT d.*
FROM Documents d
WHERE EXISTS (SELECT 1
FROM #keywords kt
LEFT JOIN Keywords k ON kt.Keyword = k.Keyword
AND k.DocumentID = d.ID
HAVING COUNT(*) = COUNT(k.keywords) -- there are no missing matches
);
And another, slightly confusing one:
SELECT d.*
FROM Documents d
WHERE NOT EXISTS (SELECT 1
FROM #keywords kt
WHERE NOT EXISTS (SELECT 1
FROM Keywords k
WHERE k.Keyword = kt.Keyword
AND K.DocumentID = d.ID
)
);
-- For each document, there are no keywords for which there is no match

How to use INTERSECT together with COUNT in SQLite?

I have a table called customer_transactions and a table called blacklist.
The customer_transactions table has a column called atm_name.
Both tables share a unique key called id.
How can I intersect the two tables in such a way that the query shows me
customers that appear on both tables.
a corresponding column that displays the times that they had used a
certain atm alongside the atm's name
(for instance: id_1 -- bank of america -- 2; id_1 -- citibank -- 3;
id_2 -- bank of america -- 1; id_2 -- citibank -- 4, etcetera).
I have something like this
SELECT id,
atm_name,
count(atm_name) as atm_count
FROM customer_transactions
GROUP BY id, atm_name
How can I INTERSECT this table with the blacklist table and maintain what I currently have as output?
Thanks in advance.
You seem to want a join. Assuming that column id relates the two tables, and that it is a unique key in blacklist, you can do:
select ct.id, ct.atm_name, count(*) as atm_count
from customer_transactions ct
inner join blacklist b on b.id = ct.id
group by ct.id, ct.atm_name
You can also express this logic with exists and a correlated subquery:
select ct.id, ct.atm_name, count(*) as atm_count
from customer_transactions ct
where exists (select 1 from blacklist b where b.id = ct.id)
group by ct.id, ct.atm_name

SELECT from subquery without having to specify all columns in GROUP BY

Idea is to query an article table where an article has a given tag, and then to STRING_AGG all (even unrelated) tags that belong to that article row.
Example tables and query:
CREATE TABLE article (id SERIAL, body TEXT);
CREATE TABLE article_tag (article INT, tag INT);
CREATE TABLE tag (id SERIAL, title TEXT);
SELECT DISTICT ON (id)
q.id, q.body, STRING_AGG(q.tag_title, '|') tags
FROM (
SELECT a.*, tag.title tag_title
FROM article a
LEFT JOIN article_tag x ON a.id = tag.article
LEFT JOIN tag ON tag.id = x.tag
WHERE tag.title = 'someTag'
) q
GROUP BY q.id
Running the above, postgres require that the q.body must be included in GROUP BY:
ERROR: column "q.body" must appear in the GROUP BY clause or be used in an aggregate function
As I understand it, it's because subquery q doesn't include any PRIMARY key.
I naively thought that the DISTINCT ON would supplement that, but it doesn't seem so.
Is there a way to mark a column in a subquery as PRIMARY so that we don't have to list all columns in GROUP BY clause?
If we do have to list all columns in GROUP BY clause, does that incur significant perf cost?
EDIT: to elaborate, since PostgreSQL 9.1 you don't have to supply non-primary (i.e. functionally dependent) keys when using GROUP BY, e.g. following query works fine:
SELECT a.id, a.body, STRING_AGG(tag.title, '|') tags
FROM article a
LEFT JOIN article_tag x ON a.id = tag.article
LEFT JOIN tag ON tag.id = x.tag
GROUP BY a.id
I was wondering if I can leverage the same behavior, but with a subquery (by somehow indicating that q.id is a PRIMARY key).
It sadly doesn't work when you wrap your primary key in subquery and I don't know of any way to "mark it" as you suggested.
You can try this workaround using window function and distinct:
CREATE TABLE test1 (id serial primary key, name text, value text);
CREATE TABLE test2 (id serial primary key, test1_id int, value text);
INSERT INTO test1(name, value)
values('name1', 'test01'), ('name2', 'test02'), ('name3', 'test03');
INSERT INTO test2(test1_id, value)
values(1, 'test1'), (1, 'test2'), (3, 'test3');
SELECT DISTINCT ON (id) id, name, string_agg(value2, '|') over (partition by id)
FROM (SELECT test1.*, test2.value AS value2
FROM test1
LEFT JOIN test2 ON test2.test1_id = test1.id) AS sub;
id name string_agg
1 name1 test1|test2
2 name2 null
3 name3 test3
Demo
Problem is in outer SELECT - you should either aggregate columns either
group by them. Postgres wants you to specify what to do with q.body - group by it or calculate aggregate. Looks little bit awkward but should work.
SELECT DISTICT ON (id)
q.id, q.body, STRING_AGG(q.tag_title, '|') tags
FROM (
SELECT a.*, tag.title tag_title
FROM article a
LEFT JOIN article_tag x ON a.id = tag.article
LEFT JOIN tag ON tag.id = x.tag
WHERE tag.title = 'someTag'
) q
GROUP BY q.id, q.body
-- ^^^^^^
Another way is to make a query to get id and aggregated tags then join body to it. If you wish I can make an example.

How can I create a new table based on merging 2 tables without joining certain values?

I asked a question earlier, but I wasn't really able to explain myself clearly.
I made a graphic to hopefully help explain what I'm trying to do.
I have two separate tables inside the same database. One table called 'Consumers' with about 200 fields including one called 'METER_NUMBERS*'. And then one other table called 'Customer_Info' with about 30 fields including one called 'Meter'. These two meter fields are what the join or whatever method would be based on. The problem is that not all the meter numbers in the two tables match and some are NULL values and some are a value of 0 in both tables.
I want to join the information for the records that have matching meter numbers between the two tables, but also keep the NULL and 0 values as their own records. There are NULL and 0 values in both tables but I don't want them to join together.
There are also a few duplicate field names, like Location shown in the graphic. If it's easier to fix these duplicate field names manually I can do that, but it'd be cool to be able to do it programmatically.
The key is that I need the result in a NEW table!
This process will be a one time thing, not something I would do often.
Hopefully, I explained this clearly and if anyone can help me out that'd be awesome!
If any more information is needed, please let me know.
Thanks.
INSERT INTO new_table
SELECT * FROM
(SELECT a.*, b.* FROM Consumers a
INNER JOIN CustomerInfo b ON a.METER_NUMBER = b.METER and a.Location = b.Location
WHERE a.METER_NUMBER IS NOT NULL AND a.METER_NUMBER <> 0
UNION ALL
SELECT a.*, NULL as Meter, NULL as CustomerInfo_Location, NULL as Field2, NULL as Field3
FROM Consumers a
WHERE a.METER_NUMBER IS NULL OR a.METER_NUMBER = 0
UNION ALL
SELECT NULL as METER_NUMBER, NULL as Location, NULL as Field4, NULL as Field5, b.*
FROM CustomerInfo b
WHERE b.METER IS NULL OR b.METER = 0) c
I know to create a new table from other table you can use the following snip:
CREATE TABLE New_table
AS (SELECT customers.Meter_number, customers_info.Meter_number, ...
FROM customers, customers_info
WHERE customers.Meter_number = customers_info.Meter_number
OR customers.Meter_number IS NULL OR customers_info.Meter_number = 0);
I didn't test it out, but you should be able to do something with that.
I guess full outer join is what you need.
Create table #consumers (
meter_number int,
location varchar(50),
field4 varchar(50),
field5 varchar(50)
)
Create table #Customer_info (
meter int,
location varchar(50),
field1 varchar(50),
field2 varchar(50)
)
Insert into #consumers(meter_number ,location , field4 , field5 )
values (1234,'Dallas','a','1')
,(null, 'Denver','b','2')
,(5678,'Houston','c','3')
,(null,'Omaha','d','4')
,(0,'Portland','e','5')
,(2222,'Sacramento','f','6')
Insert into #Customer_info(meter , location )
values (1234,'Dallas')
,(null, 'Kansas')
,(5678,'Houston')
,(Null,'Denver')
,(0,'Boston')
,(4444,'NY')
Select c.*
,i.*
From #consumers c
full outer join #Customer_info i on c.meter_number=i.meter
and c.location=i.location
select * into New_Table From (select METER_NUMBER,Consumers.Location AS Location,Field4,Field5,Meter,Customer_Info.Location As Customer_Info_Location,Field2,Field3 From Consumers full outer Join Customer_Info on Consumers.METER_NUMBER=Customer_Info.Meter And Consumers.Location=Customer_Info.Location) AS t

many-to-many query

I have a problem and I dont know what is better solution.
Okay, I have 2 tables: posts(id, title), posts_tags(post_id, tag_id).
I have next task: must select posts with tags ids for example 4, 10 and 11.
Not exactly, post could have any other tags at the same time.
So, how I could do it more optimized? Creating temporary table in each query? Or may be some kind of stored procedure?
In the future, user could ask script to select posts with any count of tags (it could be 1 tag only or 10 at the same time) and I must be sure that method that I will choose would be the best method for my problem.
Sorry for my english, thx for attention.
This solution assumes that (post_id, tag_id) in post_tags is enforced to be UNIQUE:
SELECT id, title FROM posts
INNER JOIN post_tag ON post_tag.post_id = posts.id
WHERE tag_id IN (4, 6, 10)
GROUP BY id, title
HAVING COUNT(*) = 3
Although it's not a solution for all possible tag combinations, it's easy to create as dynamic SQL. To change for other sets of tags, change the IN () list to have all the tags, and the COUNT(*) = to check for the number of tags specified. The advantage of this solution over cascading a bunch of JOINs together is that you don't have to add JOINs, or even extra WHERE terms, when you change the request.
select id, title
from posts p, tags t
where p.id = t.post_id
and tag_id in ( 4,10,11 ) ;
?
Does this work?
select *
from posts
where post.post_id in
(select post_id
from post_tags
where tag_id = 4
and post_id in (select post_id
from post_tags
where tag_id = 10
and post_id in (select post_id
from post_tags
where tag_id = 11)))
You can do a time-storage trade-off by storing a one-way hash of the post's tag names sorted alphabetically.
When a post is tagged, execute select t.name from tags t inner join post_tags pt where pt.post_id = [ID_of_tagged_post] order by t.name. Concatenate all of the tag names, create a hash using the MD5 algorithm and insert the value into a column alongside your post (or into another table joined by a foreign key, if you prefer).
When you want to search for a specific combination of tags, simply execute (remembering to sort the tag names) select from posts p where p.taghash = MD5([concatenated_tag_string]).
This selects all posts that have any of the tags (4, 10, 11):
select distinct id, title from posts
where exists (
select * from posts_tags
where
post_id = id and
tag_id in (4, 10, 11))
Or you can use this:
select distinct id, title from posts
join posts_tags on post_id = id
where tag_id in (4, 10, 11)
(Both will be optimized the same way).
This selects all posts that have all of the tags (4, 10, 11):
select distinct id, title from posts
where not exists (
select * from posts_tags t1
where
t1.tag_id in (4, 10, 11) and
not exists (
select * from posts_tags as t2
where
t1.tag_id = t2.tag_id and
id = t2.post_id))
The list of tags in the in clause is what dynamically changes (in all cases).
But, this last query is not really fast, so you could use something like this instead:
create temporary table target_tags (tag_id int);
insert into target_tags values(4),(10),(11);
select id, title from posts
join posts_tags on post_id = id
join target_tags on target_tags.tag_id = posts_tags.tag_id
group by id, title
having count(*) = (select count(*) from target_tags);
drop table target_tags;
The part that changes dynamically is now in the second statement (the insert).