SQL pattern to get "and" list of multiple-row matches? - sql

I'm not a database programmer, but I have a simple database-backed app where I have items with tags. Each item may have multiple tags, so I'm using a typical junction table (like this), where each row represents the fact that the item with the appropriate ID has the tag with the appropriate ID.
This works very logically when I want to do something like select all items with a given tag.
But, what is the typical pattern for doing AND searches? That is, what if I want to find all items which have all of a certain set of tags? This is such a common operation that I'd think some of the intro tutorials would cover it, but I guess I'm not looking in the right places.
The approach I tried was to use INTERSECT, first directly and then with subqueries and IN. This works, but builds up long-seeming queries quickly as I add search terms. And, crucially, this approach appears to be about an order of magnitude slower than the approach of shoving all the tags as text into one "tags" column and using SQLite's full-text search. (And, as I would expect/hope, the FTS search gets faster as I add more terms, which doesn't seem to be the case with the INTERSECTS approach.)
What's the proper design pattern here, and what's the right way to make it snappy? I'm using SQLite in this case, but I'm most interested in a general answer, since this must be a common thing to do.

The following is the standard ANSI SQL solution which avoids synchronizing the number of ids and the ids themselves.
with tag_ids (tid) as (
values (1), (2)
)
select id
from tags
where id (select tid from tag_ids)
having count(*) = (select count(*) from tag_ids);
The values clause ("row constructor") is supported by PostgreSQL and DB2. For database that don't support that, you can replace it with a simple "select", e.g. in Oracle this would be:
with tag_ids (tid) as (
select 1 as tid from dual
union all
select 2 from dual
)
select id
from tags
where id (select tid from tag_ids)
having count(*) = (select count(*) from tag_ids);
For SQL Server you would simply leave out the "from dual", as it does not require a FROM clause for a SELECT.
This assumes that one tag can only be assigned exactly once. If that isn't the case, you would need to use a count(distinct id) in the having clause.

I would be inclined to use a group by:
select id
from tags
where id in (<tag1>, <tag2>)
group by id
having count(*) = 2
This would guarantee that both appear.
For an unlimited size list, you could store the ids in a string, such as '|tag1|tag2|tag3|' (note delimiters on ends). Then you can do:
select id
from tags
where #taglist like '%|'+tag+'|%'
group by id
having count(*) = len(#taglist) - (len(replace(#taglist, '|', '') - 1)
This is using SQL Server syntax. But, it is saying two things. The WHERE clause is saying that the tag is in the list. The HAVING clause is saying that the number of matches equals the length of the list. It does this with a trick, by counting the number of separtors and subtracting 1.

Related

In SQL how can I remove duplicates on a column without including it in the SELECT command/ the extract

I'm trying to remove duplicates on a column in SQL, without including that column in the extract (since it contains personally identifiable data). I thought I might be able to do this with nested queries (as below), however this isn't working. I also thought it might be possible to remove duplicates in the WHERE statement, but couldn't find anything from googling. Any ideas? Thanks in advance.
SELECT [ETHNIC], [RELIGION]
FROM
(SELECT DISTINCT [ID], [ETHNIC], [RELIGION]
FROM MainData)
Using distinct like that will apply distinct to the row, so if there are two rows with the same ID but different ETHNIC and RELIGION the distinct won't remove them. To do that you could use group by in your query, but then you need to use an aggregation (e.g. max):
SELECT [ETHNIC], [RELIGION]
FROM
(SELECT [ID], MAX([ETHNIC]) AS ETHNIC, MAX([RELIGION]) AS RELIGION
FROM MainData
GROUP BY [ID])
If that's not what you're looking for, some SQL dialects require that you name your inner select, so you could try adding AS X to the end of your query.

HAVING clause in SQL

I'm trying to understand why some DBMS systems allow the below while the most don't. Assume table X has attributes name, id, data
SELECT id, count(*) as count
FROM TABLE X
GROUP BY id
HAVING count > X.data
In most databases, it's illegal to use non-grouping or non-aggregate field in HAVING clause conditional statement. Some systems seem to allow the same. Would you be able to explain why they would have allowed the HAVING condition to use an attribute which may not have a unique value throughout the group?
Referred to database documentation of DB2, PostgreSQL, MySQL
SELECT id, count(*) as count
FROM TABLE X
GROUP BY id
HAVING count > X.data
The first issue with this query:
SELECT name, count(*) as count
FROM TABLE X
GROUP BY id
HAVING count > X.data;
is that you have name in the SELECT but id in GROUP BY. Because you are grouping by id, I assume that there are multiple values in X. Hence, this is incorrect syntax.
There are some cases where this is allowed by the standard -- and even in some databases. However, that requires that id be unique in the table (the technical jargon is that the columns in the SELECT are functionally dependent on the columns in the GROUP BY).
The next issue is the use of count in the HAVING clause. This is fine conceptually. However, not all databases may support it.
Finally, you have x.data in the HAVING clause. If that is functionally dependent on a subset of the GROUP BY keys, then the usage conforms with the standard. However, that is unlikely in this case.
The standard is quite explicit that x.data is out-of-scope after the aggregation. So, this should result in a syntax error -- and it does in almost all databases.
There are a dwindling number of databases that support this construct -- happily MySQL no longer supports it by default. In such databases, they take an arbitrary and indetermine value of data from a row in each group and use that for the comparison.

Using multiple nested fields in BigQuery

I have some records that have information about stores. These records have several different nested fields. One of the nested fields is tags and one is employees. I am trying to get a count of the number of stores that have a tag and an employee with a certain name. So I did this:
SELECT count(*)
FROM [stores.stores_844_1]
where tags.tag_name='foo'
and employees.first_name='bar'
Then I get the error:
Error: Cannot query the cross product of repeated fields tags.tag_name and employees.first_name.
I can make it work by changing the query to:
SELECT count(*)
FROM ((flatten([stores.stores_844_1],tags))
where tags.tag_name='foo'
and employees.first_name='bar'
The problem with this is that I am dynamically creating the where clause and so my from clause will have to change depending on what I have in the where. While I could generate some logic in code to figure out what the from clause should be, I was wondering if there is a way to do something like:
SELECT count(*)
FROM [stores.stores_844_1]
where tags.tag_name='foo' WITHIN RECORD
and employees.first_name='bar' WITHIN RECORD
That would not have to flatten the main table?
I have tried using an ugly work around like this:
SELECT count(*)
FROM
(SELECT GROUP_CONCAT(CONCAT('>', tags.tag_name,'<')) WITHIN RECORD as f1, GROUP_CONCAT(CONCAT('>',employees.first_name,'<')) WITHIN RECORD as f2
FROM [stores.stores_844_1]
)
where f1 CONTAINS '>foo<'
and f2 CONTAINS '>bar<'
This ugly workaround works how I want it to, but it just seems really hacky and ugly and there must be a better way, right?
You can use WITHIN RECORD to come up with another field that indicates whether the values are present. I'm not sure if this meets your requirements, since you still have to change the FROM clause, but it seems cleaner than what you are currently doing. In other words, try this:
SELECT count(*) FROM (
SELECT SUM(IF(tags.tag_name='foo', 1, 0)) WITHIN RECORD as has_foo,
SUM(IF(employees.first_name='bar', 1, 0)) WITHIN RECORD as has_bar,
FROM [stores.stores_844_1])
WHERE has_foo > 0 AND has_bar > 0

SQL Having on columns not in SELECT

I have a table with 3 columns:
userid mac_address count
The entries for one user could look like this:
57193 001122334455 42
57193 000C6ED211E6 15
57193 FFFFFFFFFFFF 2
I want to create a view that displays only those MAC's that are considered "commonly used" for this user. For example, I want to filter out the MAC's that are used <10% compared to the most used MAC-address for that user. Furthermore I want 1 row per user. This could easily be achieved with a GROUP BY, HAVING & GROUP_CONCAT:
SELECT userid, GROUP_CONCAT(mac_address SEPARATOR ',') AS macs, count
FROM mactable
GROUP BY userid
HAVING count*10 >= MAX(count)
And indeed, the result is as follows:
57193 001122334455,000C6ED211E6 42
However I really don't want the count-column in my view. But if I take it out of the SELECT statement, I get the following error:
#1054 - Unknown column 'count' in 'having clause'
Is there any way I can perform this operation without being forced to have a nasty count-column in my view? I know I can probably do it using inner queries, but I would like to avoid doing that for performance reasons.
Your help is very much appreciated!
As HAVING explicitly refers to the column names in the select list, it is not possible what you want.
However, you can use your select as a subselect to a select that returns only the rows you want to have.
SELECT a.userid, a.macs
FROM
(
SELECT userid, GROUP_CONCAT(mac_address SEPARATOR ',') AS macs, count
FROM mactable
GROUP BY userid
HAVING count*10 >= MAX(count)
) as a
UPDATE:
Because of a limitation of MySQL this is not possible, although it works in other DBMS like Oracle.
One solution would be to create a view for the subquery. Another solution seems cleaner:
CREATE VIEW YOUR_VIEW (userid, macs) AS
SELECT userid, GROUP_CONCAT(mac_address SEPARATOR ',') AS macs, count
FROM mactable
GROUP BY userid
HAVING count*10 >= MAX(count)
This will declare the view as returning only the columns userid and macs although the underlying SELECT statement returns more columns than those two.
Although I am not sure, whether the non-DBMS MySQL supports this or not...

Can I use non-aggregate columns with group by?

You cannot (should not) put non-aggregates in the SELECT line of a GROUP BY query.
I would however like access the one of the non-aggregates associated with the max. In plain english, I want a table with the oldest id of each kind.
CREATE TABLE stuff (
id int,
kind int,
age int
);
This query gives me the information I'm after:
SELECT kind, MAX(age)
FROM stuff
GROUP BY kind;
But it's not in the most useful form. I really want the id associated with each row so I can use it in later queries.
I'm looking for something like this:
SELECT id, kind, MAX(age)
FROM stuff
GROUP BY kind;
That outputs this:
SELECT stuff.*
FROM
stuff,
( SELECT kind, MAX(age)
FROM stuff
GROUP BY kind) maxes
WHERE
stuff.kind = maxes.kind AND
stuff.age = maxes.age
It really seems like there should be a way to get this information without needing to join. I just need the SQL engine to remember the other columns when it's calculating the max.
You can't get the Id of the row that MAX found, because there might not be only one id with the maximum age.
You cannot (should not) put non-aggregates in the SELECT line of a GROUP BY query.
You can, and have to, define what you are grouping by for the aggregate function to return the correct result.
MySQL (and SQLite) decided in their infinite wisdom that they would go against spec, and allow queries to accept GROUP BY clauses missing columns quoted in the SELECT - it effectively makes these queries not portable.
It really seems like there should be a way to get this information without needing to join.
Without access to the analytic/ranking/windowing functions that MySQL doesn't support, the self join to a derived table/inline view is the most portable means of getting the result you desire.
I think it's tempting indeed to ask the system to solve the problem in one pass rather than having to do the job twice (find the max, and the find the corresponding id). You can do using CONCAT (as suggested in Naktibalda refered article), not sure that would be more effeciant
SELECT MAX( CONCAT( LPAD(age, 10, '0'), '-', id)
FROM STUFF1
GROUP BY kind;
Should work, you have to split the answer to get the age and the id.
(That's really ugly though)
In recent databases you can use sum() over (parition by ...) to solve this problem:
select id, kind, age as max_age from (
select id, kind, age, max(age) over (partition by kind) as mage
from table)
where age = mage
This can then be single pass
PostgesSQL's DISTINCT ON will be useful here.
SELECT DISTINCT ON (kind) kind, id, age
FROM stuff
ORDER BY kind, age DESC;
This groups by kind and returns the first row in the ordered format. As we have ordered by age in descending order, we will get the row with max age for kind.
P.S. columns in DISTINCT ON should appear first in order by
You have to have a join because the aggregate function max retrieves many rows and chooses the max.
So you need a join to choose the one that the agregate function has found.
To put it a different way how would you expect the query to behave if you replaced max with sum?
An inner join might be more efficient than your sub query though.