PostgreSQL: How to select on non-aggregating column? - sql

Seems like a simple question but I'm having trouble accomplishing it. What I want to do is return all names that have duplicate ids. The view looks as such:
id | name | other_col
---+--------+----------
1 | James | x
2 | John | x
2 | David | x
3 | Emily | x
4 | Cameron| x
4 | Thomas | x
And so in this case, I'd just want the result:
name
-------
John
David
Cameron
Thomas
The following query works but it seems like an overkill to have two separate selects:
select name
from view where id = ANY(select id from view
WHERE other_col='x'
group by id
having count(id) > 1)
and other_col='x';
I believe it should be possible to do something under the lines of:
select name from view WHERE other_col='x' group by id, name having count(id) > 1;
But this returns nothing at all! What is the 'proper' query?
Do I just have to it like my first working suggestion or is there a better way?

You state you want to avoid two "queries", which isn't really possible. There are plenty of solutions available, but I would use a CTE like so:
WITH cte AS
(
SELECT
id,
name,
other_col,
COUNT(name) OVER(PARTITION BY id) AS id_count
FROM
table
)
SELECT name FROM cte WHERE id_count > 1;
You can reuse the CTE, so you don't have to duplicate logic and I personally find it easier to read and understand what it is doing.

SELECT name FROM Table
WHERE id IN (SELECT id, COUNT(*) FROM Table GROUP BY id HAVING COUNT(*)>1) Temp

Use EXIST operator
SELECT * FROM table t1
WHERE EXISTS(
SELECT null FROM table t2
WHERE t1.id = t2.id
AND t1.name <> t2.name
)

Use a join:
select distinct name
from view v1
join view v2 on v1.id = v2.id
and v1.name != v2.name
The use of distinct is there in case there are more than 2 rows sharing the same id. If that's not possible, you can omit distinct.
A note: Naming a column id when it's not unique will likely cause confusion, because it's the industry standard for the unique identifier column. If there isn't a unique column at all, it will cause coding difficulties.

Do not use a CTE. That's typically more expensive because Postgres has to materialize the intermediary result.
An EXISTS semi-join is typically fastest for this. Just make sure to repeat predicates (or match the values):
SELECT name
FROM view v
WHERE other_col = 'x'
AND EXISTS (
SELECT 1 FROM view
WHERE other_col = 'x' -- or: other_col = v.other_col
AND id <> v.id -- exclude join to self
);
That's a single query, even if you see the keyword SELECT twice here. An EXISTS expression does not produce a derived table, it will be resolved to simple index look-ups.
Speaking of which: a multicolumn index on (other_col, id) should help. Depending on data distribution and access patterns, appending the payload column name to enable index-only scans might help: (other_col, id, name). Or even a partial index, if other_col = 'x' is a constant predicate:
CREATE INDEX ON view (id) WHERE other_col = 'x';
PostgreSQL does not use a partial index
The upcoming Postgres 9.6 would even allow an index-only scan on the partial index:
CREATE INDEX ON view (id, name) WHERE other_col = 'x';
You will love this improvement (quoting the /devel manual):
Allow using an index-only scan with a partial index when the index's
predicate involves column(s) not stored in the index (Tomas Vondra,
Kyotaro Horiguchi)
An index-only scan is now allowed if the query mentions such columns
only in WHERE clauses that match the index predicate
Verify performance with EXPLAIN (ANALYZE, TIMING OFF) SELECT ...
Run a couple of times to rule out caching effects.

Related

Select where join match list of value

I'm not sure how to ask this question. I have the following schema :
message_id
message_content
1
Hello World
2
EHLO
message_id
concerned_user
1
laura
1
vick
1
john
2
laura
2
vick
How to select message_id which concern laura and vick (and only laura and vick). Expected result is ̀[2]`.
I'm sure it is basic SQL but I don't find it.
As questionned in some answer: I use PostgreSQL.
Thanks !
Build a string of the concerned users and see if that matches what you are looking for. In PostgreSQL the string concatenation group function is STRING_AGG:
select message_id
from mytable
group by message_id
having string_agg(concerned_user, ',' order by concerned_user) = 'laura,vick';
If there can be duplicates in the table (two or more rows for the same message_id and concerned_user), you must add DISTINCT: string_agg(distinct concerned_user ...).
If only 2 specific users have the same message, then that message will have only 2 unique users.
And a count for each user won't be zero.
SELECT message_id
FROM your_table
GROUP BY message_id
HAVING COUNT(DISTINCT concerned_user) = 2
AND COUNT(CASE WHEN concerned_user = 'laura' THEN 1 END) > 0
AND COUNT(CASE WHEN concerned_user = 'vick' THEN 1 END) > 0;
db<>fiddle here
In very basic SQL it can be done with a statement doing something using WHERE EXISTS / NOT EXISTS
like (in pseudo code, real SQL statement follows):
select messages-from-laura
where exists(message-from-vick-with-same-id)
and not exist(message-from-someone-else with same-id)
same-id refering to laura-message id.
This is standard SQL, it works in postgresql, mssql (tested with both) and probably with others.
select *
from messages laura_messages
where concerned_user = 'laura'
and (exists (select 1 from messages vick_messages
where vick_messages.message_id = laura_messages.message_id))
and not exists (select 1 from messages other_messages
where other_messages.message_id = laura_messages.message_id
and other_messages.concerned_user not in ('laura','vick'))
Note the use of aliases for the main table and subqueries tables
(in SQL, you can add an alias name after the table name in the FROM TableName clause)
and the reference of the first message id in EXISTS subqueries.
Note also that you don't need to return something in subqueries, just that some data exists, so doing SELECT 1 is fine.
And you could write the equivalent query with joins but the optimizer is very good at rewriting queries so it would probably be the same, this one is simpler imho.
Or you could also use GROUP or something more sophisticated, the nice thing with sql is that you often have several possibilities to write the same query :)

Struggling with a complicated query on row-based Field/Value table

Bare with me for a little bit of setup here please.
I have a table MAIN that has a Field/Value representation that looks like this:
I have another table called STORE_FLAG:
I am trying to write a parameterized query for which I will be given one FIELD_ID and one or more IDs from the STORE_FLAG table.
What I need to do is select from the MAIN table ROW_IDs where:
for the given FIELD_ID, the VALUE = 'YES' AND
for the given STORE_FLAG_IDS, ANY of those FIELD_IDs correspond to a VALUE = 'x' in the MAIN table.
Not that this would be a good idea, but I cannot pivot the whole table into a column-based table to then do a traditional where clause.
Example:
Given a Field_Id = 1 and a list of StoreIds = (30,50). I would want to return row_ids 1 and 2. This is because row_id 1 and 2 have a field_id 1 with value 'YES' AND at least one of the field_ids 3 and 5 have a value 'x'. But row_id 3 has a value of null for both field_id 3 and 5 and row_id 4 has a field_id 1 with value = 'NO'.
I was thinking something like this:
SELECT DISTINCT ROW_ID FROM MAIN
WHERE (FIELD_ID = :providedFieldId OR FIELD_ID IN (SELECT FIELD_ID FROM STORE_FLAG WHERE ID IN :providedStoreIdList))
AND (FIELD_VALUE = 'YES' OR FIELD_VALUE = 'x')
which (I think) works, but feels naïve to me..? I feel like there is some sort of super duper grouping way to do this, but I can't wrap my head around it. Any suggestions would be really appreciated.
here is a way to do this
select distinct m.row_id
from main m
where m.field_id=:providedFieldId
and m.field_value='YES'
and exists (select 1
from STORE_FLAG sf
join main m2
on sf.field_id=m2.field_id
where sf.id in ('30','50') /* you need to bind the values from :providedStoreIdList using a table function*/
and m2.field_value='x'
and m2.row_id=m.row_id
)
link on how to bind an in list
https://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:110612348061
Your provided solution /query will not work as you say. Because in your last line of query [AND (FIELD_VALUE = 'YES' OR FIELD_VALUE = 'x')] conflict with your requirement . Using your query, you will get ROW_ID if any one is true either FIELD_VALUE = 'YES' OR FIELD_VALUE = 'x'. Which is wrong. You can see below query-
SELECT SUB_QUERY.ROW_ID FROM
(
select DISTINCT MAIN.ROW_ID,MAIN.FIELD_VALUE from STORE_FLAG
RIGHT OUTER JOIN MAIN ON STORE_FLAG.FIELD_ID=MAIN.FIELD_ID
WHERE ((STORE_FLAG.ID IN ('202','203') AND MAIN.FIELD_VALUE='x')
OR (MAIN.FIELD_ID ='1' AND MAIN.FIELD_VALUE='YES'))
) SUB_QUERY
GROUP BY SUB_QUERY.ROW_ID
HAVING (LISTAGG(SUB_QUERY.FIELD_VALUE, ',') WITHIN GROUP (ORDER BY SUB_QUERY.ROW_ID) IN ('YES,x','x,YES'))
I think you need to run and understand my sub query part at first.

CROSS APPLY query very slow when additional column added

I have a CROSS APPLY query which executes very quickly (1 second). However, if I add certain additional columns to the top SELECT, the query will run very slow (many minutes). I'm not seeing what is causing this.
SELECT
cs.show_title, im.primaryTitle
FROM
captive_state cs
CROSS APPLY
(SELECT TOP 1
imdb.tconst, imdb.titleType, imdb.primaryTitle,
imdb.genres, imdb.genre1, imdb.genre2, imdb.genre3
FROM
imdb_data imdb
WHERE
(imdb.primaryTitle LIKE cs.show_title+'%')
AND (imdb.titleType like 'tv%' OR imdb.titleType = 'movie')
ORDER BY
imdb.titleType, imdb.tconst DESC) AS im
WHERE
cs.genre1 IS NULL
I've tried adding/removing various columns and only when adding the 'genre' fields - e.g. genre2 (varchar(50)) - does the slowness occur. For example,
SELECT cs.show_title, im.primaryTitle, im.genre2
I would expect the query to basically have the same performance whether adding one additional column or not.
Here are the query plans without the extra column, and with.
The first table (cs) has a primary key index and an index on genre1. The second table (imdb) has a primary key index and an index on primaryTitle.
I'm not sure if those would cause any problems though.
Thanks for any suggestions.
In your second screenshot, you're performing an Index Scan on the primary key for imdb_data. This is essentially scanning the table as if there is no index.
You have two options. Either change your query to use the indexed columns of imdb_data or create a new index to cover this query.
Maybe switch to an alternative for the topped CROSS APPLY
SELECT TOP 1 WITH TIES
cs.show_title,
imdb.tconst, imdb.titleType, imdb.primaryTitle,
imdb.genres, imdb.genre1, imdb.genre2, imdb.genre3
FROM captive_state cs
JOIN imdb_data imdb
ON imdb.primaryTitle LIKE cs.show_title+'%'
AND (imdb.titleType = 'movie' OR imdb.titleType LIKE 'tv%')
WHERE cs.genre1 IS NULL
ORDER BY ROW_NUMBER() OVER (PARTITION BY cs.show_title ORDER BY imdb.titleType, imdb.tconst DESC)
You could include additional columns to index [imdb_data].[idx_primary_table]. (name is not readable from screenshot):
CREATE INDEX idx_name ON [imdb_data].[idx_primary_table](same cols as in original)
INCLUDE (genre1, genre2, genre3) WITH (DROP_EXISTING=ON)
Try to use "join" with "row_number()" instead of "apply"
select
dat.primaryTitle
,dat.show_title
from (
select
imdb.primaryTitle
,cs.show_title
,row_number() over (partition by cs.show_title order by imdb.titleType, imdb.tconst DESC) as rn
from imdb_data imdb
inner join captive_state cs on imdb.primaryTitle LIKE cs.show_title+'%'
where (imdb.titleType like 'tv%' OR imdb.titleType = 'movie')
and cs.genre1 IS NULL
) dat
where dat.rn = 1

Is there something I can change to make my view run faster?

I just created a view but it is really slow, since my actual table has something around 800k rows.
Is there something I can change in the actual sql code to make it run faster?
Here is how it looks now:
Select B.*
FROM
(Select A.*, (select count(B.KEY_ID)/77
FROM book_new B
where B.KEY_ID = A.KEY_ID) as COUNT_KEY
FROM
(select *
from book_new
where region = 'US'
and (actual_release_date is null or
actual_release_date >= To_Date( '01/07/16','dd/mm/yy'))
) A
) B
WHERE B.COUNT_KEY = 1
OR (B.COUNT_KEY > 1 AND B.NEW_OLD <> 'Old')
The most obvious things to do are add indexes:
Add an index on book_new(key_id)
Add an index on book_new(region, actual_release_date)
These are probably sufficient. It is possible that rewriting the query would help, but this is a good beginning. If you want to rewrite the query, it would help if you described the logic you are trying to implement.
There are many ways to solve this issue based on your needs
You can create an indexed view
You can create an index in the base tables which are used in this view.
You can use the required columns in the SELECT statement instead of using SELECT * FROM,
If the table contains many columns but you require only few columns, you can create a NON CLUSTERED INDEX with INCLUDE COLUMNS option which will reduce the LOGICAL READS.
For starters, replace the scalar subquery for COUNT_KEY with a windowed COUNT(*).
SELECT * FROM
(
select book_new.*, COUNT(*) OVER ( PARTITION BY book_new.key_id)/77 COUNT_KEY
from book_new
where region = 'US'
and (actual_release_date is null or
actual_release_date >= To_Date( '01/07/16','dd/mm/yy'))
)
WHERE count_key = 1 OR ( count_key > 1 AND new_old <> 'Old' )
This way, you only go through the BOOK_NEW table one time.
BTW, I agree with other comments that this query makes little sense.

Ensuring two columns only contain valid results from same subquery

I have the following table:
id symbol_01 symbol_02
1 abc xyz
2 kjh okd
3 que qid
I need a query that ensures symbol_01 and symbol_02 are both contained in a list of valid symbols. In other words I would needs something like this:
select *
from mytable
where symbol_01 in (
select valid_symbols
from somewhere)
and symbol_02 in (
select valid_symbols
from somewhere)
The above example would work correctly, but the subquery used to determine the list of valid symbols is identical both times and is quite large. It would be very innefficient to run it twice like in the example.
Is there a way to do this without duplicating two identical sub queries?
Another approach:
select *
from mytable t1
where 2 = (select count(distinct symbol)
from valid_symbols vs
where vs.symbol in (t1.symbol_01, t1.symbol_02));
This assumes that the valid symbols are stored in a table valid_symbols that has a column named symbol. The query would also benefit from an index on valid_symbols.symbol
You could try use a CTE like;
WITH ValidSymbols AS (
SELECT DISTINCT valid_symbol
FROM somewhere
)
SELECT mt.*
FROM MyTable mt
INNER JOIN ValidSymbols v1
ON mt.symbol_01 = v1.valid_symbol
INNER JOIN ValidSymbols v2
ON mt.symbol_02 = v2.valid_symbol
From a performance perspective, your query is the right way to do this. I would write it as:
select *
from mytable t
where exists (select 1
from valid_symbols vs
where t.symbol_01 = vs.valid_symbol
) and
exists (select 1
from valid_symbols vs
where t.symbol_02 = vs.valid_symbol
) ;
The important component is that you need an index on valid_symbols(valid_symbol). With this index, the lookup should be pretty fast. Appropriate indexes can even work if valid_symbols is a view, although the effect depends on the complexity of the view.
You seem to have a situation where you have two foreign key relationships. If you explicitly declare these relationships, then the database will enforce that the columns in your table match the valid symbols.