How to use SELECT DISTINCT with RANDOM() function in PostgreSQL? - sql

I am trying to run a SQL query to get four random items. As the table product_filter has more than one touple in product i have to use DISTINCT in SELECT, so i get this error:
for SELECT DISTINCT, ORDER BY expressions must appear in select list
But if i put RANDOM() in my SELECT it will avoid the DISTINCT result.
Someone know how to use DISTINCT with the RANDOM() function? Below is my problematic query.
SELECT DISTINCT
p.id,
p.title
FROM
product_filter pf
JOIN product p ON pf.cod_product = p.cod
JOIN filters f ON pf.cod_filter = f.cod
WHERE
p.visible = TRUE
LIMIT 4
ORDER BY RANDOM();

You either do a subquery
SELECT * FROM (
SELECT DISTINCT p.cod, p.title ... JOIN... WHERE
) ORDER BY RANDOM() LIMIT 4;
or you try GROUPing for those same fields:
SELECT p.cod, p.title, MIN(RANDOM()) AS o FROM ... JOIN ...
WHERE ... GROUP BY p.cod, p.title ORDER BY o LIMIT 4;
Which of the two expressions will evaluate faster depends on table structure and indexing; with proper indexing on cod and title, the subquery version will run faster (cod and title will be taken from index cardinality information, and cod is the only key needed for the JOIN, so if you index by title, cod and visible (used in the WHERE), it is likely that the physical table will not even be accessed at all.
I am not so sure whether this would happen with the second expression too.

You can simplify your query to avoid the problem a priori:
SELECT p.cod, p.title
FROM product p
WHERE p.visible
AND EXISTS (
SELECT 1
FROM product_filter pf
JOIN filters f ON f.cod = pf.cod_filter
WHERE pf.cod_product = p.cod
)
ORDER BY random()
LIMIT 4;
Major points:
You have only columns from table product in the result, other tables are only checked for existence of a matching row. For a case like this the EXISTS semi-join is likely the fastest and simplest solution. Using it does not multiply rows from the base table product, so you don't need to remove them again with DISTINCT.
LIMIT has to come last, after ORDER BY.
I simplified WHERE p.visible = 't' to p.visible, because this should be a boolean column.

Use a subquery. Don't forget the table alias, t. LIMIT comes after ORDER BY.
SELECT *
FROM (SELECT DISTINCT a, b, c
FROM datatable WHERE a = 'hello'
) t
ORDER BY random()
LIMIT 10;

I think you need a subquery:
select *
from (select DISTINCT p.cod, p.title
from product_filter pf join
product p
on pf.cod_product = p.cod
where p.visible = 't'
) t
LIMIT 4
order by RANDOM()
Calculate the distinct values first, and then do the limit.
Do note, this does have performance implications, because this query does a distinct on everything before selecting what you want. Whether this matters depends on the size of your table and how you are using the query.

SELECT DISTINCT U.* FROM
(
SELECT p.cod, p.title FROM product__filter pf
JOIN product p on pf.cod_product = p.cod
JOIN filters f on pf.cod_filter = f.cod
WHERE p.visible = 't'
ORDER BY RANDOM()
) AS U
LIMIT 4
This does the RANDOM first then the LIMIT afterwards.

Related

PostgreSQL GROUP BY column must appear in the GROUP BY

SELECT
COUNT(follow."FK_accountId"),
score.*
FROM
(
SELECT items.*, AVG(reviews.score) as "averageScore" FROM "ITEM_VARIATION" as items
INNER JOIN "ITEM_REVIEW" as reviews ON reviews."FK_itemId"=items.id
GROUP BY items.id
) as score
INNER JOIN "ITEM_FOLLOWER" as follow ON score.id=follow."FK_itemId"
GROUP BY score.id
Inner Block works by itself and I believe I followed the same format.
However it outputs error:
ERROR: column "score.name" must appear in the GROUP BY clause or be used in an aggregate function
LINE 18: score.*
^
Is listing all the columns in score field only solution?
there are over 10 columns to list so I'd like to avoid that solution if it's not the only one
columns not included on the aggregation must be specified during group by
SELECT
COUNT(follow."FK_accountId"),
score.id,
score.name
FROM
(
SELECT items.id as id, items.name as name, AVG(reviews.score) as "averageScore" FROM "ITEM_VARIATION" as items
INNER JOIN "ITEM_REVIEW" as reviews ON reviews."FK_itemId"=items.id
GROUP BY items.id, items.name
) as score
INNER JOIN "ITEM_FOLLOWER" as follow ON score.id=follow."FK_itemId"
GROUP BY score.id, score.name
I would suggest you use correlated subqueries or a lateral join:
SELECT i.*,
(SELECT AVG(r.score)
FROM "ITEM_REVIEW" r
WHERE r."FK_itemId" = i.id
) as averageScore,
(SELECT COUNT(*)
FROM "ITEM_FOLLOWER" f
WHERE f."FK_itemId" = i.id
)
FROM "ITEM_VARIATION" i;
With the right indexes, this is probably faster as well.

SQL is returning distinct values without the distinct word

My sql query is returning distinct values.
This is the query:
select *
from Products
where [Product_ID] in (select Product_Id f
from MyCart
where User_Id = '5570928b-7a1b-4652-9c6b-592e76a70a07')
The second select query is returning (7,7,3) and the first select is returning information only for one 7 and 3.
I suppose it is because the 7's are duplicates, but I need the result to contain information about all the products in the second select, no matter if they are duplicate or not.
In that case, use JOIN:
select p.*
from Products p join
MyCart c
on p.Product_Id = c.Product_Id
where c.User_Id = '5570928b-7a1b-4652-9c6b-592e76a70a07';
Usually, duplicates are undesirable, which is why EXISTS and IN are used.

Distinct on multi-columns in sql

I have this query in sql
select cartlines.id,cartlines.pageId,cartlines.quantity,cartlines.price
from orders
INNER JOIN
cartlines on(cartlines.orderId=orders.id)where userId=5
I want to get rows distinct by pageid ,so in the end I will not have rows with same pageid more then once(duplicate)
any Ideas
Thanks
Baaroz
Going by what you're expecting in the output and your comment that says "...if there rows in output that contain same pageid only one will be shown...," it sounds like you're trying to get the top record for each page ID. This can be achieved with ROW_NUMBER() and PARTITION BY:
SELECT *
FROM (
SELECT
ROW_NUMBER() OVER(PARTITION BY c.pageId ORDER BY c.pageID) rowNumber,
c.id,
c.pageId,
c.quantity,
c.price
FROM orders o
INNER JOIN cartlines c ON c.orderId = o.id
WHERE userId = 5
) a
WHERE a.rowNumber = 1
You can also use ROW_NUMBER() OVER(PARTITION BY ... along with TOP 1 WITH TIES, but it runs a little slower (despite being WAY cleaner):
SELECT TOP 1 WITH TIES c.id, c.pageId, c.quantity, c.price
FROM orders o
INNER JOIN cartlines c ON c.orderId = o.id
WHERE userId = 5
ORDER BY ROW_NUMBER() OVER(PARTITION BY c.pageId ORDER BY c.pageID)
If you wish to remove rows with all columns duplicated this is solved by simply adding a distinct in your query.
select distinct cartlines.id,cartlines.pageId,cartlines.quantity,cartlines.price
from orders
INNER JOIN
cartlines on(cartlines.orderId=orders.id)where userId=5
If however, this makes no difference, it means the other columns have different values, so the combinations of column values creates distinct (unique) rows.
As Michael Berkowski stated in comments:
DISTINCT - does operate over all columns in the SELECT list, so we
need to understand your special case better.
In the case that simply adding distinct does not cover you, you need to also remove the columns that are different from row to row, or use aggregate functions to get aggregate values per cartlines.
Example - total quantity per distinct pageId:
select distinct cartlines.id,cartlines.pageId, sum(cartlines.quantity)
from orders
INNER JOIN
cartlines on(cartlines.orderId=orders.id)where userId=5
If this is still not what you wish, you need to give us data and specify better what it is you want.

Complex Query duplicating Result (same id, different columns values)

I have this query, working great:
SELECT * FROM
(
select
p.id,
comparestrings('marco', pc.value) as similarity
from
unit u, person p
inner join person_field pc ON (p.id = pc.id_person)
inner join field c ON (pc.id_field = c.id AND c.flag_name = true)
where ( u.id = 1 ) AND p.id_unit = u.id
) as subQuery
where
similarity is not null
AND
similarity > 0.35
order by
similarity desc;
Let me explain the situation.
TABLES:
person ID as column.
field a table that represents a column, like name, varchar (something like that)
person_field represents the value of that person and that field.. Like this:
unit not relevant for this question
Eg.:
Person id 1
Field id 1 {name, eg)
value "Marco Noronha"
So the function "comparestrings" returns a double from 0 to 1, where 1 is exact ('Marco' == 'Marco').
So, I need all persons that have similarity above 0.35 and i also need its similarity.
No problem, the query works fine and as it was suppost to. But now I have a new requirement that, the table "person_field" will contain an alteration date, to keep track of the changes of those rows.
Eg.:
Person ID 1
Field ID 1
Value "Marco Noronha"
Date - 01/25/2013
Person ID 1
Field ID 1
Value "Marco Tulio Jacovine Noronha"
Date - 02/01/2013
So what I need to do, is consider ONLY the LATEST row!!
If I execute the same query the result would be (eg):
1, 0.8
1, 0.751121
2, 0.51212
3, 0.42454
//other results here, other 'person's
And lets supose that the value I want to bring is 1, 0.751121 (witch is the lattest value by DATE)
I think I should do something like order by date desc limit 1...
But if I do something like that, the query will return only ONE person =/
Like:
1, 0.751121
When I really want:
1, 0.751121
2, 0.51212
3, 0.42454
You can use DISTINCT ON(p.id) on the sub-query:
SELECT * FROM
(
select
DISTINCT ON(p.id)
p.id,
comparestrings('marco', pc.value) as similarity
from
unit u, person p
inner join person_field pc ON (p.id = pc.id_person)
inner join field c ON (pc.id_field = c.id AND c.flag_name = true)
where ( u.id = 1 ) AND p.id_unit = u.id
ORDER BY p.id, pc.alt_date DESC
) as subQuery
where
similarity is not null
AND
similarity > 0.35
order by
similarity desc;
Notice that, to make it work I needed to add ORDER BY p.id, pc.alt_date DESC:
p.id: required by DISTINCT ON (if you use ORDER BY, the first fields must be exactly the same as DISTINCT ON);
pc.alt_date DESC: the alter date you mentioned (we order desc, so we get the oldest ones by each p.id)
By the way, seems that you don't need a sub-query at all (just make sure comparestrings is marked as stable or immutable, and it'll be fast enough):
SELECT
DISTINCT ON(p.id)
p.id,
comparestrings('marco', pc.value) as similarity
FROM
unit u, person p
inner join person_field pc ON (p.id = pc.id_person)
inner join field c ON (pc.id_field = c.id AND c.flag_name = true)
WHERE ( u.id = 1 ) AND p.id_unit = u.id
AND COALESCE(comparestrings('marco', pc.value), 0.0) > 0.35
ORDER BY p.id, pc.alt_date DESC, similarity DESC;
Change the reference to person to a subquery as in the following example (the subquery is the one called p):
. . .
from unit u cross join
(select p.*
from (select p.*,
row_number() over (partition by person_id order by alterationdate desc) as seqnum
from person p
) p
where seqnum = 1
) p
. . .
This uses the row_number() function to identify the last row. I've used an additional subquery to limit the result just to the most recent. You could also include this in an on clause or a where clause.
I also changed the , to an explicit cross join.

SQL query performance question (multiple sub-queries)

I have this query:
SELECT p.id, r.status, r.title
FROM page AS p
INNER JOIN page_revision as r ON r.pageId = p.id AND (
r.id = (SELECT MAX(r2.id) from page_revision as r2 WHERE r2.pageId = r.pageId AND r2.status = 'active')
OR r.id = (SELECT MAX(r2.id) from page_revision as r2 WHERE r2.pageId = r.pageId)
)
Which returns each page and the latest active revision for each, unless no active revision is available, in which case it simply returns the latest revision.
Is there any way this can be optimised to improve performance or just general readability? I'm not having any issues right now, but my worry is that when this gets into a production environment (where there could be a lot of pages) it's going to perform badly.
Also, are there any obvious problems I should be aware of? The use of sub-queries always bugs me, but to the best of my knowledge this cant be done without them.
Note:
The reason the conditions are in the JOIN rather than a WHERE clause is that in other queries (where this same logic is used) I'm LEFT JOINing from the "site" table to the "page" table, and If no pages exist I still want the site returned.
Jack
Edit: I'm using MySQL
If "active" is the first in alphabetical order you migt be able to reduce subqueries to:
SELECT p.id, r.status, r.title
FROM page AS p
INNER JOIN page_revision as r ON r.pageId = p.id AND
r.id = (SELECT r2.id
FROM page_revision as r2
WHERE r2.pageId = r.pageId
ORDER BY r2.status, r2.id DESC
LIMIT 1)
Otherwise you can replace ORDER BY line with
ORDER BY CASE r2.status WHEN 'active' THEN 0 ELSE 1 END, r2.id DESC
These all come from my assumptions on SQL Server, your mileage with MySQL may vary.
Maybe a little re-factoring is in order?
If you added a latest_revision_id column onto pages your problem would disappear, hopefully with only a couple of lines added to your page editor.
I know it's not normalized but it would simplify (and greatly speed up) the query, and sometimes you do have to denormalize for performance.
Your problem is a particular case of what is described in this question.
The best you can get using standard ANSI SQL seems to be:
SELECT p.id, r.status, r.title
FROM page AS p
INNER JOIN page_revision as r ON r.pageId = p.id
AND r.id = (SELECT MAX(r2.id) from page_revision as r2 WHERE r2.pageId = r.pageId)
Other approaches are available but dependent on what database you're using. I'm not really sure it can be improved much for MySQL.
In MS SQL 2005+ and Oracle:
SELECT p.id, r.status, r.title
FROM (
SELECT p.*, r,*,
ROW_NUMBER() OVER (PARTITION BY p.pageId ORDER BY CASE WHEN p.status = 'active' THEN 0 ELSE 1 END, r.id DESC) AS rn
FROM page AS p, page_revision r
WHERE r.id = p.pageId
) o
WHERE rn = 1
In MySQL that can become a problem, as subqueries cannot use the INDEX RANGE SCAN as the expression from the outer query is not considered constant.
You'll need to create two indexes and a function that returns the last page revision to use those indexes:
CREATE INDEX ix_revision_page_status_id ON page_revision (page_id, id, status);
CREATE INDEX ix_revision_page_id (page_id, id);
CREATE FUNCTION `fn_get_last_revision`(input_id INT) RETURNS int(11)
BEGIN
DECLARE id INT;
SELECT r_id
INTO id
FROM (
SELECT r.id
FROM page_revisions
FORCE INDEX (ix_revision_page_status_id)
WHERE page_id = input_id
AND status = 'active'
ORDER BY id DESC
LIMIT 1
UNION ALL
SELECT r.id
FROM page_revisions
FORCE INDEX (ix_revision_page_id)
WHERE page_id = input_id
ORDER BY id DESC
LIMIT 1
) o
LIMIT 1;
RETURN id;
END;
SELECT po.id, r.status, r.title
FROM (
SELECT p.*, fn_get_last_revision(p.page_id) AS rev_id
FROM page p
) po, page_revision r
WHERE r.id = po.rev_id;
This will efficiently use index to get the last revision of the page.
P. S. If you will use codes for statuses and use 0 for active, you can get rid of the second index and simplify the query.