How to select only first rows that satisfies conditions? - sql

I'm doing a join between two tables and adding a condition want to obtain only the first row that satisfie the join condition and the "extern" condition too.
This query for example:
select * from PRMPROFILE p, user v
where
p.id = v.profile
and p.language = 0
and v.userid like '%TEST%';
First of all, i want to know how to group the result of this inner join using the profile (v.profile or p.id). After that how to show only the first appearence for each group.
Thanks in advance.

You can use an analytic query for this:
select *
from (
select p.*, v.*,
row_number() over (partition by p.id order by v.userid) as rn
from prmprofile p
join user v on v.profile = p.id
where p.language = 0
and v.userid like '%TEST%'
)
where rn = 1;
The inner query gets all the data (but using * isn't ideal), and adds an extra column that assigns a row number sequence across each p.id value. They have to be ordered by something, and you haven't said what makes a particular row 'first', so I've guessed as user ID - you can of course change that to something more appropriate, that will give consistent results when the query is rerun. (You can look at rank and dense_rank for alternative methods to pick the 'first' row).
The outer query just restricts the result set to those where the extra column has the value 1, which will give you one row for every p.id.
Another approach would be to use a subquery to identify the 'first' row and join to that, but it isn't clear what the criteria would be and if it would be selective enough.

Please try:
select * from(
select *,
row_number() over (partition by v.userid order by v.userid) RNum
from PRMPROFILE p, user v
where
p.id = v.profile
and p.language = 0
and v.userid like '%TEST%'
)x
where RNum=1;

You can use LIMIT keyword.
select * from PRMPROFILE p, user v
where
p.id = v.profile
and p.language = 0
and v.userid like '%TEST%'
limit 1

select * from PRMPROFILE p, user v
where p.id = v.profile and p.language = 0
and v.userid like '%TEST%'
fetch first 1 row only

It will display only the top result
select top 1 * from PRMPROFILE p, user v
where
p.id = v.profile
and p.language = 0
and v.userid like '%TEST%';

Related

Complex Query duplicating Result (same id, different columns values)

I have this query, working great:
SELECT * FROM
(
select
p.id,
comparestrings('marco', pc.value) as similarity
from
unit u, person p
inner join person_field pc ON (p.id = pc.id_person)
inner join field c ON (pc.id_field = c.id AND c.flag_name = true)
where ( u.id = 1 ) AND p.id_unit = u.id
) as subQuery
where
similarity is not null
AND
similarity > 0.35
order by
similarity desc;
Let me explain the situation.
TABLES:
person ID as column.
field a table that represents a column, like name, varchar (something like that)
person_field represents the value of that person and that field.. Like this:
unit not relevant for this question
Eg.:
Person id 1
Field id 1 {name, eg)
value "Marco Noronha"
So the function "comparestrings" returns a double from 0 to 1, where 1 is exact ('Marco' == 'Marco').
So, I need all persons that have similarity above 0.35 and i also need its similarity.
No problem, the query works fine and as it was suppost to. But now I have a new requirement that, the table "person_field" will contain an alteration date, to keep track of the changes of those rows.
Eg.:
Person ID 1
Field ID 1
Value "Marco Noronha"
Date - 01/25/2013
Person ID 1
Field ID 1
Value "Marco Tulio Jacovine Noronha"
Date - 02/01/2013
So what I need to do, is consider ONLY the LATEST row!!
If I execute the same query the result would be (eg):
1, 0.8
1, 0.751121
2, 0.51212
3, 0.42454
//other results here, other 'person's
And lets supose that the value I want to bring is 1, 0.751121 (witch is the lattest value by DATE)
I think I should do something like order by date desc limit 1...
But if I do something like that, the query will return only ONE person =/
Like:
1, 0.751121
When I really want:
1, 0.751121
2, 0.51212
3, 0.42454
You can use DISTINCT ON(p.id) on the sub-query:
SELECT * FROM
(
select
DISTINCT ON(p.id)
p.id,
comparestrings('marco', pc.value) as similarity
from
unit u, person p
inner join person_field pc ON (p.id = pc.id_person)
inner join field c ON (pc.id_field = c.id AND c.flag_name = true)
where ( u.id = 1 ) AND p.id_unit = u.id
ORDER BY p.id, pc.alt_date DESC
) as subQuery
where
similarity is not null
AND
similarity > 0.35
order by
similarity desc;
Notice that, to make it work I needed to add ORDER BY p.id, pc.alt_date DESC:
p.id: required by DISTINCT ON (if you use ORDER BY, the first fields must be exactly the same as DISTINCT ON);
pc.alt_date DESC: the alter date you mentioned (we order desc, so we get the oldest ones by each p.id)
By the way, seems that you don't need a sub-query at all (just make sure comparestrings is marked as stable or immutable, and it'll be fast enough):
SELECT
DISTINCT ON(p.id)
p.id,
comparestrings('marco', pc.value) as similarity
FROM
unit u, person p
inner join person_field pc ON (p.id = pc.id_person)
inner join field c ON (pc.id_field = c.id AND c.flag_name = true)
WHERE ( u.id = 1 ) AND p.id_unit = u.id
AND COALESCE(comparestrings('marco', pc.value), 0.0) > 0.35
ORDER BY p.id, pc.alt_date DESC, similarity DESC;
Change the reference to person to a subquery as in the following example (the subquery is the one called p):
. . .
from unit u cross join
(select p.*
from (select p.*,
row_number() over (partition by person_id order by alterationdate desc) as seqnum
from person p
) p
where seqnum = 1
) p
. . .
This uses the row_number() function to identify the last row. I've used an additional subquery to limit the result just to the most recent. You could also include this in an on clause or a where clause.
I also changed the , to an explicit cross join.

How to use SELECT DISTINCT with RANDOM() function in PostgreSQL?

I am trying to run a SQL query to get four random items. As the table product_filter has more than one touple in product i have to use DISTINCT in SELECT, so i get this error:
for SELECT DISTINCT, ORDER BY expressions must appear in select list
But if i put RANDOM() in my SELECT it will avoid the DISTINCT result.
Someone know how to use DISTINCT with the RANDOM() function? Below is my problematic query.
SELECT DISTINCT
p.id,
p.title
FROM
product_filter pf
JOIN product p ON pf.cod_product = p.cod
JOIN filters f ON pf.cod_filter = f.cod
WHERE
p.visible = TRUE
LIMIT 4
ORDER BY RANDOM();
You either do a subquery
SELECT * FROM (
SELECT DISTINCT p.cod, p.title ... JOIN... WHERE
) ORDER BY RANDOM() LIMIT 4;
or you try GROUPing for those same fields:
SELECT p.cod, p.title, MIN(RANDOM()) AS o FROM ... JOIN ...
WHERE ... GROUP BY p.cod, p.title ORDER BY o LIMIT 4;
Which of the two expressions will evaluate faster depends on table structure and indexing; with proper indexing on cod and title, the subquery version will run faster (cod and title will be taken from index cardinality information, and cod is the only key needed for the JOIN, so if you index by title, cod and visible (used in the WHERE), it is likely that the physical table will not even be accessed at all.
I am not so sure whether this would happen with the second expression too.
You can simplify your query to avoid the problem a priori:
SELECT p.cod, p.title
FROM product p
WHERE p.visible
AND EXISTS (
SELECT 1
FROM product_filter pf
JOIN filters f ON f.cod = pf.cod_filter
WHERE pf.cod_product = p.cod
)
ORDER BY random()
LIMIT 4;
Major points:
You have only columns from table product in the result, other tables are only checked for existence of a matching row. For a case like this the EXISTS semi-join is likely the fastest and simplest solution. Using it does not multiply rows from the base table product, so you don't need to remove them again with DISTINCT.
LIMIT has to come last, after ORDER BY.
I simplified WHERE p.visible = 't' to p.visible, because this should be a boolean column.
Use a subquery. Don't forget the table alias, t. LIMIT comes after ORDER BY.
SELECT *
FROM (SELECT DISTINCT a, b, c
FROM datatable WHERE a = 'hello'
) t
ORDER BY random()
LIMIT 10;
I think you need a subquery:
select *
from (select DISTINCT p.cod, p.title
from product_filter pf join
product p
on pf.cod_product = p.cod
where p.visible = 't'
) t
LIMIT 4
order by RANDOM()
Calculate the distinct values first, and then do the limit.
Do note, this does have performance implications, because this query does a distinct on everything before selecting what you want. Whether this matters depends on the size of your table and how you are using the query.
SELECT DISTINCT U.* FROM
(
SELECT p.cod, p.title FROM product__filter pf
JOIN product p on pf.cod_product = p.cod
JOIN filters f on pf.cod_filter = f.cod
WHERE p.visible = 't'
ORDER BY RANDOM()
) AS U
LIMIT 4
This does the RANDOM first then the LIMIT afterwards.

i want to modify this SQL statement to return only distinct rows of a column

select
picks.`fbid`,
picks.`time`,
categories.`name` as cname,
options.`name` as oname,
users.`name`
from
picks
left join categories
on (categories.`id` = picks.`cid`)
left join options
on (options.`id` = picks.oid)
left join users
on (users.fbid = picks.`fbid`)
order by
time desc
that query returns a result that like:
my question is.... I would like to modify the query to select only DISTINCT fbid's. (perhaps the first row only sorted by time)
can someone help with this?
select
p2.fbid,
p2.time,
c.`name` as cname,
o.`name` as oname,
u.`name`
from
( select p1.fbid,
min( p1.time ) FirstTimePerID
from picks p1
group by p1.fbid ) as FirstPerID
JOIN Picks p2
on FirstPerID.fbid = p2.fbid
AND FirstPerID.FirstTimePerID = p2.time
LEFT JOIN Categories c
on p2.cid = c.id
LEFT JOIN Options o
on p2.oid = o.id
LEFT JOIN Users u
on p2.fbid = u.fbid
order by
time desc
I don't know why you originally had LEFT JOINs, as it appears that all picks must be associated with a valid category, option and user... I would then remove the left, and change them to INNER joins instead.
The first inner query grabs for each fbid, the FIRST entry time which will result in a single entity for the FBID. From that, it re-joins to the picks table for the same ID and timeslot... then continues for the rest of the category, options, users join criteria of that single entry.
2 options, you could write a group by clause.
Or you could write a nested query joined back to itself to get pertinent info.
Nested aliased table:
SELECT
n.fBids
FROM
MyTable t
INNER JOIN
(SELECT DISTINCT fBids
FROM MyTable) n
ON n.ID = t.ID
Or group by option
SELECT fBId from MyTable
GROUP BY fBID
select picks.`fbid`, picks.`time`, categories.`name` as cname,
options.`name` as oname, users.`name` from picks left join categories
on (categories.`id` = picks.`cid`) left join options on (options.`id` = picks.oid)
left join users on (users.fbid = picks.`fbid`)
order by time desc GROUP BY picks.`fbid`
select
picks.fbid,
MIN(picks.time) as first_time,
MAX(picks.time) as last_time
from
picks
group by
picks.fbid
order by
MIN(picks.time) desc
However, if you want only distinct fbid's you cannot display cname and other columns at the same time.

SQL Query Help Part 2 - Add filter to joined tables and get max value from filter

I asked this question on SO. However, I wish to extend it further. I would like to find the max value of the 'Reading' column only where the 'state' is of value 'XX' for example.
So if I join the two tables, how do I get the row with max(Reading) value from the result set. Eg.
SELECT s.*, g1.*
FROM Schools AS s
JOIN Grades AS g1 ON g1.id_schools = s.id
WHERE s.state = 'SA' // how do I get row with max(Reading) column from this result set
The table details are:
Table1 = Schools
Columns: id(PK), state(nvchar(100)), schoolname
Table2 = Grades
Columns: id(PK), id_schools(FK), Year, Reading, Writing...
I'd think about using a common table expression:
WITH SchoolsInState (id, state, schoolname)
AS (
SELECT id, state, schoolname
FROM Schools
WHERE state = 'XX'
)
SELECT *
FROM SchoolsInState AS s
JOIN Grades AS g
ON s.id = g.id_schools
WHERE g.Reading = max(g.Reading)
The nice thing about this is that it creates this SchoolsInState pseudo-table which wraps all the logic about filtering by state, leaving you free to write the rest of your query without having to think about it.
I'm guessing [Reading] is some form of numeric value.
SELECT TOP (1)
s.[Id],
s.[State],
s.[SchoolName],
MAX(g.[Reading]) Reading
FROM
[Schools] s
JOIN [Grades] g on g.[id_schools] = s.[Id]
WHERE s.[State] = 'SA'
Group By
s.[Id],
s.[State],
s.[SchoolName]
Order By
MAX(g.[Reading]) DESC
UPDATE:
Looking at Tom's i don't think that would work but here is a modified version that does.
WITH [HighestGrade] (Reading)
AS (
SELECT
MAX([Reading]) Reading
FROM
[Grades]
)
SELECT
s.*,
g.*
FROM
[HighestGrade] hg
JOIN [Grades] AS g ON g.[Reading] = hg.[Reading]
JOIN [Schools] AS s ON s.[id] = g.[id_schools]
WHERE s.state = 'SA'
This CTE method should give you what you want. I also had it break down by year (grade_year in my code to avoid the reserved word). You should be able to remove that easily enough if you want to. This method also accounts for ties (you'll get both rows back if there is a tie):
;WITH MaxReadingByStateYear AS (
SELECT
S.id,
S.school_name,
S.state,
G.grade_year,
RANK() OVER(PARTITION BY S.state, G.grade_year ORDER BY Reading DESC) AS ranking
FROM
dbo.Grades G
INNER JOIN Schools S ON
S.id = G.id_schools
)
SELECT
id,
state,
school_name,
grade_year
FROM
MaxReadingByStateYear
WHERE
state = 'AL' AND
ranking = 1
One way would be this:
SELECT...
FROM...
WHERE...
AND g1.Reading = (select max(G2.Reading)
from Grades G2
inner join Schools s2
on s2.id = g2.id_schools
and s2.state = s.state)
There are certainly more.

SQL query performance question (multiple sub-queries)

I have this query:
SELECT p.id, r.status, r.title
FROM page AS p
INNER JOIN page_revision as r ON r.pageId = p.id AND (
r.id = (SELECT MAX(r2.id) from page_revision as r2 WHERE r2.pageId = r.pageId AND r2.status = 'active')
OR r.id = (SELECT MAX(r2.id) from page_revision as r2 WHERE r2.pageId = r.pageId)
)
Which returns each page and the latest active revision for each, unless no active revision is available, in which case it simply returns the latest revision.
Is there any way this can be optimised to improve performance or just general readability? I'm not having any issues right now, but my worry is that when this gets into a production environment (where there could be a lot of pages) it's going to perform badly.
Also, are there any obvious problems I should be aware of? The use of sub-queries always bugs me, but to the best of my knowledge this cant be done without them.
Note:
The reason the conditions are in the JOIN rather than a WHERE clause is that in other queries (where this same logic is used) I'm LEFT JOINing from the "site" table to the "page" table, and If no pages exist I still want the site returned.
Jack
Edit: I'm using MySQL
If "active" is the first in alphabetical order you migt be able to reduce subqueries to:
SELECT p.id, r.status, r.title
FROM page AS p
INNER JOIN page_revision as r ON r.pageId = p.id AND
r.id = (SELECT r2.id
FROM page_revision as r2
WHERE r2.pageId = r.pageId
ORDER BY r2.status, r2.id DESC
LIMIT 1)
Otherwise you can replace ORDER BY line with
ORDER BY CASE r2.status WHEN 'active' THEN 0 ELSE 1 END, r2.id DESC
These all come from my assumptions on SQL Server, your mileage with MySQL may vary.
Maybe a little re-factoring is in order?
If you added a latest_revision_id column onto pages your problem would disappear, hopefully with only a couple of lines added to your page editor.
I know it's not normalized but it would simplify (and greatly speed up) the query, and sometimes you do have to denormalize for performance.
Your problem is a particular case of what is described in this question.
The best you can get using standard ANSI SQL seems to be:
SELECT p.id, r.status, r.title
FROM page AS p
INNER JOIN page_revision as r ON r.pageId = p.id
AND r.id = (SELECT MAX(r2.id) from page_revision as r2 WHERE r2.pageId = r.pageId)
Other approaches are available but dependent on what database you're using. I'm not really sure it can be improved much for MySQL.
In MS SQL 2005+ and Oracle:
SELECT p.id, r.status, r.title
FROM (
SELECT p.*, r,*,
ROW_NUMBER() OVER (PARTITION BY p.pageId ORDER BY CASE WHEN p.status = 'active' THEN 0 ELSE 1 END, r.id DESC) AS rn
FROM page AS p, page_revision r
WHERE r.id = p.pageId
) o
WHERE rn = 1
In MySQL that can become a problem, as subqueries cannot use the INDEX RANGE SCAN as the expression from the outer query is not considered constant.
You'll need to create two indexes and a function that returns the last page revision to use those indexes:
CREATE INDEX ix_revision_page_status_id ON page_revision (page_id, id, status);
CREATE INDEX ix_revision_page_id (page_id, id);
CREATE FUNCTION `fn_get_last_revision`(input_id INT) RETURNS int(11)
BEGIN
DECLARE id INT;
SELECT r_id
INTO id
FROM (
SELECT r.id
FROM page_revisions
FORCE INDEX (ix_revision_page_status_id)
WHERE page_id = input_id
AND status = 'active'
ORDER BY id DESC
LIMIT 1
UNION ALL
SELECT r.id
FROM page_revisions
FORCE INDEX (ix_revision_page_id)
WHERE page_id = input_id
ORDER BY id DESC
LIMIT 1
) o
LIMIT 1;
RETURN id;
END;
SELECT po.id, r.status, r.title
FROM (
SELECT p.*, fn_get_last_revision(p.page_id) AS rev_id
FROM page p
) po, page_revision r
WHERE r.id = po.rev_id;
This will efficiently use index to get the last revision of the page.
P. S. If you will use codes for statuses and use 0 for active, you can get rid of the second index and simplify the query.