Complex Query duplicating Result (same id, different columns values) - sql

I have this query, working great:
SELECT * FROM
(
select
p.id,
comparestrings('marco', pc.value) as similarity
from
unit u, person p
inner join person_field pc ON (p.id = pc.id_person)
inner join field c ON (pc.id_field = c.id AND c.flag_name = true)
where ( u.id = 1 ) AND p.id_unit = u.id
) as subQuery
where
similarity is not null
AND
similarity > 0.35
order by
similarity desc;
Let me explain the situation.
TABLES:
person ID as column.
field a table that represents a column, like name, varchar (something like that)
person_field represents the value of that person and that field.. Like this:
unit not relevant for this question
Eg.:
Person id 1
Field id 1 {name, eg)
value "Marco Noronha"
So the function "comparestrings" returns a double from 0 to 1, where 1 is exact ('Marco' == 'Marco').
So, I need all persons that have similarity above 0.35 and i also need its similarity.
No problem, the query works fine and as it was suppost to. But now I have a new requirement that, the table "person_field" will contain an alteration date, to keep track of the changes of those rows.
Eg.:
Person ID 1
Field ID 1
Value "Marco Noronha"
Date - 01/25/2013
Person ID 1
Field ID 1
Value "Marco Tulio Jacovine Noronha"
Date - 02/01/2013
So what I need to do, is consider ONLY the LATEST row!!
If I execute the same query the result would be (eg):
1, 0.8
1, 0.751121
2, 0.51212
3, 0.42454
//other results here, other 'person's
And lets supose that the value I want to bring is 1, 0.751121 (witch is the lattest value by DATE)
I think I should do something like order by date desc limit 1...
But if I do something like that, the query will return only ONE person =/
Like:
1, 0.751121
When I really want:
1, 0.751121
2, 0.51212
3, 0.42454

You can use DISTINCT ON(p.id) on the sub-query:
SELECT * FROM
(
select
DISTINCT ON(p.id)
p.id,
comparestrings('marco', pc.value) as similarity
from
unit u, person p
inner join person_field pc ON (p.id = pc.id_person)
inner join field c ON (pc.id_field = c.id AND c.flag_name = true)
where ( u.id = 1 ) AND p.id_unit = u.id
ORDER BY p.id, pc.alt_date DESC
) as subQuery
where
similarity is not null
AND
similarity > 0.35
order by
similarity desc;
Notice that, to make it work I needed to add ORDER BY p.id, pc.alt_date DESC:
p.id: required by DISTINCT ON (if you use ORDER BY, the first fields must be exactly the same as DISTINCT ON);
pc.alt_date DESC: the alter date you mentioned (we order desc, so we get the oldest ones by each p.id)
By the way, seems that you don't need a sub-query at all (just make sure comparestrings is marked as stable or immutable, and it'll be fast enough):
SELECT
DISTINCT ON(p.id)
p.id,
comparestrings('marco', pc.value) as similarity
FROM
unit u, person p
inner join person_field pc ON (p.id = pc.id_person)
inner join field c ON (pc.id_field = c.id AND c.flag_name = true)
WHERE ( u.id = 1 ) AND p.id_unit = u.id
AND COALESCE(comparestrings('marco', pc.value), 0.0) > 0.35
ORDER BY p.id, pc.alt_date DESC, similarity DESC;

Change the reference to person to a subquery as in the following example (the subquery is the one called p):
. . .
from unit u cross join
(select p.*
from (select p.*,
row_number() over (partition by person_id order by alterationdate desc) as seqnum
from person p
) p
where seqnum = 1
) p
. . .
This uses the row_number() function to identify the last row. I've used an additional subquery to limit the result just to the most recent. You could also include this in an on clause or a where clause.
I also changed the , to an explicit cross join.

Related

How to select oldest record from sql

So I have a table of different appraisals saved for a lot of different vehicles and I want to select all appraisals with specific appraisal type but there can be more than 1 entries for the specific record id with that type and I only want to select the oldest
So I have this query (results below)
select
p.seller_opportunity_id
, p.created_at
, p.created_by
, p.type
, p.pricing_output_quote3_rounded_list_price_usd
from
frunk.pricing_events as p
inner join database.opportunity o
inner join database.vehicle_c v on
o.vehicle_id_c = v.id on
p.seller_opportunity_id = o.id
where
o.auto_reject_c is false
and o.stage_name not in (
'Lost'
, 'Sold'
, 'Handover'
)
-- and p.type = 'appraisal-escalated'
and o.id = 'id'
order by
p.created_at desc
Which results in this
Image URL
I want to create a nested query where I can get the pricing_output_quote3_rounded_list_price_usd for one seller_opportunity_id both from the type appraisal-escalated and manual-quote with the values of first records (there can be several as shown on screenshot)
Please note that the o.id where clause is for example sake and in the actual query I'd be querying the whole table with all the ids so adding
where p.created_at = (select min(p.created_at) from frunk) would not work
Use a ranking function like dense_rank.
select *
from (select p.*
,dense_rank() over(partition by seller_opportunity_id order by created_at) as rnk
from appraisals
where type = 'appraisal-escalated'
) t
where rnk = 1
Read more about the function in the documentation

Select last record out of grouped records

i have this code and i want someone to help me to change it to a grouped query which orders froms below.
SELECT *
FROM dbo.users_pics INNER JOIN profile
ON users_pics.email = profile.email
Left Join photo_comment
On users_pics.u_pic_id = photo_comment.pic_id
WHERE users_pics.wardrobe = MMColParam
ORDER BY u_pic_id asc
what i mean is i have grouped of records which i want to select one per record only from beneath. for example if i have 10 records of the name "John" i want to select the last "John" out of the 10 and then the rest also follows
I'm going to presume that your users table contains a single user, and each user has a single profile, and your photo_comment table can contain multiple comments.
Depending on your RDBMS, you can do this a number of ways. Row_Number can often be a quick way of doing this if you're using a database which supports window functions such as SQL Server or Oracle.
A generic solution to this is to join the table back to itself using the MAX aggregate. This is dependent on having a field to determine which record is the max. Generally speaking, that would be an identity/auto number field or a time stamp field.
Here is the basic concept using photo_comment_id as your determining column:
SELECT *
FROM dbo.users_pics INNER JOIN profile
ON users_pics.email = profile.email
LEFT Join (
SELECT pic_id, MAX(photo_comment_id) max_photo_comment_id
FROM max_photo_comment
GROUP BY pic_id
) max_photo_comment On users_pics.u_pic_id = max_photo_comment.pic_id
LEFT Join photo_comment On
max_photo_comment.pic_id = photo_comment.pic_id AND
max_photo_comment.max_photo_comment_id = photo_comment.photo_comment_id
WHERE users_pics.wardrobe = MMColParam
ORDER BY u_pic_id asc
If your database supports ROW_NUMBER, then you can do this as well (still using the photo_comment_id field):
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY photo_comment.pic_id
ORDER BY photo_comment.photo_comment_id DESC) rn
FROM dbo.users_pics INNER JOIN profile
ON users_pics.email = profile.email
LEFT JOIN photo_comment
ON users_pics.u_pic_id = photo_comment.pic_id
WHERE users_pics.wardrobe = MMColParam
) t
WHERE rn = 1
ORDER BY u_pic_id asc

SQL Server 2012 - Improve query performance

I'm looking for a way to improve the following query.
It collects members of organizations that have a membership of any organization in 2013.
I've been able to determine that the sub-query in this query is the real performance killer, but I can't find a way to remove the subquery and keep the resulting table correct.
The query simply collects all "PersonID" and "MemberId" for people that have a membership in this calendar year. BUT, it is possible to have two memberships in one calendar year. If that should happen, then we only want to select the last membership you have in that calendar year: that's what the subquery is for.
A "WorkingYear" is not the same as a calendar year. A workingyear can be an entire year, but it can also run from september 2013 to september 2014, for example. That's why I specify that the workingyear has to start or end in 2013.
This is the query:
SELECT DISTINCT PersonID,
m.id AS MemberId
FROM Members AS m
INNER JOIN WorkingYears AS w
ON m.WorkingYearID = w.ID
AND ( YEAR(w.StartDate) = 2013
OR YEAR(w.EndDate) = 2013 )
WHERE m.Id = (SELECT TOP 1 m2.id
FROM DBA_Member m2
WHERE personid = m.PersonID
AND ( ( droppedOut = 'false' )
OR ( droppedOut = 'true'
AND ( yeardropout = 2013 ) ) )
ORDER BY m.StartDate DESC)
This query should collect about 50.000 rows for me, so obviously it also executes the sub query at least 50.000 times and I'm looking for a way to avoid this. Does anyone have any ideas that could point me in the right direction?
All fields that are used in JOINS should be indexed correctly. There is also a seperate index on 'droppedOut' (bit), 'yeardropout' (int). I also created an index on both fields at the same time to no avail.
In the execution plan, I see that an "eager spool" is occurring, that takes up 60% of the query time. It has an outputlist of Member.ID, Member.DroppedOut, Member.YearDropout, which are indeed all the fields that I'm using in my subquery. Also, it gets 50.500 rebinds.
Does anyone have any advice?
You only need to do the sub-query once if you use a CTE
WITH subQall AS
(
select id, personID,
ROW_NUMBER() OVER (PARTITION BY personID ORDER BY StartDate DESC) as rnum
from DBA_Member
WHERE (droppedOut='false') OR (droppedOut='true' AND (yeardropout = 2013))
), subQ AS
(
select id, personID
from subQall
where rnum = 1
)
SELECT DISTINCT PersonID, m.id as MemberId
FROM Members AS m
INNER JOIN WorkingYears AS w ON m.WorkingYearID = w.ID
JOIN subQ ON m.ID = subQ.ID and m.personID = subQ.personID
WHERE StartDate BETWEEN '1-1-2013' AND '12-31-2013'
Can you try a join instead of the sub query?
like this
SELECT DISTINCT PersonID, m.id as MemberId
FROM Members AS m
INNER JOIN WorkingYears AS w ON m.WorkingYearID = w.ID
AND (year(w.StartDate) = 2013 OR year(w.EndDate) = 2013)
JOIN (select top 1 m2.id ID from DBA_Member m2 where personid= m.PersonID
and ((droppedOut='false') OR (droppedOut='true' AND (yeardropout = 2013)))
order by m.StartDate desc) Member ON m.Id = Member.ID

How to select only first rows that satisfies conditions?

I'm doing a join between two tables and adding a condition want to obtain only the first row that satisfie the join condition and the "extern" condition too.
This query for example:
select * from PRMPROFILE p, user v
where
p.id = v.profile
and p.language = 0
and v.userid like '%TEST%';
First of all, i want to know how to group the result of this inner join using the profile (v.profile or p.id). After that how to show only the first appearence for each group.
Thanks in advance.
You can use an analytic query for this:
select *
from (
select p.*, v.*,
row_number() over (partition by p.id order by v.userid) as rn
from prmprofile p
join user v on v.profile = p.id
where p.language = 0
and v.userid like '%TEST%'
)
where rn = 1;
The inner query gets all the data (but using * isn't ideal), and adds an extra column that assigns a row number sequence across each p.id value. They have to be ordered by something, and you haven't said what makes a particular row 'first', so I've guessed as user ID - you can of course change that to something more appropriate, that will give consistent results when the query is rerun. (You can look at rank and dense_rank for alternative methods to pick the 'first' row).
The outer query just restricts the result set to those where the extra column has the value 1, which will give you one row for every p.id.
Another approach would be to use a subquery to identify the 'first' row and join to that, but it isn't clear what the criteria would be and if it would be selective enough.
Please try:
select * from(
select *,
row_number() over (partition by v.userid order by v.userid) RNum
from PRMPROFILE p, user v
where
p.id = v.profile
and p.language = 0
and v.userid like '%TEST%'
)x
where RNum=1;
You can use LIMIT keyword.
select * from PRMPROFILE p, user v
where
p.id = v.profile
and p.language = 0
and v.userid like '%TEST%'
limit 1
select * from PRMPROFILE p, user v
where p.id = v.profile and p.language = 0
and v.userid like '%TEST%'
fetch first 1 row only
It will display only the top result
select top 1 * from PRMPROFILE p, user v
where
p.id = v.profile
and p.language = 0
and v.userid like '%TEST%';

How to use SELECT DISTINCT with RANDOM() function in PostgreSQL?

I am trying to run a SQL query to get four random items. As the table product_filter has more than one touple in product i have to use DISTINCT in SELECT, so i get this error:
for SELECT DISTINCT, ORDER BY expressions must appear in select list
But if i put RANDOM() in my SELECT it will avoid the DISTINCT result.
Someone know how to use DISTINCT with the RANDOM() function? Below is my problematic query.
SELECT DISTINCT
p.id,
p.title
FROM
product_filter pf
JOIN product p ON pf.cod_product = p.cod
JOIN filters f ON pf.cod_filter = f.cod
WHERE
p.visible = TRUE
LIMIT 4
ORDER BY RANDOM();
You either do a subquery
SELECT * FROM (
SELECT DISTINCT p.cod, p.title ... JOIN... WHERE
) ORDER BY RANDOM() LIMIT 4;
or you try GROUPing for those same fields:
SELECT p.cod, p.title, MIN(RANDOM()) AS o FROM ... JOIN ...
WHERE ... GROUP BY p.cod, p.title ORDER BY o LIMIT 4;
Which of the two expressions will evaluate faster depends on table structure and indexing; with proper indexing on cod and title, the subquery version will run faster (cod and title will be taken from index cardinality information, and cod is the only key needed for the JOIN, so if you index by title, cod and visible (used in the WHERE), it is likely that the physical table will not even be accessed at all.
I am not so sure whether this would happen with the second expression too.
You can simplify your query to avoid the problem a priori:
SELECT p.cod, p.title
FROM product p
WHERE p.visible
AND EXISTS (
SELECT 1
FROM product_filter pf
JOIN filters f ON f.cod = pf.cod_filter
WHERE pf.cod_product = p.cod
)
ORDER BY random()
LIMIT 4;
Major points:
You have only columns from table product in the result, other tables are only checked for existence of a matching row. For a case like this the EXISTS semi-join is likely the fastest and simplest solution. Using it does not multiply rows from the base table product, so you don't need to remove them again with DISTINCT.
LIMIT has to come last, after ORDER BY.
I simplified WHERE p.visible = 't' to p.visible, because this should be a boolean column.
Use a subquery. Don't forget the table alias, t. LIMIT comes after ORDER BY.
SELECT *
FROM (SELECT DISTINCT a, b, c
FROM datatable WHERE a = 'hello'
) t
ORDER BY random()
LIMIT 10;
I think you need a subquery:
select *
from (select DISTINCT p.cod, p.title
from product_filter pf join
product p
on pf.cod_product = p.cod
where p.visible = 't'
) t
LIMIT 4
order by RANDOM()
Calculate the distinct values first, and then do the limit.
Do note, this does have performance implications, because this query does a distinct on everything before selecting what you want. Whether this matters depends on the size of your table and how you are using the query.
SELECT DISTINCT U.* FROM
(
SELECT p.cod, p.title FROM product__filter pf
JOIN product p on pf.cod_product = p.cod
JOIN filters f on pf.cod_filter = f.cod
WHERE p.visible = 't'
ORDER BY RANDOM()
) AS U
LIMIT 4
This does the RANDOM first then the LIMIT afterwards.