Select latest from joined table excluding duplicates

Select latest from joined table excluding duplicates - sql

I have two joined tables, parent one shows unit's name, child shows recording temperatures, that can be inserted either by automatic process (AUTO) or by user. So for given unit reading records from simple join would look like
UNIT TEMP TIMESTAMP DATA_SOURCE
ABC -20 10:26 AUTO
ABC -19 11:27 USER
ABC -19 11:27 AUTO
The goal is to select the latest temp reading. I can use subquery to do so:
SELECT A.UNIT, B.TEMP, B.TIMESTAMP,B.DATA_SOURCE
FROM units_table A left outer join readings_table B on A.Gkey=B.unit_gkey
WHERE B.TIMESTAMP=
(SELECT MAX(TIMESTAMP) FROM readings_table B1
WHERE A.Gkey=B1.unit_gkey)
It would be simple but in the example above there are two exact timestamps, so I will get TWO readings. In such case I'd like to ignore the AUTO source. Is there an elegant way to do it?
Edit: to be clear I want only ONE ROW result:
ABC -19 11:27 USER

You can do this with row_number() instead:
SELECT ut.UNIT, rt.TEMP, rt.TIMESTAMP, rt.DATA_SOURCE
FROM units_table ut left outer join
(SELECT rt.*,
row_number() over (partition by rt.unit_Gkey
order by timestamp desc,
(case when rt.data_source = 'AUTO' then 1 else 0 end)
) as seqnm
FROM readings_table rt
) rt
on rt.unit_Gkey = ut.gkey
WHERE rt.seqnum = 1;
Note: if you wanted the duplicates, you would use rank() or dense_rank() instead of row_number() (and remove the second clause in the order by).

http://www.w3schools.com/sql/sql_distinct.asp
just use the distinct key word look at the example! :)

Related

Recursive subtraction from two separate tables to fill in historical data

I have two datasets hosted in Snowflake with social media follower counts by day. The main table we will be using going forward (follower_counts) shows follower counts by day:
This table is live as of 4/4/2020 and will be updated daily. Unfortunately, I am unable to get historical data in this format. Instead, I have a table with historical data (follower_gains) that shows net follower gains by day for several accounts:
Ideally - I want to take the follower_count value from the minimum date in the current table (follower_counts) and subtract the sum of gains (organic + paid gains) for each day, until the minimum date of the follower_gains table, to fill in the follower_count historically. In addition, there are several accounts with data in these tables, so it would need to be grouped by account. It should look like this:
I've only gotten as far as unioning these two tables together, but don't even know where to start with looping through these rows:
WITH a AS (
SELECT
account_id,
date,
organizational_entity,
organizational_entity_type,
vanity_name,
localized_name,
localized_website,
organization_type,
total_followers_count,
null AS paid_follower_gain,
null AS organic_follower_gain,
account_name,
last_update
FROM follower_counts
UNION ALL
SELECT
account_id,
date,
organizational_entity,
organizational_entity_type,
vanity_name,
localized_name,
localized_website,
organization_type,
null AS total_followers_count,
organic_follower_gain,
paid_follower_gain,
account_name,
last_update
FROM follower_gains)
SELECT
a.account_id,
a.date,
a.organizational_entity,
a.organizational_entity_type,
a.vanity_name,
a.localized_name,
a.localized_website,
a.organization_type,
a.total_followers_count,
a.organic_follower_gain,
a.paid_follower_gain,
a.account_name,
a.last_update
FROM a
ORDER BY date desc LIMIT 100

UPDATE: Changed union to union all and added not exists to remove duplicates. Made changes per the comments.
NOTE: Please make sure you don't post images of the tables. It's difficult to recreate your scenario to write a correct query. Test this solution and update so that I can make modifications if necessary.
You don't loop through in SQL because its not a procedural language. The operation you define in the query is performed for all the rows in a table.
with cte as (SELECT a.account_id,
a.date,
a.organizational_entity,
a.organizational_entity_type,
a.vanity_name,
a.localized_name,
a.localized_website,
a.organization_type,
(a.follower_count - (b.organic_gain+b.paid_gain)) AS follower_count,
a.account_name,
a.last_update,
b.organic_gain,
b.paid_gain
FROM follower_counts a
JOIN follower_gains b ON a.account_id = b.account_id
AND b.date < (select min(date) from
follower_counts c where a.account.id = c.account_id)
)
SELECT b.account_id,
b.date,
b.organizational_entity,
b.organizational_entity_type,
b.vanity_name,
b.localized_name,
b.localized_website,
b.organization_type,
b.follower_count,
b.account_name,
b.last_update,
b.organic_gain,
b.paid_gain
FROM cte b
UNION ALL
SELECT a.account_id,
a.date,
a.organizational_entity,
a.organizational_entity_type,
a.vanity_name,
a.localized_name,
a.localized_website,
a.organization_type,
a.follower_count,
a.account_name,
a.last_update,
NULL as organic_gain,
NULL as paid_gain
FROM follower_counts a where not exists (select 1 from
follower_gains c where a.account_id = c.account_id AND a.date = c.date)

You could do something like this, instead of using the variable you can just wrap it another bracket and write at end ) AS FollowerGrowth
DECLARE #FollowerGrowth INT =
( SELECT total_followers_count
FROM follower_gains
WHERE AccountID = xx )
-
( SELECT TOP 1 follower_count
FROM follower_counts
WHERE AccountID = xx
ORDER BY date ASCENDING )

How to check if two cells have the same value?

I am using PSQL to query a database. I'm using two tables (d_items and chartevents) which are linked using itemid.
The following code
select
subject_id, hadm_id, icustay_id
, di.itemid, di.label
, charttime, storetime
, value, valuenum, valueuom
, error, resultstatus
from chartevents ce
inner join d_items di
on ce.itemid = di.itemid
where subject_id BETWEEN 1 AND 10
and di.itemid in
(
8368, 51
)
order by subject_id, charttime, itemid)
Outputs:
(Link: https://i.imgur.com/trGnwe5.png)
I only want to keep the measurements that include both systolic and diastolic BP. So actually, each (unique) charttime has to have both. How do I achieve this?

You can use window functions or exists. So, here is one way:
with t as (
select subject_id, hadm_id, icustay_id,
di.itemid, di.label,
charttime, storetime,
value, valuenum, valueuom,
error, resultstatus
from chartevents ce inner join
d_items di
on ce.itemid = di.itemid
where subject_id between 1 and 10 and
di.itemid in (8368, 51)
)
select t.*
from (select t.*,
sum( (itemid = 51):: int) over (partition by subject_id, charttime) as cnt_51,
sum( (itemid = 8368):: int) over (partition by subject_id, charttime) as cnt_8368
from t
) t
where cnt_51 > 0 and cnt_8368 > 0
order by subject_id, charttime, itemid;
I am using the itemid to identify the two measurements. You might need to use like on the label.

I am a developer who uses Oracle, but, I think I can provide some concepts to assist you.
Looking at your table, I think table d_items is merely a label to identify the
data with systolic measurements vs. diastolic measurements. So, we can ignore table
d_items.
I think your goal is to display the systolic BP and the diastolic BP on the same record.
What you want to do is to join table chartevents against itself. I assume that subject_id
and chartime will define a unique set of records. Looking at the output columns, it looks
like value and valuenum represent the same data.
Your table join will look something like this:
Select ... systol.value, diastol.value......
from chartevents systol
join chartevents diastol
on (systol.subject_id = diastol.subject_id
and systol.charttime = diastol.charttime)
where ...
I will leave the rest of the work to you to complete the query.

Modify my SQL Server query -- returns too many rows sometimes

I need to update the following query so that it only returns one child record (remittance) per parent (claim).
Table Remit_To_Activate contains exactly one date/timestamp per claim, which is what I wanted.
But when I join the full Remittance table to it, since some claims have multiple remittances with the same date/timestamps, the outermost query returns more than 1 row per claim for those claim IDs.
SELECT * FROM REMITTANCE
WHERE BILLED_AMOUNT>0 AND ACTIVE=0
AND REMITTANCE_UUID IN (
SELECT REMITTANCE_UUID FROM Claims_Group2 G2
INNER JOIN Remit_To_Activate t ON (
(t.ClaimID = G2.CLAIM_ID) AND
(t.DATE_OF_LATEST_REGULAR_REMIT = G2.CREATE_DATETIME)
)
where ACTIVE=0 and BILLED_AMOUNT>0
)
I believe the problem would be resolved if I included REMITTANCE_UUID as a column in Remit_To_Activate. That's the REAL issue. This is how I created the Remit_To_Activate table (trying to get the most recent remittance for a claim):
SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
MAX(claim_id) AS ClaimID,
INTO Latest_Remit_To_Activate
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID
Claims_Group2 contains these fields:
REMITTANCE_UUID,
CLAIM_ID,
BILLED_AMOUNT,
CREATE_DATETIME
Here are the 2 rows that are currently giving me the problem--they're both remitts for the SAME CLAIM, with the SAME TIMESTAMP. I only want one of them in the Remits_To_Activate table, so only ONE remittance will be "activated" per Claim:
enter image description here

You can change your query like this:
SELECT
p.*, latest_remit.DATE_OF_LATEST_REMIT
FROM
Remittance AS p inner join
(SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
claim_id,
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID) as latest_remit
on latest_remit.claim_id = p.claim_id;
This will give you only one row. Untested (so please run and make changes).

Without having more information on the structure of your database -- especially the structure of Claims_Group2 and REMITTANCE, and the relationship between them, it's not really possible to advise you on how to introduce a remittance UUID into DATE_OF_LATEST_REMIT.
Since you are using SQL Server, however, it is possible to use a window function to introduce a synthetic means to choose among remittances having the same timestamp. For example, it looks like you could approach the problem something like this:
select *
from (
select
r.*,
row_number() over (partition by cg2.claim_id order by cg2.create_datetime desc) as rn
from
remittance r
join claims_group2 cg2
on r.remittance_uuid = cg2.remittance_uuid
where
r.active = 0
and r.billed_amount > 0
and cg2.active = 0
and cg2.billed_amount > 0
) t
where t.rn = 1
Note that that that does not depend on your DATE_OF_LATEST_REMIT table at all, it having been subsumed into the inline view. Note also that this will introduce one extra column into your results, though you could avoid that by enumerating the columns of table remittance in the outer select clause.
It also seems odd to be filtering on two sets of active and billed_amount columns, but that appears to follow from what you were doing in your original queries. In that vein, I urge you to check the results carefully, as lifting the filter conditions on cg2 columns up to the level of the join to remittance yields a result that may return rows that the original query did not (but never more than one per claim_id).

A co-worker offered me this elegant demonstration of a solution. I'd never used "over" or "partition" before. Works great! Thank you John and Gaurasvsa for your input.
if OBJECT_ID('tempdb..#t') is not null
drop table #t
select *, ROW_NUMBER() over (partition by CLAIM_ID order by CLAIM_ID) as ROW_NUM
into #t
from
(
select '2018-08-15 13:07:50.933' as CREATE_DATE, 1 as CLAIM_ID, NEWID() as
REMIT_UUID
union select '2018-08-15 13:07:50.933', 1, NEWID()
union select '2017-12-31 10:00:00.000', 2, NEWID()
) x
select *
from #t
order by CLAIM_ID, ROW_NUM
select CREATE_DATE, MAX(CLAIM_ID), MAX(REMIT_UUID)
from #t
where ROW_NUM = 1
group by CREATE_DATE

Grouping records on consecutive dates

If I have following table in Postgres:
order_dtls
Order_id Order_date Customer_name
-------------------------------------
1 11/09/17 Xyz
2 15/09/17 Lmn
3 12/09/17 Xyz
4 18/09/17 Abc
5 15/09/17 Xyz
6 25/09/17 Lmn
7 19/09/17 Abc
I want to retrieve such customer who has placed orders on 2 consecutive days.
In above case Xyz and Abc customers should be returned by query as result.

There are many ways to do this. Use an EXISTS semi-join followed by DISTINCT or GROUP BY, should be among the fastest.
Postgres syntax:
SELECT DISTINCT customer_name
FROM order_dtls o
WHERE EXISTS (
SELEST 1 FROM order_dtls
WHERE customer_name = o.customer_name
AND order_date = o.order_date + 1 -- simple syntax for data type "date" in Postgres!
);
If the table is big, be sure to have an index on (customer_name, order_date) to make it fast - index items in this order.
To clarify, since Oto happened to post almost the same solution a bit faster:
DISTINCT is an SQL construct, a syntax element, not a function. Do not use parentheses like DISTINCT (customer_name). Would be short for DISTINCT ROW(customer_name) - a row constructor unrelated to DISTINCT - and just noise for the simple case with a single expression, because Postgres removes the pointless row wrapper for a single element automatically. But if you wrap more than one expression like that, you get an actual row type - an anonymous record actually, since no row type is given. Most certainly not what you want.
What is a row constructor used for?
Also, don't confuse DISTINCT with DISTINCT ON (expr, ...). See:
Select first row in each GROUP BY group?

Try something like...
SELECT `order_dtls`.*
FROM `order_dtls`
INNER JOIN `order_dtls` AS mirror
ON `order_dtls`.`Order_id` <> `mirror`.`Order_id`
AND `order_dtls`.`Customer_name` = `mirror`.`Customer_name`
AND DATEDIFF(`order_dtls`.`Order_date`, `mirror`.`Order_date`) = 1

The way I would think of it doing it would be to join the table the date part with itselft on the next date and joining it with the Customer_name too.
This way you can ensure that the same customer_name done an order on 2 consecutive days.
For MySQL:
SELECT distinct *
FROM order_dtls t1
INNER JOIN order_dtls t2 on
t1.Order_date = DATE_ADD(t2.Order_date, INTERVAL 1 DAY) and
t1.Customer_name = t2.Customer_name
The result you should also select it with the Distinct keyword to ensure the same customer is not displayed more than 1 time.

For postgresql:
select distinct(Customer_name) from your_table
where exists
(select 1 from your_table t1
where
Customer_name = your_table.Customer_name and Order_date = your_table.Order_date+1 )
Same for MySQL, just instead of your_table.Order_date+1 use: DATE_ADD(your_table.Order_date , INTERVAL 1 DAY)

This should work:
SELECT A.customer_name
FROM order_dtls A
INNER JOIN (SELECT customer_name, order_date FROM order_dtls) as B
ON(A.customer_name = B.customer_name and Datediff(B.Order_date, A.Order_date) =1)
group by A.customer_name

How can I select adjacent rows to an arbitrary row (in sql or postgresql)?

I want to select some rows based on certain criteria, and then take one entry from that set and the 5 rows before it and after it.
Now, I can do this numerically if there is a primary key on the table, (e.g. primary keys that are numerically 5 less than the target row's key and 5 more than the target row's key).
So select the row with the primary key of 7 and the nearby rows:
select primary_key from table where primary_key > (7-5) order by primary_key limit 11;
2
3
4
5
6
-=7=-
8
9
10
11
12
But if I select only certain rows to begin with, I lose that numeric method of using primary keys (and that was assuming the keys didn't have any gaps in their order anyway), and need another way to get the closest rows before and after a certain targeted row.
The primary key output of such a select might look more random and thus less succeptable to mathematical locating (since some results would be filtered, out, e.g. with a where active=1):
select primary_key from table where primary_key > (34-5)
order by primary_key where active=1 limit 11;
30
-=34=-
80
83
100
113
125
126
127
128
129
Note how due to the gaps in the primary keys caused by the example where condition (for example becaseu there are many inactive items), I'm no longer getting the closest 5 above and 5 below, instead I'm getting the closest 1 below and the closest 9 above, instead.

There's a lot of ways to do it if you run two queries with a programming language, but here's one way to do it in one SQL query:
(SELECT * FROM table WHERE id >= 34 AND active = 1 ORDER BY id ASC LIMIT 6)
UNION
(SELECT * FROM table WHERE id < 34 AND active = 1 ORDER BY id DESC LIMIT 5)
ORDER BY id ASC
This would return the 5 rows above, the target row, and 5 rows below.

Here's another way to do it with analytic functions lead and lag. It would be nice if we could use analytic functions in the WHERE clause. So instead you need to use subqueries or CTE's. Here's an example that will work with the pagila sample database.
WITH base AS (
SELECT lag(customer_id, 5) OVER (ORDER BY customer_id) lag,
lead(customer_id, 5) OVER (ORDER BY customer_id) lead,
c.*
FROM customer c
WHERE c.active = 1
AND c.last_name LIKE 'B%'
)
SELECT base.* FROM base
JOIN (
-- Select the center row, coalesce so it still works if there aren't
-- 5 rows in front or behind
SELECT COALESCE(lag, 0) AS lag, COALESCE(lead, 99999) AS lead
FROM base WHERE customer_id = 280
) sub ON base.customer_id BETWEEN sub.lag AND sub.lead
The problem with sgriffinusa's solution is that you don't know which row_number your center row will end up being. He assumed it will be row 30.

For similar query I use analytic functions without CTE. Something like:
select ...,
LEAD(gm.id) OVER (ORDER BY Cit DESC) as leadId,
LEAD(gm.id, 2) OVER (ORDER BY Cit DESC) as leadId2,
LAG(gm.id) OVER (ORDER BY Cit DESC) as lagId,
LAG(gm.id, 2) OVER (ORDER BY Cit DESC) as lagId2
...
where id = 25912
or leadId = 25912 or leadId2 = 25912
or lagId = 25912 or lagId2 = 25912
such query works more faster for me than CTE with join (answer from Scott Bailey). But of course less elegant

You could do this utilizing row_number() (available as of 8.4). This may not be the correct syntax (not familiar with postgresql), but hopefully the idea will be illustrated:
SELECT *
FROM (SELECT ROW_NUMBER() OVER (ORDER BY primary_key) AS r, *
FROM table
WHERE active=1) t
WHERE 25 < r and r < 35
This will generate a first column having sequential numbers. You can use this to identify the single row and the rows above and below it.

If you wanted to do it in a 'relationally pure' way, you could write a query that sorted and numbered the rows. Like:
select (
select count(*) from employees b
where b.name < a.name
) as idx, name
from employees a
order by name
Then use that as a common table expression. Write a select which filters it down to the rows you're interested in, then join it back onto itself using a criterion that the index of the right-hand copy of the table is no more than k larger or smaller than the index of the row on the left. Project over just the rows on the right. Like:
with numbered_emps as (
select (
select count(*)
from employees b
where b.name < a.name
) as idx, name
from employees a
order by name
)
select b.*
from numbered_emps a, numbered_emps b
where a.name like '% Smith' -- this is your main selection criterion
and ((b.idx - a.idx) between -5 and 5) -- this is your adjacency fuzzy-join criterion
What could be simpler!
I'd imagine the row-number based solutions will be faster, though.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Select latest from joined table excluding duplicates - sql

http://www.w3schools.com/sql/sql_distinct.asp just use the distinct key word look at the example! :)

Related

Recursive subtraction from two separate tables to fill in historical data

How to check if two cells have the same value?

Modify my SQL Server query -- returns too many rows sometimes

Grouping records on consecutive dates

How can I select adjacent rows to an arbitrary row (in sql or postgresql)?

Categories

Resources