in-house records over outside records - sql

I am stuck with this problem.
I have duplicates in my query coming from that some of the records are in-house and outside simultaneously. I prefer in-house over outside but some of the outside are preferable when there is no entrance from in-house.
Select id, date, location from prod
id date location
----------
2 01/01/2012 in-house
2 05/01/2012 outside <- in this situation i want to keep just in-house
id date location
----------
4 01/01/2012 in-house
5 03/01/2012 outside <- in this situation i want to keep both since there is no db entry for id=5 therefor i have just info from outside
Could someone help?

One way to do this is to do a full outer join from your table to itself and then use coalesce.
Select
COALESCE(Inside.Id, outside.id) Id,
COALESCE(Inside.date, outside.date) Date,
COALESCE(Inside.location, outside.location) Location
From
prod Inside
FULL OUTER JOIN prod Outside
ON Inside.id = Outside.iD
and Inside.location <> Outside.Location
Where
(Inside.Location = 'in-house'
or
Inside.Location is null)
and
(outside.Location = 'outside'
or
outside.Location is null)
DEMO
Notes:
If your fields can be nullable you may want to use a Case statement instead of coalesce and use the ID field to determine which table to use. Using the date as an example
CASE WHEN Inside.Id is not null THEN Inside.date ELSE outside.date END date
As Dems noted this also assumes that {id, location} is unique.
UPDATE
Since you're using SQL Server and {ID, Location} isn't unique and you want the max date value and always choosing in-house over outside then you can use ROW_NUMBER/WHERE RowNumber = 1 effectively here, by ordering First on location and then by date.
WITH cte
AS (SELECT Row_number() OVER ( partition BY ID
ORDER BY CASE LOCATION WHEN 'in-house' THEN 0
WHEN 'outside' THEN 1 END,
DATE DESC) rn,
ID,
Date,
Location
FROM prod)
SELECT ID,
Date,
Location
FROM cte
WHERE rn = 1
Demo
Note We didn't have to use a case statement but I wanted the mapping to be explicit.

Related

Counting how many times one specific value changed to another specific value in order of date range and grouped by ID

I have a table like below where I need to query a count of how many times each ID went from specifically 'Waste Sale' in one value to 'On Stop' in the very next value based on ascending date and if there are no instances of this, the count will be 0
ID
Stage name
Stage Changed Date
1
Waste Sale
06-05-2022
1
On Stop
08-06-2022
1
Cancelled
09-02-2022
2
Waste Sale
06-05-2022
2
On Stop
07-05-2022
2
Waste Sale
08-06-2022
2
On Stop
10-07-2022
3
Cancelled
10-07-2022
3
On Stop
11-07-2022
The result I would be looking for based on the above table would be something like this:
ID
Count of 'Waste Sales to On Stops'
1
1
2
2
3
0
ID 1 having a count of 1 because there was one instance of 'Waste Sale' changing to 'On Stop' in the very next value based on date range
ID 3 having a count of 0 because even though the stage name changed to 'On Stop' the previous value based on date range wasn't 'Waste Sale'.
I have a hunch I would have to use something like LEAD() and GROUP BY/ ORDER BY but since I'm so new to SQL would really appreciate some help on the specific syntax and coding. Any version of SQL is okay.
We can use window function lead to take a peek at the next value of the query result.
select distinct id,
(
select count(*)
from
(
select *,
lead(stage_name)
over(
partition by id
order by stage_changed_date)
as stage_next
from sales s2
) s3
where s3.id = s1.id
and s3.stage_name = 'waste sale'
and s3.stage_next = 'on stop'
) as count_of_waste_sales_to_on_stop
from sales s1
order by id;
Query above uses lead(stage_name) over(partition by id order by stage_changed_date) to get the next stage_name in the query result while segregating it by id and order it based on stage_changed_date. Check the query on DB Fiddle.
Note:
I have no experience in zoho, so i'm unsure if the query will 100% works or not. They said it supported ansi-sql, however there might some differences with MySQL due to reasons.
The column names are not the exact same with op question due to testing only done using DB Fiddle.
There might better query out there waiting to be written.

T-SQL Duplicate, selecting 'master" record based on Modified Date

I have a database with two ID fields, one assigned as a GUID by the system, and a ExternalID, which is used to denote duplicates after data cleansing, the table also contains a ModifiedDate
I am trying to merge these records, with the most recently modified record absorbing the older accounts. I have tried the following query types.
SELECT
a1.GUID
,a1.ModifiedDate
,a2.GUID
,a2,ModifiedDate
FROM Accounts a1
INNER JOIN Accounts a2 on a1.ExternalID = a2.ExternalID
This unfortunately causes the duplicate accounts to appear twice, once for the Master record, again for the subordinate record, which returns the Master record as a duplicate.
WITH Dup as (
SELECT 1 as track
,ExternalID DomEx
,ExternalID
,GUID DomGUID
,ModDate
from crm.Accounts
WHERE ExternalID is not null
UNION ALL
SELECT track +1
,OI.DomEx
,OG.ExternalID
,OG.GUID
,Og.ModDate
from crm.Accounts OG
INNER JOIN Dup OI on OI.ExternalID = OG.ExternalID
)
,
cte_dp as(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY ExternalID Order by track, ModDate desc) rn
FROM Dup
)
SELECT * FROM cte_dp
This unfortunately reaches the recursion limit of 100, and runs indefinitely if the limit is escaped.
Is it possible to correct the logic here in order to present the results required, or is there a more elegant solution.
+--------------+---------------------+--------------------+--+
| MasterGUID | SharedExternalID | SubordinateGUID | |
+--------------+---------------------+--------------------+--+
| (MasterGUID) | (SharedExternalID) | (SubordinateGUID) | |
| (MasterGUID) | (Shared ExternalID) | (SubordinateGUID) | |
+--------------+---------------------+--------------------+--+
Is the result I would ideally like to achieve, with MasterGUID being the GUID with the most recent modified date from between the two duplicates.
MERGE Accounts a1
USING Accounts a2
ON a1.ExternalID = a2.ExternalID
WHEN MATCHED THEN
UPDATE
SET a1.ModifiedDate = a2.ModifiedDate,
a1.guid = a2.guid;
SELECT * FROM Accounts a1;
a1.ExternalID = a2.ExternalID
is symmetric, so if you switch the order, the relationship will have the same logical result. So, if you find such a pair (for instance, self), then it will appear twice in the result. We need to break the symmetry with an additional condition:
a1.ExternalID = a2.ExternalID and a1.GUID < a2.GUID
This will prevent joining with self. If that is needed, you can use union, but for now I will assume that is not needed. If there is another match of ExternalID, the match will yield true if the left side has a strictly smaller GUID than the right side, therefore the inverse will not be true and the duplicates will go away.
If you post sample data this will be easier but is this what you mean?
SELECT *
FROM (
select *
, ROW_NUMBER() OVER (PARTITION BY ExternalID ORDER BY ModifiedDate DESC) rnk
from accounts
) i
WHERE i.rnk = 1

Determine records which held particular "state" on a given date

I have a state machine architecture, where a record will have many state transitions, the one with the greatest sort_key column being the current state. My problem is to determine which records held a particular state (or states) for a given date.
Example data:
items table
id
1
item_transitions table
id item_id created_at to_state sort_key
1 1 05/10 "state_a" 1
2 1 05/12 "state_b" 2
3 1 05/15 "state_a" 3
4 1 05/16 "state_b" 4
Problem:
Determine all records from items table which held state "state_a" on date 05/15. This should obviously return the item in the example data, but if you query with date "05/16", it should not.
I presume I'll be using a LEFT OUTER JOIN to join the items_transitions table to itself and narrow down the possibilities until I have something to query on that will give me the items that I need. Perhaps I am overlooking something much simpler.
Your question rephrased means "give me all items which have been changed to state_a on 05/15 or before and have not changed to another state afterwards. Please note that for the example it added 2001 as year to get a valid date. If your "created_at" column is not a datetime i strongly suggest to change it.
So first you can retrieve the last sort_key for all items before the threshold date:
SELECT item_id,max(sort_key) last_change_sort_key
FROM item_transistions it
WHERE created_at<='05/15/2001'
GROUP BY item_id
Next step is to join this result back to the item_transitions table to see to which state the item was switched at this specific sort_key:
SELECT *
FROM item_transistions it
JOIN (SELECT item_id,max(sort_key) last_change_sort_key
FROM item_transistions it
WHERE created_at<='05/15/2001'
GROUP BY item_id) tmp ON it.item_id=tmp.item_id AND it.sort_key=tmp.last_change_sort_key
Finally you only want those who switched to 'state_a' so just add a condition:
SELECT DISTINCT it.item_id
FROM item_transistions it
JOIN (SELECT item_id,max(sort_key) last_change_sort_key
FROM item_transistions it
WHERE created_at<='05/15/2001'
GROUP BY item_id) tmp ON it.item_id=tmp.item_id AND it.sort_key=tmp.last_change_sort_key
WHERE it.to_state='state_a'
You did not mention which DBMS you use but i think this query should work with the most common ones.

SQL Query to remove cyclic redundancy

I have a table that looks like this:
Column A | Column B | Counter
---------------------------------------------
A | B | 53
B | C | 23
A | D | 11
C | B | 22
I need to remove the last row because it's cyclic to the second row. Can't seem to figure out how to do it.
EDIT
There is an indexed date field. This is for Sankey diagram. The data in the sample table is actually the result of a query. The underlying table has:
date | source node | target node | path count
The query to build the table is:
SELECT source_node, target_node, COUNT(1)
FROM sankey_table
WHERE TO_CHAR(data_date, 'yyyy-mm-dd')='2013-08-19'
GROUP BY source_node, target_node
In the sample, the last row C to B is going backwards and I need to ignore it or the Sankey won't display. I need to only show forward path.
Removing all edges from your graph where the tuple (source_node, target_node) is not ordered alphabetically and the symmetric row exists should give you what you want:
DELETE
FROM sankey_table t1
WHERE source_node > target_node
AND EXISTS (
SELECT NULL from sankey_table t2
WHERE t2.source_node = t1.target_node
AND t2.target_node = t1.source_node)
If you don't want to DELETE them, just use this WHERE clause in your query for generating the input for the diagram.
If you can adjust how your table is populated, you can change the query you're using to only retrieve the values for the first direction (for that date) in the first place, with a little bit an analytic manipulation:
SELECT source_node, target_node, counter FROM (
SELECT source_node,
target_node,
COUNT(*) OVER (PARTITION BY source_node, target_node) AS counter,
RANK () OVER (PARTITION BY GREATEST(source_node, target_node),
LEAST(source_node, target_node), TRUNC(data_date)
ORDER BY data_date) AS rnk
FROM sankey_table
WHERE TO_CHAR(data_date, 'yyyy-mm-dd')='2013-08-19'
)
WHERE rnk = 1;
The inner query gets the same data you collect now but adds a ranking column, which will be 1 for the first row for any source/target pair in any order for a given day. The outer query then just ignores everything else.
This might be a candidate for a materialised view if you're truncating and repopulating it daily.
If you can't change your intermediate table but can still see the underlying table you could join back to it using the same kind of idea; assuming the table you're querying from is called sankey_agg_table:
SELECT sat.source_node, sat.target_node, sat.counter
FROM sankey_agg_table sat
JOIN (SELECT source_node, target_node,
RANK () OVER (PARTITION BY GREATEST(source_node, target_node),
LEAST(source_node, target_node), TRUNC(data_date)
ORDER BY data_date) AS rnk
FROM sankey_table) st
ON st.source_node = sat.source_node
AND st.target_node = sat.target_node
AND st.rnk = 1;
SQL Fiddle demos.
DELETE FROM yourTable
where [Column A]='C'
given that these are all your rows
EDIT
I would recommend that you clean up your source data if you can, i.e. delete the rows that you call backwards, if those rows are incorrect as you state in your comments.

How to get a value from previous result row of a SELECT statement?

If we have a table called FollowUp and has rows [ ID(int) , Value(Money) ]
and we have some rows in it, for example
ID --Value
1------70
2------100
3------150
8------200
20-----250
45-----280
and we want to make one SQL Query that get each row ID,Value and the previous Row Value in which data appear as follow
ID --- Value ---Prev_Value
1 ----- 70 ---------- 0
2 ----- 100 -------- 70
3 ----- 150 -------- 100
8 ----- 200 -------- 150
20 ---- 250 -------- 200
45 ---- 280 -------- 250
i make the following query but i think it's so bad in performance in huge amount of data
SELECT FollowUp.ID, FollowUp.Value,
(
SELECT F1.Value
FROM FollowUp as F1 where
F1.ID =
(
SELECT Max(F2.ID)
FROM FollowUp as F2 where F2.ID < FollowUp.ID
)
) AS Prev_Value
FROM FollowUp
So can anyone help me to get the best solution for such a problem ?
This sql should perform better then the one you have above, although these type of queries tend to be a little performance intensive... so anything you can put in them to limit the size of the dataset you are looking at will help tremendously. For example if you are looking at a specific date range, put that in.
SELECT followup.value,
( SELECT TOP 1 f1.VALUE
FROM followup as f1
WHERE f1.id<followup.id
ORDER BY f1.id DESC
) AS Prev_Value
FROM followup
HTH
You can use the OVER statement to generate nicely increasing row numbers.
select
rownr = row_number() over (order by id)
, value
from your_table
With the numbers, you can easily look up the previous row:
with numbered_rows
as (
select
rownr = row_number() over (order by id)
, value
from your_table
)
select
cur.value
, IsNull(prev.value,0)
from numbered_rows cur
left join numbered_rows prev on cur.rownr = prev.rownr + 1
Hope this is useful.
This is not an answer to your actual question.
Instead, I feel that you are approaching the problem from a wrong direction:
In properly normalized relational databases the tuples ("rows") of each table should contain references to other db items instead of the actual values. Maintaining these relations between tuples belongs to the data insertion part of the codebase.
That is, if containing the value of a tuple with closest, smaller id number really belongs into your data model.
If the requirement to know the previous value comes from the view part of the application - that is, a single view into the data that needs to format it in certain way - you should pull the contents out, sorted by id, and handle the requirement in view specific code.
In your case, I would assume that knowing the previous tuples' value really would belong in the view code instead of the database.
EDIT: You did mention that you store them separately and just want to make a query for it. Even still, application code would probably be the more logical place to do this combining.
What about pulling the lines into your application and computing the previous value there?
Create a stored procedure and use a cursor to iterate and produce rows.
You could use the function 'LAG'.
SELECT ID,
Value,
LAG(value) OVER(ORDER BY ID) AS Prev_Value
FROM FOLLOWUP;