How to deal with duplicate values in dataset

How to deal with duplicate values in dataset - sql

I have made a dashboard on quicksight for displaying relevant ticket details where the ticket data is fetched by running a cradle job. I have flagged tickets as true or false based on the tag added to it. However when I deleted and added another tag, the query reads both value of tags. I only want to read the latest value of the tag and classify based on that value. How can I make my query to read only the latest value of tag. I have added the query snippet that I am using :
WITH sim
AS (
SELECT *
FROM DEFAULT.o_remedy_sim_tickets_poc
WHERE run_date = (
SELECT max(run_date)
FROM DEFAULT.o_remedy_sim_tickets_poc
)
)
,sim_audit
AS (
SELECT case_id
,array_join(array_agg(to_string), ',') tag
FROM "default"."o_remedy_sim_audittrail_poc"
WHERE run_date = (
SELECT max(run_date)
FROM "default"."o_remedy_sim_audittrail_poc"
)
AND upper(description) in ('TAG', 'TAGS')
GROUP BY case_id
)
SELECT sim.*
,Extract(DAY FROM (coalesce(resolved_date, now()) - assigned_date)) age
,Extract(DAY FROM (resolved_date - assigned_date)) AS time_to_resolve
,(abs(round((Extract(DAY FROM (coalesce(resolved_date, now()) - assigned_date)) / 7), 0)) + 1) * 7 AS age_distrubtion
,CASE
WHEN Upper(tag) LIKE '%FALSE%POSITIVE%'
THEN 1
ELSE 0
END IS_FALSE_POSITIVE
,CASE
WHEN Upper(tag) LIKE '%TRUE%POSITIVE%'
THEN 1
ELSE 0
END IS_TRUE_POSITIVE

Related

Oracle SQL - Timestamp splits query result into 2 rows, Need all in one with

I need a time-based query (Random or Current) with all results in one row. My current query is as follows:
WITH started AS
(
SELECT f.*, CURRENT_DATE + ROWNUM / 24
FROM
(
SELECT
d.route_name,
d.op_name,
d.route_step_name,
nvl(MAX(DECODE(d.complete_reason, NULL, d.op_STARTS)), 0) started_units,
round(nvl(MAX(DECODE(d.complete_reason, 'PASS', d.op_complete)), 0) / d.op_starts * 100, 2) yield
FROM
(
SELECT route_name,
op_name,
route_step_name,
complete_reason,
complete_quantity,
sum(start_quantity) OVER(PARTITION BY route_name, op_name, COMPLETE_REASON) op_starts,
sum(complete_quantity) OVER(PARTITION BY route_name, op_name, COMPLETE_REASON ) op_complete
FROM FTPC_LT_PRDACT.tracked_object_history
WHERE route_name = 'HEADER FINAL ASSEMBLY'
AND OP_NAME NOT LIKE '%DISPOSITION%'
and (tobj_type = 'Lot')
AND xfr_insert_pid IN
(
SELECT xfr_start_id
FROM FTPC_LT_PRDACT.xfr_interval_id
WHERE last_modified_time <= SYSDATE
AND OP_NAME NOT LIKE '%DISPOSITION%'
and complete_reason = 'PASS' OR complete_reason IS NULL
)
) d
GROUP BY d.route_name, d.op_name, d.route_step_name, complete_reason, d.op_starts
ORDER BY d.route_step_name
) f
),
queued AS
(
SELECT
ts.route_name,
ts.queue_name,
o.op_name,
sum (th.complete_quantity) queued_units
FROM
FTPC_LT_PRDACT.tracked_object_HISTORY th,
FTPC_LT_PRDACT.tracked_object_status ts,
FTPC_LT_PRDACT.route_arc a,
FTPC_LT_PRDACT.route_step r,
FTPC_LT_PRDACT.operation o,
FTPC_LT_PRDACT.lot l
WHERE r.op_key = o.op_key
and l.lot_key = th.tobj_key
AND a.to_node_key = r.route_step_key
AND a.from_node_key = ts.queue_key
and th.tobj_history_key = ts.tobj_history_key
AND a.main_path = 1
AND (ts.tobj_type = 'Lot')
AND O.OP_NAME NOT LIKE '%DISPOSITION%'
and th.route_name = 'HEADER FINAL ASSEMBLY'
GROUP BY ts.route_name, ts.queue_name, o.op_name
)
SELECT
started.route_name,
started.op_name,
started.route_step_name,
max(started.yield) started_yield,
max(started.started_units) started_units,
case when queued.queue_name is NULL then 'N/A' else queued.queue_name end QUEUE_NAME,
case when queued.queued_units is NULL then 0 else queued.queued_units end QUEUED_UNITS
FROM started
left JOIN queued ON started.op_name = queued.op_name
group by started.route_name, started.op_name, started.route_step_name, queued.queue_name, QUEUED_UNITS
order by started.route_step_name asc
;
Current Query (as expected) but missing timestamp:
I need to have a timestamp for each individual row for a different application to display the results. Any help would be greatly appreciated! When I try to add a timestamp my query is altered:
Query once timestamp is added:
Edit: I need to display the query in a visualization tool. That tool is time based and will skew the table results unless there is a datetime associated with each field. The date time value can be random, but cannot be the same for each result.
The query is to be displayed on a live dashboard, every time the application is refreshed, the query is expected to be updated.

User Life Cycle SQL Query Logic in Snowflake

I am working on building a query to track the life cycle of an user through the platform via events. The table EVENTS has 3 columns USER_ID, DATE_TIME and EVENT_NAME. Below is a snapshot of the table,
My query should return the below result (the first timestamp for the registered event followed by the immediate/next timestamp of the following log_in event and finally followed by the immediate/next timestamp of the final landing_page event),
Below is my query ,
WITH FIRST_STEP AS
(SELECT
USER_ID,
MIN(CASE WHEN EVENT_NAME = 'registered' THEN DATE_TIME ELSE NULL END) AS REGISTERED_TIMESTAMP
FROM EVENTS
GROUP BY 1
),
SECOND_STEP AS
(SELECT * FROM EVENTS
WHERE EVENT_NAME = 'log_in'
ORDER BY DATE_TIME
),
THIRD_STEP AS
(SELECT * FROM EVENTS
WHERE EVENT_NAME = 'landing_page'
ORDER BY DATE_TIME
)
SELECT
a.USER_ID,
a.REGISTERED_TIMESTAMP,
(SELECT
CASE WHEN b.DATE_TIME >= a.REGISTRATIONS_TIMESTAMP THEN b.DATE_TIME END AS LOG_IN_TIMESTAMP
FROM SECOND_STEP
LIMIT 1
),
(SELECT
CASE WHEN c.DATE_TIME >= LOG_IN_TIMESTAMP THEN c.DATE_TIME END AS LANDING_PAGE_TIMESTAMP
FROM THIRD_STEP
LIMIT 1
)
FROM FIRST_STEP AS a
LEFT JOIN SECOND_STEP AS b ON a.USER_ID = b.USER_ID
LEFT JOIN THIRD_STEP AS c ON b.USER_ID = c.USER_ID;
Unfortunately I am getting the "SQL compilation error: Unsupported subquery type cannot be evaluated" error when I try to run the query

This is a perfect use case for MATCH_RECOGNIZE.
The pattern you are looking for is register anything* login anything* landing and the measures are the min(iff(event_name='x', date_time, null)) for each.
Check:
https://towardsdatascience.com/funnel-analytics-with-sql-match-recognize-on-snowflake-8bd576d9b7b1
https://docs.snowflake.com/en/user-guide/match-recognize-introduction.html
Set the output to one row per match.
Untested sample query:
select *
from data
match_recognize(
partition by user_id
order by date_time
measures min(iff(event_name='register', date_time, null)) as t1
, min(iff(event_name='log_in', date_time, null)) as t2
, min(iff(event_name='landing_page', date_time, null)) as t3
one row per match
pattern(register anything* login anything* landing)
define
register as event_name = 'register'
, login as event_name = 'log_in'
, landing as event_name = 'landing_page'
);

SQL Creating a cloumn in a view with the results from cross-referencing two tables

I'm new to SQL and was attempting to create a view that combines data from a database of readings and from a database of failures. I wanted to create a view ordered by target name, then metric name, then by timestamp, with an additional column that returns a 1 to say there was a failure on that day and a zero otherwise. The query I've written is currently reading that I'm missing a right parenthesis, but when I eliminate the parenthesis it finds the table names invalid. I'm unsure whether my use of case is causing it, although it has worked on some practice samples. Any help checking this and suggestions on how to improve it would be much appreciated.
SELECT * FROM
(
with new_failure_table as (
SELECT target_name, END_TIMESTAMP,START_TIMESTAMP,
((END_TIMESTAMP - (START_TIMESTAMP))*24*60)
FROM failure_table
WHERE (END_TIMESTAMP - (START_TIMESTAMP))*24*60 >5
AND (END_TIMESTAMP - START_TIMESTAMP) < 1
and availability_status = 'Target Down'
)
-- Simplifies failure table to include actual failures according to two parameters
SELECT
t1.target_name,
t1.metric_name,
t1.rollup_timestamp,
t1.average,
t1.minimum,
t1.maximum,
t1.standard_deviation,
t2.END_TIMESTAMP,
t2.START_TIMESTAMP,
(CASE
when t1.target_name = t2.target_name
and t1.rollup_timestamp = trunc(END_TIMESTAMP+1)
and t1.rollup_timestamp = trunc(START_TIMESTAMP+1)
THEN '1' ELSE '0' END) AS failure_status
--Used to create column that reads 1 when there was a failure between the two readings and 0 otherwise
FROM
data_readings AS t1, new_failure_table AS t2
WHERE t1.target_name = t2.target_name
)
GROUP BY t1.target_name, metric_name
ORDER BY rollup_timestamp desc;

You do not need to wrap case in parenthesis instead it can be
CASE
when t1.target_name = t2.target_name
and t1.rollup_timestamp = trunc(END_TIMESTAMP+1)
and t1.rollup_timestamp = trunc(START_TIMESTAMP+1)
THEN '1' ELSE '0' END AS failure_status
Also in FROM do not use AS instead of:
...
FROM data_readings AS t1, new_failure_table AS t2
...
use
...
FROM data_readings t1, new_failure_table t2
...
UPD: The whole query should look like this
SELECT * FROM
(
with new_failure_table as (
SELECT target_name, END_TIMESTAMP,START_TIMESTAMP,
((END_TIMESTAMP - (START_TIMESTAMP))*24*60)
FROM failure_table
WHERE (END_TIMESTAMP - (START_TIMESTAMP))*24*60 >5
AND (END_TIMESTAMP - START_TIMESTAMP) < 1
and availability_status = 'Target Down'
)
-- Simplifies failure table to include actual failures according to two parameters
SELECT
t1.target_name as target_name,
t1.metric_name as metric_name,
t1.rollup_timestamp as rollup_timestamp,
t1.average,
t1.minimum,
t1.maximum,
t1.standard_deviation,
t2.END_TIMESTAMP,
t2.START_TIMESTAMP,
CASE
when t1.target_name = t2.target_name
and t1.rollup_timestamp = trunc(END_TIMESTAMP+1)
and t1.rollup_timestamp = trunc(START_TIMESTAMP+1)
THEN '1' ELSE '0' END AS failure_status
--Used to create column that reads 1 when there was a failure between the two readings and 0 otherwise
FROM
data_readings t1, new_failure_table t2
WHERE t1.target_name = t2.target_name
)
GROUP BY target_name, metric_name
ORDER BY rollup_timestamp desc;

LAG within CASE giving false negative offset

TL;DR: scroll down to TASK 2.
I am dealing with the following data set:
email,createdby,createdon
a#b.c,jsmith,2016-10-10
a#b.c,nsmythe,2016-09-09
a#b.c,vstark,2016-11-11
b#x.y,ajohnson,2015-02-03
b#x.y,elear,2015-01-01
...
and so on. Each email is guaranteed to have at least one duplicate in the data set.
Now, there are two tasks to resolve; I resolved one of them but am struggling with the other one. I will now present both tasks for completeness.
TASK 1 (resolved):
For each row, for each email, return an additional column with the name of the user that created the first record with this email.
Expected result for the above sample data set:
email,createdby,createdon,original_createdby
a#b.c,jsmith,2016-10-10,nsmythe
a#b.c,nsmythe,2016-09-09,nsmythe
a#b.c,vstark,2016-11-11,nsmythe
b#x.y,ajohnson,2015-02-03,elear
b#x.y,elear,2015-01-01,elear
Code to get the above:
;WITH q0 -- this is just a security measure in case there are unique emails in the data set
AS ( SELECT t.email
FROM t
GROUP BY t.email
HAVING COUNT(*) > 1) ,
q1
AS ( SELECT q0.email
, createdon
, createdby
, ROW_NUMBER() OVER ( PARTITION BY q0.email ORDER BY createdon ) rn
FROM t
JOIN q0
ON t.email = q0.email)
SELECT q1.email
, q1.createdon
, q1.createdby
, LAG(q1.createdby, q1.rn - 1) OVER ( ORDER BY q1.email, q1.createdon ) original_createdby
FROM q1
ORDER BY q1.email
, q1.rn
Brief explanation: I partition data set by email, then I number rows in each partition ordered by creation date, finally I return [createdby] value from (rn-1)th record. Works exactly as expected.
Now, similar to the above, there is TASK 2:
TASK 2:
For each row, for each email, return name of the user that created the first duplicate. I.e. name of a user where rn=2.
Expected result:
email,createdby,createdon,first_dupl_createdby
a#b.c,jsmith,2016-10-10,jsmith
a#b.c,nsmythe,2016-09-09,jsmith
a#b.c,vstark,2016-11-11,jsmith
b#x.y,ajohnson,2015-02-03,ajohnson
b#x.y,elear,2015-01-01,ajohnson
I want to keep things performant so trying to employ LEAD-LAG functions:
WITH q0
AS ( SELECT t.email
FROM t
GROUP BY t.email
HAVING COUNT(*) > 1) ,
q1
AS ( SELECT q0.email
, createdon
, createdby
, ROW_NUMBER() OVER ( PARTITION BY q0.email ORDER BY createdon ) rn
FROM t
JOIN q0
ON t.email = q0.email)
SELECT q1.email
, q1.createdon
, q1.createdby
, q1.rn
, CASE q1.rn
WHEN 1 THEN LEAD(q1.createdby, 1) OVER ( ORDER BY q1.email, q1.createdon )
ELSE LAG(q1.createdby, q1.rn - 2) OVER ( ORDER BY q1.email, q1.createdon )
END AS first_dupl_createdby
FROM q1
ORDER BY q1.email
, q1.rn
Explanation: for the first record in each partition, return [createdby] from the following record (i.e. from the record containing the first duplicate). For all other records in the same partition return [createdby] from (rn-2) records ago (i.e. for rn = 2 we're staying on the same record, for rn = 3 we're going 1 record back, for rn = 4 - 2 records back and so on).
An issue comes up on the
ELSE LAG(q1.createdby, q1.rn - 2)
operation. Apparently, against any logic, despite the existence of the preceding line (WHEN 1 THEN...), the ELSE block is also evaluated for rn = 1, resulting in a negative offset value passed to the LAG function:
Msg 8730, Level 16, State 2, Line 37
Offset parameter for Lag and Lead functions cannot be a negative value.
When I comment out that ELSE line, the whole thing works fine but obviously I am not getting any results in the first_dupl_createdby column for rn > 1.
QUESTION:
Is there any way of re-writing the above CASE statement (in TASK #2) so that it always returns the value from a record where rn = 2 within each partition but - and this is important bit - without doing a self-JOIN operation (I know I could prepare rows where rn = 2 in a separate sub-query but this would mean extra scans on the whole table and also running an unnecessary self-JOIN).

I think you can simply use the max window function as you are trying to get the value from rownumber = 2 for each partition.
SELECT q1.email
, q1.createdon
, q1.createdby
, q1.rn
, max(case when rn=2 then q1.createdby end) over(partition by q1.email) first_dup_created_by
FROM q1
ORDER BY q1.email, q1.rn
You can use a similar query to get the results for rownumber=1 for the 1st scenario as well.

You can get the information for each email using row_number() and conditional aggregation:
select email,
max(case when seqnum = 1 then createdby end) as createdby_first,
max(case when seqnum = 2 then createdby end) as createdby_second
from (select t.*,
row_number() over (partition by email order by createdon) as seqnum
from t
) t
group by email;
You can join this information back to the original data to get the information you want. I don't see how lag() naturally would be used to solve this problem.

/shrug
; WITH duplicate_email_addresses AS (
SELECT email
FROM t
GROUP
BY email
HAVING Count(*) > 1
)
, records_with_duplicate_email_addresses AS (
SELECT email
, createdon
, createdby
, Row_Number() OVER (PARTITION BY email ORDER BY createdon) AS sequencer
FROM t
WHERE EXISTS (
SELECT *
FROM duplicate_email_addresses
WHERE email = t.email
)
)
, second_duplicate_record AS ( -- Why do you need any more than this?
SELECT email
, createdon
, createdby
FROM records_with_duplicate_email_addresses
WHERE sequencer = 2
)
SELECT records_with_duplicate_email_addresses.email
, records_with_duplicate_email_addresses.createdon
, records_with_duplicate_email_addresses.createdby
, second_duplicate_record.createdby AS first_duplicate_createdby
FROM records_with_duplicate_email_addresses
INNER
JOIN second_duplicate_record
ON second_duplicate_record.email = records_with_duplicate_email_addresses.email
;

troubles with next and previous query

I have a list and the returned table looks like this. I took the preview of only one car but there are many more.
What I need to do now is check that the current KM value is larger then the previous and smaller then the next. If this is not the case I need to make a field called Trustworthy and should fill it with either 1 or 0 (true/ false).
The result that I have so far is this:
validKMstand and validkmstand2 are how I calculate it. It did not work in one list so that is why I separated it.
In both of my tries my code does not work.
Here is the code that I have so far.
FullList as (
SELECT
*
FROM
eMK_Mileage as Mileage
)
, ValidChecked1 as (
SELECT
UL1.*,
CASE WHEN EXISTS(
SELECT TOP(1)UL2.*
FROM FullList AS UL2
WHERE
UL2.FK_CarID = UL1.FK_CarID AND
UL1.KM_Date > UL2.KM_Date AND
UL1.KM > UL2.KM
ORDER BY UL2.KM_Date DESC
)
THEN 1
ELSE 0
END AS validkmstand
FROM FullList as UL1
)
, ValidChecked2 as (
SELECT
List1.*,
(CASE WHEN List1.KM > ulprev.KM
THEN 1
ELSE 0
END
) AS validkmstand2
FROM ValidChecked1 as List1 outer apply
(SELECT TOP(1)UL3.*
FROM ValidChecked1 AS UL3
WHERE
UL3.FK_CarID = List1.FK_CarID AND
UL3.KM_Date <= List1.KM_Date AND
List1.KM > UL3.KM
ORDER BY UL3.KM_Date DESC) ulprev
)
SELECT * FROM ValidChecked2 order by FK_CarID, KM_Date

Maybe something like this is what you are looking for?
;with data as
(
select *, rn = row_number() over (partition by fk_carid order by km_date)
from eMK_Mileage
)
select
d.FK_CarID, d.KM, d.KM_Date,
valid =
case
when (d.KM > d_prev.KM /* or d_prev.KM is null */)
and (d.KM < d_next.KM /* or d_next.KM is null */)
then 1 else 0
end
from data d
left join data d_prev on d.FK_CarID = d_prev.FK_CarID and d_prev.rn = d.rn - 1
left join data d_next on d.FK_CarID = d_next.FK_CarID and d_next.rn = d.rn + 1
order by d.FK_CarID, d.KM_Date
With SQL Server versions 2012+ you could have used the lag() and lead() analytical functions to access the previous/next rows, but in versions before you can accomplish the same thing by numbering rows within partitions of the set. There are other ways too, like using correlated subqueries.
I left a couple of conditions commented out that deal with the first and last rows for every car - maybe those should be considered valid is they fulfill only one part of the comparison (since the previous/next rows are null)?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to deal with duplicate values in dataset - sql

Related

Oracle SQL - Timestamp splits query result into 2 rows, Need all in one with

User Life Cycle SQL Query Logic in Snowflake

SQL Creating a cloumn in a view with the results from cross-referencing two tables

LAG within CASE giving false negative offset

troubles with next and previous query

Categories

Resources