Teradata SQL Get Rid of Duplicates with Specific Order - sql

I just started teradata SQL this week, so sorry if I don't phrase things correctly. I originally created a script in R that gets rid of duplicates within my table, but now I need to transfer this code into SQL. Here is some sample data:
I want to get rid of any D's in the DELETE column, partition by ID, order by STATUS, DATE, and AMOUNT (with actual dates and amounts before ?s). I want STATUS to go in this order: P, H, F, U, T. I want the first row that has STATUS, DATE, and AMOUNT filled out (with STATUS in order). Here is the example output data:
I'm really stuck on the order issue and the code I've written isn't producing any data at all (but no errors).
SAMPLE CODE:
CREATE VOLATILE TABLE new_tble
AS
(SELECT *
FROM table
QUALIFY row_number() OVER (partition BY ID ORDER BY ID, DATE, AMOUNT)=1
WHERE DELETE <> 'D'
)
with data;

This is a direct translation of your description into Teradata SQL, assuming ? means NULL:
select *
from tab
where "delete" is null
and "date" is not null
and amount is not null
qualify
row_number()
over (partition by id
order by case status
when 'P' then 1
when 'H' then 2
when 'F' then 3
when 'U' then 4
when 'T' then 5
end
,"date"
,amount) = 1

Related

How create a unique ID based on conditions in SQL?

I would like to get a new ID, no matter the format (in the example below 11,12,13...)
Based on the following condition:
Every time the days column value is greater then 1 and not null then current row and all following ones will get the same ID until a new value will meet the condition.
Within the same email
Below you can see the expected 1 (in the format of XX)
I thought about using two conditions with the following order between them
Every time the days column value is greater then 1 then all following rows will get the same ID until a new value will meet the condition.
2.AND When lag (previous) is equal to 0/1/null.
Assuming you have an EmailDate column over which you're ordering (a DATETIME field, really), try something like this:
WITH
TableNameWithEmailDateIDs AS (
SELECT
*,
ROW_NUMBER() OVER (
ORDER BY
Email DESC,
EmailDate
) AS EmailDateID
FROM
TableName
),
IDs AS (
SELECT
*,
LEAD(EmailDateID, 1) OVER (
ORDER BY
Email,
EmailDate
) AS LeadEmailDateID
FROM
(
SELECT
*,
-- REMOVE +10 if you don't want 11 to be starting ID
ROW_NUMBER() OVER (
ORDER BY
Email DESC,
EmailDate
)+10 AS ID
FROM
TableNameWithEmailDateIDs
WHERE
Days > 1
OR Days IS NULL
) X
)
SELECT
COALESCE(TableName.EmailDate, IDs.EmailDate) AS EmailDate,
IDs.Email,
COALESCE(TableName.Days, IDs.Days) AS Days,
IDs.ID
FROM
IDs
LEFT JOIN TableNameWithEmailDateIDs TableName
ON IDs.Email = TableName.Email
AND TableName.EmailDateID BETWEEN
IDs.EmailDateID
AND IDs.LeadEmailDateID-1
ORDER BY
ID DESC,
TableName.EmailDate DESC
;
First, create a CTE that generates IDs for each distinct Email/Date combo (helpful for LEFT JOIN condition later). Then, create a CTE that generates IDs for rows that meet your condition (i.e. the important rows). Finally, LEFT JOIN your main table onto that CTE to fill in the "gaps", so to speak.
I suggest running each of the components of this query independently to fully understand what's going on.
Hope it helps!

How to find difference in date between each unique ID across multiple rows when not ordered? (PostgreSQL)

I have a table with id, order sequence and date, and I am trying to add two columns, one with a difference in date function, and another with a status function that is reliant on the value of the difference in date.
Table looks like this:
The issue I am having is that, when I try to find the difference between the dates of each unique id, so that if it's the first order sequence, it should be null, if it's any subsequent order sequence, let's say 3, it will be the 3rd date - 2nd date. Now this all works with the code I have:
case
when ord_seq = 1 then null
else ord_date - lag(ord_date) over (order by id)
end as date_diff,
However, this only works when the table is already ordered. If I jumble up the order that I input the table in, the values come out a little different. I figured it might be because "lag" function only takes the previous row's value, so if the previous row does not belong to the same id, and is not in chronological order, the dates won't subtract well.
My code looks like this at the moment:
select
id,
ord_seq,
ord_date,
case
when ord_seq = 1 then null
else ord_date - lag(ord_date) over (order by id)
end as date_diff,
case
when ord_seq = 1 then 'New'
when ord_date - lag(ord_date) over (order by id, ord_seq) between 1 and 200 then 'Retain'
when ord_date - lag(ord_date) over (order by id, ord_seq) > 200 then 'Reactivated'
end as status
from t1
order by id, ord_seq, ord_date
My db<>fiddle
Am I using the correct function here? How do I find the difference in date between one unique ID, regardless of the order of the table?
Any help would be much appreciated.
In case you want to see end table result (error is on id 'ddd', ord seq '2' and '3'):
Ordered Input:
Not Ordered Input:
When using this:
You miss the partition by in your window frame definition. Here it is, working regardless of any table order:
select *,
ord_date - lag(ord_date) over (partition by id order by ord_seq) as date_diff
from t1;
Please note however that database tables have no natural order that you can not rely upon and can not be considered ordered, no matter in what sequence the records have been inserted. You must specify explicitly an order by clause if you need a specific order.

How to Find Last Change row in SQL - Big Query

Can someone please provide a query that I can use in Google Big Query to identify the total count of users for whom the value changed specifically from 'C' to 'P'? In the below table userid=123 satisfies this even though later userid = 123 changes back from 'P' to 'C'.
userid timestamp Value
123 9-15-2020 02:35:45 C
456 9-15-2020 01:45:09 P
789 9-15-2020 06:22:10 P
123 9-15-2020 03:43:00 P
456 9-15-2020 03:45:10 C
123 9-15-2020 07:40:34 C
You can try using lag()
select userid from
(
select userid, timestamp, value, lag(value) over(partition by userid order by timestamp) as prev_value
from tablename
)A where value='P' and prev_value='C'
Can someone please provide a query that I can use in Google Big Query to identify the total count of users for whom the value changed specifically from 'C' to 'P'
Note that this is not consistent with the title of the question.
lag() is the key idea. But it is unclear whether you want the count of users or the count of changes. This calculates both:
select count(*) as num_changes,
count(distinct userid) as num_users_with_change
from (select t.*,
lag(value) over(partition by userid order by timestamp) as prev_value
from tablename t
) t
where value = 'P' and prev_value = 'C';
The second column counts a user only once, regardless of the number of times they have changed (which is my interpretation of your question).
identify the total count of users for whom the value changed specifically from 'C' to 'P'?
Below is for BigQuery Standard SQL
#standardSQL
SELECT COUNT(DISTINCT userid) AS qualified_users
FROM `project.dataset.table`
GROUP BY userid
HAVING STRPOS(STRING_AGG(value, '' ORDER BY timestamp), 'CP') > 0
Note; I assume your timestamp column is of TIMESTAMP data type - otherwise you will need to use PARSE_TIMESTAMP in ORDER BY portion

Compare two tables of data in HIVE

I have to find out if data in both the tables is same for a given view_date. If same my SQL should return zero, else non zero.
Table1/Table2 columns:
Source
view_date
count
start_date
end_date
I tried in the below way:
SELECT *
FROM (
SELECT count(*)
FROM table1
) a
JOIN (
SELECT count(*)
FROM TABLE 2
) b
WHERE view_date = '05/08/2016'
AND a.x != b.y;
But I am not getting the expected result. Could someone please help me?
Here is one method that counts the number of rows that are unique in each table:
select count(*)
from (select source, count, start_date, end_date,
min(which) as minwhich, max(which) as maxwhich
from ((select source, count, start_date, end_date, 1 as which
from table1
where viewdate = '2016-06-08'
) union all
(select source, count, start_date, end_date, 2 as which
from table2
where viewdate = '2016-06-08'
)
) t12
group by source, count, start_date, end_date
having minwhich = maxwhich
) t;
Note: If rows are duplicated across all values in a table, this does not check that the same number of duplicates are in each table.
To do a full comparison of 2 tables, you not only need to make sure that the number of rows match, but you must check that all the data in all the columns for all the rows match!
This can be a complicated problem (when I worked at Hortonworks, for 1 project we developed 3 different programs to try to solve this). Lately I had the opportunity to develop a program that solves this in an elegant and efficient way: https://github.com/bolcom/hive_compared_bq
The program shows you the differences in a webpage (which is something you could skip if you don't need it) and also gives you a return value 0/1 which is what you currently want.

Ranking over several columns

In the process of query optimization I got to following SQL query:
select s.*
from
(
select id, DATA, update_dt, inspection_dt, check_dt
RANK OVER()
(PARTITION by ID
ORDER BY update_dt DESC, DATA) rank
FROM TABLE
where update_dt < inspection_dt or update_dt < check_dt
) r
where r.rank = 1
Query returns the DATA that corresponds to the latest check_dt.
However, what I want to get is:
1. DATA corresponding to latest check_dt
2. DATA corresponding to latest inspection_dt.
One of the trivial solutions - just write two separate queries with a where single condition - one for inspection_dt, and one for check_dt. However, that way it loses initial intent - to shorten the running time.
By observing the source data I noticed the way to implement it - check date is always later than inspection date; knowing that I could just extract the record with the rank = 1 and it will give me DATA corresponding to latest CHECK_DT, and record with the largest rank would correspond to INSPECTION.
However, data I'm afraid data will not be always consistent, so I was looking for more abstract solution.
How about this?
select s.*
from (select id, DATA, update_dt, inspection_dt, check_dt,
RANK() OVER (PARTITION by ID
ORDER BY update_dt DESC, DATA
) as rank_upd,
RANK() OVER (PARTITION by ID
ORDER BY inspection_dt DESC, DATA
) as rank_insp,
FROM TABLE
) r
where r.rank_upd = 1 or r.rank_insp = 1;