A query calls two instances of the same tables joined to compare fields, gives mirrored results. How do I eliminate mirrored duplicates? - sql

This is a simpler version of the query I have.
Alias1 as
(select distinct ID, file_tag, status, creation_date from tables where creation_dt >= sysdate and creation_dt <= sysdate + 1),
Alias2 as
(select distinct ID, file_tag, status, creation_date from same tables creation_dt >= sysdate and creation_dt <= sysdate + 1)
select distinct Alias1.ID ID_1,
Alias2.ID ID_2,
Alias1.file_tag,
Alias1.creation_date in_dt1,
Alias2.creation_date in_dt2
from Alias1, Alias2
where Alias1.file_tag = Alias2.file_tag
and Alias1.ID != Alias2.ID
order by Alias1.creation_dt desc
This is an example of the results. Both of these are the same, though their values are flipped.
ID_1 ID_2 File_Tag in_dt1 in_dt2
70 66 Apples 6/25/2012 3:06 6/25/2012 2:53:47 PM
66 70 Apples 6/25/2012 2:53 6/25/2012 3:06:18 PM
The goal of the query is to find more than one ID with a matching file tag and do stuff to the one submitted earlier in the day (the query runs daily and only needs duplicates from that given day). I am still relatively new to SQL/Oracle and wonder if there's a better way to approach this problem.

SELECT *
FROM (SELECT id, file_tag, creation_date in_dt
, row_number() OVER (PARTITION BY file_tag
ORDER BY creation_date) rn
, count(*) OVER (PARTITION BY file_tag) ct
FROM tables
WHERE creation_date >= TRUNC(SYSDATE)) tbls
WHERE rn = 1
AND ct > 1;
This should get you the first (earliest) row within each file_tag having at least 2 records today.
The inner select calculates the relative row numbers of each set of identical file_tag records by creation date. The outer select retrieves the first one in each partition.
This assumes from your goal statement that you want to do something with the earliest single row for each file_tag. The inner query only returns rows with a creation_date of sometime on the current day.

Here is an easy way, just by chaning your comparison operation:
select distinct Alias1.ID ID_1, Alias2.ID ID_2, Alias1.file_tag,
Alias1.creation_date in_dt1, Alias2.creation_date in_dt2
from Alias1 join
Alias2
on Alias1.file_tag = Alias2.file_tag and
Alias1.ID < Alias2.ID
order by Alias1.creation_dt desc
Replacing the not-equals with less-than orders the two ideas so the smaller one is always first. This will eliminate the duplicates. Note: I also fixed the join syntax.

Related

Eliminate records

I am writing TSQL to eliminate some data in a stored procedure.
The scenario is that there are four data points ID, Recordnumer, OrderDate,RejectDate
The ID can have multiple same or different order date and reject date.
I need to eliminate all the records apart from 1/01/1900 (This is not an actual rejection and a null which is substituted with this value).
However, if no rejection with 1/01/1900 then I should eliminate all records apart from the max of the reject date.
The record number is a roumber that I have done using Row over partition. Please shed a light: The image a particular records and I need to apply this rule on all the records in the table. The expected results are highlighted in yellow for different ID's
Is this what you want?
select t.*
from t
where t.reject_date = '1900-01-01' or
t.reject_date = (select max(t2.reject_date)
from t t2
where t2.id = t.id
);
For each id, this keeps the rows where the reject_date is 1900-01-01 or the reject date is the maximum reject date for that id.
EDIT:
This might be more appropriate:
select t.*
from t
where t.reject_date = (select t2.reject_date
from t t2
where t2.id = t.id
order by (case when t2.reject_date = '1900-01-01' then 1 else 2 end),
t2.reject_date desc
);
Seems you don't need row_number() for this
select id
, OrderDate
, RejectDate
, max(case when RejectDate = '1900-01-01' then '9999-12-31' else RejectDate end) as rSum
from tableA
group by id, OrderDate, RejectDate

Optimize SQL Script: getting range value from another table

My script I believe should be running but it may not be that 'efficient' and the main problem is I guess it's taking too long to run hence when I run it at work, the whole session is being aborted before it finishes.
I have basically 2 tables
Table A - contains every transactions a person do
Person's_ID Transaction TransactionDate
---------------------------------------
123 A 01/01/2017
345 B 04/06/2015
678 C 13/07/2015
123 F 28/10/2016
Table B - contains person's ID and GraduationDate
What I want to do is check if a person is active.
Active = if there is at least 1 transaction done by the person 1 month before his GraduationDate
The run time is too long because imagine if I have millions of persons and each persons do multiple transactions and these transactions are recorded line by line in Table A
SELECT
PERSON_ID
FROM
(SELECT PERSON_ID, TRANSACTIONDATE FROM TABLE_A) A
LEFT JOIN
(SELECT CIN, GRAD_DATE FROM TABLE_B) B
ON A.PERSON_ID = B.PERSON_ID
AND TRANSACTIONDATE <= GRAD_DATE
WHERE TRANSACTIONDATE BETWEEN GRAD_DATE - INTERVAL '30' DAY AND GRAD_DATE;
*Table A and B are products of joined tables hence they are subqueried.
If you just want active customers, I would try exists:
SELECT PERSON_ID
FROM TABLE_A A
WHERE EXISTS (SELECT 1
FROM TABLE_B B
WHERE A.PERSON_ID = B.PERSON_ID AND
A.TRANSACTIONDATE BETWEEN B.GRAD_DATE - INTERVAL '30' DAY AND GRAD_DATE
);
The performance, though, is likely to be similar to your query. If the tables were really tables, I would suggest indexes. In reality, you will probably need to understand the views (so you can create better indexes) or perhaps use temporary tables.
A non-equi-join might be quite inefficient (no matter if it's coded as join or a Not Exists), but the logic can be rewritten to:
SELECT
PERSON_ID
FROM
( -- combine both Selects
SELECT 0 AS flag -- indicating source table
PERSON_ID, TRANSACTIONDATE AS dt
FROM TABLE_A
UNION ALL
SELECT 1 AS flag,
PERSON_ID, GRAD_DATE
FROM TABLE_B
) A
QUALIFY
flag = 1 -- only return a row from table B
AND Min(dt) -- if the previous row (from table A) is within 30 days
Over (PARTITION BY PERSON_ID
ORDER BY dt, flag
ROWS BETWEEN 1 Preceding AND 1 Preceding) >= dt - 30
This assumes that there's only one row from table A per person, otherwise the MIN has to be changed to:
AND MAX(CASE WHEN flag = 1 THEN dt END) -- if the previous row (from table A) is within 30 days
Over (PARTITION BY PERSON_ID
ORDER BY dt, flag
ROWS UNBOUNDED Preceding) >= dt - 30

Compare two tables of data in HIVE

I have to find out if data in both the tables is same for a given view_date. If same my SQL should return zero, else non zero.
Table1/Table2 columns:
Source
view_date
count
start_date
end_date
I tried in the below way:
SELECT *
FROM (
SELECT count(*)
FROM table1
) a
JOIN (
SELECT count(*)
FROM TABLE 2
) b
WHERE view_date = '05/08/2016'
AND a.x != b.y;
But I am not getting the expected result. Could someone please help me?
Here is one method that counts the number of rows that are unique in each table:
select count(*)
from (select source, count, start_date, end_date,
min(which) as minwhich, max(which) as maxwhich
from ((select source, count, start_date, end_date, 1 as which
from table1
where viewdate = '2016-06-08'
) union all
(select source, count, start_date, end_date, 2 as which
from table2
where viewdate = '2016-06-08'
)
) t12
group by source, count, start_date, end_date
having minwhich = maxwhich
) t;
Note: If rows are duplicated across all values in a table, this does not check that the same number of duplicates are in each table.
To do a full comparison of 2 tables, you not only need to make sure that the number of rows match, but you must check that all the data in all the columns for all the rows match!
This can be a complicated problem (when I worked at Hortonworks, for 1 project we developed 3 different programs to try to solve this). Lately I had the opportunity to develop a program that solves this in an elegant and efficient way: https://github.com/bolcom/hive_compared_bq
The program shows you the differences in a webpage (which is something you could skip if you don't need it) and also gives you a return value 0/1 which is what you currently want.

SQL Server : UNION ALL but remove duplicate IDs by choosing first date of occurrence

I am unioning two queries but I'm getting an ID that occurs in each query. I do not know how to keep only the first time the id occurs. Everything else about the row is different. In general, it will be hard to know which of the two queries I will have to keep a duplicate on, therefore, I need a general solution.
I was thinking about creating a temp table and choosing the min date (once the date has been converted to an int).
Any ideas on the proper syntax?
You can do this using the row_number() function. This will assign a sequential number, starting with 1, to each row with the same id (based on the partition by clause). The ordering of the sequence is determined by the order by clause. So, the following assigns 1 to the earliest date for each id:
select t.*
from (select t.*,
row_number() over (partition by id order by date asc) as seqnum
from ((select *
from <subquery1>
) union all
(select *
from <subquery2>
)
) t
) t
where seqnum = 1;
The final where clause simply filters for the first occurrence.
If you use the keyword UNION, then it will remove duplicates from the two data sets you are working with. UNION ALL preserves duplicates.
You can view the specifics here:
http://www.w3schools.com/sql/sql_union.asp
If you want to only have one of the 2 records and they are not identical you will have to filter them yourself. You may need to do something like the following. THis may be possible to do with the one (select union select) block but this should get you started.
select *
from (
select id
, date
, otherstuf
from table_1
union all
select id
, date
, otherstuf
from table_2
) x1
, (
select id
, date
, otherstuf
from table_1
union all
select id
, date
, otherstuf
from table_2
) x2
where x1.id = x2.id
and x1.date < x2.date
Although rethinking this if you go down a path like this why bother to UNION it?

sql query to get earliest date

If I have a table with columns id, name, score, date
and I wanted to run a sql query to get the record where id = 2 with the earliest date in the data set.
Can you do this within the query or do you need to loop after the fact?
I want to get all of the fields of that record..
If you just want the date:
SELECT MIN(date) as EarliestDate
FROM YourTable
WHERE id = 2
If you want all of the information:
SELECT TOP 1 id, name, score, date
FROM YourTable
WHERE id = 2
ORDER BY Date
Prevent loops when you can. Loops often lead to cursors, and cursors are almost never necessary and very often really inefficient.
SELECT TOP 1 ID, Name, Score, [Date]
FROM myTable
WHERE ID = 2
Order BY [Date]
While using TOP or a sub-query both work, I would break the problem into steps:
Find target record
SELECT MIN( date ) AS date, id
FROM myTable
WHERE id = 2
GROUP BY id
Join to get other fields
SELECT mt.id, mt.name, mt.score, mt.date
FROM myTable mt
INNER JOIN
(
SELECT MIN( date ) AS date, id
FROM myTable
WHERE id = 2
GROUP BY id
) x ON x.date = mt.date AND x.id = mt.id
While this solution, using derived tables, is longer, it is:
Easier to test
Self documenting
Extendable
It is easier to test as parts of the query can be run standalone.
It is self documenting as the query directly reflects the requirement
ie the derived table lists the row where id = 2 with the earliest date.
It is extendable as if another condition is required, this can be easily added to the derived table.
Try
select * from dataset
where id = 2
order by date limit 1
Been a while since I did sql, so this might need some tweaking.
Using "limit" and "top" will not work with all SQL servers (for example with Oracle).
You can try a more complex query in pure sql:
select mt1.id, mt1."name", mt1.score, mt1."date" from mytable mt1
where mt1.id=2
and mt1."date"= (select min(mt2."date") from mytable mt2 where mt2.id=2)