SQL Server : UNION ALL but remove duplicate IDs by choosing first date of occurrence - sql

I am unioning two queries but I'm getting an ID that occurs in each query. I do not know how to keep only the first time the id occurs. Everything else about the row is different. In general, it will be hard to know which of the two queries I will have to keep a duplicate on, therefore, I need a general solution.
I was thinking about creating a temp table and choosing the min date (once the date has been converted to an int).
Any ideas on the proper syntax?

You can do this using the row_number() function. This will assign a sequential number, starting with 1, to each row with the same id (based on the partition by clause). The ordering of the sequence is determined by the order by clause. So, the following assigns 1 to the earliest date for each id:
select t.*
from (select t.*,
row_number() over (partition by id order by date asc) as seqnum
from ((select *
from <subquery1>
) union all
(select *
from <subquery2>
)
) t
) t
where seqnum = 1;
The final where clause simply filters for the first occurrence.

If you use the keyword UNION, then it will remove duplicates from the two data sets you are working with. UNION ALL preserves duplicates.
You can view the specifics here:
http://www.w3schools.com/sql/sql_union.asp

If you want to only have one of the 2 records and they are not identical you will have to filter them yourself. You may need to do something like the following. THis may be possible to do with the one (select union select) block but this should get you started.
select *
from (
select id
, date
, otherstuf
from table_1
union all
select id
, date
, otherstuf
from table_2
) x1
, (
select id
, date
, otherstuf
from table_1
union all
select id
, date
, otherstuf
from table_2
) x2
where x1.id = x2.id
and x1.date < x2.date
Although rethinking this if you go down a path like this why bother to UNION it?

Related

How create a unique ID based on conditions in SQL?

I would like to get a new ID, no matter the format (in the example below 11,12,13...)
Based on the following condition:
Every time the days column value is greater then 1 and not null then current row and all following ones will get the same ID until a new value will meet the condition.
Within the same email
Below you can see the expected 1 (in the format of XX)
I thought about using two conditions with the following order between them
Every time the days column value is greater then 1 then all following rows will get the same ID until a new value will meet the condition.
2.AND When lag (previous) is equal to 0/1/null.
Assuming you have an EmailDate column over which you're ordering (a DATETIME field, really), try something like this:
WITH
TableNameWithEmailDateIDs AS (
SELECT
*,
ROW_NUMBER() OVER (
ORDER BY
Email DESC,
EmailDate
) AS EmailDateID
FROM
TableName
),
IDs AS (
SELECT
*,
LEAD(EmailDateID, 1) OVER (
ORDER BY
Email,
EmailDate
) AS LeadEmailDateID
FROM
(
SELECT
*,
-- REMOVE +10 if you don't want 11 to be starting ID
ROW_NUMBER() OVER (
ORDER BY
Email DESC,
EmailDate
)+10 AS ID
FROM
TableNameWithEmailDateIDs
WHERE
Days > 1
OR Days IS NULL
) X
)
SELECT
COALESCE(TableName.EmailDate, IDs.EmailDate) AS EmailDate,
IDs.Email,
COALESCE(TableName.Days, IDs.Days) AS Days,
IDs.ID
FROM
IDs
LEFT JOIN TableNameWithEmailDateIDs TableName
ON IDs.Email = TableName.Email
AND TableName.EmailDateID BETWEEN
IDs.EmailDateID
AND IDs.LeadEmailDateID-1
ORDER BY
ID DESC,
TableName.EmailDate DESC
;
First, create a CTE that generates IDs for each distinct Email/Date combo (helpful for LEFT JOIN condition later). Then, create a CTE that generates IDs for rows that meet your condition (i.e. the important rows). Finally, LEFT JOIN your main table onto that CTE to fill in the "gaps", so to speak.
I suggest running each of the components of this query independently to fully understand what's going on.
Hope it helps!

Compare two tables of data in HIVE

I have to find out if data in both the tables is same for a given view_date. If same my SQL should return zero, else non zero.
Table1/Table2 columns:
Source
view_date
count
start_date
end_date
I tried in the below way:
SELECT *
FROM (
SELECT count(*)
FROM table1
) a
JOIN (
SELECT count(*)
FROM TABLE 2
) b
WHERE view_date = '05/08/2016'
AND a.x != b.y;
But I am not getting the expected result. Could someone please help me?
Here is one method that counts the number of rows that are unique in each table:
select count(*)
from (select source, count, start_date, end_date,
min(which) as minwhich, max(which) as maxwhich
from ((select source, count, start_date, end_date, 1 as which
from table1
where viewdate = '2016-06-08'
) union all
(select source, count, start_date, end_date, 2 as which
from table2
where viewdate = '2016-06-08'
)
) t12
group by source, count, start_date, end_date
having minwhich = maxwhich
) t;
Note: If rows are duplicated across all values in a table, this does not check that the same number of duplicates are in each table.
To do a full comparison of 2 tables, you not only need to make sure that the number of rows match, but you must check that all the data in all the columns for all the rows match!
This can be a complicated problem (when I worked at Hortonworks, for 1 project we developed 3 different programs to try to solve this). Lately I had the opportunity to develop a program that solves this in an elegant and efficient way: https://github.com/bolcom/hive_compared_bq
The program shows you the differences in a webpage (which is something you could skip if you don't need it) and also gives you a return value 0/1 which is what you currently want.

How to do a query that is agnostic of the sort field?

I have multiple tables that have the same date_time added field in each table. After doing a UNION of all tables i want to sort them by the most recent one. But the query will tell me that the i have to add a table name like videos.date_time rather than ORDER BY date_time. How can i structure the query so that it is agnostic of the which date_time field?
Unless you are using a proprietary feature such as SQL Server's TOP directive, the Order By in a Union query is always at the bottom and always applies to the entire query. E.g.
Select Col1, date_time
From Table1
Union All
Select Col1, date_time
From Table2
Order By date_time
If your query does include various elements such TOP or LIMIT which require an Order By and thus you want to differentiate the Order By's, then you can encapsulate your query into a derived table:
Select Col, date_time
From (
Select Col1 As Col, date_time
From Table1
Union All
Select Col1, date_time
From Table2
) As Z
Order By Z.date_time
In SQL Server you can also order by a column number, e.g. "ORDER BY 2" in which case whatever the second column is in your union set would be the sort target.
As I understand you have X tables (where X is > 1), and every table have it's own date_time column and you want to get last updated. If that's true, than one of the possible ways is to do it that way
SELECT id, date_added FROM table1
UNION ALL
SELECT id, date_added FROM table2
ORDER BY date_added DESC;
Other ways which I have in mind is when you fetch results, put them in array and do the "magic" inside it.

Oracle query needs to return the highest date from result

I have a really big query in which makes some troubles for me because one join can return several rows. I only want the latest row (identified by a date field) in this result set, but I cant seem to put together the correct query to make it work.
The query I need MAX date from is:
SELECT custid,reason,date FROM OPT opt WHERE opt.custid = 167043;
Teh custid is really found through a join, but for simplicity I've added it to the where clause here. This query produces the following result:
custid grunn date
167043 "Test 1" 19.10.2005 12:33:18
167043 "Test 2" 28.11.2005 16:23:35
167043 "Test 3" 14.06.2010 15:43:16
How can I retrieve only one record from this resultset? And that record is the one with the highest date? Ultimately Im putting this into a big query which does alot of joins, so hopefully I can use this example into my bigger query.
You can do this:
SELECT * FROM
( SELECT custid,reason,date FROM OPT opt WHERE opt.custid = 167043
ORDER BY date DESC
)
WHERE ROWNUM = 1;
You can solve it by using analytic functions. Try something like this:
select custid
,reason
,date
from (select custid
,reason
,date
,row_number() over(partition by cust_id order by date desc) as rn
from opt)
where rn = 1;
This is how it works: The resultset is divided into groups of cust_id (partition by). In each group, the rows will be sorted by the date column in descending order (order by). Each row within the group will be assigned a sequence number (row_number) from 1 to N.
This way the row with the highest value for date will be assigned 1, the second latest 2, third latest 3 etc..
Finally, I just pick the rows with nr = 1, which basically filters out the other rows.
Or another way using the LAST function in its aggregate form.
with my_source_data as (
select 167043 as custid, 'Test 1' as reason, date '2010-10-01' as the_date from dual union all
select 167043 as custid, 'Test 2' as reason, date '2010-10-02' as the_date from dual union all
select 167043 as custid, 'Test 3' as reason, date '2010-10-03' as the_date from dual union all
select 167044 as custid, 'Test 1' as reason, date '2010-10-01' as the_date from dual
)
select
custid,
max(reason) keep (dense_rank last order by the_date) as reason,
max(the_date)
from my_source_data
group by custid
I find this quite useful as it rolls the process of finding the last row and the value all into one. The use of MAX (or another aggregate function such as MIN) in case that the combination of the grouping and the order by is not deterministic.
This function will basically take the contents of the column based on the grouping, order it by the ordering given then take the last value.
rather than using row_number() I think it's better to select what you actually want to select (e.g. the last date)
SELECT custid
, reason
, date
from
(
SELECT custid
, reason
, date
, max(opt.date) over (partition by opt.custid order by opt.date) last_date
FROM OPT opt
WHERE opt.custid = 167043;
)
where date = last_date
both solutions with ROW_NUMBER and KEEP are good. I would tend to prefer ROW_NUMBER when retrieving a large number of columns, and keep KEEP for one or two columns, otherwise you will have to deal with duplicates and the statement will get pretty unreadable.
For a small number of columns however, KEEP should perform better

sql query to get earliest date

If I have a table with columns id, name, score, date
and I wanted to run a sql query to get the record where id = 2 with the earliest date in the data set.
Can you do this within the query or do you need to loop after the fact?
I want to get all of the fields of that record..
If you just want the date:
SELECT MIN(date) as EarliestDate
FROM YourTable
WHERE id = 2
If you want all of the information:
SELECT TOP 1 id, name, score, date
FROM YourTable
WHERE id = 2
ORDER BY Date
Prevent loops when you can. Loops often lead to cursors, and cursors are almost never necessary and very often really inefficient.
SELECT TOP 1 ID, Name, Score, [Date]
FROM myTable
WHERE ID = 2
Order BY [Date]
While using TOP or a sub-query both work, I would break the problem into steps:
Find target record
SELECT MIN( date ) AS date, id
FROM myTable
WHERE id = 2
GROUP BY id
Join to get other fields
SELECT mt.id, mt.name, mt.score, mt.date
FROM myTable mt
INNER JOIN
(
SELECT MIN( date ) AS date, id
FROM myTable
WHERE id = 2
GROUP BY id
) x ON x.date = mt.date AND x.id = mt.id
While this solution, using derived tables, is longer, it is:
Easier to test
Self documenting
Extendable
It is easier to test as parts of the query can be run standalone.
It is self documenting as the query directly reflects the requirement
ie the derived table lists the row where id = 2 with the earliest date.
It is extendable as if another condition is required, this can be easily added to the derived table.
Try
select * from dataset
where id = 2
order by date limit 1
Been a while since I did sql, so this might need some tweaking.
Using "limit" and "top" will not work with all SQL servers (for example with Oracle).
You can try a more complex query in pure sql:
select mt1.id, mt1."name", mt1.score, mt1."date" from mytable mt1
where mt1.id=2
and mt1."date"= (select min(mt2."date") from mytable mt2 where mt2.id=2)