Compare two tables of data in HIVE

Compare two tables of data in HIVE - sql

I have to find out if data in both the tables is same for a given view_date. If same my SQL should return zero, else non zero.
Table1/Table2 columns:
Source
view_date
count
start_date
end_date
I tried in the below way:
SELECT *
FROM (
SELECT count(*)
FROM table1
) a
JOIN (
SELECT count(*)
FROM TABLE 2
) b
WHERE view_date = '05/08/2016'
AND a.x != b.y;
But I am not getting the expected result. Could someone please help me?

Here is one method that counts the number of rows that are unique in each table:
select count(*)
from (select source, count, start_date, end_date,
min(which) as minwhich, max(which) as maxwhich
from ((select source, count, start_date, end_date, 1 as which
from table1
where viewdate = '2016-06-08'
) union all
(select source, count, start_date, end_date, 2 as which
from table2
where viewdate = '2016-06-08'
)
) t12
group by source, count, start_date, end_date
having minwhich = maxwhich
) t;
Note: If rows are duplicated across all values in a table, this does not check that the same number of duplicates are in each table.

To do a full comparison of 2 tables, you not only need to make sure that the number of rows match, but you must check that all the data in all the columns for all the rows match!
This can be a complicated problem (when I worked at Hortonworks, for 1 project we developed 3 different programs to try to solve this). Lately I had the opportunity to develop a program that solves this in an elegant and efficient way: https://github.com/bolcom/hive_compared_bq
The program shows you the differences in a webpage (which is something you could skip if you don't need it) and also gives you a return value 0/1 which is what you currently want.

Related

SQL get latest availability per member

I have a situation where I store in a table each member's availability.
It's a simple table with 4 column.
CREATE TABLE availablities (
availablity_id serial PRIMARY KEY,
member_id serial,
availablity_status_id serial,
start_date timestamp
);
Each member can have multiple records in the table and to get the current status
I get for each member the record that has the most recent start_date that is smaller then now().
I first tried with a naive Max() and Group by query
select
status_code, max(start_date) start_date,availablities.member_id
from
availablities
join
availablity_status on availablity_status.availablity_status_id = availablities.availablity_status_id
where
start_date <= now()
group by
status_code,availablities.member_id;
But this return multiple records per user as I get the most recent record by user and by status.
I finally came up with a query that gives me the expected result.
select status_code,start_date,a2.member_id from availablities a2
join availablity_status on availablity_status.availablity_status_id = a2.availablity_status_id
where a2.availablity_id in(
select
max(availablity_id)
from availablities a
where
a.member_id = a2.member_id and
start_date in(
select
max(start_date) start_date
from availablities
where
start_date <= now()
and a.member_id = availablities.member_id
)
);
But this query takes 60 times longer to execute and doesn't feel right.
I'm pretty sure there must be a better solution but I can't get my hands on it.
What is the correct way to get the expected result?
I've created a DB-fiddle to make it easier to see. Query 1 is incorrect and Query 2 is much slower when we have a couple more data.
https://www.db-fiddle.com/f/iWgvuj8kcms9F5CKuoKsny/2

It looks like you need to use a simple row_number window function here:
with a as (
select *, Row_Number() over(partition by member_id order by start_date desc, availablity_id desc) rn
from availablities
where start_date<now()
)
select s.status_code, a.start_date, a.member_id
from a join availablity_status s on s.availablity_status_id=a.availablity_status_id
where rn=1
Note your data is not selective enough, so for member_id 3, is it available or not? What is the most recent date when there are two identical dates?
I added a tie-breaker to also sort by availability_id to get your expected results
Actually it's availablity_id - you seem to have a common typo here!
See your updated Fiddle

Joining data from two sources using bigquery

Can anyone please check whether below code is correct? In cte_1, I’m taking all dimensions and metrics from t1 excpet value1, value2, value3. In cte_2, I’m finding the unique row number for t2. In cte_3, I’m taking all distinct dimensions and metrics using join on two keys such as Date, and Ad. In cte_4, I’m taking the values for only row number 1. I’m getting sum(value1),sum(value2),sum(value3) correct ,but sum(value4) is incorrect
WITH cte_1 AS
(SELECT *except(value1, value2, value3) FROM t1 where Date >"2020-02-16" and Publisher ="fb")
-- Find unique row number from t2--
,cte_2 as(
SELECT ROW_NUMBER() OVER(ORDER BY Date) distinct_row_number, * FROM t2
,cte_3 as
(SELECT cte_2.*,cte_1.*except(Date) FROM cte_2 join cte_1
on cte_2.Date = cte_1. Date
and cte_2.Ad= cte_1.Ad))
,cte_4 AS (
(SELECT *
FROM
(
SELECT *,
row_number() OVER (PARTITION BY distinct_row_number ORDER BY Date) as rn
FROM cte_3 ) T
where rn = 1 ))
select sum(value1),sum(value2),sum(value3),sum(value4) from cte_4
Please see the sample table below:

Whilst your data does not seem compliant with the query you shared, since it is lacking the field named Ad and other fields have different names, such as Date and ReportDate, I was able to identify some issues and propose improvements.
First, within your temp table cte_1, you are only using a filter in the WHERE clause, you could use it within your from statement in your last step, such as :
SELECT * FROM (SELECT field1,field2,field3 FROM t1 WHERE Date > DATE(2020,02,16) )
Second, in cte_2, you need to select all the columns you will need from the table t2. Otherwise, your table will have only the row number and it won't be possible to join it with other tables, once it does not provide any other information. Thus, if you need the row number, you select it together with the other columns, which it has to include your primary key if you will perform any join in the future. The syntax would be as follows:
SELECT field1, field2, ROW_NUMBER() OVER(ORDER BY Date) FROM t2
Third, in cte_3, I assume you want to perform an INNER JOIN. Thus, you need to make sure that the primary keys are present in both tables, in your case Date and Ad, which I could not find within your data. Furthermore, you can not have duplicated names when joining two tables and selecting all the columns. For example, in your case you have Brand, value 1, value 2 and value 3 in both tables, it will cause an error. Thus, you need to specify where these fields should come from by selecting one by one or the using a EXCEPT clause.
Finally, in cte_4 and your final select could be together in one step. Basically, you are selecting only one row of data ordered by Date. Then summing the fields value 1, value 2 and value 3 individually based on the partition by date. Moreover, you are not selecting any identifier for the sum, which means that your table will have only the final sums. In general, when peforming a aggregation, such as SUM(), the primary key(s) is selected as well. Lastly, this step could have been performed in one step such as follows, using only the data from t2:
SELECT ReportDate, Brand, sum(value1) as sum_1,sum(value2) as sum_1,sum(value3) as sum_1, sum(value4) as sum_1 FROM (SELECT t2.*, ROW_NUMBER() OVER(PARTITION BY Date ORDER BY Date) as rn t2)
WHERE rn=1
GROUP BY ReportDate, Brand
UPDATE:
With your explanation in the comment section. I was able to created a more specific query. The fields ReportDate,Brand,Portfolio,Campaign and value1,value2,value3 are from t2. Whilst value4 is from t1. The sum is made based on the row number equals to 1. For this reason, the tables t1 and t2 are joined before being using ROW_NUMBER(). Finally, in the last Select statement rn is not selected and the data is aggregated based on ReportDate, Brand, Portfolio and t2.Campaign.
WITH cte_1 AS (
SELECT t2.ReportDate, t2.Brand, t2.Portfolio, t2.Campaign,
t2.value1, t2.value2, t2.value3, t1.value4
FROM t2 LEFT JOIN t1 on t2.ReportDate = t1.ReportDate and t1.placement=t2.Ad
),
cte_2 AS(
SELECT *, ROW_NUMBER() OVER(PARTITION BY Date ORDER BY ReportDate) as rn FROM cte_1
)
SELECT ReportDate, Brand, Portfolio, Campaign, SUM(value1) as sum1, SUM(value2) as sum2, SUM(value3) as sum3,
SUM(value4) as sum4
FROM cte_2
WHERE rn=1
GROUP BY 1,2,3,4

Teradata SQL Get Rid of Duplicates with Specific Order

I just started teradata SQL this week, so sorry if I don't phrase things correctly. I originally created a script in R that gets rid of duplicates within my table, but now I need to transfer this code into SQL. Here is some sample data:
I want to get rid of any D's in the DELETE column, partition by ID, order by STATUS, DATE, and AMOUNT (with actual dates and amounts before ?s). I want STATUS to go in this order: P, H, F, U, T. I want the first row that has STATUS, DATE, and AMOUNT filled out (with STATUS in order). Here is the example output data:
I'm really stuck on the order issue and the code I've written isn't producing any data at all (but no errors).
SAMPLE CODE:
CREATE VOLATILE TABLE new_tble
AS
(SELECT *
FROM table
QUALIFY row_number() OVER (partition BY ID ORDER BY ID, DATE, AMOUNT)=1
WHERE DELETE <> 'D'
)
with data;

This is a direct translation of your description into Teradata SQL, assuming ? means NULL:
select *
from tab
where "delete" is null
and "date" is not null
and amount is not null
qualify
row_number()
over (partition by id
order by case status
when 'P' then 1
when 'H' then 2
when 'F' then 3
when 'U' then 4
when 'T' then 5
end
,"date"
,amount) = 1

SQL Server : UNION ALL but remove duplicate IDs by choosing first date of occurrence

I am unioning two queries but I'm getting an ID that occurs in each query. I do not know how to keep only the first time the id occurs. Everything else about the row is different. In general, it will be hard to know which of the two queries I will have to keep a duplicate on, therefore, I need a general solution.
I was thinking about creating a temp table and choosing the min date (once the date has been converted to an int).
Any ideas on the proper syntax?

You can do this using the row_number() function. This will assign a sequential number, starting with 1, to each row with the same id (based on the partition by clause). The ordering of the sequence is determined by the order by clause. So, the following assigns 1 to the earliest date for each id:
select t.*
from (select t.*,
row_number() over (partition by id order by date asc) as seqnum
from ((select *
from <subquery1>
) union all
(select *
from <subquery2>
)
) t
) t
where seqnum = 1;
The final where clause simply filters for the first occurrence.

If you use the keyword UNION, then it will remove duplicates from the two data sets you are working with. UNION ALL preserves duplicates.
You can view the specifics here:
http://www.w3schools.com/sql/sql_union.asp

If you want to only have one of the 2 records and they are not identical you will have to filter them yourself. You may need to do something like the following. THis may be possible to do with the one (select union select) block but this should get you started.
select *
from (
select id
, date
, otherstuf
from table_1
union all
select id
, date
, otherstuf
from table_2
) x1
, (
select id
, date
, otherstuf
from table_1
union all
select id
, date
, otherstuf
from table_2
) x2
where x1.id = x2.id
and x1.date < x2.date
Although rethinking this if you go down a path like this why bother to UNION it?

A query calls two instances of the same tables joined to compare fields, gives mirrored results. How do I eliminate mirrored duplicates?

This is a simpler version of the query I have.
Alias1 as
(select distinct ID, file_tag, status, creation_date from tables where creation_dt >= sysdate and creation_dt <= sysdate + 1),
Alias2 as
(select distinct ID, file_tag, status, creation_date from same tables creation_dt >= sysdate and creation_dt <= sysdate + 1)
select distinct Alias1.ID ID_1,
Alias2.ID ID_2,
Alias1.file_tag,
Alias1.creation_date in_dt1,
Alias2.creation_date in_dt2
from Alias1, Alias2
where Alias1.file_tag = Alias2.file_tag
and Alias1.ID != Alias2.ID
order by Alias1.creation_dt desc
This is an example of the results. Both of these are the same, though their values are flipped.
ID_1 ID_2 File_Tag in_dt1 in_dt2
70 66 Apples 6/25/2012 3:06 6/25/2012 2:53:47 PM
66 70 Apples 6/25/2012 2:53 6/25/2012 3:06:18 PM
The goal of the query is to find more than one ID with a matching file tag and do stuff to the one submitted earlier in the day (the query runs daily and only needs duplicates from that given day). I am still relatively new to SQL/Oracle and wonder if there's a better way to approach this problem.

SELECT *
FROM (SELECT id, file_tag, creation_date in_dt
, row_number() OVER (PARTITION BY file_tag
ORDER BY creation_date) rn
, count(*) OVER (PARTITION BY file_tag) ct
FROM tables
WHERE creation_date >= TRUNC(SYSDATE)) tbls
WHERE rn = 1
AND ct > 1;
This should get you the first (earliest) row within each file_tag having at least 2 records today.
The inner select calculates the relative row numbers of each set of identical file_tag records by creation date. The outer select retrieves the first one in each partition.
This assumes from your goal statement that you want to do something with the earliest single row for each file_tag. The inner query only returns rows with a creation_date of sometime on the current day.

Here is an easy way, just by chaning your comparison operation:
select distinct Alias1.ID ID_1, Alias2.ID ID_2, Alias1.file_tag,
Alias1.creation_date in_dt1, Alias2.creation_date in_dt2
from Alias1 join
Alias2
on Alias1.file_tag = Alias2.file_tag and
Alias1.ID < Alias2.ID
order by Alias1.creation_dt desc
Replacing the not-equals with less-than orders the two ideas so the smaller one is always first. This will eliminate the duplicates. Note: I also fixed the join syntax.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Compare two tables of data in HIVE - sql

Related

SQL get latest availability per member

Joining data from two sources using bigquery

Teradata SQL Get Rid of Duplicates with Specific Order

SQL Server : UNION ALL but remove duplicate IDs by choosing first date of occurrence

A query calls two instances of the same tables joined to compare fields, gives mirrored results. How do I eliminate mirrored duplicates?

Categories

Resources