Two Table Comparison in HIVE - hive

I have two different set of tables. I want to compare the total count in both tables and want to display whether the two tables counts are matching or not. If matching, then 'Pass' else 'fail'.
SELECT (SELECT COUNT (*)
FROM Table1 t1
INNER JOIN Table2 t2
ON TRIM (t1.mgac_ac_id) = TRIM (t2.account))
AS cnt1,
(SELECT COUNT (*) FROM t3) AS cnt2 where cnt1=cnt2;
Above show code is incorrect. Could anyone help on code. Whether want to create any variables in HIVE?

OK, it's simple to complete this. like below:
select
case when tmp1.value = tmp2.value then 'Pass' else 'Fail' end as result
from
(select count(1) as value from table1) tmp1
join
(select count(1) as value from table2) tmp2 on 1=1

Related

How to loop select statement based on the previous select statement?

I need all the data to be in one straight line.
I need to create a report and for that i need to summarize all.
So i did this select statement first.
Query1
Select t1.scn,t1.vsl_name, t1.act_arr_dt_tm, t1.act_dept_dt_tm, t1.del_remarks
from vesvoy t1
So based on the select statement above i need to get all the t1.scn to loop the below sql
Query2
Select t1.scn,t2.void_flg,
MAX(case when t2.inv_num like 'VI%' then t2.inv_num end) as Vessel,
MAX(case when t2.inv_num like 'VI%' then t2.inv_amt end) as Vessel_amt
from pbosinvoiceitem t1
inner join pbosinvoice t2 ON t2.id = t1.master_id
inner join pbosinvtype t4 ON t4.code = t2.inv_type
group by t1.scn,t2.void_flg
so that i can get the result like in the report. I have try to create temp table but the data that i get is all duplicate.
I try combining both queries but the result shows duplicate result
I'm sitting here thinking: "What does the first query have to do with the second?" Both are on the same table, vesvoy, so what could the question be. The second is processing the same rows as the first.
I suspect the issue is that the joins are losing rows. So, I suspect that the answer is use left join, rather than inner join. Along the way, get rid of the select distinct. This is generally a bad idea. In combination with a group by on the same non-aggregated columns, it just shows a lack of awareness of SQL.
So, does this address your concern?
SELECT t1.scn,
MAX(case when t3.inv_num like 'VI%' then t3.inv_num end) as Vessel,
MAX(case when t3.inv_num like 'VI%' then t3.inv_amt end) as Vessel_amt
FROM vesvoy t1 LEFT JOIN
pbosinvoiceitem t2
ON t2.scn = t1.scn LEFT JOIN
pbosinvoice t3
ON t3.id = t2.master_id
GROUP BY t1.scn;
This will return NULL for the non-matching rows.

SQL to sum a total of a column where 2 columns match in a different table

SO I have 2 tables and would like to SUM the total of a column in one table where 2 other columns match in another table.
In table1 I have acc_ref and bill_no.
acc_ref is different but bill_no could be 1-10 (so 2 or more acc_ref could have the same bill_no)
In table2 I have acc_ref, bill_no and tran_amnt.
Εach acc_ref has multiple rows and I want to SUM the tran_amnt but only if acc_ref and bill_no both match in table1.
I tried this but I get an error
'The columns in the SELECT clause must be contained in the GROUP BY
clause'
select a.acc_ref, a.bill_no
from table1 a
where exists (select acc_ref, bill_no, SUM (tran_amount)
from table2 b
where a.acc_ref = b.acc_ref
and a.bill_no = b.bill_no
group by acc_ref)
Apologies if this is very basic and obvious but I'm struggling!!
In you description it seems like table1 does not contain any useful information. Because both columns you give also exist in table2. So if nothing else from table1 is needed, you could just remove table1 from the query). Still with your problem you should do a simple join
SELECT a.acc_ref, a.bill_no, SUM(b.tran_amount)
FROM table1 a
JOIN table2 b ON b.acc_ref = a.acc_ref AND b.bill_no=a.bill_no
GROUP BY a.acc_ref, a.bill_no
I believe you should use case:
Sample:
SELECT
TABEL1.Id,
CASE WHEN EXISTS (SELECT Id FROM TABLE2 WHERE TABLE2.ID = TABLE1.ID)
THEN 'TRUE'
ELSE 'FALSE'
END AS NewFiled
FROM TABLE1

Setting a column value in the SELECT Statement based on a value existing in another table

I have 2 tables. One table lists all the records of items we track. The other table contains flags of attributes of the records in the first table.
For example, Table 1 has columns
Tab1ID, Name, Address, Phone
Table 2 has these columns
Tab2ID, Tab1ID, FlagName
There is a 1 to Many relationship between Table1 and Table2 linked by Tab1ID.
I'd like to create a query that has all the records from Table1 in it. However, if one of the records in Table2 has a Flagname=Retired (with a matching Tab1ID) then I want a "Y" to show up in the select column list otherwise an "N".
I think it might look something like this:
Select Name, Address, Phone, (select something in table2)
from Table1
where Tab1ID > 1;
It's the subquery in the column that has me stumped.
Pat
You can use exists:
Select t1.*,
(case when exists (select 1
from table2 t2
where t2.tab1id = t1.tab1id and t2.flagname = 'Retired'
)
then 'Y' else 'N'
end) as retired_flag
from Table1 t1;
I would do a normal join returning multiple records, but convert them to bits with case statements. Then use that as the subquery and pull the max value for each bit column.
select
name
,address
,phone
,max(retired_flag)
from (
select
table1.name
,table1.address
,table1.phone
,case when table2.flagname = 'retired' then 1 else 0 end as [retired_flag]
from table1
left join table2
on table1.tab1id = table2.tab1id
where tab1id > 1
) tbl
group by
name
,address
,phone

Comparing two tables for equality in HIVE

I have two tables, table1 and table2. Each with the same columns:
key, c1, c2, c3
I want to check to see if these tables are equal to eachother (they have the same rows). So far I have these two queries (<> = not equal in HIVE):
select count(*) from table1 t1
left outer join table2 t2
on t1.key=t2.key
where t2.key is null or t1.c1<>t2.c1 or t1.c2<>t2.c2 or t1.c3<>t2.c3
And
select count(*) from table1 t1
left outer join table2 t2
on t1.key=t2.key and t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3
where t2.key is null
So my idea is that, if a zero count is returned, the tables are the same. However, I'm getting a zero count for the first query, and a non-zero count for the second query. How exactly do they differ? If there is a better way to check this certainly let me know.
The first one excludes rows where t1.c1, t1.c2, t1.c3, t2.c1, t2.c2, or t2.c3 is null. That means that you effectively doing an inner join.
The second one will find rows that exist in t1 but not in t2.
To also find rows that exist in t2 but not in t1 you can do a full outer join. The following SQL assumes that all columns are NOT NULL:
select count(*) from table1 t1
full outer join table2 t2
on t1.key=t2.key and t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3
where t1.key is null /* this condition matches rows that only exist in t2 */
or t2.key is null /* this condition matches rows that only exist in t1 */
If you want to check for duplicates and the tables have exactly the same structure and the tables do not have duplicates within them, then you can do:
select t.key, t.c1, t.c2, t.c3, count(*) as cnt
from ((select t1.*, 1 as which from table1 t1) union all
(select t2.*, 2 as which from table2 t2)
) t
group by t.key, t.c1, t.c2, t.c3
having cnt <> 2;
There are various ways that you can relax the conditions in the first paragraph, if necessary.
Note that this version also works when the columns have NULL values. These might be causing the problem with your data.
Well, the best way is calculate the hash sum of each table, and compare the sum of hash.
So no matter how many column are they, no matter what data type are they, as long as the two table has the same schema, you can use following query to do the comparison:
select sum(hash(*)) from t1;
select sum(hash(*)) from t2;
And you just need to compare the return values.
I used EXCEPT statement and it worked.
select * from Original_table
EXCEPT
select * from Revised_table
Will show us all the rows of the Original table that are not in the Revised table.
If your table is partitioned you will have to provide a partition predicate.
Fyi, partition values don't need to be provided if you use Presto and querying via SQL lab.
I would recommend you not using any JOINs to try to compare tables:
it is quite an expensive operations when tables are big (which is often the case in Hive)
it can give problems when some rows/"join keys" are repeated
(and it can also be unpractical when data are in different clusters/datacenters/clouds).
Instead, I think using a checksum approach and comparing the checksums of both tables is best.
I have developed a Python script that allows you to do easily such comparison, and see the differences in a webbrowser:
https://github.com/bolcom/hive_compared_bq
I hope that can help you!
First get count for both the tables C1 and C2. C1 and C2 should be equal. C1 and C2 can be obtained from the following query
select count(*) from table1
if C1 and C2 are not equal, then the tables are not identical.
2: Find distinct count for both the tables DC1 and DC2. DC1 and DC2 should be equal. Number of distinct records can be found using the following query:
select count(*) from (select distinct * from table1)
if DC1 and DC2 are not equal, the tables are not identical.
3: Now get the number of records obtained by performing a union on the 2 tables. Let it be U. Use the following query to get the number of records in a union of 2 tables:
SELECT count (*)
FROM
(SELECT *
FROM table1
UNION
SELECT *
FROM table2)
You can say that the data in the 2 tables is identical if distinct count for the 2 tables is equal to the number of records obtained by performing union of the 2 tables. ie DC1 = U and DC2 = U
another variant
select c1-c2 "different row counts"
, c1-c3 "mismatched rows"
from
( select count(*) c1 from table1)
,( select count(*) c2 from table2 )
,(select count(*) c3 from table1 t1, table2 t2
where t1.key= t2.key
and T1.c1=T2.c1 )
Try with WITH Clause:
With cnt as(
select count(*) cn1 from table1
)
select 'X' from dual,cnt where cnt.cn1 = (select count(*) from table2);
One easy solution is to do inner join. Let's suppose we have two hive tables namely table1 and table2. Both the table has same column namely col1, col2 and col3. The number of rows should also be same. Then the command would be as follows
**
select count(*) from table1
inner join table2
on table1.col1 = table2.col1
and table1.col2 = table2.col2
and table1.col3 = table2.col3 ;
**
If the output value is same as number of rows in table1 and table2 , then all the columns has same value, If however the output count is lesser than there are some data which are different.
Use a MINUS operator:
SELECT count(*) FROM
(SELECT t1.c1, t1.c2, t1.c3 from table1 t1
MINUS
SELECT t2.c1, t2.c2, t2.c3 from table2 t2)

two SQL COUNT() queries?

I want to count both the total # of records in a table, and the total # of records that match certain conditions. I can do these with two separate queries:
SELECT COUNT(*) AS TotalCount FROM MyTable;
SELECT COUNT(*) AS QualifiedCount FROM MyTable
{possible JOIN(s) as well e.g. JOIN MyOtherTable mot ON MyTable.id=mot.id}
WHERE {conditions};
Is there a way to combine these into one query so that I get two fields in one row?
SELECT {something} AS TotalCount,
{something else} AS QualifiedCount
FROM MyTable {possible JOIN(s)} WHERE {some conditions}
If not, I can issue two queries and wrap them in a transaction so they are consistent, but I was hoping to do it with one.
edit: I'm most concerned about atomicity; if there are two sub-SELECT statements needed that's OK as long as if there's an INSERT coming from somewhere it doesn't make the two responses inconsistent.
edit 2: The CASE answers are helpful but in my specific instance, the conditions may include a JOIN with another table (forgot to mention that in my original post, sorry) so I'm guessing that approach won't work.
One way is to join the table against itself:
select
count(*) as TotalCount,
count(s.id) as QualifiedCount
from
MyTable a
left join
MyTable s on s.id = a.id and {some conditions}
Another way is to use subqueries:
select
(select count(*) from Mytable) as TotalCount,
(select count(*) from Mytable where {some conditions}) as QualifiedCount
Or you can put the conditions in a case:
select
count(*) as TotalCount,
sum(case when {some conditions} then 1 else 0 end) as QualifiedCount
from
MyTable
Related:
SQL Combining several SELECT results
In Sql Server or MySQL, you can do that with a CASE statement:
select
count(*) as TotalCount,
sum(case when {conditions} then 1 else 0 end) as QualifiedCount
from MyTable
Edit: This also works if you use a JOIN in the condition:
select
count(*) as TotalCount,
sum(case when {conditions} then 1 else 0 end) as QualifiedCount
from MyTable t
left join MyChair c on c.TableId = t.Id
group by t.id, t.[othercolums]
The GROUP BY is there to ensure you only find one row from the main table.
if you are just counting rows you could just use nested queries.
select
(SELECT COUNT(*) AS TotalCount FROM MyTable) as a,
(SELECT COUNT(*) AS QualifiedCount FROM MyTable WHERE {conditions}) as b
In Oracle SQL Developer I had to add a * FROM in my select, or else i was getting a syntax error:
select * FROM
(select COUNT(*) as foo FROM TABLE1),
(select COUNT(*) as boo FROM TABLE2);
MySQL doesn't count NULLs, so this should work too:
SELECT count(*) AS TotalCount,
count( if( field = value, field, null)) AS QualifiedCount
FROM MyTable {possible JOIN(s)} WHERE {some conditions}
That works well if the QuailifiedCount field comes from a LEFT JOIN, and you only care if it exists. To get the number of users, and the number of users that have filled in their address:
SELECT count( user.id) as NumUsers, count( address.id) as NumAddresses
FROM Users
LEFT JOIN Address on User.address_id = Address.id;