Compare two hive tables with same columns but different value in data - hive

A database has two tables table1 and table2.the structure is same for both the tables,
table1
name score1 score2 scorefinal
a 0 1 0
table2
name score1 score2 scorefinal
a 0 0 0
i want to compare2 tables and check if value of table2.score=(table1.score+2)
I want the Hive query to check that
enter code here
select count(*)
from test.table2 a,
test.table1 b
where a.score1='2' and a.score2='0'
and b.score1='2' and b.score2='0'
and a.scorefinal=(b.scorefinal+2);
i want to use in a better way for comparison using joins if possible

select * from
(select *,
case when a.score1 = b.score1 or a.score2 = b.score2 or a.scorefinal=(b.scorefinal+2)
then 1
else 0
end as value -- giving an OR condition here which means if any of one is true then 1, if you need the condition to be true for all the cases you need to keep AND
from
test.table2 a,
test.table1 b
) q
where value = 1

Related

Compare two columns in SQL

I'm new to SQL and have very basic queries in GCP.
Let's consider this table below:
Name
B
C
Arun
1234-5678
1234
Tara
6789 - 7654
6789
Arun
4567
4324
Here, I want to compare column B and C and if they match then give 1 else 0 in column same and else different (which we have to create).
So here the catch:
if column B has 1234-5678 and column C has 1234, then the column should match considering only the number before the "-" in the value.
The output should be :
Name
B
C
same
different
Arun
1234-5678
1234
1
0
Tara
6789 - 7654
6789
1
0
Arun
4567
4324
0
1
Also, I want to count the values of 1 for each values in Name for same and different columns.
So far I've tried this:
SELECT
name,
b,
c ,
if(b = c, 1, 0) as same,
if (b!=c,1,0) as different,
count(same),
count(different)
From Table
using "MySQL" (will work almost same with SQL server as well) here's the possible solution.
Step 1) Setup table
CREATE TABLE Users (
Name varchar(50),
B varchar(50),
C varchar(50)
);
INSERT INTO Users
VALUES
('Arun', '1234-5678', '1234'),
('Tara', '6789-7654', '6789'),
('Arun', '4567', '4324');
Step 2) same & different columns
SELECT
Name, B, C,
CASE WHEN SUBSTRING_INDEX(B, "-", 1) = C THEN 1 ELSE 0 END as same,
CASE WHEN SUBSTRING_INDEX(B, "-", 1) <> C THEN 1 ELSE 0 END as different
FROM
Users
Step 3) Join both results to get total_same & total_different for each user
SELECT
Name,
SUM(CASE WHEN SUBSTRING_INDEX(B, "-", 1) = C THEN 1 ELSE 0 END) as total_same,
SUM(CASE WHEN SUBSTRING_INDEX(B, "-", 1) <> C THEN 1 ELSE 0 END) as total_different
FROM
Users
GROUP BY Name
Reference: SQL Fiddle
For the first step, you will need to SUBSTR the column b.
We start at position 1 and we want 4 characters (only works if there's only 4 characters before the '-').
With table2 as (
select name, b,c, same, different from (select name, b, c, case when (SUBSTR(b,1,4) = c)
then '1' else '0' end as same, case when(SUBSTR(b,1,4)!= c) then '1' else '0' end as different
from Table1
group by name, b,c))
The WITH clause can be used when you have complex query, and if you want to create a temporary table in order to use it after.
The Table2 give you this :
After the WITH clause, you will have the second step, the count of same / different per name :
Select table1.name,count(table2.same+table2.different) as total from table1
join table2 on (table2.name = table1.name and table2.b = table1.b)
group by table1.name;
The output give you the total per name (the name are group by, so in your example you will only have 2 rows, one for Arun with a total of 2 (same + different) and the other one with a total of 1)
So here's the entire code :
with table2 as (
select name, b,c, same, different from (select name, b, c, case when (SUBSTR(b,1,4) = c) then '1' else '0' end as same, case when(SUBSTR(b,1,4)!= c) then '1' else '0' end as different
From Table1
group by name, b,c))
select table1.name, table1.b, table1.c, count(table2.same+table2.different) as total from table1
join table2 on (table2.name = table1.name and table2.b = table1.b)
group by table1.name;

How do I check if a certain value exists?

I have a historization table called CUR_VALID. This table looks something like this:
ID CUR_VALID
1 N
1 N
1 Y
2 N
2 Y
3 Y
For every ID there needs to be one Y. If there is no Y or multiple Y there is something wrong. The statment for checking if there are multiple Y I already got. Now I only need to check for every ID if there is one Y existing. Im just not sure how to do that. This is what I have so far. So how do I check if the Value 'Y' exists?
SELECT Count(1) [Number of N]
,MAX(CUR_VALID = 'N')
,[BILL_ID]
,[BILL_MONTH]
,[BILL_SRC_ID]
FROM db.dbo.table
GROUP BY [BILL_ID]
,[BILL_MONTH]
,[BILL_SRC_ID]
Having MAX(CUR_VALID = 'N') > 1
Why are you fiddling with 'N' when you are interested in 'Y'?
Use conditional aggregation to get the count of the value your are interested in.
SELECT
COUNT(*) AS number_of_all,
COUNT(CASE WHEN cur_valid = 'Y' THEN 1 END) AS number_of_y,
COUNT(CASE WHEN cur_valid = 'N' THEN 1 END) AS number_of_n,
bill_id,
bill_month,
bill_src_id,
FROM db.dbo.table
GROUP BY bill_id, bill_month, bill_src_id;
Add a HAVING clause in order to get only valid
HAVING COUNT(CASE WHEN cur_valid = 'Y' THEN 1 END) = 1
or invalid
HAVING COUNT(CASE WHEN cur_valid = 'Y' THEN 1 END) <> 1
bills.
The following query will give you the list of id for which your integrity condition is not met: For every ID there needs to be one Y. If there is no Y or multiple Y there is something wrong.
select T1.id from table T1 where (select count(*) from table T2 where T2.id=T1.id and T2.CUR_VALID='Y')!=1
This query returns both not having at least one 'Y' value and more than one 'Y' value ID's.
First, sum up the Y values and relate to each id, then select not 1 ones from that table.
select * from (
select ID, SUM(case when CUR_VALID = 'Y' then 1 else 0 end) as CNT
from table
group by ID
) b where b.CNT <> 1
DBFiddle
As I understand, you want to get all the id for which your integrity check passes. And integrity check for you means, there is only one row with CUR_VALID value equal to Y in the CUR_VALID table.
This can be achieved by a group by clause:
select id from CUR_VALID
where CUR_VALID.CUR_VALID = 'Y'
group by id
having count(CUR_VALID.CUR_VALID) = 1;

How do I determine if a group of data exists in a table, given the data that should appear in the group's rows?

I am writing data to a table and allocating a "group-id" for each batch of data that is written. To illustrate, consider the following table.
GroupId Value
------- -----
1 a
1 b
1 c
2 a
2 b
3 a
3 b
3 c
3 d
In this example, there are three groups of data, each with similar but varying values.
How do I query this table to find a group that contains a given set of values? For instance, if I query for (a,b,c) the result should be group 1. Similarly, a query for (b,a) should result in group 2, and a query for (a, b, c, e) should result in the empty set.
I can write a stored procedure that performs the following steps:
select distinct GroupId from Groups -- and store locally
for each distinct GroupId: perform a set-difference (except) between the input and table values (for the group), and vice versa
return the GroupId if both set-difference operations produced empty sets
This seems a bit excessive, and I hoping to leverage some other commands in SQL to simplify. Is there a simpler way to perform a set-comparison in this context, or to select the group ID that contains the exact input values for the query?
This is a set-within-sets query. I like to solve it using group by and having:
select groupid
from GroupValues gv
group by groupid
having sum(case when value = 'a' then 1 else 0 end) > 0 and
sum(case when value = 'b' then 1 else 0 end) > 0 and
sum(case when value = 'c' then 1 else 0 end) > 0 and
sum(case when value not in ('a', 'b', 'c') then 1 else - end) = 0;
The first three conditions in the having clause check that each elements exists. The last condition checks that there are no other values. This method is quite flexible, for various exclusions and inclusion conditions on the values you are looking for.
EDIT:
If you want to pass in a list, you can use:
with thelist as (
select 'a' as value union all
select 'b' union all
select 'c'
)
select groupid
from GroupValues gv left outer join
thelist
on gv.value = thelist.value
group by groupid
having count(distinct gv.value) = (select count(*) from thelist) and
count(distinct (case when gv.value = thelist.value then gv.value end)) = count(distinct gv.value);
Here the having clause counts the number of matching values and makes sure that this is the same size as the list.
EDIT:
query compile failed because missing the table alias. updated with right table alias.
This is kind of ugly, but it works. On larger datasets I'm not sure what performance would look like, but the nested instances of #GroupValues key off GroupID in the main table so I think as long as you have a good index on GroupID it probably wouldn't be too horrible.
If Object_ID('tempdb..#GroupValues') Is Not Null Drop Table #GroupValues
Create Table #GroupValues (GroupID Int, Val Varchar(10));
Insert #GroupValues (GroupID, Val)
Values (1,'a'),(1,'b'),(1,'c'),(2,'a'),(2,'b'),(3,'a'),(3,'b'),(3,'c'),(3,'d');
If Object_ID('tempdb..#FindValues') Is Not Null Drop Table #FindValues
Create Table #FindValues (Val Varchar(10));
Insert #FindValues (Val)
Values ('a'),('b'),('c');
Select Distinct gv.GroupID
From (Select Distinct GroupID
From #GroupValues) gv
Where Not Exists (Select 1
From #FindValues fv2
Where Not Exists (Select 1
From #GroupValues gv2
Where gv.GroupID = gv2.GroupID
And fv2.Val = gv2.Val))
And Not Exists (Select 1
From #GroupValues gv3
Where gv3.GroupID = gv.GroupID
And Not Exists (Select 1
From #FindValues fv3
Where gv3.Val = fv3.Val))

SQL (TSQL) - Select values in a column where another column is not null?

I will keep this simple- I would like to know if there is a good way to select all the values in a column when it never has a null in another column. For example.
A B
----- -----
1 7
2 7
NULL 7
4 9
1 9
2 9
From the above set I would just want 9 from B and not 7 because 7 has a NULL in A. Obviously I could wrap this as a subquery and USE the IN clause etc. but this is already part of a pretty unique set and am looking to keep this efficient.
I should note that for my purposes this would only be a one-way comparison... I would only be returning values in B and examining A.
I imagine there is an easy way to do this that I am missing, but being in the thick of things I don't see it right now.
You can do something like this:
select *
from t
where t.b not in (select b from t where a is null);
If you want only distinct b values, then you can do:
select b
from t
group by b
having sum(case when a is null then 1 else 0 end) = 0;
And, finally, you could use window functions:
select a, b
from (select t.*,
sum(case when a is null then 1 else 0 end) over (partition by b) as NullCnt
from t
) t
where NullCnt = 0;
The query below will only output one column in the final result. The records are grouped by column B and test if the record is null or not. When the record is null, the value for the group will increment each time by 1. The HAVING clause filters only the group which has a value of 0.
SELECT B
FROM TableName
GROUP BY B
HAVING SUM(CASE WHEN A IS NULL THEN 1 ELSE 0 END) = 0
If you want to get all the rows from the records, you can use join.
SELECT a.*
FROM TableName a
INNER JOIN
(
SELECT B
FROM TableName
GROUP BY B
HAVING SUM(CASE WHEN A IS NULL THEN 1 ELSE 0 END) = 0
) b ON a.b = b.b

How do I modify this query without increasing the number of rows returned?

I've got a sub-select in a query that looks something like this:
left outer join
(select distinct ID from OTHER_TABLE) as MYJOIN
on BASE_OBJECT.ID = MYJOIN.ID
It's pretty straightforward. Checks to see if a certain relation exists between the main object being queried for and the object represented by OTHER_TABLE by whether or not MYJOIN.ID is null on the row in question.
But now the requirements have changed a little. There's another row in OTHER_TABLE that can have a value of 1 or 0, and the query needs to know whether a relation exists between the primary for a 1-value, and also if it exists for a 0 value. The obvious solutions is to put:
left outer join
(select distinct ID, TYPE_VALUE from OTHER_TABLE) as MYJOIN
on BASE_OBJECT.ID = MYJOIN.ID
But that would be wrong because if 0-type and 1-type objects both exist for the same ID, it will increase the number of rows returned by the query, which isn't acceptable. So what I need is some sort of subselect that will return 1 row for each distinct ID, with a "1-type exists" column and a "0-type exists" column. And I have no idea how to code that in SQL.
For example, for the following table,
ID | TYPE_VALUE
_________________
1 | 1
3 | 0
3 | 1
4 | 0
I'd like to see a result set like this:
ID | HAS_TYPE_0 | HAS_TYPE_1
______________________________
1 | 0 | 1
3 | 1 | 1
4 | 1 | 0
Anyone know how I could set up a query to do this? Hopefully with a minimum of ugly hacks?
In the general case, you would use EXISTS:
SELECT DISTINCT ID,
CASE WHEN EXISTS (
SELECT * FROM Table1 y
WHERE y.TYPE_VALUE = 0 AND ID = x.ID)
THEN 1
ELSE 0 END AS HAS_TYPE_0,
CASE WHEN EXISTS (
SELECT * FROM Table1 y
WHERE y.TYPE_VALUE = 1 AND ID = x.ID)
THEN 1
ELSE 0 END AS HAS_TYPE_1
FROM Table1 x;
If you have a very large number of elements in the table, this won't perform so great - those nested subselects are often a kiss of death when it comes to performance.
For your specific case, you could also use GROUP BY and MAX() and MIN() to speed things up:
SELECT
ID,
CASE WHEN MIN(TYPE_VALUE) = 0 THEN '1' ELSE 0 END AS HAS_TYPE_0,
CASE WHEN MAX(TYPE_VALUE) = 1 THEN '1' ELSE 0 END AS HAS_TYPE_1
FROM Table1
GROUP BY ID;
Instead of select distinct ID, TYPE_VALUE from OTHER_TABLE
use
select ID,
MAX(CASE WHEN TYPE_VALUE =0 THEN 1 END) as has_type_0,
MAX(CASE WHEN TYPE_VALUE =1 THEN 1 END) as has_type_1
from OTHER_TABLE
GROUP BY ID;
You can do the same using PIVOT opearator...