selecting incremental data from multiple tables in Hive - sql

I have five tables(A,B,C,D,E) in Hive database and I have to union the data from these tables based on logic over column "id".
The condition is :
Select * from A
UNION
select * from B (except ids not in A)
UNION
select * from C (except ids not in A and B)
UNION
select * from D(except ids not in A,B and C)
UNION
select * from E(except ids not in A,B,C and D)
Have to insert this data into final table.
One way is to create a the target table (target)and append it with data for each UNION stage and then using this table for joining with the other UNION stage.
This would be the part of my .hql file :
insert into target
(select * from A
UNION
select B.* from
A
RIGHT OUTER JOIN B
on A.id=B.id
where ISNULL(A.id));
INSERT INTO target
select C.* from
target
RIGHT outer JOIN C
ON target.id=C.id
where ISNULL(target.id);
INSERT INTO target
select D.* from
target
RIGHT OUTER JOIN D
ON target.id=D.id
where ISNULL(target.id);
INSERT INTO target
select E.* from
target
RIGHT OUTER JOIN E
ON target.id=E.id
where ISNULL(target.id);
Is there a better to make this happen ? I assume we anyway have to do the
multiple joins/lookups .I am looking forward for best approach to achieve this
in
1) Hive with Tez
2) Spark-sql
Many Thanks in advance

If id is unique within each table, then row_number can be used instead of rank.
select *
from (select *
,rank () over
(
partition by id
order by src
) as rnk
from (
select 1 as src,* from a
union all select 2 as src,* from b
union all select 3 as src,* from c
union all select 4 as src,* from d
union all select 5 as src,* from e
) t
) t
where rnk = 1
;

I think I would try to do this as:
with ids as (
select id, min(which) as which
from (select id, 1 as which from a union all
select id, 2 as which from b union all
select id, 3 as which from c union all
select id, 4 as which from d union all
select id, 5 as which from e
) x
)
select a.*
from a join ids on a.id = ids.id and ids.which = 1
union all
select b.*
from b join ids on b.id = ids.id and ids.which = 2
union all
select c.*
from c join ids on c.id = ids.id and ids.which = 3
union all
select d.*
from d join ids on d.id = ids.id and ids.which = 4
union all
select e.*
from e join ids on e.id = ids.id and ids.which = 5;

Related

select count(*) result in a LEFT OUTER JOIN is same after switching tables , I can't understand why

As I am learning left outer join, I came to the conclusion
A left outer join B = everything in A and common thing in B mirroring respective value in the result table, other values of A which don't have common values with B table, have a null value in B side.
So if A has 15 values, B has 29 values(5 commons), then the result of the following query will be 15. Or if A has 15 values, B has 10 values(5 commons) the result will be still 15.
select count(*) from
A left outer join B
on A.name=B.name;
My Problem:
I have a dvdrental database. Customer table, Payment table. They have 599,14596 rows respectively.
When I run the query: (I expected 14,596 and got 14,596)
select count(*) from
payment left outer join customer
on payment.customer_id=customer.customer_id;
but when I switched tables i.e;( I expected 599 but getting 14,596)
select count(*) from
customer left outer join payment
on payment.cusotmer_id=customer.customer_id;
why? I can't understand. Help
It's just like an inner join, since there are no non-matches:
Note: Feel free to change the val column name to customer_id in the following. The result will be the same.
WITH cte1 (id, val) AS (SELECT 1, 100 UNION SELECT 2, 100)
, cte2 (id, val) AS (SELECT 10, 100)
SELECT COUNT(*)
FROM cte1 LEFT JOIN cte2 ON cte1.val = cte2.val
;
and
WITH cte1 (id, val) AS (SELECT 1, 100 UNION SELECT 2, 100)
, cte2 (id, val) AS (SELECT 10, 100)
SELECT COUNT(*)
FROM cte2 LEFT JOIN cte1 ON cte1.val = cte2.val
;
will produce the same count (2).
I think the real issue is which one you choose as the anchor table.
because it places the other on top of it and does the addition.
There is not one customer payment here, for example, when we look at the first payment table, when we put count on the first payment table, it will bring 5 customers because the result is based on the payment table. In the other possibility, when the customers' table is based, it brings six results. As it can be understood, which one you choose depends on the base table.
--customer (count 6) , payment (5)
WITH
payment(customer_id,paid) AS (SELECT 1,100 UNION SELECT 2,200 UNION SELECT 3,300 UNION SELECT 4,400 UNION SELECT 5,500 )
,customer(customer_id) AS (SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6)
SELECT COUNT(*) FROM payment py LEFT OUTER JOIN customer ct ON py.customer_id=ct.customer_id;
WITH
payment(customer_id,paid) AS (SELECT 1,100 UNION SELECT 2,200 UNION SELECT 3,300 UNION SELECT 4,400 UNION SELECT 5,500 )
,customer(customer_id) AS (SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6)
SELECT COUNT(*) FROM customer ct LEFT OUTER JOIN payment py ON py.customer_id=ct.customer_id;

LISTAGG in SQL is returning a row with null values

I have 2 tables A and B, B is having a foreign key relationship with A i.e.,(b.detail_id = a.id)
I want to apply LISTAGG query on one of the column in B.
SELECT LISTAGG(DISTINCT b.delivery_cadence, ',') WITHIN GROUP (ORDER BY b.delivery_cadence)
delivery_cadence, a.id FROM A a, B b WHERE b.detail_id = a.id AND a.id = 1236565;
The above query is returning me a row with all values as null, But I want no rows. How can I achieve this?
If it's not possible any alternative solution for this.
a.id = 1236565 is not exists in A table.
just add having count(b.delivery_cadence) > 0 eg
SELECT LISTAGG(DISTINCT b.delivery_cadence, ',') WITHIN GROUP (ORDER BY b.delivery_cadence)
delivery_cadence, a.id FROM A a, B b WHERE b.detail_id = a.id AND a.id = 1236565
HAVING COUNT(b.delivery_cadence) > 0
Say you have tables like these:
create table a(id) as
(
select 1 from dual union all
select 2 from dual
);
create table b(detail_id, delivery_cadence) as
(
select 1, 'x' from dual union all
select 1, 'A' from dual union all
select 2, 'y' from dual
);
If I understand well, you need (in ANSI join syntax):
SQL> SELECT LISTAGG(DISTINCT b.delivery_cadence, ',') WITHIN GROUP (ORDER BY b.delivery_cadence) delivery_cadence,
2 a.id
3 FROM A
4 inner join B ON b.detail_id = a.id
5 where a.id = 1236565
6 group by a.id;
no rows selected
For a value of ID that exists in tables, you get:
SQL> SELECT LISTAGG(DISTINCT b.delivery_cadence, ',') WITHIN GROUP (ORDER BY b.delivery_cadence) delivery_cadence,
2 a.id
3 FROM A
4 inner join B ON b.detail_id = a.id
5 where a.id = 1
6 group by a.id;
DELIVERY_CADENCE ID
---------------- ----------
A,x 1
1 row selected.

Easy way to get a single resultset from three identical tables?

The DB I'm working with has three tables with identical column layouts, OPEX, NOPEX and CAPEX. I would like to query all three for items with a matching AssetId and get a single result set so that I can process them all at the same time in my .Net code.
The twist is that I do need to know which table they came from.
I know I can do this with a series of CASE in the SELECT clause, perhaps using the ID column in each where it's non-zero to decide which of the tables it came from. But I would have to have one for each column and the tables are pretty wide.
Is there some other way to solve this problem?
In order to get them into one set, you would use a combination of UNION and EXISTS() checks. The UNION ALL will give you a single result set that contains data from all three tables, and the EXISTS check on each will confirm the table you are querying from has corresponding records in the other tables.
SELECT *, 'OPEX' AS table_name
FROM OPEX o
WHERE EXISTS (
SELECT 1
FROM NOPEX n
WHERE n.asset_id = o.asset_id)
AND EXISTS (
SELECT 1
FROM CAPEX c
WHERE c.asset_id = o.asset_id)
UNION ALL
SELECT *, 'NOPEX' AS table_name
FROM NOPEX n
WHERE EXISTS (
SELECT 1
FROM Opex o
WHERE o.asset_id = n.asset_id)
AND EXISTS (
SELECT 1
FROM CAPEX c
WHERE c.asset_id = n.asset_id)
UNION ALL
SELECT *, 'CAPEX' AS table_name
FROM CAPEX c
WHERE EXISTS (
SELECT 1
FROM Opex o
WHERE o.asset_id = c.asset_id)
AND EXISTS (
SELECT 1
FROM NOPEX n
WHERE n.asset_id = c.asset_id)
I guess you could also do INNER JOINs?
SELECT c.*, 'CAPEX' AS table_name
FROM CAPEX c
INNER JOIN OPEX o
ON o.asset_id = c.asset_id
INNER JOIN NOPEX n
ON n.asset_id = c.asset_id
UNION ALL
SELECT o.*, 'OPEX' AS table_name
FROM OPEX o
INNER JOIN CAPEX c
ON c.asset_id = o.asset_id
INNER JOIN NOPEX n
ON n.asset_id = o.asset_id
UNION ALL
SELECT n.*, 'NOPEX' AS table_name
FROM NOPEX n
INNER JOIN OPEX o
ON o.asset_id = n.asset_id
INNER JOIN CAPEX c
ON c.asset_id = n.asset_id
Similar answer to dfundako, but resolving sooner where AssetId is in all three tables and less hitting of the indexes on the related tables:
;with cte as (
select
AssetID
from (
select distinct
AssetID
from Opex
union all
select distinct
AssetID
from Nopex
union all
select distinct
AssetID
from Capex
) as AssetIDs
group by AssetId
having count(AssetId) = 3
)
select 'Opex', * from Opex as o
inner join cte
on o.AssetID = cte.AssetID
union all
select 'Nopex', * from Nopex as n
inner join cte
on n.AssetID = cte.AssetID
union all
select 'Capex', * from Capex as c
inner join cte
on c.AssetID = cte.AssetID

Inner joined same query returns more result than when it's executed alone

I don't know if I'm wrong but I've always thought (and I still do) that the number of records returned from querying a table alone and inner join the same table and querying this relation would be the same. Like this:
select 'foo' foo from dual;
versus
select * from (select 'foo' foo from dual)q1 inner join
(select 'foo' foo from dual)q2
on q1.foo=q2.foo
Both these queries return one record. But I have a query when I inner join it with itself I get more records . Here's my query:
SELECT distinct DOC.DOCID
FROM AG_INW_DOC DOC
JOIN LAG_CITIZENS CIT
ON DOC.CITIZENID=CIT.CITIZENID
JOIN
(SELECT TSK.DOCID,
OFCR.DEPID
FROM AG_TASKS TSK
JOIN AG_TASK_EXECUTORS EXEC
ON TSK.TASKID=EXEC.TASKID
JOIN AG_OFFICERS OFCR
ON EXEC.ISSUEDOFFICERID =OFCR.OFFICERID
WHERE EXEC.ISMAINEXECUTOR =1
)TSK ON DOC.DOCID =TSK.DOCID
LEFT JOIN
(SELECT ESCDOCID, UNDERCONTROL,ORGID FROM AG_ESCORTING_DOCUMENTS
) ESC
ON DOC.ESCDOCID =ESC.ESCDOCID
WHERE DOC.CATEGORYID IN (11,12)
AND TRUNC(DOC.RECEIVEDDATE,'DDD') BETWEEN TO_DATE('01.11.2015') AND TO_DATE('30.11.2015')
AND (TSK.DEPID IN ('017','004')
OR (TSK.DEPID ='008'
AND DOC.SUBJECTID IN (1,2,3,4,20,22,23,24) ))
AND DOC.DOCSTAT! =3
UNION ALL
SELECT distinct DOC.DOCID
FROM AG_INW_DOC DOC
JOIN LAG_CITIZENS CIT
ON DOC.CITIZENID=CIT.CITIZENID
LEFT JOIN AG_TASKS TSK
ON DOC.DOCID =TSK.DOCID
LEFT JOIN
(SELECT ESCDOCID, UNDERCONTROL,ORGID FROM AG_ESCORTING_DOCUMENTS
) ESC
ON DOC.ESCDOCID =ESC.ESCDOCID
WHERE DOC.CATEGORYID IN (11,12)
AND DOC.ADDRESSEDOFFICERID IN (9,26)
AND TRUNC(DOC.RECEIVEDDATE,'DDD') BETWEEN TO_DATE('01.11.2015') AND TO_DATE('30.11.2015')
AND DOC.DOCSTAT! =3
If I run this query alone I get 3019 records returned. But if I inner join it with itself and select from this join I get 3023 records.
Now, I don't expect anyone to examine my query and point out the problem. I just need to know what circumstances might cause this behavior.
EDIT
The query returns only distinct values. No duplicates
Your assumption is wrong.
an inner join will combine every result from the first select with every result from the second select, and then filter for the join condition. So if your select returns a single column with the following three values:
1, 2, 2
A join with itself and the join condition that the values must be the same will yield
So you get 5 rows instead of 3.
Without looking at your actual select you probably have non-unique values in the columns of you join condition.
(1, 1), (2, 2), (2, 2), (2, 2), (2, 2)
In order to find the duplicates wrap you complete query in something like this
select DOCID, count(*) from (
-- your query here
) group by DOCID
having count(*) > 1
Your assumption holds only if you join on a primary or a unique key.
Here a small example that demonstrates the opposite:
This query gives two rows, but the key is not unique:
select 'foo, I''M no PK' foo from dual union all
select 'foo, I''M no PK' foo from dual
;
Join of the above row source (using WITH) give 2 * 2 rows.
with dual2 as (
select 'foo, I''M no PK' foo from dual union all
select 'foo, I''M no PK' foo from dual
)
select * from (select foo from dual2)q1 inner join
(select foo from dual2)q2
on q1.foo=q2.foo
;
.
foo, I'M no PK foo, I'M no PK
foo, I'M no PK foo, I'M no PK
foo, I'M no PK foo, I'M no PK
foo, I'M no PK foo, I'M no PK
UPDATE
The above assumption is valid, but not relevant for this question.
The problem is in the construction DISTINCT UNION ALL DISTINCT
This may pass dups - the UNION must be used instead.
Example
with tab1 as (
select 1 foo from dual union all
select 1 foo from dual union all
select 2 foo from dual)
, tab2 as (
select 2 foo from dual union all
select 2 foo from dual union all
select 3 foo from dual)
select DISTINCT foo from tab1
UNION ALL
select DISTINCT foo from tab2
order by 1;
gives
1
2
2
3

Problem combining result of two different queries into one

I have two tables (TableA and TableB).
create table TableA
(A int null)
create table TableB
(B int null)
insert into TableA
(A) values (1)
insert into TableB
(B) values (2)
I cant join them together but still I would like to show the result from them as one row.
Now I can make select like this:
select
(select A from tableA) as A
, B from TableB
Result:
A B
1 2
But if I now delete from tableB:
delete tableB
Now when I run the same query as before:
select
(select A from tableA) as A
, B from TableB
I see this:
A B
But I was expecting seeing value from tableA
like this:
Expected Result:
A B
1
Why is this happening and how can I still see the value from TableA although selectB is returning 0 rows?
I am using MS SQL Server 2005.
Use a LEFT JOIN (although it's more of a cross join in your case).
If your db supports it:
SELECT a.a, b.b
FROM a
CROSS JOIN b
If not, do something like:
SELECT a.a, b.b
FROM a
LEFT JOIN b ON ( 1=1 )
However, once you have more rows in a or b, this will return the cartesian product:
1 1
1 2
2 1
2 2
This will actually give you what you're looking for, but if you only have one row per table:
select
(select A from tableA) as A
, (select B from TableB) as B
give this a try:
DECLARE #TableA table (A int null)
DECLARE #TableB table (B int null)
insert into #TableA (A) values (1)
insert into #TableB (B) values (2)
--this assumes that you don't have a Numbers table, and generates one on the fly with up to 500 rows, you can increase or decrease as necessary, or just join in your Numbers table instead
;WITH Digits AS
(
SELECT 0 AS nbr
UNION SELECT 1 UNION SELECT 2 UNION SELECT 3
UNION SELECT 4 UNION SELECT 5 UNION SELECT 6
UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
)
, AllNumbers AS
(
SELECT u3.nbr * 100 + u2.nbr * 10 + u1.nbr + 1 AS Number
FROM Digits u1, Digits u2, Digits u3
WHERE u3.nbr * 100 + u2.nbr * 10 + u1.nbr + 1 <= 500
)
, AllRowsA AS
(
SELECT
A, ROW_NUMBER() OVER (ORDER BY A) AS RowNumber
FROM #TableA
)
, AllRowsB AS
(
SELECT
B, ROW_NUMBER() OVER (ORDER BY B) AS RowNumber
FROM #TableB
)
SELECT
a.A,b.B
FROM AllNumbers n
LEFT OUTER JOIN AllRowsA a on n.Number=a.RowNumber
LEFT OUTER JOIN AllRowsB b on n.Number=b.RowNumber
WHERE a.A IS NOT NULL OR b.B IS NOT NULL
OUTPUT:
A B
----------- -----------
1 2
(1 row(s) affected)
if you DELETE #TableB, the output is:
A B
----------- -----------
1 NULL
(1 row(s) affected)
try this:
select a, (select b from b) from a
union
select b, (select a from a) from b
should retrieve you all the existing data.
you can filter it more by surrounding it with another select