Sparksql to select certain records against 3 tables - sql

I have 3 tables and need to fetch the records as below
Table_A,
Table_B,
Table_C
Select only Table_A records which are common in Table_B & Table_C and ignore which are not common in both Table_B & Table_C finally results would be no duplicates.
Approach 1 Tried: inner join Table_A with Table_B and again separate inner join Table_A with Table_C finally did union.
Ab = Table_A.join(Table_B,Table_A["id"] == Table_B["id"], "inner").select(common columns)
Ac = Table_A.join(Table_C,Table_A["id"] == Table_C["id"], "inner").select(common columns)
result = Ab.union(Ac) <<Got more duplicates>>
result = result,dropDuplicates(["id"])
But still I got the duplicates.
Approach 2 Tried with SparkSql:
Table_A
left outer
Table_B
on A.id = B.id
left outer Table_C
on A.id = c.id
In this Approach, no duplicates but more records than Table_A also the uncommon records.
Any suggestion and best approach would be apprciated

In Spark SQL, I would recommend exists:
select a.*
from table_a a
where exists (select 1 from table_b b on b.id = a.id)
and exists (select 1 from table_c c on c.id = a.id)
This does the filtering you want, and will not duplicate records of table_a in the resuletset, even if there are multiple matches in table_b or table_c.

Related

How to join large subset of data with smaller subset data

I have three tables in SQL Server
TABLE_A - contains 500 rows
TABLE_B - contains 1 million rows
TABLE_C - contains 1 million rows
I want to select the rows from TABLE_B and TABLE_C join with TABLE_A based on a row number position from TABLE_B and TABLE_C tables.
Below is my sample query:
SELECT TOP (50), *
INTO ##tempResult
FROM TABLE_A
LEFT JOIN
(SELECT *
FROM
(SELECT
memberID,
ROW_NUMBER() OVER (PARTITION BY TABLE_A.member_id ORDER BY TABLE_A EM.UTupdateDate DESC) AS rowNum,
FROM
TABLE_B
JOIN
TABLE_C ON TABLE_B.memberID = TABLE_C.memberID
)
) AS TABLE_subset
WHERE
TABLE_subset.rowNum <=2
) AS TABLE_INC ON TABLE_A.memberID = TABLE_INC.memberID
WHERE TABLE_A.colA = 'XYZ'
Here the TABLE_subset is joining entire records in TABLE_B and TABLE_C, but I want to join only the top 50 records with TABLE_A.
Is there any way to achieve this ?
Your question and query doesn't match exactly, but CROSS APPLY is probably your friend here.
The general idea is:
select TOP 50 *
from tableA a
CROSS APPLY (
SELECT TOP 2 b.id, c.otherid
from tableB b
inner join tableC c
ON c.id = b.id
where b.id = a.id -- Here you match field between A and B
order by b.date DESC -- order by something
) data
Now just need to adapt to your needs

SQL inner join with conditional selection

I am new in SQL. Lets say I have 2 tables one is table_A and the other one is table_B. And I want to create a view with two of them which is view_1.
table_A:
id
foo
1
d
2
e
null
f
table_B
id
name
1
a
2
b
3
c
and when I use this query :
SELECT DISTINCT table_A.id, table_B.name
FROM table_A
INNER JOIN table_B ON table_B.id = table_A.id
the null value in table_A can't be seen in the view_1 since it is not found in table_B. I want view_1 to show also this null row like :
id
name
1
a
2
b
null
no entry
Should I create a 4. table? I couldn't find a way.
Try this Query:
SELECT DISTINCT a.id,(CASE When b.name IS NULL OR b.name = '' Then 'No Entry' else b.name end) name FROM table_A a
LEFT JOIN table_B b on a.id = b.id
You are looking for an outer join. Thus you keep all table_A rows and join table_B rows where they exist. If no match exists, the table_B columns in the joined row are NULL.
You replace NULLs with a value with COALESCE.
SELECT a.id, COALESCE(b.name, 'no entry') AS name
FROM table_a a
LEFT OUTER JOIN table_b b ON b.id = a.id
ORDER BY a.id NULLS LAST;
You haven't tagged your request with your DBMS. Not all DBMS support the NULLS LAST clause.
Please note that there is no DISTINCT in my query. It is not needed. And every time you think you must use DISTINCT, think twice. SELECT DISTINCT is very seldom needed. Most often it is used, because the query is kind of flawed and causes the undesired duplicates itself.

SQL Query Duplicating records

I've got two tables.
Let's call them table_A and table_B.
Table_B contains the ForeignKey of table_A.
Table_A
ID Name
1 A
2 B
3 C
Table_B
ID table_a_fk
1 2
2 3
Now I want to get all the names out of table_a IF table_b does not contain the ID of the record in table_a.
I've tried it with this query:
SELECT a.name
FROM table_a a, table_b b
WHERE a.id != b.table_a_fk
With this Query I'm getting the right result I just get this result like 5times and I don't know why.
Hope someone can explain me that.
Your query creates a cartesian product between your two tables A and B. It is the cartesian product that generates those duplicate values. Instead, you want to use an anti-join, which is most commonly written in SQL using NOT EXISTS
SELECT a.name
FROM table_a a
WHERE NOT EXISTS (
SELECT *
FROM table_b b
WHERE a.id = b.table_a_fk
)
Another way to express an anti-join with NOT IN (only if table_b.table_a_fk is NOT NULL):
SELECT a.name
FROM table_a a
WHERE a.id NOT IN (
SELECT b.table_a_fk
FROM table_b b
)
Another, less common way to express an anti-join:
SELECT a.name
FROM table_a a
LEFT OUTER JOIN table_b b ON a.id = b.table_a_fk
WHERE b.id IS NULL
use distinct
SELECT distinct a.name
FROM table_a a, table_b b
WHERE a.id != b.table_a_fk
or better is...
Select distinct name
from tableA a
Where not exists (Select * from tableB
Where table_a_fk = a.id)

Select from two different tables by value in third table

I have next tables.
First one is A.
A have two columns: A_ID and A_VALUE.
Second table is B. B too have two columns: B_ID and B_VALUE
In additional I have table C. Table C have C_ID and bool columns C_BOOL
If C_BOOL value == true i need select value from A with given ID.
If C_BOOL value == false i need select value from B.
How I can write SELECT for this?
I use oracle db.
Thanks in advice.
SELECT CASE C.BOOL WHEN 1 THEN A.ID ELSE B.ID END
FROM A
JOIN B
ON B.ID = A.ID
JOIN C
ON C.ID = A.ID
Try this query:
SELECT C_ID,CASE WHEN C_BOOL = 1 THEN T3.A_VALUE ELSE T2.B_VALUE END
FROM TABLE_C T1 LEFT OUTER JOIN TABLE_B T2 ON T1.C_ID = T2.B_ID
LEFT OUTER JOIN TABLE_A T3 T2 ON T1.C_ID = T3.A_ID
select decode(C.BOOL,1,A.ID,B.ID) FROM C
JOIN A
ON A.ID=C.ID
JOIN B
ON B.ID=C.ID;
I consider T McKeown answer as valid this is just equivalent (but more compact) I suppose.

SQL joining/unioning help - link table b and c to table a?

I have 3 tables in my database. Each of them has one column, "index" that links the fields across all 3.
Our starting point is table a, and the indexes inside it. If the index is not there, I don't need it.
Tables b and c are very similar, and every index listed in table a will be in b or c, or both. All I need to do is make sure that all the fields in table a are joined to fields in table b or c.
I started with:
SELECT *
FROM `table_a`
JOIN table_b ON table_a.index = table_b.index
Which works great. But it will exclude all the indexes in table a which don't match, which is why I believe, when I add:
UNION
FROM `table_a`
JOIN table_c ON table_a.index = table_c.index
I actually get LESS results, rather than more.
Can someone tell me how to say "if the index isn't in table b, then look in table c?"
I'm not sure this is what you're after, but it will give you all the results of a, and any possible matches from b OR c.
SELECT *
FROM table_a
LEFT OUTER JOIN table_b ON table_a.index = table_b.index
LEFT OUTER JOIN table_c ON table_a.index = table_c.index
Did you try a "Union All" ?
SELECT *
FROM `table_a`
JOIN table_b ON table_a.index = table_b.index
UNION ALL
SELECT *
FROM `table_a`
JOIN table_c ON table_a.index = table_c.index
First step:
SELECT * FROM
table_a
LEFT JOIN table_b ON table_a.index = table_b.index
LEFT JOIN table_c ON table_b.index = table_c.index
This will get you all indexes from all 3 tables.
If you want to have the first index which isn't NULL from the tables _a, _b, or _c you can do this like this in MySQL:
SELECT COALESCE(table_a.index, table_b.index, table_c.index) AS firstIndexFromABC
FROM
table_a
LEFT JOIN table_b ON table_a.index = table_b.index
LEFT JOIN table_c ON table_b.index = table_c.index
Or what DB are you using? Update: MySQL
Update after some comments:
Sorry, I still don't get it. That's what the COALESCE method does. You get in 1 column combined the value of a if it's there, if not you get b if it's there, if not you get c.
If you mean, that you want the information, from which table you took the index then try this:
SELECT COALESCE(table_a.index, table_b.index, table_c.index) AS firstIndexFromABC,
CASE WHEN table_a.index IS NULL AND table_b.index IS NULL THEN 'c'
WHEN table_a.index IS NULL AND table_b.index IS NOT NULL THEN 'b'
WHEN table_a.index IS NOT NULL THEN 'a'
END AS whichTable
FROM
table_a
LEFT JOIN table_b ON table_a.index = table_b.index
LEFT JOIN table_c ON table_b.index = table_c.index