How to query 2 tables in sql server with many to many relationship to identify differences - sql

I have two tables with a many to many relationship and I am trying to merge the 2 tables in a select statement. I want to see all of the records from both tables, but only match 1 record from table A to 1 record to table b, so null values are ok.
For example table A has 20 records that match only 15 records from table B. I want to see all 20 records, the 5 that are unable to be matched can show null.
Table 1
Something | Code#
apple | 75
pizza | 75
orange | 6
Ball | 75
green | 4
red | 6
Table 2
date | id#
Feb-15 | 75
Feb-11 | 75
Jan-10 | 6
Apr-08 | 4
The result I need is
Something | Date | Code# | ID#
apple | Feb-15 | 75 | 75
pizza | Feb-11 | 75 | 75
orange | Jan-10 | 6 | 6
Ball | NULL | 75 | NULL
green | Apr-08 | 4 | 4
red | NULL | 6 | NULL

I'm imagining something like this. You want to pair of the rows side by side but one side is going to have more than the others.
select * /* change to whatever you need */
from
(
select *, row_number() over (partition by "code#" order by "something") as rn
from tableA
) as a
full outer join /* sounds like maybe left outer join will work too */
(
select *, row_number() over (partition by "id#" order by "date" desc) as rn
from tableB
) as b
on b."id#" = a."code#" and b.rn = a.rn
Actually I don't know how you're going to get "ball" to comes after "apple" and "pizza" without some other column to sort on. Rows in SQL tables don't have any ordering and you can't rely on the default listing from select *... or assume that the order of insertion is significant.

A regular Left-join should do it for you.
select tableA.*
, tableB.*
from tableA
left join tableB
on tableB.PrimaryKey = tableA.PrimaryKey

we would need to see the table structure to tell you for sure, but essentially you join on the full key (if possible)
SELECT * FROM TABLEA A
JOIN TABLEB B ON
A.FULLKEY = B.FULLKEY

Left outer join
Question changed
Make that a full outer join
select table1.*, table2.*
from table1
full outer join table2
on table1.Code# = table2.id#
This is probably not a true many to many but I think this is what you are asking for

Related

Comparing different columns in SQL for each row

after some transformation I have a result from a cross join (from table a and b) where I want to do some analysis on. The table for this looks like this:
+-----+------+------+------+------+-----+------+------+------+------+
| id | 10_1 | 10_2 | 11_1 | 11_2 | id | 10_1 | 10_2 | 11_1 | 11_2 |
+-----+------+------+------+------+-----+------+------+------+------+
| 111 | 1 | 0 | 1 | 0 | 222 | 1 | 0 | 1 | 0 |
| 111 | 1 | 0 | 1 | 0 | 333 | 0 | 0 | 0 | 0 |
| 111 | 1 | 0 | 1 | 0 | 444 | 1 | 0 | 1 | 1 |
| 112 | 0 | 1 | 1 | 0 | 222 | 1 | 0 | 1 | 0 |
+-----+------+------+------+------+-----+------+------+------+------+
The ids in the first column are different from the ids in the sixth column.
In a row are always two different IDs that are matched with each other. The other columns always have either 0 or 1 as a value.
I am now trying to find out how many values(meaning both have "1" in 10_1, 10_2 etc) two IDs have on average in common, but I don't really know how to do so.
I was trying something like this as a start:
SELECT SUM(CASE WHEN a.10_1 = 1 AND b.10_1 = 1 then 1 end)
But this would obviously only count how often two ids have 10_1 in common. I could make something like this for example for different columns:
SELECT SUM(CASE WHEN (a.10_1 = 1 AND b.10_1 = 1)
OR (a.10_2 = 1 AND b.10_1 = 1) OR [...] then 1 end)
To count in general how often two IDs have one thing in common, but this would of course also count if they have two or more things in common. Plus, I would also like to know how often two IDS have two things, three things etc in common.
One "problem" in my case is also that I have like ~30 columns I want to look at, so I can hardly write down for each case every possible combination.
Does anyone know how I can approach my problem in a better way?
Thanks in advance.
Edit:
A possible result could look like this:
+-----------+---------+
| in_common | count |
+-----------+---------+
| 0 | 100 |
| 1 | 500 |
| 2 | 1500 |
| 3 | 5000 |
| 4 | 3000 |
+-----------+---------+
With the codes as column names, you're going to have to write some code that explicitly references each column name. To keep that to a minimum, you could write those references in a single union statement that normalizes the data, such as:
select id, '10_1' where "10_1" = 1
union
select id, '10_2' where "10_2" = 1
union
select id, '11_1' where "11_1" = 1
union
select id, '11_2' where "11_2" = 1;
This needs to be modified to include whatever additional columns you need to link up different IDs. For the purpose of this illustration, I assume the following data model
create table p (
id integer not null primary key,
sex character(1) not null,
age integer not null
);
create table t1 (
id integer not null,
code character varying(4) not null,
constraint pk_t1 primary key (id, code)
);
Though your data evidently does not currently resemble this structure, normalizing your data into a form like this would allow you to apply the following solution to summarize your data in the desired form.
select
in_common,
count(*) as count
from (
select
count(*) as in_common
from (
select
a.id as a_id, a.code,
b.id as b_id, b.code
from
(select p.*, t1.code
from p left join t1 on p.id=t1.id
) as a
inner join (select p.*, t1.code
from p left join t1 on p.id=t1.id
) as b on b.sex <> a.sex and b.age between a.age-10 and a.age+10
where
a.id < b.id
and a.code = b.code
) as c
group by
a_id, b_id
) as summ
group by
in_common;
The proposed solution requires first to take one step back from the cross-join table, as the identical column names are super annoying. Instead, we take the ids from the two tables and put them in a temporary table. The following query gets the result wanted in the question. It assumes table_a and table_b from the question are the same and called tbl, but this assumption is not needed and tbl can be replaced by table_a and table_b in the two sub-SELECT queries. It looks complicated and uses the JSON trick to flatten the columns, but it works here:
WITH idtable AS (
SELECT a.id as id_1, b.id as id_2 FROM
-- put cross join of table a and table b here
)
SELECT in_common,
count(*)
FROM
(SELECT idtable.*,
sum(CASE
WHEN meltedR.value::text=meltedL.value::text THEN 1
ELSE 0
END) AS in_common
FROM idtable
JOIN
(SELECT tbl.id,
b.*
FROM tbl, -- change here to table_a
json_each(row_to_json(tbl)) b -- and here too
WHERE KEY<>'id' ) meltedL ON (idtable.id_1 = meltedL.id)
JOIN
(SELECT tbl.id,
b.*
FROM tbl, -- change here to table_b
json_each(row_to_json(tbl)) b -- and here too
WHERE KEY<>'id' ) meltedR ON (idtable.id_2 = meltedR.id
AND meltedL.key = meltedR.key)
GROUP BY idtable.id_1,
idtable.id_2) tt
GROUP BY in_common ORDER BY in_common;
The output here looks like this:
in_common | count
-----------+-------
2 | 2
3 | 1
4 | 1
(3 rows)

Hive / SQL - Left join with fallback

In Apache Hive I have to tables I would like to left-join keeping all the data from the left data and adding data where possible from the right table.
For this I use two joins, because the join is based on two fields (a material_id and a location_id).
This works fine with two traditional left joins:
SELECT
a.*,
b.*
FROM a
INNER JOIN (some more complex select) b
ON a.material_id=b.material_id
AND a.location_id=b.location_id;
For the location_id the database only contains two distinct values, say 1 and 2.
We now have the requirement that if there is no "perfect match", this means that only the material_id can be joined and there is no correct combination of material_id and location_id (e.g. material_id=100 and location_id=1) for the join for the location_id in the b-table, the join should "default" or "fallback" to the other possible value of the location_id e.g. material_id=001 and location_id=2 and vice versa. This should only be the case for the location_id.
We have already looked into all possible answers also with CASE etc. but to no prevail. A setup like
...
ON a.material_id=b.material_id AND a.location_id=
CASE WHEN a.location_id = b.location_id THEN b.location_id ELSE ...;
we tried or did not figure out how really to do in hive query language.
Thank you for your help! Maybe somebody has a smart idea.
Here is some sample data:
Table a
| material_id | location_id | other_column_a |
| 100 | 1 | 45 |
| 101 | 1 | 45 |
| 103 | 1 | 45 |
| 103 | 2 | 45 |
Table b
| material_id | location_id | other_column_b |
| 100 | 1 | 66 |
| 102 | 1 | 76 |
| 103 | 2 | 88 |
Left - Join Table
| material_id | location_id | other_column_a | other_column_b
| 100 | 1 | 45 | 66
| 101 | 1 | 45 | NULL (mat. not in b)
| 103 | 1 | 45 | DEFAULT TO where location_id=2 (88)
| 103 | 2 | 45 | 88
PS: As stated here exists etc. does not work in the sub-query ON.
The solution is to left join without a.location_id = b.location_id and number all rows in order of preference. Then filter by row_number. In the code below the join will duplicate rows first because all matching material_id will be joined, then row_number() function will assign 1 to rows where a.location_id = b.location_id and 2 to rows where a.location_id <> b.location_id if exist also rows where a.location_id = b.location_id and 1 if there are not exist such. b.location_id added to the order by in the row_number() function so it will "prefer" rows with lower b.location_id in case there are no exact matching. I hope you have caught the idea.
select * from
(
SELECT
a.*,
b.*,
row_number() over(partition by material_id
order by CASE WHEN a.location_id = b.location_id THEN 1 ELSE 2 END, b.location_id ) as rn
FROM a
LEFT JOIN (some more complex select) b
ON a.material_id=b.material_id
)s
where rn=1
;
Maybe this is helpful for somebody in the future:
We also came up with a different approach.
First, we create another table to calculate averages from the table b based on material_id over all (!) locations.
Second, In the join table we create three columns:
c1 - the value where material_id and location_id are matching (result from a left join of table a with table b). This column is null if there is no perfect match.
c2 - the value from the table where we write the number from the averages (fallback) table for this material_id (regardless of the location)
c3 - the "actual value" column where we use a case statement to decide if when the column 1 is NULL (there is no perfect match of material and location) then we use the value from column 2 (the average over all the other locations for the material) for the further calculations.

Oracle join on first row of a subquery

This may seem simple, but somehow it isn't. I have a table of historical rate data called TBL_A that looks like this:
| id | rate | added_date |
|--------|--------|--------------|
| bill | 7.50 | 1/24/2011 |
| joe | 8.50 | 5/3/2011 |
| ted | 8.50 | 4/17/2011 |
| bill | 9.00 | 9/29/2011 |
In TBL_B, I have hours that need to be joined to a single row of TBL_A in order to get costing info:
| id | hours | added_date |
|--------|---------|--------------|
| bill | 10 | 2/26/2011 |
| ted | 4 | 7/4/2011 |
| bill | 9 | 10/14/2011 |
As you can see, for Bill there are two rates in TBL_A, but they have different dates. To properly get Bill's cost for a period of time, you have to join each row of TBL_B on an row in TBL_A that is appropriate for the date.
I figured this would be easy; because this didn't have to an exceptionally fast query, I could just do a separate subquery for each row of costing info. However, joined subqueries apparently cannot "see" other tables that they are joined on. This query throws an invalid identifier (ORA-00904) on anything in the subquery that has the "h" alias:
SELECT h.id, r.rate * h.hours as "COST", h.added_date
FROM TBL_B h
JOIN (SELECT * FROM (
SELECT i.id, i.rate
FROM TBL_A i
WHERE i.id = h.id and i.added_date < h.added_date
ORDER BY i.added_date DESC)
WHERE rownum = 1) r
ON h.id = r.id
If the problem is simply scoping, I don't know if the approach I took can ever work. But all I'm trying to do here is get a single row based on some criteria, so I'm definitely open to other methods.
EDIT: The desired output would be this:
| id | cost | added_date |
|--------|---------|--------------|
| bill | 75 | 2/26/2011 |
| ted | 34 | 7/4/2011 |
| bill | 81 | 10/14/2011 |
Note that Bill has two different rates in the two entries in the table. The first row is 10 * 7.50 = 75 and the second row is 9 * 9.00 = 81.
Try using not exists:
select
b.id,
a.rate,
b.hours,
a.rate*b.hours as "COST",
b.added_date,
a.added_date
from
tbl_b b
inner join tbl_a a on
b.id = a.id
where
a.added_date < b.added_date
and not exists (
select
1
from
tbl_a a2
where
a2.added_date > a.added_date
and a2.added_date < b.added_date
and a2.id = a.id
)
As an explanation why this is happening: Only correlated subqueries are aware of the context in which they're being run, since they're run for each row. A joined subquery is actually executed prior to the join, and so it has no knowledge of the surrounding tables. You need to return all identifying information with it to make the join in the top level of the query, rather than trying to do it within the subquery.
select id, cost, added_date from (
select
h.id,
r.rate * h.hours as "COST",
h.added_date,
-- For each record, assign r=1 for 'newest' rate
row_number() over (partition by h.id, h.added_date order by r.added_date desc) r
from
tbl_b h,
tbl_a r
where
r.id = h.id and
-- Date of rate must be entered before
-- hours billed:
r.added_date < h.added_date
)
where r = 1
;

How to merge MySQL queries with different column counts?

Definitions:
In the results, * denotes an empty column
The data in the tables is such that every field in the table has the value Fieldname + RowCount (so column 'a' in row 1 contains the value 'a1').
2 MySQL Tables
Table1
Fieldnames: a,b,c,d
Table2
Fieldnames: e,f,g,h,i,j
Task:
I want to get the first 4 rows from each of the tables.
Standalone Queries
SELECT Table1.* FROM Table1 WHERE 1 LIMIT 0,4 -- Colcount 4
SELECT Table2.* FROM Table2 WHERE 1 LIMIT 0,4 -- Colcount 6
A simple UNION of the queries fails because the two parts have different column counts.
Version1: add two empty fields to the first query
SELECT Table1.*,'' AS i,'' AS j FROM Table1 WHERE 1 LIMIT 0,4
UNION
SELECT Table2.* FROM Table2 WHERE 1 LIMIT 0,4
So I will get the following fields in the result set:
a,b,c,d,i,j
a1,b1,c1,d1,*,*,
a2,b2,c2,d2,*,*,
....
....
e1,f1,g1,h1,i1,j1
e2,f2,g2,h2,i2,j2
The problem is that the field names of Table2 are overridden by Table1.
Version2 - shift columns by using empty fields:
SELECT Table1.*,'','','','','','' FROM Table1 WHERE 1 LIMIT 0,4
UNION
SELECT '','','','',Table2.* FROM Table2 WHERE 1 LIMIT 0,4
So I will get the following fields in the result set:
a,b,c,d,i,j
a1,b1,c1,d1,*,*,*,*,*,*,
a2,b2,c2,d2,*,*,*,*,*,*,
....
....
*,*,*,*,e1,f1,g1,h1,i1,j1
*,*,*,*,e2,f2,g2,h2,i2,j2
....
....
Problem is solved but I get many empty fields.
Is there a known performance issue?
How do you solve this task?
Is there a best practice to solve this issue?
The output from a query should be a table, which is a set of rows, each row with the same set of column names and types. (There are some DBMS that support ragged rows - with different sets of columns, but that is not a mainstream feature.)
You have to decide how to handle two sets of four rows with different sets of columns in the two sets.
The simplest option, usually, is to do the two standalone queries. The two result sets are not comparable, and should not be conflated.
If you choose your Version 1, then you should decide which set of column names is appropriate, or create a composite set of names using 'AS x' column aliases.
If you choose your Version 2, then you should probably name the trailing columns of the first clause of the UNION; at the moment, they all have no name:
SELECT Table1.*, '' AS e, '' AS f, '' AS g, '' AS h, '' AS i, '' AS j
FROM Table1 WHERE 1 LIMIT 0,4
UNION
SELECT '' AS a, '' AS b, '' AS c, '' AS d, Table2.*
FROM Table2 WHERE 1 LIMIT 0,4
(The AS comments in the second are redundant, but self-consistent; the two halves of the UNION have the same column headings explicitly.)
Except that you have provided empty strings instead of NULL, the notation you have chosen corresponds to an 'OUTER UNION'. You can find occasional references to it in selected parts of the literature (E F Codd in the RM/V2 book; C J Date in critiques of all things OUTER). SQL 1999 provided it as a UNION JOIN; SQL 2003 removed UNION JOIN (that's pretty unusual - and damning of the feature).
I'd use two separate queries.
The thing that seems most sensible is your "version 2", except using NULLs instead of empty strings.
This took some thinking, and then some MySQL-specific workarounds. The concept is this: A Join will produce the table structure you want. What you really want is a full outer join where no row 'matches.' To do this, we need a reliable way to ensure that rows don't match, and then, we have to UNION and LEFT JOIN and a RIGHT JOIN, to overcome MySQL's limitation of no FULL OUTER JOINs.
SQL Fiddle
MySQL 5.6 Schema Setup:
CREATE TABLE A (a int, b int, c int, d int);
CREATE TABLE B (e int, f int, g int, h int, i int, j int);
INSERT INTO A VALUES (1,1,1,1),(2,2,2,2);
INSERT INTO B VALUES (8,8,8,8,8,8),(9,9,9,9,9,9);
Query 1:
SELECT * FROM
(SELECT * FROM (SELECT "TableA" as unique_field) as Ax CROSS JOIN A) as A
LEFT JOIN
(SELECT * FROM (SELECT "TableB" as unique_field) as Bx CROSS JOIN B) AS B
on A.unique_field=B.unique_field
UNION
SELECT * FROM
(SELECT * FROM (SELECT "TableA" as unique_field) as Ax CROSS JOIN A) as A
RIGHT JOIN
(SELECT * FROM (SELECT "TableB" as unique_field) as Bx CROSS JOIN B) AS B
on A.unique_field=B.unique_field
Results:
| unique_field | a | b | c | d | unique_field | e | f | g | h | i | j |
|--------------|--------|--------|--------|--------|--------------|--------|--------|--------|--------|--------|--------|
| TableA | 1 | 1 | 1 | 1 | (null) | (null) | (null) | (null) | (null) | (null) | (null) |
| TableA | 2 | 2 | 2 | 2 | (null) | (null) | (null) | (null) | (null) | (null) | (null) |
| (null) | (null) | (null) | (null) | (null) | TableB | 8 | 8 | 8 | 8 | 8 | 8 |
| (null) | (null) | (null) | (null) | (null) | TableB | 9 | 9 | 9 | 9 | 9 | 9 |
This syntax: SELECT * FROM (SELECT 1 as unique_field) as Ax CROSS JOIN A) as A is more easily understood as (SELECT 1 as unique_field, * FROM A) AS A, but, MySQL doesn't allow a * to follow a field specification.

Filter a one-to-many query by requiring all of many meet criteria

Imagine the following tables:
create table boxes( id int, name text, ...);
create table thingsinboxes( id int, box_id int, thing enum('apple,'banana','orange');
And the tables look like:
Boxes:
id | name
1 | orangesOnly
2 | orangesOnly2
3 | orangesBananas
4 | misc
thingsinboxes:
id | box_id | thing
1 | 1 | orange
2 | 1 | orange
3 | 2 | orange
4 | 3 | orange
5 | 3 | banana
6 | 4 | orange
7 | 4 | apple
8 | 4 | banana
How do I select the boxes that contain at least one orange and nothing that isn't an orange?
How does this scale, assuming I have several hundred thousand boxes and possibly a million things in boxes?
I'd like to keep this all in SQL if possible, rather than post-processing the result set with a script.
I'm using both postgres and mysql, so subqueries are probably bad, given that mysql doesn't optimize subqueries (pre version 6, anyway).
SELECT b.*
FROM boxes b JOIN thingsinboxes t ON (b.id = t.box_id)
GROUP BY b.id
HAVING COUNT(DISTINCT t.thing) = 1 AND SUM(t.thing = 'orange') > 0;
Here's another solution that does not use GROUP BY:
SELECT DISTINCT b.*
FROM boxes b
JOIN thingsinboxes t1
ON (b.id = t1.box_id AND t1.thing = 'orange')
LEFT OUTER JOIN thingsinboxes t2
ON (b.id = t2.box_id AND t2.thing != 'orange')
WHERE t2.box_id IS NULL;
As always, before you make conclusions about the scalability or performance of a query, you have to try it with a realistic data set, and measure the performance.
I think Bill Karwin's query is just fine, however if a relatively small proportion of boxes contain oranges, you should be able to speed things up by using an index on the thing field:
SELECT b.*
FROM boxes b JOIN thingsinboxes t1 ON (b.id = t1.box_id)
WHERE t1.thing = 'orange'
AND NOT EXISTS (
SELECT 1
FROM thingsinboxes t2
WHERE t2.box_id = b.id
AND t2.thing <> 'orange'
)
GROUP BY t1.box_id
The WHERE NOT EXISTS subquery will only be run once per orange thing, so it's not too expensive provided there aren't many oranges.