SQL join on multiple columns or on single calculated column - sql

I'm migrating the backend a budget database from Access to SQL Server and I ran into an issue.
I have 2 tables (let's call them t1 and t2) that share many fields in common: Fund, Department, Object, Subcode, TrackingCode, Reserve, and FYEnd.
If I want to join the tables to find records where all 7 fields match, I can create an inner join using each field:
SELECT *
FROM t1
INNER JOIN t2
ON t1.Fund = t2.Fund
AND t1.Department = t2.Department
AND t1.Object = t2.Object
AND t1.Subcode = t2.Subcode
AND t1.TrackingCode = t2.TrackingCode
AND t1.Reserve = t2.Reserve
AND t1.FYEnd = t2.FYEnd;
This works, but it runs very slowly. When the backend was in Access, I was able to solve the problem by adding a calculated column to both tables. It basically, just concatenated the fields using "-" as a delimiter. The revised query is as follows:
SELECT *
FROM t1 INNER JOIN t2
ON CalculatedColumn = CalculatedColumn
This looks cleaner and runs much faster. The problem is when I moved t1 and t2 to SQL Server, the same query gives me an error message:
I'm new to SQL Server. Can anyone explain what's going on here? Is there a setting I need to change for the calculated column?

Posting this as an answer from my comment.
Usually, this is an issue with mismatched Data types between the two columns referenced. Check and make sure the data types of the two fields (CompositeID) are the same.

You have to calculate the columns before joining them as the ON clause can only access columns for the table.
It is no good to have two identical tables anyway so you should rethink your design completely.
SELECT t1a.*,t2a.*
FROM (SELECT CalculatedColumn, * FROM t1) t1a INNER JOIN (SELECT CalculatedColumn, * FROM t2 ) t2a
ON t1a.CalculatedColumn = t2a.CalculatedColumn

Related

Improve SQL script

I would like to know if there is a better way to write the example script stated below.
Table 1 has 1 line for every item.
Table 2 has 1 line for every physique available of an item.
I would write the SQL below. But when I have about 18 physique values, this will increase the join section. I can join the table without specifying the Physique, but this leaves me with a dataset where rows are exploded and I need to run a Distinct or Group By.
select
t2.ItemID, t2.Name, t1_width.Target as 'Width', t1_length.Target as 'Length'
from
t2
left join t1 as t1_width on t1_width.ItemID = t2.ItemID and t1_Physique = 'Width'
left join t1 as t1_length on t1_length.ItemID = t2.ItemID and t1_Physique = 'Length'
Maybe there is a better way to call the right values from the Select of make one join?
You must use Pivot Table in this case. PIVOT basically changes the rows to columns. In this way there'd be only one INNER JOIN ON t1.ItemID = t2.ItemID.
To start with,
SELECT DISTINCT Physique FROM table2
to get the pivot (column values). There is even a set query to generate this in the example below.
[Use this example to build the query]
Quick Tip: You could use either MAX(target) or COUNT(target) in the agg func call in PIVOT table, depends on dataset you trying to generate.

BigQuery how to automatically handle "duplicate column names" on left join

I am working with a dataset of tables that (a) often requires joining tables together, however also (b) frequently has duplicate columns names. Any time I write a query along the lines of:
SELECT
t1.*, t2.*
FROM t1
LEFT JOIN t2 ON t1.this_id = t2.matching_id
...I get the error Duplicate column names in the result are not supported. Found duplicate(s): this_col, that_col, another_col, more_cols, dupe_col, get_the_idea_col
I understand that with BigQuery, it is better to avoid using * when selecting tables, however my data tables aren't too big + my bigquery budget is high, and doing these joins with all columns helps significantly with data exploration.
Is there anyway BigQuery can automatically handle / rename columns in these situations (e.g. prefix the column with the table name), as opposed to not allowing the query all together?
Thanks!
The simplest way is to select records rather than columns:
SELECT t1, t2
FROM t1 LEFT JOIN
t2
ON t1.this_id = t2.matching_id;
This is pretty much what I do for ad hoc queries.
If you want the results as columns and not records (they don't look much different in the results), you can use EXCEPT:
SELECT t1.* EXCEPT (duplicate_column_name),
t2.* EXCEPT (duplicate_column_name),
t1.duplicate_column_name as t1_duplicate_column_name,
t2.duplicate_column_name as t2_duplicate_column_name
FROM t1 LEFT JOIN
t2
ON t1.this_id = t2.matching_id;
Is there anyway BigQuery can automatically handle / rename columns in these situations (e.g. prefix the column with the table name), as opposed to not allowing the query all together?
This is possible with BigQuery Legacy SQL - which can be handy for data exploration unless you are dealing with data types or using some functions/features specific to standard sql
So below
#legacySQL
SELECT t1.*, t2.*
FROM table1 AS t1
LEFT JOIN table2 AS t2
ON t1.this_id = t2.matching_id
will produce output where all column names will be prefixed with respective alias like t1_this_id and t2_matching_id

Comparing two Access tables identical in structure but not data

I have two tables in Access database, Table1 and Table2 with exactly the same structure but Table1 has more data. I want to figure out which data am I missing from Table2. The primary key for each table is composed of text fields:
CenterName
BuildingName
FloorNo
RoomNo
Each center can have many buildings and two different centers can have a building with the same name. Also room numbers and floor numbers can be the same across different buildings and different centers.
I have tried
SELECT t1.CenterName, t1.BuildingName, t1.FloorNo, t1.RoomNo, t2.CenterName
FROM Table1 as t1 LEFT JOIN Table2 as t2 ON t1.CenterName=t2.CenterName
WHERE t2.CenterName Is Null;
But the above does not return any data, meaning all the Centers are in both tables. But it does not tell me anything about the rest of the fields that might be missing from Table2.
Can anyone please help to re-write my query so it works as intended?
I am used to SQL Server database so building queries in Access is a bit time consuming for me. Before I transfer all the data into SQL Server for analysis I wanted to see if I can get any help here.
Join on all four of the fields which make up the primary key.
SELECT
t1.CenterName,
t1.BuildingName,
t1.FloorNo,
t1.RoomNo,
t2.CenterName
FROM
Table1 AS t1
LEFT JOIN Table2 AS t2
ON
t1.CenterName = t2.CenterName
AND t1.BuildingName = t2.BuildingName
AND t1.FloorNo = t2.FloorNo
AND t1.RoomNo = t2.RoomNo
WHERE t2.CenterName Is Null;

Number of Records don't match when Joining three tables

Despite going through every material I could possibly find on the internet, I haven't been able to solve this issue myself. I am new to MS Access and would really appreciate any pointers.
Here's my problem - I have three tables
Source1084 with columns - Department, Sub-Dept, Entity, Account, +few more
R12CAOmappingTable with columns - Account, R12_Account
Table4 with columns - R12_Account, Department, Sub-Dept, Entity, New Dept, LOB +few more
I have a total of 1084 records in Source and the result table must also contain 1084 records. I need to draw a table with all the columns from Source + R12_account from R12CAOmappingTable + all columns from Table4.
Here is the query I wrote. This yields the right columns but gives me more or less number of records with interchanging different join options.
SELECT rmt.r12_account,
srb.version,
srb.fy,
srb.joblevel,
srb.scenario,
srb.department,
srb.[sub-department],
srb.[job function],
srb.entity,
srb.employee,
table4.lob,
table4.product,
table4.newacct,
table4.newdept,
srb.[beg balance],
srb.jan,
srb.feb,
srb.mar,
srb.apr,
srb.may,
srb.jun,
srb.jul,
srb.aug,
srb.sep,
srb.oct,
srb.nov,
srb.dec,
rmt.r12_account
FROM (source1084 AS srb
LEFT JOIN r12caomappingtable AS rmt
ON srb.account = rmt.account)
LEFT JOIN table4
ON ( srb.department = table4.dept )
AND ( srb.[sub-department] = table4.subdept )
AND ( srb.entity = table4.entity )
WHERE ( ( ( srb.[sub-department] ) = table4.subdept )
AND ( ( srb.entity ) = table4.entity )
AND ( ( rmt.r12_account ) = table4.r12_account ) );
In this simple example, Table1 contains 3 rows with unique fld1 values. Table2 contains one row, and the fld1 value in that row matches one of those in Table1. Therefore this query returns 3 rows.
SELECT *
FROM
Table1 AS t1
LEFT JOIN Table2 AS t2
ON t1.fld1 = t2.fld1;
However if I add the WHERE clause as below, that version of the query returns only one row --- the row where the fld1 values match.
SELECT *
FROM
Table1 AS t1
LEFT JOIN Table2 AS t2
ON t1.fld1 = t2.fld1
WHERE t1.fld1 = t2.fld1;
In other words, that WHERE clause counteracts the LEFT JOIN because it excludes rows where t2.fld1 is Null. If that makes sense, notice that second query is functionally equivalent to this ...
SELECT *
FROM
Table1 AS t1
INNER JOIN Table2 AS t2
ON t1.fld1 = t2.fld1;
Your situation is similar. I suggest you first eliminate the WHERE clause and confirm this query returns at least your expected 1084 rows.
SELECT Count(*) AS CountOfRows
FROM (source1084 AS srb
LEFT JOIN r12caomappingtable AS rmt
ON srb.account = rmt.account)
LEFT JOIN table4
ON ( srb.department = table4.dept )
AND ( srb.[sub-department] = table4.subdept )
AND ( srb.entity = table4.entity );
After you get the query returning the correct number of rows, you can alter the SELECT list to return the columns you want. But the columns aren't really the issue until you can get the correct rows.
Without knowing your tables values it is hard to give a complete answer to your question. The issue that is causing you a problem based on how you described it. Is more then likely based on the type of joins you are using.
The best way I found to understand what type of joins you should be using would referencing a Venn diagram explaining the different type of joins that you can use.
Jeff Atwood also has a really good explanation of SQL joins on his site using the above method as well.
Best to just use the query builder. Drop in your main table. Choose the columns you want. Now for any of the other lookup values then simply drop in the other tables, draw the join line(s), double click and use a left join. You can do this for 2 or 30 columns that need to "grab" or lookup other values from other tables. The number of ORIGINAL rows in the base table returned should ALWAYS remain the same.
So just use the query builder and follow the above.
The problem with your posted SQL is you NESTED the joins inside (). Don't do that. (or let the query builder do this for you – they tend to be quite messy but will also work).
Just use this:
FROM source1084 AS srb
LEFT JOIN r12caomappingtable AS rmt
ON srb.account = rmt.account
LEFT JOIN table4
ON ( srb.department = table4.dept )
AND ( srb.[sub-department] = table4.subdept )
AND ( srb.entity = table4.entity )
As noted, I don't see why you are "repeating" the conditions again in the where clause.

INNER JOIN with complex condition dramatically increases the execution time

I have 2 tables with several identical fields needed to be linked in JOIN condition. E.g. in each table there are fields: P1, P2. I want to write the following join query:
SELECT ... FROM Table1
INNER JOIN
Table2
ON Table1.P1 = Table2.P1
OR Table1.P2 = Table2.P2
OR Table1.P1 = Table2.P2
OR Table1.P2 = Table2.P1
In the case I have huge tables this request is executing a lot of time.
I tried to test how long will be the request of a query with one condition only. First, I have modified the tables in such way all data from P2 & P1 where copied as new rows into Table1 & Table2. So my query is simple:
SELECT ... FROM Table1 INNER JOIN Table2 ON Table1.P = Table2.P
The result was more then surprised: the execution time from many hours (the 1st case) was reduced to 2-3 seconds!
Why is it so different? Does it mean the complex conditions are always reduce performance? How can I improve the issue? May be P1,P2 indexing will help? I want to remain the 1st DB schema and not to move to one field P.
The reason the queries are different is because of the join strategies being used by the optimizer. There are basically four ways that two tables can be joined:
"Hash join": Creates a hash table on one of the tables which it uses to look up the values in the second.
"Merge join": Sorts both tables on the key and then readsthe results sequentially for the join.
"Index lookup": Uses an index to look up values in one table.
"Nested Loop": Compars each value in each table to all the values in the other table.
(And there are variations on these, such as using an index instead of a table, working with partitions, and handling multiple processors.) Unfortunately, in SQL Server Management Studio both (3) and (4) are shown as nested loop joins. If you look more closely, you can tell the difference from the parameters in the node.
In any case, your original join is one of the first three -- and it goes fast. These joins can basically only be used on "equi-joins". That is, when the condition joining the two tables includes an equality operator.
When you switch from a single equality to an "in" or set of "or" conditions, the join condition has changed from an equijoin to a non-equijoin. My observation is that SQL Server does a lousy job of optimization in this case (and, to be fair, I think other databases do pretty much the same thing). Your performance hit is the hit of going from a good join algorithm to the nested loops algorithm.
Without testing, I might suggest some of the following strategies.
Build an index on P1 and P2 in both tables. SQL Server might use the index even for a non-equijoin.
Use the union query suggested in another solution. Each query should be correctly optimized.
Assuming these are 1-1 joins, you can also do this as a set of multiple joins:
from table1 t1 left outer join
table2 t2_11
on t1.p1 = t2_11.p1 left outer join
table2 t2_12
on t1.p1 = t2_12.p2 left outer join
table2 t2_21
on t1.p2 = t2_21.p2 left outer join
table2 t2_22
on t1.p2 = t2_22.p2
And then use case/coalesce logic in the SELECT to get the value that you actually want. Although this may look more complicated, it should be quite efficient.
you can use 4 query and Union there result
SELECT ... FROM Table1
INNER JOIN
Table2
ON Table1.P1 = Table2.P1
UNION
SELECT ... FROM Table1
INNER JOIN
Table2
ON Table1.P1 = Table2.P2
UNION
SELECT ... FROM Table1
INNER JOIN
Table2
ON Table1.P2 = Table2.P1
UNION
SELECT ... FROM Table1
INNER JOIN
Table2
ON Table1.P2 = Table2.P2
Does using CTEs help performance?
;WITH Table1_cte
AS
(
SELECT
...
[P] = P1
FROM Table1
UNION
SELECT
...
[P] = P2
FROM Table1
)
, Table2_cte
AS
(
SELECT
...
[P] = P1
FROM Table2
UNION
SELECT
...
[P] = P2
FROM Table2
)
SELECT ... FROM Table1_cte x
INNER JOIN
Table2_cte y
ON x.P = y.P
I suspect, as far as the processor is concerned, the above is just different syntax for the same complex conditions.