SQL Performance Inner Join - sql

Let me ask you something I've been thinking about for a while. Imagine that you have two tables with data:
MAIN TABLE (A)
| ID | Date |
|:-----------|------------:|
| 1 | 01-01-1990|
| 2 | 01-01-1991|
| 3 | 01-01-1992|
| 4 | 01-01-2000|
| 5 | 01-01-2001|
| 6 | 01-01-2003|
SECONDARY TABLE (B)
| ID | Date | TOTAL |
|:-----------|------------:|--------:|
| 1 | 01-01-1990| 1 |
| 2 | 01-01-1991| 2 |
| 3 | 01-01-1992| 1 |
| 4 | 01-01-2000| 5 |
| 5 | 01-01-2001| 8 |
| 6 | 01-01-2003| 7 |
and you want to select only ID with date greater than 31-12-1999 and get the following columns: ID, Date and Total. For that we have many options but my question would be which of the following would be better in terms of performance:
OPTION 1
With main as(
select id,
date
from A
where date > '31-12-1999'
)
select main.id,
main.date,
B.total
from main inner join B on main.id = b.id
OPTION 1
With main as(
select id,
date
from A
where date > '31-12-1999'
),
secondary as (
select id,
total
from B
where date > '31-12-1999'
)
select main.id,
main.date,
secondary.total
from main inner join secondary on main.id = b.id
Which of both queries would be better in terms of performance? Thanks in advance!
DATE FOR BOTH TABLES MEANS THE SAME

You don't need to use CTE you can directly join two tables -
select A.id,
A.date,
B.total
from A inner join B on A.id = b.id
where A.date > '31-12-1999'

You would need to test on your data. But there is really no need for CTEs:
select a.id a.date, b.total
from a inner join
b
on a.id = b.id
where a.date > '1999-12-31' and b.date > '1999-12-31';
As for your specific question, the two queries are not the same, because the first is filtering on only one date and the second is filtering on two dates. You should run the query that implements the logic that you intend.

Related

Bigquery SQL - WHERE IN Col Value

so i have reference table (table A) like this
| cust_id | prod |
| 1 | A, B |
| 2 | C, D, E|
This reference table will be joined by transaction history like table (table B)
| trx_id | cust_id | prod | amount
| 1 | 1 | A | 10
| 2 | 1 | B | 5
| 3 | 1 | C | 1
| 4 | 1 | D | 6
i want to get sumup value of table b amount, but the list of products is only obtained from table A.
i tried something like this but doesn't work
SELECT A.cust_id
, SUM(B.amount) AS amount
FROM A
INNER JOIN B ON A.cust_id = B.cust_id
AND B.prod IN(A.prod)
GROUP BY 1
Hmmm . . . Try splitting the prod and join on that:
SELECT A.cust_id, SUM(B.amount) AS amount
FROM A CROSS JOIN
UNNEST(SPLIT(a.prod, ', ')) p JOIN
B
ON A.cust_id = B.cust_id AND B.prod = p
GROUP BY 1;
Note: Storing multiple values in a string is a really bad idea. You can use a separate junction table (one row per customer and product) or use an array.
Below is for BigQuery Standard SQL
#standardSQL
select cust_id,
sum(if(b.prod in unnest(split(a.prod, ', ')), amount, 0)) as amount
from `project.dataset.tableB` b
join `project.dataset.tableA` a
using(cust_id)
group by cust_id
Also, note: for BigQuery - in general - storing multiple values in a string is a really good idea. :o)
See Denormalize data whenever possible for more on this

Join table A on table B and select only the first occurrence from B after specific date from table A

I'm trying to determine the best way to do the following.... Table a has a specific start_date. table b has a bunch of dollar amounts with various dates based on payments received and when. I only want to show the row from table b with the first date occurrence >= the start_date from table a. I also do not want to retrieve duplicates ID numbers which is what I am encountering now.
I have something like this so far...
Select a.ID, a.Start_Date
From a
Left Join (Select ID, Min(Recd_Dt) as Mindate, Total_Recd
Group by ID, Total_Recd) b on a.ID = b.ID and a.Start_Date <= b.Mindate
table a looks like this...
ID | Start_Dt
1 | 11/2/2017
2 | 11/3/2017
table b looks like this...
ID | Recd_Dt | Total_Recd
1 | 11/1/2017 | $600
1 | 11/10/2017 | $800
1 | 11/19/2017 | $100
2 | 11/2/2017 | $200
2 | 11/5/2017 | $600
2 | 11/6/2017 | $100
Id Like to see something like this...
ID | Recd_Dt | Total_Recd | Sum_of_Total_Recd_After_Start
1 | 11/10/2017 | $800 | $900
2 | 11/5/2017 | $600 | $700
furthermore, I'd like to also have a second join on the same table b that will give me a sum of any amount that occurred after the Start_Date
Give this a try:
SELECT
a.ID,
b.Recd_Dt,
b.Total_Recd,
SUM(Total_Recd) OVER(PARTITION BY a.ID) AS Sum_of_Total_Recd_After_Start
FROM a
INNER JOIN b ON a.ID = b.ID AND b.Recd_Dt > a.Start_Dt
QUALIFY ROW_NUMBER() OVER(PARTITION BY a.ID ORDER BY b.Start_Dt) = 1
1) Get all rows from table "a"
2) Get related rows from table "b" with Recd_Dt > Start_Dt
3) ROW_NUMBER orders rows by the earliest Start_Dt per each ID
4) QUALIFY ... = 1 keeps only the first row per ID grouping
5) SUM(Total_Recd) adds up the Total_Recd column per each ID grouping
I haven't tested it, but let me know if it works.

Comparing different columns in SQL for each row

after some transformation I have a result from a cross join (from table a and b) where I want to do some analysis on. The table for this looks like this:
+-----+------+------+------+------+-----+------+------+------+------+
| id | 10_1 | 10_2 | 11_1 | 11_2 | id | 10_1 | 10_2 | 11_1 | 11_2 |
+-----+------+------+------+------+-----+------+------+------+------+
| 111 | 1 | 0 | 1 | 0 | 222 | 1 | 0 | 1 | 0 |
| 111 | 1 | 0 | 1 | 0 | 333 | 0 | 0 | 0 | 0 |
| 111 | 1 | 0 | 1 | 0 | 444 | 1 | 0 | 1 | 1 |
| 112 | 0 | 1 | 1 | 0 | 222 | 1 | 0 | 1 | 0 |
+-----+------+------+------+------+-----+------+------+------+------+
The ids in the first column are different from the ids in the sixth column.
In a row are always two different IDs that are matched with each other. The other columns always have either 0 or 1 as a value.
I am now trying to find out how many values(meaning both have "1" in 10_1, 10_2 etc) two IDs have on average in common, but I don't really know how to do so.
I was trying something like this as a start:
SELECT SUM(CASE WHEN a.10_1 = 1 AND b.10_1 = 1 then 1 end)
But this would obviously only count how often two ids have 10_1 in common. I could make something like this for example for different columns:
SELECT SUM(CASE WHEN (a.10_1 = 1 AND b.10_1 = 1)
OR (a.10_2 = 1 AND b.10_1 = 1) OR [...] then 1 end)
To count in general how often two IDs have one thing in common, but this would of course also count if they have two or more things in common. Plus, I would also like to know how often two IDS have two things, three things etc in common.
One "problem" in my case is also that I have like ~30 columns I want to look at, so I can hardly write down for each case every possible combination.
Does anyone know how I can approach my problem in a better way?
Thanks in advance.
Edit:
A possible result could look like this:
+-----------+---------+
| in_common | count |
+-----------+---------+
| 0 | 100 |
| 1 | 500 |
| 2 | 1500 |
| 3 | 5000 |
| 4 | 3000 |
+-----------+---------+
With the codes as column names, you're going to have to write some code that explicitly references each column name. To keep that to a minimum, you could write those references in a single union statement that normalizes the data, such as:
select id, '10_1' where "10_1" = 1
union
select id, '10_2' where "10_2" = 1
union
select id, '11_1' where "11_1" = 1
union
select id, '11_2' where "11_2" = 1;
This needs to be modified to include whatever additional columns you need to link up different IDs. For the purpose of this illustration, I assume the following data model
create table p (
id integer not null primary key,
sex character(1) not null,
age integer not null
);
create table t1 (
id integer not null,
code character varying(4) not null,
constraint pk_t1 primary key (id, code)
);
Though your data evidently does not currently resemble this structure, normalizing your data into a form like this would allow you to apply the following solution to summarize your data in the desired form.
select
in_common,
count(*) as count
from (
select
count(*) as in_common
from (
select
a.id as a_id, a.code,
b.id as b_id, b.code
from
(select p.*, t1.code
from p left join t1 on p.id=t1.id
) as a
inner join (select p.*, t1.code
from p left join t1 on p.id=t1.id
) as b on b.sex <> a.sex and b.age between a.age-10 and a.age+10
where
a.id < b.id
and a.code = b.code
) as c
group by
a_id, b_id
) as summ
group by
in_common;
The proposed solution requires first to take one step back from the cross-join table, as the identical column names are super annoying. Instead, we take the ids from the two tables and put them in a temporary table. The following query gets the result wanted in the question. It assumes table_a and table_b from the question are the same and called tbl, but this assumption is not needed and tbl can be replaced by table_a and table_b in the two sub-SELECT queries. It looks complicated and uses the JSON trick to flatten the columns, but it works here:
WITH idtable AS (
SELECT a.id as id_1, b.id as id_2 FROM
-- put cross join of table a and table b here
)
SELECT in_common,
count(*)
FROM
(SELECT idtable.*,
sum(CASE
WHEN meltedR.value::text=meltedL.value::text THEN 1
ELSE 0
END) AS in_common
FROM idtable
JOIN
(SELECT tbl.id,
b.*
FROM tbl, -- change here to table_a
json_each(row_to_json(tbl)) b -- and here too
WHERE KEY<>'id' ) meltedL ON (idtable.id_1 = meltedL.id)
JOIN
(SELECT tbl.id,
b.*
FROM tbl, -- change here to table_b
json_each(row_to_json(tbl)) b -- and here too
WHERE KEY<>'id' ) meltedR ON (idtable.id_2 = meltedR.id
AND meltedL.key = meltedR.key)
GROUP BY idtable.id_1,
idtable.id_2) tt
GROUP BY in_common ORDER BY in_common;
The output here looks like this:
in_common | count
-----------+-------
2 | 2
3 | 1
4 | 1
(3 rows)

SQL Server - Select first row that meets criteria

I have 2 tables that contain IDs. There will be duplicate IDs in one of the tables and I only want to return one row for each matching ID in table B. For example:
Table A
+-----------+-----------+
| objectIdA | objectIdB |
+-----------+-----------+
| 1 | A |
| 1 | B |
| 1 | D |
| 5 | F |
+-----------+-----------+
Table B
+-----------+
| objectIdA |
+-----------+
| 1 |
| 5 |
+-----------+
Would return:
+-----------+-----------+
| objectIdA | objectIdB |
+-----------+-----------+
| 1 | D |
| 5 | F |
+-----------+-----------+
I only need one entry from Table A that matches Table B. It doesn't matter which row of table A is returned.
I'm using SQL Server.
Thanks.
;WITH CTE
AS (
SELECT B.objectIdA
,A.objectIdB
,ROW_NUMBER() OVER (PARTITION BY B.objectIdA ORDER BY A.objectIdB DESC) rn
FROM TableA A
INNER JOIN TableB B ON A.objectIdA = B.objectIdA
)
SELECT C.objectIdA
,C.objectIdB
FROM CTE
WHERE rn = 1
You can do so,by using a subselect for table a to get one entry per objectIdA group
select b.*,a.[objectIdB]
from b
join
(select [objectIdA], max([objectIdB]) [objectIdB]
from a group by [objectIdA]
) a
on(b.[objectIdA] = a.[objectIdA])
Fiddle Demo
Edit deom comments to get a whole row from tablea you can use a self join for tablea
select b.*,a.*
from b
join a
on(b.[objectIdA] = a.[objectIdA])
join (select [objectIdA], max([objectIdB]) [objectIdB]
from a group by [objectIdA]) a1
on(a.[objectIdA] = a1.[objectIdA]
and
a.[objectIdB] = a1.[objectIdB])
Fiddle Demo 2
SELECT MAX(b.ID) AS ID
,MAX(Value) AS Value
,MAX(OtherCol1) AS OtherCol1
,MAX(OtherCol2) AS OtherCol2
,MAX(OtherCol3) AS OtherCol3
FROM TblA AS a
INNER JOIN TblB AS b ON a.TblBID = b.ID
GROUP BY TblBID
Table A
Table B
Table A Data
Table B Data
Query Result
You should use PARTITION OVER to achieve the results.
SELECT
t.objectIdA,
t.objectIdB
FROM (
SELECT
a.objectIdA,
a.objectIdB,
rowid = ROW_NUMBER() OVER (PARTITION BY a.objectIdA ORDER BY a.objectIdB DESC)
FROM TableA a
INNER JOIN TableB b ON (a.objectIdA = b.objectIdA)
) t
WHERE rowid <= 1
Fiddle Code: http://sqlfiddle.com/#!3/a2ccd/1

SQL return max value from child for each parent row

I have 2 tables - 1 with parent records, 1 with child records. For each parent record, I'm trying to return a single child record with the MAX(SalesPriceEach).
Additionally I'd like to only return a value when there is more than 1 child record.
parent - SalesTransactions table:
+-------------------+---------+
|SalesTransaction_ID| text |
+-------------------+---------+
| 1 | Blah |
| 2 | Blah2 |
| 3 | Blah3 |
+-------------------+---------+
child - SalesTransactionLines table
+--+-------------------+---------+--------------+
|id|SalesTransaction_ID|StockCode|SalesPriceEach|
+--+-------------------+---------+--------------+
| 1| 1 | 123 | 99 |
| 2| 1 | 35 | 50 |
| 3| 2 | 15 | 75 |
+--+-------------------+---------+--------------+
desired results
+-------------------+---------+--------------+
|SalesTransaction_ID|StockCode|SalesPriceEach|
+-------------------+---------+--------------+
| 1 | 123 | 99 |
| 2 | 15 | 75 |
+-------------------+---------+--------------+
I found a very similar question here, and based my query on the answer but am not seeing the results I expect.
WITH max_feature AS (
SELECT c.StockCode,
c.SalesTransaction_ID,
MAX(c.SalesPriceEach) as feature
FROM SalesTransactionLines c
GROUP BY c.StockCode, c.SalesTransaction_ID)
SELECT p.SalesTransaction_ID,
mf.StockCode,
mf.feature
FROM SalesTransactions p
LEFT JOIN max_feature mf ON mf.SalesTransaction_ID = p.SalesTransaction_ID
The results from this query are returning multiple rows for each parent, and not even the highest value first!
select stl.SalesTransaction_ID, stl.StockCode, ss.MaxSalesPriceEach
from SalesTransactionLines stl
inner join
(
select stl2.SalesTransaction_ID, max(stl2.SalesPriceEach) MaxSalesPriceEach
from SalesTransactionLines stl2
group by stl2.SalesTransaction_ID
having count(*) > 1
) ss on (ss.SalesTransaction_ID = stl.SalesTransaction_ID and
ss.MaxSalesPriceEach = stl.SalesPriceEach)
OR, alternatively:
SELECT stl1.*
FROM SalesTransactionLines AS stl1
LEFT OUTER JOIN SalesTransactionLines AS stl2
ON (stl1.SalesTransaction_ID = stl2.SalesTransaction_ID
AND stl1.SalesPriceEach < stl2.SalesPriceEach)
WHERE stl2.SalesPriceEach IS NULL;
I know I'm a year late to this party but I always prefer using Row_Number in these situations. It solves the problem when there are two rows that meet your Max criteria and makes sure that only one row is returned:
with z as (
select
st.SalesTransaction_ID
,row=ROW_NUMBER() OVER(PARTITION BY st.SalesTransaction_ID ORDER BY stl.SalesPriceEach DESC)
,stl.StockCode
,stl.SalesPriceEach
from
SalesTransactions st
inner join SalesTransactionLines stl on stl.SalesTransaction_ID = st.SalesTransaction_ID
)
select * from z where row = 1
SELECT SalesTransactions.SalesTransaction_ID,
SalesTransactionLines.StockCode,
MAX(SalesTransactionLines.SalesPriceEach)
FROM SalesTransactions RIGHT JOIN SalesTransactionLines
ON SalesTransactions.SalesTransaction_ID = SalesTransactionLines.SalesTransaction_ID
GROUP BY SalesTransactions.SalesTransaction_ID, alesTransactionLines.StockCode;
select a.SalesTransaction_ID, a.StockCode, a.SalesPriceEach
from SalesTransacions as a
inner join (select SalesTransaction_ID, MAX(SalesPriceEach) as SalesPriceEach
from SalesTransactionLines group by SalesTransaction_ID) as b
on a.SalesTransaction_ID = b.SalesTransaction_ID
and a.SalesPriceEach = b.SalesPriceEach
subquery returns table with trans ids and their maximums so just join it with transactions table itself by those 2 values