I have a Postgres database with 3 tables that looks a little something like this:
table categories
id
type
table games
id
table game_category
id
game_id
category_id
I want to select all games which have more than x categories where type is something
I have gotten this far:
SELECT * FROM games WHERE id IN (
SELECT game_id FROM game_category GROUP BY game_id HAVING COUNT(*) >= 5
)
This works to select all games with more than 5 categories, but doesn't narrow down the categories by their type. How could I expand on this to add the additional check for the type?
You have to join your categories table with the subquery. Then you can add a WHERE clause for the type. Replace '?' with your actual type, of course.
SELECT * FROM games WHERE id IN (
SELECT game_id FROM game_category
INNER JOIN categories ON (categories.id=game_category.category_id)
WHERE categories.type='?'
GROUP BY game_id HAVING COUNT(*) >= 5
)
Considering query response time, you can avoid the in clause. Mitchel's answer would work if written as follows:
SELECT game_id
FROM game_category gc
inner join categories c on c.id = gc.category_id
WHERE type = 'X'
GROUP BY game_id
HAVING COUNT(game_id) >= 5
Notice I avoided using count(*) that is also a query optimization strategy
In short: 3 table inner join duplicates records
I have data in BigQuery in 3 tables:
Pageviews with columns:
timestamp
user_id
title
path
Contacts with columns:
website_user_id
email
company_id
Companies with columns:
id
name
I want to display all recorded pageviews and, if user and/or company is known, display this data next to pageview.
First, I join contact and pageviews data (SQL is generated by Metabase business intelligence tool):
SELECT
`analytics.pageviews`.`timestamp` AS `timestamp`,
`analytics.pageviews`.`title` AS `title`,
`analytics.pageviews`.`path` AS `path`,
`Contacts`.`email` AS `email`
FROM `analytics.pageviews`
INNER JOIN `analytics.contacts` `Contacts` ON `analytics.pageviews`.`user_id` = `Contacts`.`website_user_id`
ORDER BY `timestamp` DESC
It works as expected and I can see pageviews attributed to known contacts.
Next, I'd like to show pageviews of contacts with known company and which company is this:
SELECT
`analytics.pageviews`.`timestamp` AS `timestamp`,
`analytics.pageviews`.`title` AS `title`,
`analytics.pageviews`.`path` AS `path`,
`Contacts`.`email` AS `email`,
`Companies`.`name` AS `name`
FROM `analytics.pageviews`
INNER JOIN `analytics.contacts` `Contacts` ON `analytics.pageviews`.`user_id` = `Contacts`.`website_user_id`
INNER JOIN `analytics.companies` `Companies` ON `Contacts`.`company_id` = `Companies`.`id`
ORDER BY `timestamp` DESC
With this query I would expect to see only pageviews where associated contact AND company are known (just another column for company name). The problem is, I get duplicate rows for every pageview (sometimes 5, sometimes 20 identical rows).
I want to avoid selecting DISTINCT timestamps because it can lead to excluding valid pageviews from different users but with identical timestamp.
How to approach this?
Your description sounds like you have duplciates in companies. This is easy to test for:
select c.id, count(*)
from `analytics.companies` c
group by c.id
having count(*) >= 2;
You can get the details using window functions:
select c.*
from (select c.*, count(*) over (partition by c.id) as cnt
from `analytics.companies` c
) c
where cnt >= 2
order by cnt desc, id;
I am attempting to make a complex SQL join on several tables: as shown below. I have included an image of the dB schema also.
Consider table_1 -
e_id name
1 a
2 b
3 c
4 d
and table_2 -
e_id date
1 1/1/2019
1 1/1/2020
2 2/1/2019
4 2/1/2019
The issue here is performance. From the tables 2 - 4 we only want the most recent entry for a given e_id but because these tables contain historical data (~ >3.5M rows) it's quite slow. I've attached an example of how we're currently trying to achieve this but it only includes one join of 'table_1' with 'table_x'. We group by e_id and get the max date for it. The other way we've thought about doing this is creating a Materialized View and pulling data from that and refreshing it after some period of time. Any improvements welcome.
from fds.region as rg
inner join (
select e_id, name, p_id
from fds.table_1
where sec_type = 'S' AND active_flag = 1
) as table_1 on table_1.e_id = rg.e_id
inner join fds.table_2 table_2 on table_2.e_id = rg.e_id
inner join fds.sec sec on sec.p_id = table_1.p_id
inner join fds.entity ent on ent.int_entity_id = sec.int_entity_id
inner join (
SELECT int_1.e_id, int_1.date, int_1.int_price
FROM fds.table_4 int_1
INNER JOIN (
SELECT e_id, MAX(date) date
FROM fds.table_2
GROUP BY e_id
) int_2 ON int_1.e_id = int_2.fsym_id AND int_1.date = int_2.date
) as table_4 on table_4.e_id = rg.e_id
where rg.region_str like '%US' and ent.sec_type = 'P'
order by table_2.int_price
limit 500;
You can simplify this logic:
(
SELECT int_1.e_id, int_1.date, int_1.int_price
FROM fds.table_4 int_1
INNER JOIN (
SELECT e_id, MAX(date) date
FROM fds.table_2
GROUP BY e_id
) int_2 ON int_1.e_id = int_2.fsym_id AND int_1.date = int_2.date
) as table_4
To:
(SELECT DISTINCT ON (int_1.e_id) int_1.*
FROM fds.table_4 int_1
ORDER BY int_1.e_id, int_1.date DESC
) table_4
This can take advantage of an index on fds.table_4(e_id, date desc) -- and might be wicked fast with such an index.
You also want appropriate indexes for the joins and filtering. However, it is hard to be more specific without an execution plan.
this is more like a general question.
I am looking for the best way to join 4, maybe 5 different tables. I am trying to create a Power Bi pulling live information from an IBM AS400 where customer service can type one of our parts number,
see how many parts we have in inventory, if none, see the lead time and if there are any orders already already entered for the typed part number.
SERI is our inventory table with 37180 records.
(active inventory that is available)
METHDM is our kit table with 37459 records.
(this table contains the bill of materials for custom kits, KIT A123 contains different part numbers in it witch are in SERI as well.)
STKA is our part lead time table with 76796 records.
(lead time means how long will it take for parts to come in)
OCRI is our sales order table with 6497 records.
(This table contains all customer orders)
I have some knowledge in writing queries but this one is more challenging of what I have created in the past. Should I start with the table that has the most records and start left joining the rest ?
From STKA 76796 records
Left join METHDM 37459 records on STKA
left join SERI 37180 records on STKA
left join OCRI 6497 records on STAK
Select
STKA.v6part as part,
STKA.v6plnt as plant,
STKA.v6tdys as pur_leadtime,
STKA.v6prpt as Pur_PrepLeadtime,
STKA.v6lead as Mfg_leadtime,
STKA.v6prpt as Mfg_PrepLeadTime,
METHDM.AQMTLP AS COMPONENT,
METHDM.AQQPPC AS QTYNEEDED,
SERI.HTLOTN AS BATCH,
SERI.HTUNIT AS UOM,
(HTQTY - HTQTYC) as ONHAND,
OCRI.DDORD# AS SALESORDER,
OCRI.DDRDAT AS PROMISED
from stka
left join METHDM on STKA.V6PART = METHDM.AQPART
left join SERI on STKA.V6PART = SERI.HTPART
left join OCRI on STKA.V6PART = OCRI.DDPART
Is this the best way to join the tables?
I think you already have your answer, but conceptually, there are a few issues here to deal with, and I figured I would give you a few examples, using data a little bit like yours, but massively simplified.
CREATE TABLE #STKA (V6PART INT, OTHER_DATA VARCHAR(50));
CREATE TABLE #METHDM (AQPART INT, KIT_ID INT, SOME_DATE DATETIME, OTHER_DATA VARCHAR(50));
CREATE TABLE #SERI (HTPART INT, OTHER_DATA VARCHAR(50));
CREATE TABLE #OCRI (DDPART INT, OTHER_DATA VARCHAR(50));
INSERT INTO #STKA SELECT 1, NULL UNION ALL SELECT 2, NULL UNION ALL SELECT 3, NULL; --1, 2, 3 Ids
INSERT INTO #METHDM SELECT 1, 1, '20200108 10:00', NULL UNION ALL SELECT 1, 2, '20200108 11:00', NULL UNION ALL SELECT 2, 1, '20200108 13:00', NULL; --1 Id appears twice, 2 Id once, no 3 Id
INSERT INTO #SERI SELECT 1, NULL UNION ALL SELECT 3, NULL; --1 and 3 Ids
INSERT INTO #OCRI SELECT 1, NULL UNION ALL SELECT 4, NULL; --1 and 4 Ids
So fundamentally we have a few issues here:
o the first problem is that the IDs in the tables differ, one table has an ID #4 but this isn't in any of the others;
o the second issue is that we have multiple rows for the same ID in one table;
o the third issue is that some tables are "missing" IDs that are in other tables, which you already covered by using LEFT JOINs, so I will ignore this.
--This will select ID 1 twice, 2 once, 3 once, and miss 4 completely
SELECT
*
FROM
#STKA
LEFT JOIN #METHDM ON #METHDM.AQPART = #STKA.V6PART
LEFT JOIN #SERI ON #SERI.HTPART = #STKA.V6PART
LEFT JOIN #OCRI ON #OCRI.DDPART = #STKA.V6PART;
So the problem here is that we don't have every ID in our "anchor" table STKA, and in fact there's no single table that has every ID in it. Now your data might be fine here, but if it isn't then you can simply add a step to find every ID, and use this as the anchor.
--This will select each ID, but still doubles up on ID 1
WITH Ids AS (
SELECT V6PART AS ID FROM #STKA
UNION
SELECT AQPART AS ID FROM #METHDM
UNION
SELECT HTPART AS ID FROM #SERI
UNION
SELECT DDPART AS ID FROM #OCRI)
SELECT
*
FROM
Ids I
LEFT JOIN #STKA ON #STKA.V6PART = I.Id
LEFT JOIN #METHDM ON #METHDM.AQPART = I.Id
LEFT JOIN #SERI ON #SERI.HTPART = I.Id
LEFT JOIN #OCRI ON #OCRI.DDPART = I.Id;
That's using a common-table expression, but a subquery would also do the job. However, this still leaves us with an issue where ID 1 appears twice in the list, because it has multiple rows in one of the sub-tables.
One way to fix this is to pick the row with the latest date, or any other ORDER you can apply to the data:
--Pick the best row for the table where it has multiple rows, now we get one row per ID
WITH Ids AS (
SELECT V6PART AS ID FROM #STKA
UNION
SELECT AQPART AS ID FROM #METHDM
UNION
SELECT HTPART AS ID FROM #SERI
UNION
SELECT DDPART AS ID FROM #OCRI),
BestMETHDM AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY AQPART ORDER BY SOME_DATE DESC) AS ORDER_ID
FROM
#METHDM)
SELECT
*
FROM
Ids I
LEFT JOIN #STKA ON #STKA.V6PART = I.Id
LEFT JOIN BestMETHDM ON BestMETHDM.AQPART = I.Id AND BestMETHDM.ORDER_ID = 1
LEFT JOIN #SERI ON #SERI.HTPART = I.Id
LEFT JOIN #OCRI ON #OCRI.DDPART = I.Id;
Of course you could also add some aggregation (SUM, MAX, MIN, AVG, etc.) to fix this problem (if it is indeed an issue). Also, I used a common-table expression, but this would work just as well with a subquery.
Expanding on a comment made on the question..
I would say I will start with SERI as that table contains the entire inventory for our facility and should cover the other tables
However the question said
SERI is our inventory table with 37180 records. (active inventory that is available)
In my experience, active inventory, isn't the same as all parts.
Normally, in a query like this, I'd expect the first table to be a Parts Master table of some sort that contains every possible part ID.
I'm trying to select all records from table A which are related to X certain records in table B (m:n relationship).
These are my tables:
hs_hotels:
id - name
hs_hotel_options:
hotel_id - option_id
hs_options:
id - name
Now I have, let's say 3, IDs of certain records from table hs_options and I'm looking for all records from table hs_hotels which are related to the given records from hs_options and also fit some other conditions. At the end, I want a set of records from hs_hotels.
This is my current attempt:
SELECT `hs_hotels`.*
FROM `hs_hotels`
RIGHT JOIN `hs_hotel_options` ON (`hs_hotels`.`id` = `hs_hotel_options`.`hotel_id`)
WHERE `hs_hotel_options`.`option_id` IN (1, 3, 5)
GROUP BY `hs_hotel`.`id`
1, 3 and 5 are IDs of records from table hs_options. This just selects all records from hs_hotels which are related to 1 of the 3 given records from hs_options instead of all 3 records. The problem I have is that I don't know how to solve this. I could try to select all records WHERE option_id = 1 AND option_id = 3 AND... but of course this won't work.
Do you have any ideas how to do this? Thanks in advance!
Similar to Tom H.'s you need to count the options and then associate it with the hotel.
You can use inner join exists or IN. I chose INNER JOIN
SELECT
*
FROM
Hotels H
INNER JOIN (
SELECT
hotel_id
FROM
hs_hotel_options
WHERE
option_id IN (1, 3, 5)
GROUP BY
hotel_id
HAVING COUNT(*) = 3) req_options
ON h.hotel_id = req_options.hotel_id
SELECT
<column list>
FROM
Hotels H
WHERE
( SELECT COUNT(*)
FROM
Hotel_Options HO
WHERE
HO.hotel_id = H.hotel_id AND
HO.option_id IN (1, 3, 5)
) = 3
I don't know if this subquery syntax is supported under your RDBMS or not.