Select rows from a filtered portion of Table A where a column matches a relationship with a column from the row in Table B that matches by ID - sql

I want to get all rows in a table where one column matches a relationship with the value of the column in the row in a different table that has the same value of another column.
Concretely, I have two tables, orders and product_info that I'm accessing through Amazon Redshift
Orders
| ID | Date | Amount | Region |
=====================================
| 1 | 2019/4/1 | $120 | A |
| 1 | 2019/4/4 | $100 | A |
| 2 | 2019/4/2 | $50 | A |
| 3 | 2019/4/6 | $70 | B |
The partition keys of order are region and date.
Product Information
| ID | Release Date | Region |
| ---- | ------------ | ------ |
| 1 | 2019/4/2 | A |
| 2 | 2019/4/3 | A |
| 3 | 2019/4/5 | B |
The primary key of product information is id, and the partition key is region.
I want to get all rows from Orders in region A where the date of the row is greater than the release date value in product information for that ID.
So in this case it should return just one row,
| 1 | 2019/4/4 | $100 | A |
I tried doing
select *
from orders
INNER JOIN product_info ON orders.date>product_info.release_date
AND orders.id=product_info.id
AND orders.region=A
AND product_info.region=A
limit 10
The problem is that this query was absurdly slow (cancelled it after 10 minutes). The tables are extremely large, and I have a feeling it was scanning the entire table without restricting it to region first (in reality I have other filters in addition to region that I want to apply to the list of IDs before I do the inner join, but I've limited it to only region for the sake of simplifying the question).
How can I efficiently write this type of query?

The best way to make an SQL query faster is to exclude rows as soon as possible.
So, rather than putting conditions like orders.region=A in the JOIN statement, you should move them to a WHERE statement. This will eliminate rows before they are joined.
Also, make the JOIN condition as simple as possible so that the database can optimize the comparison.
Try something like this:
SELECT *
FROM orders
INNER JOIN product_info ON orders.id = product_info.id
WHERE orders.region = 'A'
AND product_info.region = 'A'
AND orders.date > product_info.release_date
Any further optimization would require consideration of the DISTKEY and SORTKEY on the Redshift tables. (Preferably a DISTKEY of id and a SORTKEY of date).

Related

SQL Server selecting data as array from two tables

I have a database in which there are two tables tableA, tableB. Now for each primary id in tableA there may be multiple rows in tableB.
Table A primary key (ServiceOrderId)
+----------------+-------+-------+-------------+
| ServiceOrderId | Tax | Total | OrderNumber |
+----------------+-------+-------+-------------+
| 12 | 45.00 | 347 | 1011 |
+----------------+-------+-------+-------------+
Table B foreign key (ServiceOrderId)
+----+-------------+---------------------+----------+-------+------+----------------+
| Id | ServiceName | ServiceDescription | Quantity | Price | Cost | ServiceOrderId |
+----+-------------+---------------------+----------+-------+------+----------------+
| 39 | MIN-C | Commercial Pretreat | NULL | 225 | 23 | 12 |
+----+-------------+---------------------+----------+-------+------+----------------+
| 40 | MIN-C | Commercial Pretreat | NULL | 225 | 25 | 12 |
+----+-------------+---------------------+----------+-------+------+----------------+
Is there a way in which I can fetch the values as an array of multiple rows of tableB with single row of tableA. Because when I am saving to database I am using temp table to save multiple rows of tableB with single row of tableA.
Query I am using
SELECT
ordr.*,
info.*
FROM
tblServiceOrder as ordr
JOIN
tblServiceOrderInfo as info ON ordr.ServiceOrderId = info.ServiceOrderId
But above query is giving two rows for each ServiceOrderId. I am using node api to fetch data. I want something like;
Object:{
objectA:{id:12,tax:45.00:total:347,ordernumber:1011},
objectB:[
{id:39,servicename:'MIN-C',description:'Commercial Pretreat',Quantity :NULL,Price:225,Cost:23,ServiceOrderId:12 },
{id:40,servicename:'MIN-C',description:'Commercial Pretreat',Quantity :NULL,Price:225,Cost:25,ServiceOrderId:12}
]
}
There are several solutions. The first one is to use your SELECT, but with adding ORDER BY ServiceOrderID and when data are converting to object, to use the first row only in the loop for new ServiceOrderId from ordr table and add every row for the data from info table.
Other possibility is to select data from ordr table only and for every row to make another select by ServiceOrderId from info table. This solution should not be used for huge tables.

SQL / Oracle to Tableau - How to combine to sort based on two fields?

I have tables below as follows:
tbl_tasks
+---------+-------------+
| Task_ID | Assigned_ID |
+---------+-------------+
| 1 | 8 |
| 2 | 12 |
| 3 | 31 |
+---------+-------------+
tbl_resources
+---------+-----------+
| Task_ID | Source_ID |
+---------+-----------+
| 1 | 4 |
| 1 | 10 |
| 2 | 42 |
| 4 | 8 |
+---------+-----------+
A task is assigned to at least one person (denoted by the "assigned_ID") and then any number of people can be assigned as a source (denoted by "source_ID"). The ID numbers are all linked to names in another table. Though the ID numbers are named differently, they all return to the same table.
Would there be any way for me to combine the two tables based on ID such that I could search based on someone's ID number? For example- if I decide to search on or do a WHERE User_ID = 8, in order to see what Tasks that 8 is involved in, I would get back Task 1 and Task 4.
Right now, by joining all the tables together, I can easily filter on "Assigned" but not "Source" due to all the multiple entries in the table.
Use union all:
select distinct task_id
from ((select task_id, assigned_id as id
from tbl_tasks
) union all
(select task_id, source_id
from tbl_resources
)
) ti
where id = ?;
Note that this uses select distinct in case someone is assigned to the same task in both tables. If not, remove the distinct.

1 to Many Query: Help Filtering Results

Problem: SQL Query that looks at the values in the "Many" relationship, and doesn't return values from the "1" relationship.
Tables Example: (this shows two different tables).
+---------------+----------------------------+-------+
| Unique Number | <-- Table 1 -- Table 2 --> | Roles |
+---------------+----------------------------+-------+
| 1 | | A |
| 2 | | B |
| 3 | | C |
| 4 | | D |
| 5 | | |
| 6 | | |
| 7 | | |
| 8 | | |
| 9 | | |
| 10 | | |
+---------------+----------------------------+-------+
When I run my query, I get multiple, unique numbers that show all of the roles associated to each number like so.
+---------------+-------+
| Unique Number | Roles |
+---------------+-------+
| 1 | C |
| 1 | D |
| 2 | A |
| 2 | B |
| 3 | A |
| 3 | B |
| 4 | C |
| 4 | A |
| 5 | B |
| 5 | C |
| 5 | D |
| 6 | D |
| 6 | A |
+---------------+-------+
I would like to be able to run my query and be able to say, "When the role of A is present, don't even show me the unique numbers that have the role of A".
Maybe if SQL could look at the roles and say, WHEN role A comes up, grab unique number and remove it from column 1.
Based on what I would "like" to happen (I put that in quotations as this might not even be possible) the following is what I would expect my query to return.
+---------------+-------+
| Unique Number | Roles |
+---------------+-------+
| 1 | C |
| 1 | D |
| 5 | B |
| 5 | C |
| 5 | D |
+---------------+-------+
UPDATE:
Query Example: I am querying 8 tables, but I condensed it to 4 for simplicity.
SELECT
c.UniqueNumber,
cp.pType,
p.pRole,
a.aRole
FROM c
JOIN cp ON cp.uniqueVal = c.uniqueVal
JOIN p ON p.uniqueVal = cp.uniqueVal
LEFT OUTER JOIN a.uniqueVal = p.uniqueVal
WHERE
--I do some basic filtering to get to the relevant clients data but nothing more than that.
ORDER BY
c.uniqueNumber
Table sizes: these tables can have anywhere from 50,000 rows to 500,000+
Pretending the table name is t and the column names are alpha and numb:
SELECT t.numb, t.alpha
FROM t
LEFT JOIN t AS s ON t.numb = s.numb
AND s.alpha = 'A'
WHERE s.numb IS NULL;
You can also do a subselect:
SELECT numb, alpha
FROM t
WHERE numb NOT IN (SELECT numb FROM t WHERE alpha = 'A');
Or one of the following if the subselect is materializing more than once (pick the one that is faster, ie, the one with the smaller subtable size):
SELECT t.numb, t.alpha
FROM t
JOIN (SELECT numb FROM t GROUP BY numb HAVING SUM(alpha = 'A') = 0) AS s USING (numb);
SELECT t.numb, t.alpha
FROM t
LEFT JOIN (SELECT numb FROM t GROUP BY numb HAVING SUM(alpha = 'A') > 0) AS s USING (numb)
WHERE s.numb IS NULL;
But the first one is probably faster and better[1]. Any of these methods can be folded into a larger query with multiple additional tables being joined in.
[1] Straight joins tend to be easier to read and faster to execute than queries involving subselects and the common exceptions are exceptionally rare for self-referential joins as they require a large mismatch in the size of the tables. You might hit those exceptions though, if the number of rows that reference the 'A' alpha value is exceptionally small and it is indexed properly.
There are many ways to do it, and the trade-offs depend on factors such as the size of the tables involved and what indexes are available. On general principles, my first instinct is to avoid a correlated subquery such as another, now-deleted answer proposed, but if the relationship table is small then it probably doesn't matter.
This version instead uses an uncorrelated subquery in the where clause, in conjunction with the not in operator:
select num, role
from one_to_many
where num not in (select otm2.num from one_to_many otm2 where otm2.role = 'A')
That form might be particularly effective if there are many rows in one_to_many, but only a small proportion have role A. Of course you can add an order by clause if the order in which result rows are returned is important.
There are also alternatives involving joining inline views or CTEs, and some of those might have advantages under particular circumstances.

SQL: Bug in Joining two tables

I have a item table from which i want to get Sum of item quantity
Query:
Select item_id, Sum(qty) from item_tbl group by item_id
Result:
==================
| ID | Quantity |
===================
| 1 | 10 |
| 2 | 20 |
| 3 | 5 |
| 4 | 20 |
The second table is invoice table from which i am getting the item quantity which is sold. I am joining these two tables as
Query:
Select item_tbl.item_id, Sum(item_tbl.qty) as [item_qty],
-isnull(Sum(invoice.qty),0) as [invoice_qty]
from item_tbl
left join invoice on item_tbl.item_id = invoice invoice.item_id group by item_tbl.item_id
Result:
=================================
| ID | item_qty | invoice_qty |
=================================
| 1 | 10 | -5 |
| 2 | 20 | -20 |
| 3 | 10 | -25 | <------ item_qty raised from 5 to 10 ??
| 4 | 20 | -20 |
I don't know if i am joining these tables in right way. Because i want to get everything from item table and available things from invoice table to maintain the inventory. So i use left join. Help please..
Modification
when i added group by item_id, qty i got this:
=================================
| ID | item_qty | invoice_qty |
=================================
| 1 | 10 | -5 |
| 2 | 20 | -20 |
| 3 | 5 | -5 |
| 3 | 5 | -20 |
| 4 | 20 | -20 |
As its a view so ID is repeated. what should i do to avoid this ??
Clearing things up, my answer from the comments explained:
While using left join operation (A left join B) - a record will be created for every matching B record to an A record, also - a record will be created for any A record that has no matching B record, using null values wherever needed to complement the fields from B.
I would advise reading up on Using Joins in SQL when approaching such problems.
Below are 2 possible solutions, using different assumptions.
Solution A
Without any assumptions regarding primary key:
We have to sum up the item quantity column to determine the total quantity, resulting in two sums that need to be performed, I would advise using a sub query for readability and simplicity.
select item_tbl.item_id, Sum(item_tbl.qty) as [item_qty], -isnull(Sum(invoice_grouped.qty),0) as [invoice_qty]
from item_tbl left join
(select invoice.item_id as item_id, Sum(invoice.qty) as qty from invoice group by item_id) invoice_grouped
on (invoice_grouped.item_id = item_tbl.item_id)
group by item_tbl.item_id
Solution B
Assuming item_id is primary key for item_tbl:
Now we know we can rely on the fact that there is only one quantity for each item_id, so we can do without the sub query by selecting any (max) of the item quantities in the join result, resulting in a quicker execution plan.
select item_tbl.item_id, Max(item_tbl.qty) as [item_qty], -isnull(Sum(invoice.qty),0) as [invoice_qty]
from item_tbl left join invoice on (invoice.item_id = item_tbl.item_id)
group by item_tbl.item_id
If your database design is following the common rules, item_tbl.item_id must be unique.
So just change your query:
Select item_tbl.item_id, item_tbl.qty as [item_qty],
-isnull(Sum(invoice.qty),0) as [invoice_qty]
from item_tbl
left join invoice on item_tbl.item_id = invoice invoice.item_id group by item_tbl.item_id, item_tbl.qty

Joining tables if the reference exists

I got a PostgreSQL database with 4 tables:
Table A
---------------------------
| ID | B_ID | C_ID | D_ID |
---------------------------
| 1 | 1 | NULL | NULL |
---------------------------
| 2 | NULL | 1 | NULL |
---------------------------
| 3 | 2 | 2 | 1 |
---------------------------
| 4 | NULL | NULL | 2 |
---------------------------
Table B
-------------
| ID | DATA |
-------------
| 1 | 123 |
-------------
| 2 | 456 |
-------------
Table C
-------------
| ID | DATA |
-------------
| 1 | 789 |
-------------
| 2 | 102 |
-------------
Table D
-------------
| ID | DATA |
-------------
| 1 | 654 |
-------------
| 2 | 321 |
-------------
I'm trying to retrieve a result set which has joined the data from table B and the data from table C, only if one of booth IDs is not null.
SELECT "Table_A"."ID", "Table_A"."ID_B", "Table_A"."ID_C", "Table_A"."ID_D", "Table_B"."DATA", "Table_C"."DATA"
FROM "Table_A"
LEFT JOIN "Table_B" on "Table_A"."ID_B" = "Table_B"."ID"
LEFT JOIN "Table_C" on "Table_A"."ID_C" = "Table_C"."ID"
WHERE "Table_A"."ID_B" IS NOT NULL OR "Table_A"."ID_C" IS NOT NULL;
Is this recommended or should I better split this in multiple queries?
Is there a way to do an inner join between these tables?
The result I expect is:
-------------------------------------------------
| ID | ID_B | ID_C | ID_D | DATA (B) | DATA (C) |
-------------------------------------------------
| 1 | 1 | NULL | NULL | 123 | NULL |
-------------------------------------------------
| 2 | NULL | 1 | NULL | NULL | 789 |
-------------------------------------------------
| 3 | 2 | 2 | NULL | 456 | 102 |
-------------------------------------------------
EDIT: ID_B, ID_C, ID_D are foreign keys to the tables table_b, table_c, table_d
The WHERE "Table_A"."ID_B" IS NOT NULL OR "Table_A"."ID_C" IS NOT NULL; can be replaced by the corresponding clause on the B and C tables : WHERE "Table_B"."ID" IS NOT NULL OR "Table_C"."ID" IS NOT NULL; . This would also work if table_a.id_b and table_a.id_c are not FKs to the B and C tables. Otherwise, a table_a row with { 5, 5,5,5} would retrieve two NULL rows from the B and C tables.
SELECT ta."ID" AS a_id
, ta."ID_B" AS b_id
, ta."ID_C" AS c_id
, ta."ID_D" AS d_id
, tb."DATA" AS bdata
, tc."DATA" AS cdata
FROM "Table_a" ta
LEFT JOIN "Table_B" tb on ta."ID_B" = tb."ID"
LEFT JOIN "Table_C" tc on ta."ID_C" = tc."ID"
WHERE tb."ID" IS NOT NULL OR tc."ID" IS NOT NULL
;
Since you have foreign key constraints in place, referential integrity is guaranteed and the query in your Q is already the best answer.
Also indexes on Table_B.ID and Table_C.ID are given.
If matching cases in Table_A are rare (less than ~ 5 %, depending on row with and data distribution) a partial multi-column index would help performance:
CREATE INDEX table_a_special_idx ON "Table_A" ("ID_B", "ID_C")
WHERE "ID_B" IS NOT NULL OR "ID_C" IS NOT NULL;
In PostgreSQL 9.2 a covering index (index-only scan in Postgres parlance) might help even more - in which case you would include all columns of interest in the index (not in my example). Depends on several factors like row width and frequency of updates in your table.
Given your requirements, your query seems good to me.
An alternative would be to use nested selects in the projection, but depending on your data, indexes and constraints, that might be slower, as nested selects usually result in nested loops, whereas joins can be performed as merge joins or nested loops:
SELECT
"Table_A"."ID",
"Table_A"."ID_B",
"Table_A"."ID_C",
"Table_A"."ID_D",
(SELECT "DATA" FROM "Table_B" WHERE "Table_A"."ID_B" = "Table_B"."ID"),
(SELECT "DATA" FROM "Table_C" WHERE "Table_A"."ID_C" = "Table_C"."ID")
FROM "Table_A"
WHERE "Table_A"."ID_B" IS NOT NULL OR "Table_A"."ID_C" IS NOT NULL;
If Postgres does scalar subquery caching (as Oracle does), then nested selects might help in case you have a lot of data repetition in Table_A
Generally spealking the recommended way is to do it in one query only, and let the database do as much work as possible, especially if you add other operations like sorting (order by) or pagination later (limit ... offset ...) later. We have done some measurements, and there is no way to sort/paginate faster in Java/Scala, if you use any of the higher level collections like lists etc.
RDBMS deal very well with single complex statements, but they have difficulties in handling many small queries. For example, if you query the "one" and the "many relation" in one query, it will be faster than doing this in 1 + n select statements.
As for the outer join, we have done measurements, and there is no real performance penalty compared with inner joins. So if your data model and/or your query require an outer join, just do it. If it was a performance problem, you can tune it later.
As for your null comparisons, it might indicate that your data model could be optimized, but that is just a guess. Chances are that you can improve the design so that null is not allowed in these columns.