I have two tables (orders, agents) i'm trying to join on either of the two columns but not both. There are some records in orders that have both of these columns populated and it returns as duplicate
orders:
|id|order|agent_id|username|
|--+-----+--------+--------|
| 1| ord1| 5| user1|
| 2| ord2| 6| user2|
| 3| ord3| 7| user3|
agents:
|id|agent|username|FName|LName|
|--+-----+--------+-----+-----|
| 5|agnt5| user2|FNam5|LNam5|
| 6|agnt6| user3|FNam6|LNam6|
| 7|agnt7| user4|FNam7|LNam7|
I tried joining with an OR clause
select o.id, o.order, o.agen_id,o.username, a.Fname, a.LName
from orders o
left join agents a
on a.id = o.agent_id or a.username = o.username
I'm getting the following results
|id|order|agent_id|username|Fname|LName|
|--+-----+--------+--------+-----+-----|
| 1| ord1| 5| user2|FNam5|LNam5|
| 1| ord1| 5| user2|FNam5|LNam5|
| 2| ord2| 6| user3|FNam6|LNam7|
| 2| ord2| 6| user3|FNam6|LNam7|
| 3| ord3| 7| user4|FNam5|LNam5|
Expected Results
|id|order|agent_id|username|Fname|LName|
|--+-----+--------+--------+-----+-----|
| 1| ord1| 5| user2|FNam5|LNam5|
| 2| ord2| 6| user3|FNam6|LNam7|
| 3| ord3| 7| user4|FNam5|LNam5|
It looks in the case where both the agent_id and username are a match, its matching both and duplicating it my results. Is there a way to prevent the username match when the agent_id match is present.
You can left join twice, with the condition that evicts the second join if the first one matches:
select
o.id,
o.order,
o.agent_id,
o.username,
coalesce(a1.fname, a2.fname) as fname,
coalesce(a1.lname, a2.lname) as lname
from orders o
left join agents a1 on a1.id = o.agent_id
left join agents a2 on a1.id is null and a1.username = o.username
Assuming the ID match takes precedence; then you need to add an 'AND' as follows
select o.id, o.order, o.agen_id,o.username, a.Fname, a.LName
from orders o
left join agents a
on a.id = o.agent_id or (a.username = o.username and a.id <> o.agent_id)
I do not know if I understood your question well enough. If you think I do not, please clarify it for me.
To avoid these duplicate results you could use the DISTINCT clause (MySQL DISTINCT documentation).
select distinct o.id, o.order, o.agen_id,o.username, a.Fname, a.LName
from orders o
left join agents a
on a.id = o.agent_id or a.username = o.username
Another option would be for you to make the union by only one of the columns. But as I do not know the data that this database will contain, I could not recommend it with 100% security.
Related
I have 2 tables:
| Product |
|:----: |
| product_id |
| source_id|
Source
source_id
priority
sometimes there are cases when 1 product_id can contain few sources and my task is to select data with min priority from for example
| product_id | source_id| priority|
|:----: |:------:| :-----:|
| 10| 2| 9|
| 10| 4| 2|
| 20| 2| 9|
| 20| 4| 2|
| 30| 2| 9|
| 30| 4| 2|
correct result should be like:
| product_id | source_id| priority|
|:----: |:------:| :-----:|
| 10| 4| 2|
| 20| 4| 2|
| 30| 4| 2|
I am using query:
SELECT p.product_id, p.source_id, s.priority FROM Product p
INNER JOIN Source s on s.source_id = p.source_id
WHERE s.priority = (SELECT Min(s1.priority) OVER (PARTITION BY p.product_id) FROM Source s1)
but it returns error "this type of correlated subquery pattern is not supported yet" so as i understand i can't use such variant in Redshift, how should it be solved, are there any other ways?
You just need to unroll the where clause into the second data source and the easiest flag for min priority is to use the ROW_NUMBER() window function. You're asking Redshift to rerun the window function for each JOIN ON test which creates a lot of inefficiencies in clustered database. Try the following (untested):
SELECT p.product_id, p.source_id, s.priority
FROM Product p
INNER JOIN (
SELECT ROW_NUMBER() OVER (PARTITION BY p.product_id, order by s1.priority) as row_num,
source_id,
priority
FROM Source) s
on s.source_id = p.source_id
WHERE row_num = 1
Now the window function only runs once. You can also move the subquery to a CTE if that improve readability for your full case.
Already found best solution for that case:
SELECT
p.product_id
, p.source_id
, s.priority
, Min(s.priority) OVER (PARTITION BY p.product_id) as min_priority
FROM Product p
INNER JOIN Source s
ON s.source_id = p.source_id
WHERE s.priority = p.min_priority
I am doing a simple left outer join in PySpark and it is not giving correct results. Please see bellow. Value 5 (in column A) is between 1 (col B) and 10 (col C) that's why B and C should be in the output table in the first row. But I'm getting nulls. I've tried this in 3 different RDBMs MS SQL, PostGres, and SQLite all giving the correct results. Possible bug in Spark??
Table x
+---+
| A|
+---+
| 5|
| 15|
| 20|
| 50|
+---+
Table y
+----+----+---+
| B| C| D|
+----+----+---+
| 1| 10|abc|
| 21| 30|xyz|
|null|null| mn|
| 11| 20| o|
+----+----+---+
SELECT x.a, y.b, y.c, y.d
FROM x LEFT OUTER JOIN
y
ON x.a >= y.b AND x.a <= y.c
+---+----+----+----+
| a| b| c| d|
+---+----+----+----+
| 5|null|null|null|
| 15| 11| 20| o|
| 20| 11| 20| o|
| 50|null|null|null|
+---+----+----+----+
syntax LEFT JOIN
enter link description here
SELECT column1, column2 ...
FROM table_A
LEFT JOIN table_B ON join_condition
WHERE row_condition
Maybe it will help you
SELECT x.a, y.*
FROM x LEFT JOIN y ON x.id = y.xID
WHERE x.a >= y.b AND x.a <= y.c
The problem was that Spark loaded the columns as strings, not ints. Spark was doing the >= and <= comparison on strings that's why results were off.
Casting the A,B,C columns to int resolved the problem.
x=x.withColumn('A',x['A'].cast('int'))
y=y.withColumn('B',x['B'].cast('int'))
y=y.withColumn('C',x['C'].cast('int'))
Imagine I have a main table like:
Table guys
|id| name|profession|
|--|------|----------|
| 1| John| developer|
| 2| Mike| boss|
| 3| Roger| fireman|
| 4| Bob| policeman|
I also have a localized version which is not complete (the boss is missing):
Table guys_bg
|id| name | profession|
|--|------|-----------|
| 1| Джон|разработчик|
| 3|Роджър| пожарникар|
| 4| Боб| полицай|
I want to prioritize guys_bg results while still showing all the guys (The boss is still a guy, right?).
This is the desired result:
|id| name | profession|
|--|------|-----------|
| 1| Джон|разработчик|
| 2| Mike| boss|
| 3|Роджър| пожарникар|
| 4| Боб| полицай|
Take into consideration that both tables may have a lot of (100+) columns so joining the tables and using CASE for every column will be very tedious.
What are my options?
Here is one way using union all:
select gb.*
from guys_bg gb
union all
select g.*
from guys g
where not exists (select 1 from guys_bg gb where gb.id = g.id);
You can also make it with using FULL JOIN.
SELECT
ISNULL(b.id,g.id) id
, ISNULL(b.name, g.name) name
, ISNULL(b.profession, g.profession) profession
FROM
guys g
FULL JOIN guys_bg b ON g.id = b.id
The scenario is simple. I have 4 tables, A -table, B -table, C1 -table and C2 -table. A is a root level table, B references A, and C1 and C2 reference B. But each B.ID can only be referenced by either C1 or C2, never both. The results are exported to a .CSV -file which is then used for a variety of purposes, and the question here has to do with readability as well as making it easier to manage the information in external software.
I wrote a query that returns all data in all 4 tables keeping the relations intact, ordering them by A, B, C1 and C2.
SELECT A.*, B.*, C1.*, C2.*
FROM A
JOIN B
LEFT JOIN C1
LEFT JOIN C2
ORDER BY A.ID, B.ID, etc.
And got this:
A.ID | B.ID | C1.ID | C2.ID
1| 1| 1| NULL
1| 1| 2| NULL
1| 2| 1| NULL
1| 2| 2| NULL
1| 2| 3| NULL
2| 1| NULL| 1
2| 1| NULL| 2
....
Now, the question here is this: How do I return only the first distinct row for each join, so that the resultset doesn't get clogged with redundant data. Basically, the result above should produce this:
A.ID | B.ID | C1.ID | C2.ID
1| 1| 1| NULL
| | 2| NULL
| 2| 1| NULL
| | 2| NULL
| | 3| NULL
2| 1| NULL| 1
| | NULL| 2
....
I can probably do this by making each join a subquery and partitioning the results by rank, or alternatively creating a temporary table and slam the results there with the required logic, but since this will be used in a console app, I'd like to keep the solution as clean, simple and optimized as possible.
Any ideas?
This is reporting / formatting, not data, so it should be handled by the application, not by SQL.
That said, this will produce something close to your requirements
select
case arn when 1 then convert(varchar(10),aid) else '' end as aid,
case brn when 1 then convert(varchar(10),bid) else '' end as bid,
case crn when 1 then convert(varchar(10),c1id) else '' end as c1id,
c2id
from
(
select a.id aid, b.id bid, c1.id c1id, c2.id c2id,
ROW_NUMBER() over(partition by a.id order by a.id,b.id,c1.id,c2.id) arn,
ROW_NUMBER() over(partition by a.id,b.id order by a.id,b.id,c1.id,c2.id) brn,
ROW_NUMBER() over(partition by a.id,b.id,c1.id order by a.id,b.id,c1.id,c2.id) crn
FROM A
JOIN B
LEFT JOIN C1
LEFT JOIN C2
) v
I've come up with two approaches to the same idea and would like to avoid any obvious pitfalls by using one over the other. I have a table (tbl_post) where a single row can have many relationships to other tables (tbl_category, tbl_site, tbl_team). I have a relationship table to join these but don't know which structure to go with, conditional or direct? Hopefully the following will explain...
tbl_post (simple post, can be associated with many categories, teams and sites)
* id
* title
* content
tbl_category
* id
* title
* other category only columns
tbl_team
* id
* title
* other team only columns
tbl_site
* id
* title
* other site only columns
----------------------------------------------------------
tbl_post_relationship
* id (pk)
* post_id (fk tbl_post)
* related_id (fk, dependant on related_type to either tbl_category, tbl_site or tbl_team)
* related_type (category, site or team)
____________________________________
|id|post_id|related_id|related_type|
|--|-------|----------|------------|
| 1| 1| 6| category|
| 2| 1| 4| site|
| 3| 1| 9| category|
| 4| 1| 3| team|
------------------------------------
SELECT c.*
FROM tbl_category c
JOIN tbl_relationship r ON
r.post_id = 1
AND r.related_type = 'category'
AND c.id = r.related_id
------------- OR ---------------
tbl_post_relationship
* id (pk)
* post_id (fk tbl_post)
* category_id (fk tbl_category)
* site_id (fk tbl_site)
* team_id (fk tbl_team)
________________________________________
|id|post_id|category_id|site_id|team_id|
|--|-------|-----------|-------|-------|
| 1| 1| 6| NULL| NULL|
| 2| 1| NULL| 4| NULL|
| 3| 1| 9| NULL| NULL|
| 4| 1| NULL| NULL| 3|
----------------------------------------
SELECT c.*
FROM tbl_category c
JOIN tbl_relationship r ON
r.post_id = 1
AND r.category_id = c.id
So with the one approach I'll end up with lots of columns (there might be more tables) with NULL's. Or I end up with one simple table to maintain it, but every join is based on a "type". I also know I could have a table per relationship, but again that feels like too many tables. Any ideas / thoughts?
You are best out with one table per relationship. You should not worry about the amount of tables. The drawbacks of a single relationship table are several, and quite risky:
1) You cannot enforce foreign keys if the related tables vary from row to row, so your data integrity is at risk... and sooner or later you will have orphaned data.
2) Queries are more complex because you have to use the related_type to filter out the relations in many places.
3) Query maintenance is more costly, for the same reasons of 2), and because you have to explicitly use the related_type constants in many places... it'll be hell when you need to change them or add some.
I'd suggest you use the orthodox design... just got with 3 distinct relationship tables: post_category, post_team, post_site.