sql insert value from another table with original nulls but not unmatched entries - sql

OK. So this is a hard one to explain, but I am replacing the type of a foreign key in a database. To do this I need to update the values in a table that references it. That is all fine and good, and nice and easy to do.
I'm inserting this stuff into a temporary table which will replace the original table, but the insert query isn't at all difficult, it's the select that I get the values from.
However, I also want to keep any entries where the original reference was NULL. Also not hard, I could use a Left Inner Join for that.
But we're not done yet: I don't want the entries for which there is no match in the second table. I've been dinking around with this for 2 hours now, and am no closer to figuring this out than I am to the moon.
Let me give you an example data set:
____________________________
| Inventory || Customer |
|============||============|
| ID Cust || ID Name |
|------------||------------|
| 1 A || 1 A |
| 2 B || 2 B |
| 3 E || 3 C |
| 4 NULL || 4 D |
|____________||____________|
Let's say the database used to use the Customer.Name field as its Primary Key, and I need to change it to a standard int identity(1,1) not null ID. I've added the field with no issues in the Customer table, and kept the Name because I need it for other stuff. I have had no trouble with this in all the tables that do not allow NULLs, but since the "Inventory" table allows something to be associated with No customer, I'm running into troubles.
If I did a left inner join, my results would be:
______________
| Results |
|============|
| ID Cust |
|------------|
| 1 1 |
| 2 2 |
| 3 NULL |
| 4 NULL |
|____________|
However, Inventory #3 was referencing a customer which does not exist. I want that to be filtered out.
This database is my development database, where I hack, slash, and destroy things with wanton disregard for validity. So a lot of links in these tables are no longer valid.
The next step is replicating this process in the beta-testing environment, where bad records shouldn't exist, but I can't guarantee that. So I'd like to keep the filter, if possible.
The query I have right now is using a sub-query to find all rows in Inventory whose CustID either exists in Customers, or is null. It then tries to only grab the value from those rows which the subquery found. Here's the translated query:
insert into results
(
ID,
Cust
)
select
inv.ID, cust.ID
from Inventory inv, Customer cust
where inv.ID in
(
select inv.ID from Inventory inv, Customer cust
where inv.Cust is null
or cust.Name = inv.Cust
)
and cust.Name = inv.Cust
But, as I'm sure you can see, this query isn't right. I've tried using 2, 3 subqueries, inner joins, left joins, bleh. The results of this query, and many others I've tried (that weren't horribly, horribly wrong) are:
______________
| Results |
|============|
| ID Cust |
|------------|
| 1 1 |
| 2 2 |
|____________|
Which is essentially an inner-join. Considering my actual data has around 1100 records which have NULL values in that field, I don't think truncating them is the answer.
The answer I'm looking for is:
______________
| Results |
|============|
| ID Cust |
|------------|
| 1 1 |
| 2 2 |
| 4 NULL |
|____________|
The trickiest part of this insert into select is the fact that I'm looking to insert either a value from another table, or essentially a value from this table or the literal NULL. That just isn't something I know how to do; I'm still getting the hang of SQL.
Since I'm inserting the results of this query into a table, I've considered doing the insert using a select which leaves out the NULL values and un-matched records, then going back through and adding in all the NULL records, but I really want to learn how to do the more advanced queries like this.
So do any of yous folks have any ideas? 'Cause I'm lost.

How about a union?
Select all records where ID and Cust match and union that with all records where ID matches and inventory.cust is null.

Related

Why is INNER JOIN producing more records than original file?

I have two tables. Table A & Table B. Table A has 40516 rows, and records sales by seller_id. The first column in Table A is the seller_id that repeats every time a sale is made.
Example: Table A (40516 rows)
seller_id | item | cost
------------------------
1 | dog | 5000
1 | cat | 50
4 |lizard| 80
5 |bird | 20
5 |fish | 90
The seller_id is also present in Table B, and also contains the corresponding name of the seller.
Example: Table B (5851 rows)
seller_id | seller_name
-------------------------
1 | Dog and Cat World INC
4 | Reptile Love.com
5 | Ocean Dogs Inc
I want to join these two tables, but only display the seller name from Table B and all other columns from Table A. When I do this with an INNER JOIN I get 40864 rows (348 extra rows). Shouldn't the query produce only the original 40516 rows?
Also not sure if this matters, but the seller_id can contain several zeros before the number (e.g., 0000845, 0000549).
I've looked around on here and haven't really found an answer. I've tried LEFT and RIGHT joins and get the same results for one and way more results for the other.
SQL Code Example:
SELECT public.table_B.seller_name, *
FROM public.table_A
INNER JOIN public.table_B ON public.table_A.seller_id =
public.table_B.seller_id;
Expected Results:
seller_name | seller_id | item | cost
------------------------------------------------
Dog and Cat World INC | 1 | dog | 5000
Dog and Cat World INC | 1 | cat | 50
Reptile Love.com | 4 |lizard| 80
Ocean Dogs Inc | 5 |bird | 20
Ocean Dogs Inc | 5 |fish | 90
I expected the results to contain the same number of rows in Table A. Instead I gut names matching up and an additional 348 rows...
Update:
I changed "unique_id" to "seller_id" in the question.
I guess I should have chosen a better name for unique_id in the original example. I didn't mean it to be unique in the sense of a key. It is just the seller's id that repeats every time there is a sale (in Table A). The seller's ID does repeat in Table A because it is supposed to. I simply want to pair up the seller IDs with the seller names.
Thanks again everyone for their help!
unique_id is already not correctly named in the first table, so there is no reason to assume it is unique in the second table either.
Run this query to find the duplicates:
select unique_id
from table_b
group by unique_id
having count(*) > 1;
You can fix the query using distinct on:
SELECT b.seller_name, a.*
FROM public.table_A a JOIN
(SELECT DISTINCT ON (b.unique_id) b.*
FROM public.table_B b
ORDER BY b.unique_id
) b
ON a.unique_id = b.unique_id;
In this case, you may get fewer records, if there are no matches. To fix that, use a LEFT JOIN.
Because unique id column is not unique.
Gordon Linoff was correct. The seller_id (formerly listed as unique_id) was indeed duplicated throughout the data set. I foolishly assumed otherwise. Also the seller_name had many duplicates too! In the end I had to use the CONCAT() function to join the seller_id with second identifier to create a type of foreign key. After I did this the join worked as expected. Thanks everyone!

SQL Help - Join small lookup table where not all columns are required (and an other option)

I have one large table with transactions and a smaller lookup table with values I want to add based on 4 common columns. The trick here is not every combination of these 4 columns will exist in the lookup table and there are scenarios where I want it to stop checking and accept the match instead of going to the next column. I also have an "Other" option to default to if it doesn't match any of the options.
Table structures are something like this:
transaction_table
country, trans_id, store_type, store_name, channel, browser, purchase_amount, currency
lookup_table
country, store_name, channel, browser, trans_fee
The data could be something like this:
transaction_table:
country| trans_id| store_type |store_name |channel |browser |amt |currency
US | 001 | Big Box | Target | B&M |N/A |1.45 |USD
US | 002 | Big Box | Target | Online |Chrome |1.79 |USD
US | 003 | Small | Bob's Store| B&M |N/A |2.50 |USD
US | 004 | Big Box | Walmart | B&M |N/A |1.12 |USD
US | 005 | Big Box | Walmart | Online |Firefox |3.79 |USD
US | 006 | Big Box | Amazon | Online |IE |4.54 |USD
US | 007 | Small | Jim's Plc | B&M |IE |2.49 |USD
lookup_table:
country|store_name |channel |browser |trans_fee
US |Target |B&M |N/A |0.25
US |Target |Online | |0.15
US |Walmart | | |0.30
US |Other | | |0.45
So looking at the lookup_table data:
Row 1 is very specific and would be a match on all 4 of the join
columns.
Row 2 would not care what browser was used to shop at Target so
regardless of the "browser" value, the trans_fee should come back
the same (other stores may care though).
Row 3 is saying any transaction with a country='US' and the
store_name='Walmart', regardless of the rest of the join columns
would have the same trans_fee
Row 4 is the "other" scenario where it should look first at the
store_name column and if it doesn't find a match, go to Other.
The lookup_table data can change and may end up being time dependent (start_date and end_date columns added) so it really wouldn't be a good candidate for a long, complex CASE statement.
I was thinking of a combination of checking each column with an IF IN statement but I'm hoping there's a more straightforward conditional join type statement I can use to go column by column and have an other option.
Thanks!
edit: I didn't specify this but I want to basically return all of the data from transaction_table and add the corresponding trans_fee to each line.
You will need to use a conditional JOIN.
Something like this
SELECT *
FROM lookup_table
LEFT OUTER JOIN transaction_table
ON CASE WHEN lookup_table.store_name IS NOT NULL
THEN transacton_table.store_name = lookup_table.store_name END
Such partial matching is tricky. And your problem is not really that well set up. You seem to have NULLs in some columns and general values in others.
In any case, you can solve this by matching what you can and then using order by to get the best match. In your case, I think this looks like this:
select tt.*,
(select trans_fee
from lookup l
where l.country = tt.country and
l.store_name in ('other', tt.store_name) and
(l.channel = tt.channel or l.channel is null) and
(l.browser = tt.browser or l. browser is null)
order by (case when l.store_name = tt.store_name then 1 else 2 end),
(case when l.channel = tt.channel then 1 else 2 end),
(case when l.browser = tt.browser then 1 else 2 end)
fetch first 1 row only
) as trans_fee
from transaction_table tt;
This is generic SQL. But the same idea should work in any database.

New column referencing second table - do I need a join?

I have two tables (first two shown) and need to make a third from the first two - do I need to do a join or can you reference a table without joining?
The third table shown is the desired output. Thanks for any help!
| ACC | CALL DATE | | |
+-----+-----------+--+--+
| 1 1 | 2/1/18 | | |
+-----+-----------+--
+-----+---------------+--+--+
| ACC | PURCHASE DATE | | |
+-----+---------------+--+--+
| 1 1 | 1/1/18 | | |
+-----+---------------+--+--+
+-----+-----------+----------------------+--+
| ACC | CALL DATE | PRIOR MONTH PURCHASE | |
+-----+-----------+----------------------+--+
| 1 1 | 2/1/18 | YES | |
+-----+-----------+----------------------+--+
Of course you can have a query that references multiple tables without joining. union all is an example of an operator that does that.
There is also the question of what you mean by "joining" in the question. If you mean explicit joins, there are ways around that -- such as correlated subqueries. However, these are implementing some form of "join" in the database engine.
As for your query, you would want to use exists with a correlated subquery:
select t1.*,
(case when exists (select 1
from table2 t2
where t2.acc = t1.acc and
datediff(month, t2.purchase_date, t1.call_date) = 1
)
then 'Yes' else 'No'
end) as prior_month_purchase
from table1 t1;
This is "better" than a join because it does not multiply or remove rows. The result set has exactly the rows in the first table, with the additional column.
The syntax assumes SQL Server (which was an original tag). Similar logic can be expressed in other databases, although date functions are notoriously database-dependent.
Lets check the options,
Say if you were to create a new third table on the basis of the data in first two, then every update/inserts/deletes to either of the tables should also propagate into the third table as well.
Say you instead have a view which does what you need, there isnt a need to maintain that third table and also gets you the data needed from the first two each time you query it.
create view third_table as
select a.acc,a.call_date,case when dateadd(mm,-1,a.call_date)=b.purchase_date then 'Yes' else 'No end as prior_month_purchase
from first_table a
left join second_table b
on a.acc=b.acc

Select ID given the list of members

I have a table for the link/relationship between two other tables, a table of customers and a table of groups. a group is made up of one or more customers. The link table is like
APP_ID | GROUP_ID | CUSTOMER_ID
1 | 1 | 123
1 | 1 | 124
1 | 1 | 125
1 | 2 | 123
1 | 2 | 125
2 | 3 | 123
3 | 1 | 123
3 | 1 | 124
3 | 1 | 125
I now have a need, given a list of customer IDs to be able to get the group ID for that list of customer IDs. Group ID may not be unique, the same group ID will contain the same list of customer IDs but this group may exist in more than one app_id.
I'm thinking that
SELECT APP_ID, GROUP_ID, COUNT(CUSTOMER_ID) AS COUNT
FROM GROUP_CUST_REL
WHERE CUSTOMER_ID IN ( <list of ids> )
GROUP BY APP_ID, GROUP_ID
HAVING COUNT(CUSTOMER_ID) = <number of ids in list>
will return me all of the group IDs that contain all of the customer ids in the given list and only those group ids. So for a list of (123,125) only group id 2 would be returned from the above example
I will then have to link with the app table to use its created timestamp to identify the most recent application that the group existed in so that I can then pull the correct/most up to date info from the group table.
Does anyone have any thoughts on whether this is the most efficient way to do this? If there is another quicker/cleaner way I'd appreciate your thoughts.
This smells like a division:
Division sample
Other related stack overflow question
Taking a look at the provided links you'll see the solution to similar issues from relational alegebra's point of view, doesn't seem to be quicker and arguably cleaner.
I didn't look at your solution at first, and when I solved this I turned out to have solved this the same way you did.
Actually, I thought this:
<number of ids in list>
Could be turned into something like this (so that you don't need the extra parameter):
select count(*) from (<list of ids>) as t
But clearly, I was wrong. I'd stay with your current solution if I were you.

MySQL, reading this EXPLAIN statement

I have a query which is starting to cause some concern in my application. I'm trying to understand this EXPLAIN statement better to understand where indexes are potentially missing:
+----+-------------+-------+--------+---------------+------------+---------+-------------------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+------------+---------+-------------------------------+------+---------------------------------+
| 1 | SIMPLE | s | ref | client_id | client_id | 4 | const | 102 | Using temporary; Using filesort |
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 4 | www_foo_com.s.user_id | 1 | |
| 1 | SIMPLE | a | ref | session_id | session_id | 4 | www_foo_com.s.session_id | 1 | Using index |
| 1 | SIMPLE | h | ref | email_id | email_id | 4 | www_foo_com.a.email_id | 10 | Using index |
| 1 | SIMPLE | ph | ref | session_id | session_id | 4 | www_foo_com.s.session_id | 1 | Using index |
| 1 | SIMPLE | em | ref | session_id | session_id | 4 | www_foo_com.s.session_id | 1 | |
| 1 | SIMPLE | pho | ref | session_id | session_id | 4 | www_foo_com.s.session_id | 1 | |
| 1 | SIMPLE | c | ALL | userfield | NULL | NULL | NULL | 1108 | |
+----+-------------+-------+--------+---------------+------------+---------+-------------------------------+------+---------------------------------+
8 rows in set (0.00 sec)
I'm trying to understand where my indexes are missing by reading this EXPLAIN statement. Is it fair to say that one can understand how to optimize this query without seeing the query at all and just look at the results of the EXPLAIN?
It appears that the ALL scan against the 'c' table is the achilles heel. What's the best way to index this based on constant values as recommended on MySQL's documentation? |
Note, I also added an index to userfield in the cdr table and that hasn't done much good either.
Thanks.
--- edit ---
Here's the query, sorry -- don't know why I neglected to include it the first pass through.
SELECT s.`session_id` id,
DATE_FORMAT(s.`created`,'%m/%d/%Y') date,
u.`name`,
COUNT(DISTINCT c.id) calls,
COUNT(DISTINCT h.id) emails,
SEC_TO_TIME(MAX(DISTINCT c.duration)) duration,
(COUNT(DISTINCT em.email_id) + COUNT(DISTINCT pho.phone_id) > 0) status
FROM `fa_sessions` s
LEFT JOIN `fa_users` u ON s.`user_id`=u.`user_id`
LEFT JOIN `fa_email_aliases` a ON a.session_id = s.session_id
LEFT JOIN `fa_email_headers` h ON h.email_id = a.email_id
LEFT JOIN `fa_phones` ph ON ph.session_id = s.session_id
LEFT JOIN `fa_email_aliases` em ON em.session_id = s.session_id AND em.status = 1
LEFT JOIN `fa_phones` pho ON pho.session_id = s.session_id AND pho.status = 1
LEFT JOIN `cdr` c ON c.userfield = ph.phone_id
WHERE s.`partner_id`=1
GROUP BY s.`session_id`
I assume you've looked here to get more info about what it is telling you. Obviously the ALL means its going through all of them. The using temporary and using filesort are talked about on that page. You might want to look at that.
From the page:
Using filesort
MySQL must do an extra pass to find
out how to retrieve the rows in sorted
order. The sort is done by going
through all rows according to the join
type and storing the sort key and
pointer to the row for all rows that
match the WHERE clause. The keys then
are sorted and the rows are retrieved
in sorted order. See Section 7.2.12,
“ORDER BY Optimization”.
Using temporary
To resolve the query, MySQL needs to
create a temporary table to hold the
result. This typically happens if the
query contains GROUP BY and ORDER BY
clauses that list columns differently.
I agree that seeing the query might help to figure things out better.
My advice?
Break the query into 2 and use a temporary table in the middle.
Reasonning
The problem appears to be that table c is being table scanned, and that this is the last table in the query. This is probably bad: if you have a table scan, you want to do it at the start of the query, so it's only done once.
I'm not a MySQL guru, but I have spent a whole lot of time optimising queries on other DBs. It looks to me like the optimiser hasn't worked out that it should start with c and work backwards.
The other thing that strikes me is that there are probably too many tables in the join. Most optimisers struggle with more than 4 tables (because the number of possible table orders is growing exponentially, so checking them all becomes impractical).
Having too many tables in a join is the root of 90% of performance problems I have seen.
Give it a go, and let us know how you get on. If it doesn't help, please post the SQL, table definitions and indeces, and I'll take another look.
General Tips
Feel free to look at this answer I gave on general performance tips.
A great resource
MySQL Documentation for EXPLAIN
Well looking at the query would be useful, but there's at least one thing that's obviously worth looking into - the final line shows the ALL type for that part of the query, which is generally not great to see. If the suggested possible key (userfield) makes sense as an added index to table c, it might be worth adding it and seeing if that reduces the rows returned for that table in the search.
Query Plan
The query plan we might hope the optimiser would choose would be something like:
start with sessions where partner_id=1 , possibly using an index on partner_id,
join sessions to users, using an index on user_id
join sessions to phones, where status=1, using an index on session_id and possibly status
join sessions to phones again using an index on session_id and phone_id **
join phones to cdr using an index on userfield
join sessions to email_aliases, where status=1 using an index on session_id and possibly status
join sessions to email_aliases again using an index on session_id and email_id **
join email_aliases to email_headers using an index on email_id
** by putting 2 fields in these indeces, we enable the optimiser to join to the table using session_id, and immediately find out the associated phone_id or email_id without having to read the underlying table. This technique saves us a read, and can save a lot of time.
Indeces I would create:
The above query plan suggests these indeces:
fa_sessions ( partner_id, session_id )
fa_users ( user_id )
fa_email_aliases ( session_id, email_id )
fa_email_headers ( email_id )
fa_email_aliases ( session_id, status )
fa_phones ( session_id, status, phone_id )
cdr ( userfield )
Notes
You will almost certainly get acceptable performance without creating all of these.
If any of the tables are small ( less than 100 rows ) then it's probably not worth creating an index.
fa_email_aliases might work with ( session_id, status, email_id ), depending on how the optimiser works.