Trying to find non-duplicate entries in mostly identical tables(access) - sql

I have 2 different databases. They track different things about inventory. in essence they share 3 common fields. Location, item number and quantity. I've extracted these into 2 tables, with only those fields. Every time I find an answer, it doesn't get all the test cases, just some of the fields.
Items can be in multiple locations, and as a turn each location can have multiple items. The primary key would be location and item number.
I need to flag when an entry doesn't match all three fields.
I've only been able to find queries that match an ID or so, or who's queries are beyond my comprehension. in the below, I'd need a query that would show that rows 1,2, and 5 had issues. I'd run it on each table and have to verify it with a physical inventory.
Please refrain from commenting on it being silly having information in 2 different databases, All I get in response it to deal with it =P
Table A
Location ItemNum | QTY
-------------------------
1a1a | as1001 | 5
1a1b | as1003 | 10
1a1b | as1004 | 2
1a1c | as1005 | 15
1a1d | as1005 | 15
Table B
Location ItemNum | QTY
-------------------------
1a1a | as1001 | 10
1a1d | as1003 | 10
1a1b | as1004 | 2
1a1c | as1005 | 15
1a1e | as1005 | 15
This article seemed to do what I wanted but I couldn't get it to work.

To find entries in Table A that don't have an exactly matching entry in Table B:
select A.*
from A
left join B on A.location = B.location and A.ItemNum = B.ItemNum and A.qty = B.qty
where B.location Is Null
Just swap all the A's and B's to get the list of entries in B with no matching entry in A.

Related

Deleting duplicate rows with primary keys that are connected to other tables

A process was causing duplicate rows in a table where there were not supposed to be any. There are several great answers to deleting duplicate rows online. But, what if those duplicates with ID primary keys all have data in other tables tied to them?
Is there a way to delete all duplicates in the first table and migrate all data tied to those keys to the single PK ID that wasn't deleted?
For example:
TABLE 1
+-------+----------+----------+------------+
| ID(PK)| Model | ItemType | Color |
+-------+----------+----------+------------+
| 1 | 4 | B | Red |
| 2 | 4 | B | Red |
| 3 | 5 | A | Blue |
+-------+----------+----------+------------+
TABLE 2
+-------+----------+---------+
| ID(PK)| OtherID | Type |
+-------+----------+---------+
| 1 | 1 | Type1 |
| 2 | 1 | Type2 |
| 3 | 2 | Type3 |
| 4 | 2 | Type4 |
| 5 | 2 | Type5 |
+-------+----------+---------+
So I would theoretically want to delete the entry with ID: 2 from TABLE 1, and then have the OtherID fields in TABLE 2 switch to 1. This would actually be needed for X number of tables. This particular situation has 4 tables connected to its ID PK.
You cannot do this automatically. But you can do this with some queries. First, you set all the foreign keys to the correct id, which is presumably the smallest one:
with ids (
select t1.*, min(id) over (partition by Model, ItemType, Color) as min_id
from table1 t1
)
update t2
set t2.otherid = ids.min_id
from table2 t2 join
ids
on t2.otherid = ids.id
where ids.id <> ids.min_id;
Then delete the ids that are either duplicated or not referenced in table2 (depending on which you actually want):
with ids (
select t1.*, min(id) over (partition by Model, ItemType, Color) as min_id
from table1 t1
)
delete from ids
where id <> min_id;
Note: If the database has concurrent users, you might want to put it in single user mode for this operation or lock the tables so they are not modified during these two operations.
To do this right, you want to wrap everything in a single transaction and perform this during a regular maintenance period. Anything else could leave things as inconsistent as they are now.
Make a determination as to which "key" you will use.
Update all of the child tables to use the new "key" where the value is the old "key".
There should be no FK dependencies on the duplicate records, delete them.
Once all ambiguities are resolved, place an unique constraint on (ItemType,Color) (or whatever the real columns are).
If there are a lot of instances, you may need to write a script to handle this and use the information in sys.foreign_keys and sys.foreign_key_columns to determine which records to update and in which order.

Why is INNER JOIN producing more records than original file?

I have two tables. Table A & Table B. Table A has 40516 rows, and records sales by seller_id. The first column in Table A is the seller_id that repeats every time a sale is made.
Example: Table A (40516 rows)
seller_id | item | cost
------------------------
1 | dog | 5000
1 | cat | 50
4 |lizard| 80
5 |bird | 20
5 |fish | 90
The seller_id is also present in Table B, and also contains the corresponding name of the seller.
Example: Table B (5851 rows)
seller_id | seller_name
-------------------------
1 | Dog and Cat World INC
4 | Reptile Love.com
5 | Ocean Dogs Inc
I want to join these two tables, but only display the seller name from Table B and all other columns from Table A. When I do this with an INNER JOIN I get 40864 rows (348 extra rows). Shouldn't the query produce only the original 40516 rows?
Also not sure if this matters, but the seller_id can contain several zeros before the number (e.g., 0000845, 0000549).
I've looked around on here and haven't really found an answer. I've tried LEFT and RIGHT joins and get the same results for one and way more results for the other.
SQL Code Example:
SELECT public.table_B.seller_name, *
FROM public.table_A
INNER JOIN public.table_B ON public.table_A.seller_id =
public.table_B.seller_id;
Expected Results:
seller_name | seller_id | item | cost
------------------------------------------------
Dog and Cat World INC | 1 | dog | 5000
Dog and Cat World INC | 1 | cat | 50
Reptile Love.com | 4 |lizard| 80
Ocean Dogs Inc | 5 |bird | 20
Ocean Dogs Inc | 5 |fish | 90
I expected the results to contain the same number of rows in Table A. Instead I gut names matching up and an additional 348 rows...
Update:
I changed "unique_id" to "seller_id" in the question.
I guess I should have chosen a better name for unique_id in the original example. I didn't mean it to be unique in the sense of a key. It is just the seller's id that repeats every time there is a sale (in Table A). The seller's ID does repeat in Table A because it is supposed to. I simply want to pair up the seller IDs with the seller names.
Thanks again everyone for their help!
unique_id is already not correctly named in the first table, so there is no reason to assume it is unique in the second table either.
Run this query to find the duplicates:
select unique_id
from table_b
group by unique_id
having count(*) > 1;
You can fix the query using distinct on:
SELECT b.seller_name, a.*
FROM public.table_A a JOIN
(SELECT DISTINCT ON (b.unique_id) b.*
FROM public.table_B b
ORDER BY b.unique_id
) b
ON a.unique_id = b.unique_id;
In this case, you may get fewer records, if there are no matches. To fix that, use a LEFT JOIN.
Because unique id column is not unique.
Gordon Linoff was correct. The seller_id (formerly listed as unique_id) was indeed duplicated throughout the data set. I foolishly assumed otherwise. Also the seller_name had many duplicates too! In the end I had to use the CONCAT() function to join the seller_id with second identifier to create a type of foreign key. After I did this the join worked as expected. Thanks everyone!

PostgreSQL: Distribute rows evenly and according to frequency

I have trouble with a complex ordering problem. I have following example data:
table "categories"
id | frequency
1 | 0
2 | 4
3 | 0
table "entries"
id | category_id | type
1 | 1 | a
2 | 1 | a
3 | 1 | a
4 | 2 | b
5 | 2 | c
6 | 3 | d
I want to put entries rows in an order so that category_id,
and type are distributed evenly.
More precisely, I want to order entries in a way that:
category_ids that refer to a category that has frequency=0 are
distributed evenly - so that a row is followed by a different category_id
whenever possible. e.g. category_ids of rows: 1,2,1,3,1,2.
Rows with category_ids of categories with frequency<>0 should
be inserted from ca. the beginning with a minimum of frequency rows between them
(the gaps should vary). In my example these are rows with category_id=2.
So the result could start with row id #1, then #4, then a minimum of 4 rows of other
categories, then #5.
in the end result rows with same type should not be next to each other.
Example result:
id | category_id | type
1 | 1 | a
4 | 2 | b
2 | 1 | a
6 | 3 | d
.. some other row ..
.. some other row ..
.. some other row ..
5 | 2 | c
entries are like a stream of things the user gets (one at a time).
The whole ordering should give users some variation. It's just there to not
present them similar entries all the time, so it doesn't have to be perfect.
The query also does not have to give the same result on each call - using
random() is totally fine.
frequencies are there to give entries of certain categories a higher
priority so that they are not distributed across the whole range, but are placed more
at the beginning of the result list. Even if there are a lot of these entries, they
should not completely crowd out the frequency=0 entries at the beginning, through.
I'm no sure how to start this. I think I can use window functions and
ntile() to distribute rows by category_id and type.
But I have no idea how to insert the non-0-category-entries afterwards.

sql insert value from another table with original nulls but not unmatched entries

OK. So this is a hard one to explain, but I am replacing the type of a foreign key in a database. To do this I need to update the values in a table that references it. That is all fine and good, and nice and easy to do.
I'm inserting this stuff into a temporary table which will replace the original table, but the insert query isn't at all difficult, it's the select that I get the values from.
However, I also want to keep any entries where the original reference was NULL. Also not hard, I could use a Left Inner Join for that.
But we're not done yet: I don't want the entries for which there is no match in the second table. I've been dinking around with this for 2 hours now, and am no closer to figuring this out than I am to the moon.
Let me give you an example data set:
____________________________
| Inventory || Customer |
|============||============|
| ID Cust || ID Name |
|------------||------------|
| 1 A || 1 A |
| 2 B || 2 B |
| 3 E || 3 C |
| 4 NULL || 4 D |
|____________||____________|
Let's say the database used to use the Customer.Name field as its Primary Key, and I need to change it to a standard int identity(1,1) not null ID. I've added the field with no issues in the Customer table, and kept the Name because I need it for other stuff. I have had no trouble with this in all the tables that do not allow NULLs, but since the "Inventory" table allows something to be associated with No customer, I'm running into troubles.
If I did a left inner join, my results would be:
______________
| Results |
|============|
| ID Cust |
|------------|
| 1 1 |
| 2 2 |
| 3 NULL |
| 4 NULL |
|____________|
However, Inventory #3 was referencing a customer which does not exist. I want that to be filtered out.
This database is my development database, where I hack, slash, and destroy things with wanton disregard for validity. So a lot of links in these tables are no longer valid.
The next step is replicating this process in the beta-testing environment, where bad records shouldn't exist, but I can't guarantee that. So I'd like to keep the filter, if possible.
The query I have right now is using a sub-query to find all rows in Inventory whose CustID either exists in Customers, or is null. It then tries to only grab the value from those rows which the subquery found. Here's the translated query:
insert into results
(
ID,
Cust
)
select
inv.ID, cust.ID
from Inventory inv, Customer cust
where inv.ID in
(
select inv.ID from Inventory inv, Customer cust
where inv.Cust is null
or cust.Name = inv.Cust
)
and cust.Name = inv.Cust
But, as I'm sure you can see, this query isn't right. I've tried using 2, 3 subqueries, inner joins, left joins, bleh. The results of this query, and many others I've tried (that weren't horribly, horribly wrong) are:
______________
| Results |
|============|
| ID Cust |
|------------|
| 1 1 |
| 2 2 |
|____________|
Which is essentially an inner-join. Considering my actual data has around 1100 records which have NULL values in that field, I don't think truncating them is the answer.
The answer I'm looking for is:
______________
| Results |
|============|
| ID Cust |
|------------|
| 1 1 |
| 2 2 |
| 4 NULL |
|____________|
The trickiest part of this insert into select is the fact that I'm looking to insert either a value from another table, or essentially a value from this table or the literal NULL. That just isn't something I know how to do; I'm still getting the hang of SQL.
Since I'm inserting the results of this query into a table, I've considered doing the insert using a select which leaves out the NULL values and un-matched records, then going back through and adding in all the NULL records, but I really want to learn how to do the more advanced queries like this.
So do any of yous folks have any ideas? 'Cause I'm lost.
How about a union?
Select all records where ID and Cust match and union that with all records where ID matches and inventory.cust is null.

Advance Query with Join

I'm trying to convert a product table that contains all the detail of the product into separate tables in SQL. I've got everything done except for duplicated descriptor details.
The problem I am having all the products have size/color/style/other that many other products contain. I want to only have one size or color descriptor for all the items and reuse the "ID" for all the product which I believe is a Parent key to the Product ID which is a ...Foreign Key. The only problem is that every descriptor would have multiple Foreign Keys assigned to it. So I was thinking on the fly just have it skip figuring out a Foreign Parent key for each descriptor and just check to see if that descriptor exist and if it does use its Key for the descriptor.
Data Table
PI Colo Sz OTHER
1 | Blue | 5 | Vintage
2 | Blue | 6 | Vintage
3 | Blac | 5 | Simple
4 | Blac | 6 | Simple
===================================
Its destination table is this
===================================
DI Description
1 | Blue
2 | Blac
3 | 5
4 | 6
6 | Vintage
7 | Simple
=============================
Select Data.Table
Unique.Data.Table.Colo
Unique.Data.Table.Sz
Unique.Data.Table.Other
=======================================
Then the dual part of the questions after we create all the descriptors how to do a new query and assign the product ID to the descriptors.
PI| DI
1 | 1
1 | 3
1 | 4
2 | 1
2 | 3
2 | 4
By figuring out how to do this I should be able to duplicate this pattern for all 300 + columns in the product. Some of these fields are 60+ characters large so its going to save a ton of space.
Do I use a Array?
Okay, if I understand you correctly, you want all unique attributes converted from columns into rows in a single table (detailstable) that has an id and a description field:
Assuming the schema:
datatable
------------------
PI [PK]
Colo
Sz
OTHER
detailstable
------------------
DI [PK]
Description
You can first get all of the unique attributes into its own table with:
INSERT INTO detailstable (Description)
SELECT
a.description
FROM
(
SELECT DISTINCT Colo AS description
FROM datatable
UNION
SELECT DISTINCT Sz AS description
FROM datatable
UNION
SELECT DISTINCT OTHER AS description
FROM datatable
) a
Then to link up the datatable to the detailstable, I'm assuming you have a cross-reference table defined like:
datadetails
------------------
PI [PK]
DI [PK]
You can then do:
INSERT INTO datadetails (PI, DI)
SELECT
a.PI
b.DI
FROM
datatable a
INNER JOIN
detailstable b ON b.Description IN (a.Colo, a.Sz, a.OTHER)
I reckon you want to split description table for different categories, like - colorDescription, sizeDescription etc.
If that is not practical then I would recommend having an extra column showing an category attribute:
DI Description Category
1 | Blue | Color
2 | Blac | Color
3 | 5 | Size
4 | 6 | Size
6 | Vintage | Other
7 | Simple | Other
And then have primary key in this table as combination of ID and Category column.
This will have less chances for injecting any data errors. It will be also easy to track that down.