Excluding data pairs from a query based on a table? - sql

I have a massive and messy database of facilities where there are many duplicates. Addresses have been entered in such a haphazard way that I will be making many queries to identify possible duplicates. My objective is for each query to identify the possible duplicates, and then a person actually goes through the list and marks each pairing as either "not a duplicate" or "possible duplicate."
When someone marks a facility pair as not a duplicate, I want to record that data pair in a table so when that when one of the queries would otherwise return that pairing, it is instead excluded. I am at a loss for how to do this. I'm currently using MS Access for SQL queries, and have rudimentary visual basic knowledge.
Sample of how it should work
Query 1 is run to find duplicates based on city and company name. It brings back that facilities 1 and 2, 3 and 4, 5 and 6 are possible duplicates. The first two pairings are duplicates I need to go fix, but that 5 and 6 are indeed separate facilities. I click to record that facilities 5 and 6 are not duplicates, which records the data in a table. When query 1 is run again it does not return that 5 and 6 are possible duplicates.
For reference, the address duplicates look something like this, which is why there need to be multiple queries
Frank's Garage, 123 2nd St
Frank's Garage LLC, LLC, 123 Second st
Frank's Garage and muffler, 123 2nd Street
Frank's, 12 2nd st

The only way I know to fix this is to create a master table of company names and associate this table PK with records in original table. It will be a difficult and tedious process to review records and eliminate duplicates from master and associate remaining PK of a duplicate group to the original records (as you have discovered).
Create a master table of DISTINCT company and address data from original table. Include autonumber field to generate key. Join tables on company/address fields and UPDATE a field in original table with this key. Have another field in original table to receive a replacement foreign key.
Have a number field (ReplacementPK) in master table. Sort and review records and enter the key you want to retain for company/address duplicates group. Build a query joining tables on original key fields, update NewFK field in original table with selected ReplacementPK from master.
When all looks good:
Delete company and address and original FK fields from original table.
Delete records from master where PK does not match ReplacementPK.

Related

SQL Best way to return data from one table along with mapped data from another table

I have the following problem.
I have a table Entries that contains 2 columns:
EntryID - unique identifier
Name - some name
I have another EntriesMapping table (many to many mapping table) that contains 2 columns :
EntryID that refers to the EntryID of the Entries table
PartID that refers to a PartID in a seprate Parts table.
I need to write a SP that will return all data from Entries table, but for each row in the Entries table I want to provide a list of all PartID's that are registered in the EntriesMapping table.
My question is how do I best approach the deisgn of the solution to this, given that the results of the SP would regularly be processed by an app so performance is quite important.
1.
Do I write a SP that will select multiple rows per entry - where if there are more than one PartID's registered for a given entry - I will return multiple rows each having the same EntryID and Name but different PartID's
OR
2.
Do I write a SP that will select 1 row per entry in the Entries table, and have a field that is a string/xml/json that contains all the different PartID's.
OR
3. There is some other solution that I am not thinking of?
Solution 1 seems to me to be the better way to go, but I will be passing lots of repeating data.
Solution 2 wont pass extra data, but the string/json/xml would need to be processed additionally, resuling in larger cpu time per item.
PS: I feel like this is quite a common problem to solve, but I was unable to find any resource that can provide common solutions or some pros/cons to different approaches.
I think you need simple JOIN:
SELECT e.EntryId, e.Name, em.PartId
FROM Entries e
JOIN EntriesMapping em ON e.EntryId = em.EntryId
This will return what you want, no need for stored procedure for that.

SQL 2 JOINS USING SINGLE REFERENCE TABLE

I'm trying to achieve 2 joins. If I run the 1st join alone it pulls 4 lots of results, which is correct. However when I add the 2nd join which queries the same reference table using the results from the select statement it pulls in additional results. Please see attached. The squared section should not be being returned
So I removed the 2nd join to try and explain better. See pic2. I'm trying to get another column which looks up InvolvedInternalID against the initial reference table IRIS.Practice.idvClient.
Your database is simply doing as you tell it. When you add in the second join (confusingly aliased as tb1 in a 3 table query) the database is finding matching rows that obey the predicate/truth statement in the ON part of the join
If you don't want those rows in there then one of two things must be the case:
1) The truth you specified in the ON clause is faulty; for example saying SELECT * FROM person INNER JOIN shoes ON person.age = shoes.size is faulty - two people with age 13 and two shoes with size 13 will produce 4 results, and shoe size has nothing to do with age anyway
2) There were rows in the table joined in that didn't apply to the results you were looking for, but you forgot to filter them out by putting some WHERE (or additional restriction in the ON) clause. Example, a table holds all historical data as well as current, and the current record is the one with a NULL in the DeletedOn column. If you forget to say WHERE deletedon IS NULL then your data will multiply as all the past rows that don't apply to your query are brought in
Don't alias tables with tbX, tbY etc.. Make the names meaningful! Not only do aliases like tbX have no relation to the original table name (so you encounter tbX, and then have to go searching the rest of the query to find where it's declared so you can say "ah, it's the addresses table") but in this case you join idvclient in twice, but give them unhelpful aliases like tb1, tb3 when really you should have aliased them with something that describes the relationship between them and the rest of the query tables
For example, ParentClient and SubClient or OriginatingClient/HandlingClient would be better names, if these tables are in some relationship with each other.
Whatever the purpose of joining this table in twice is, alias it in relation to the purpose. It may make what you've done wriong easier to spot, for example "oh, of course.. i'm missing a WHERE parentclient.type = 'parent'" (or WHERE handlingclient.handlingdate is not null etc..)
The first step to wisdom is by calling things their proper names

MS-Access 2007: Query for names that have two or more different values in another field

Hello & thank you in advance.
I have an access db that has the following information about mammals we captured. Each capture has a unique ID, which is the capture table's primary key: "capture_id". The mammals (depending on species) have ear tags that we use to track them from year to year and day to day. These are in a field called "id_code". I have the sex of the mammal as it was recorded at capture time in another field called sex.
I want a query that will return all instances of an id_code IF the sex changes even once for that id.
Example: Animal E555 was caught 4 times, 3 times someone recorded this animal as a F and once as a M.
I've managed to get it to display this info by stacking about 5 queries on top of each other (Query for recaptured animals -> Query for all records of animals from 1st query -> Query for unique combo of id & sex (via just using those two columns & requiring "Unique Values") -> Query that pulls only duplicate id values from that last one and pulls back up all capture records of those ids). HOwever, this is clearly not the right way to do this, it is then not updateable (which I need since this is for data quality control) and for some reason it also returns duplicates of each of those records...
I realize that this could be solved two other ways:
Using R to pull up these records (I want none of this data to have to leave the database though, because we're working on getting it into one place after 35 years of collecting! And my boss can't use R and I'm seasonal, so I want him to just have to open a query)
Creating a table that tracks all animal id's as an animal index. However, this would make entering the data more difficult and also require someone to go back through 20,000 records and create a brand new animal id for every one because you can't give ear tags to voles & things so they don't get a unique identifier in the field.
Help!
It is quite simple to do with a single query. As a bonus, the query will be updatable, not duplicated, and simple to use:
SELECT mammals.ID, mammals.Sex, mammals.id_code, mammals.date_recorded
FROM mammals
WHERE mammals.id_code In
(select id_code from
(select distinct id_code, sex from [mammals]) a
group by id_code
having count(*)>1
);
The reason why you see a sub-query inside a sub-query is because Access does not support COUNT(DISTINCT). With any other "normal" database you would write:
SELECT mammals.ID, mammals.Sex, mammals.id_code, mammals.date_recorded
FROM mammals
WHERE mammals.id_code In
(select id_code
from [mammals]
group by id_code
having count(DISTINCT Sex)>1
);

Can this table structure work or should it change

I have an existing table which is expected to work for a new piece of functionality. I have the opinion that a new table is needed to achieve the objective and would like an opinion if it can work as is, or is the new table a must? The issue is a query returning more records than it should, I believe this is why:
There is a table called postcodes. Over time this has really become a town table because different town names have been entered so it has multiple records for most postcodes. In reference to the query below the relevant fields in the postcode table are:
postcode.postcode - the actual postcode, as mentioned this is not unique
postcode.twcid - is a foreign key to the forecast table, this is not unique either
The relevant fields in the forecast table are:
forecast.twcid - identifyer for the table however not unique because there four days worth of forecasts in the table. Only ever four, newver more, never less.
And here is the query:
select * from forecast
LEFT OUTER JOIN postcodes ON forecast.TWCID = postcodes.TWCID
WHERE postcodes.postcode = 3123
order by forecast.twcid, forecast.theDate;
Because there are two records in the postcode table for 3123 the results are doubled up. Two forecasts for day 1, two for day 2 etc......
Given that the relationship between postcodes and forecast is many to many (there are multiple records in the postcode tables for each postcode and twcid. And there are multiple records for each twcid in the forecast table because it always holds four days worth of forecasts) is there a way to re-write the query to only get four forecast records for a post code?
Or is my thought of creating a new postcode table which has unique records for each post code necessary?
You have a problem that postcodes can be in multiple towns. And towns can have multiple postcodes. In the United States, the US Census Bureau and the US Post Office have defined very extensive geographies for various coding schemes. The basic idea is that a zip code has a "main" town.
I would suggest that you either create a separate table with one row per postcode and the main town. Or, add a field to your database indicating a main town. You can guarantee uniqueness of this field with a filtered index:
create unique index postcode_postcode_maintown on postcodes(postcode) where IsMainTown = 1;
You might need the same thing for IsMainPostcode.
(Filtered indexes are a very nice feature in SQL Server.)
With this construct, you can change your query to:
select *
from forecast LEFT OUTER JOIN
postcodes
ON forecast.TWCID = postcodes.TWCID and postcodes.IsMainPostcode = 1
WHERE postcodes.postcode = 3123
order by forecast.twcid, forecast.theDate;
You should really never have a table without a primary key. Primary keys are, by definition, unique. The primary key should be the target for your foreign keys.
You're having problems because you're fighting against a poor database design.

How to merge two identical database data to one?

Two customers are going to merge. They are both using my application, with their own database. About a few weeks they are merging (they become one organisation). So they want to have all the data in 1 database.
So the two database structures are identical. The problem is with the data. For example, I have Table Locations and persons (these are just two tables of 50):
Database 1:
Locations:
Id Name Adress etc....
1 Location 1
2 Location 2
Persons:
Id LocationId Name etc...
1 1 Alex
2 1 Peter
3 2 Lisa
Database 2:
Locations:
Id Name Adress etc....
1 Location A
2 Location B
Persons:
Id LocationId Name etc...
1 1 Mark
2 2 Ashley
3 1 Ben
We see that person is related to location (column locationId). Note that I have more tables that is referring to the location table and persons table.
The databases contains their own locations and persons, but the Id's can be the same. In case, when I want to import everything to DB2 then the locations of DB1 should be inserted to DB2 with the ids 3 and 4. The the persons from DB1 should have new Id 4,5,6 and the locations in the person table also has to be changed to the ids 4,5,6.
My solution for this problem is to write a query which handle everything, but I don't know where to begin.
What is the best way (in a query) to renumber the Id fields also having a cascade to the childs? The databases does not containing referential integrity and foreign keys (foreign keys are NOT defined in the database). Creating FKeys and Cascading is not an option.
I'm using sql server 2005.
You say that both customers are using your application, so I assume that it's some kind of "shrink-wrap" software that is used by more customers than just these two, correct?
If yes, adding special columns to the tables or anything like this probably will cause pain in the future, because you either would have to maintain a special version for these two customers that can deal with the additional columns. Or you would have to introduce these columns to your main codebase, which means that all your other customers would get them as well.
I can think of an easier way to do this without changing any of your tables or adding any columns.
In order for this to work, you need to find out the largest ID that exists in both databases together (no matter in which table or in which database it is).
This may require some copy & paste to get a lot of queries that look like this:
select max(id) as maxlocationid from locations
select max(id) as maxpersonid from persons
-- and so on... (one query for each table)
When you find the largest ID after running the query in both databases, take a number that's larger than that ID, and add it to all IDs in all tables in the second database.
It's very important that the number needs to be larger than the largest ID that already exists in both databases!
It's a bit difficult to explain, so here's an example:
Let's say that the largest ID in any table in both databases is 8000.
Then you run some SQL that adds 10000 to every ID in every table in the second database:
update Locations set Id = Id + 10000
update Persons set Id = Id + 10000, LocationId = LocationId + 10000
-- and so on, for each table
The queries are relatively simple, but this is the most work because you have to build a query like this manually for each table in the database, with the correct names of all the ID columns.
After running the query on the second database, the example data from your question will look like this:
Database 1: (exactly like before)
Locations:
Id Name Adress etc....
1 Location 1
2 Location 2
Persons:
Id LocationId Name etc...
1 1 Alex
2 1 Peter
3 2 Lisa
Database 2:
Locations:
Id Name Adress etc....
10001 Location A
10002 Location B
Persons:
Id LocationId Name etc...
10001 10001 Mark
10002 10002 Ashley
10003 10001 Ben
And that's it! Now you can import the data from one database into the other, without getting any primary key violations at all.
If this were my problem, I would probably add some columns to the tables in the database I was going to keep. These would be used to store the pk values from the other db. Then I would insert records from the other tables. For the ones with foreign keys, I would use a known value. Then I would update as required and drop the columns I added.