Is there a way to exlude SQL results that are ALMOST duplicates?

Is there a way to exlude SQL results that are ALMOST duplicates? - sql

I have query that runs daily that shows old and new member addresses as they are updated. The query works fine except for the times when a USPS address match is done in our core system and just changes some of the abbreviations
For example:
Old Address - 1234 East Main Street
New Address - 1234 E Main St
I don't need to see these results.
I have tried removing based on unique fields in the core, however, the USPS match process creates all new fields so the query can't remove based on that information.
The main SP for this is:
INSERT INTO #results
SELECT
distinct i.INDIVIDUAL_ID,
i.FIRST_NAME,
i.MIDDLE_NAME,
i.LAST_NAME,
i.D1NAME,
CurrentAddress.ADDRESS1,
PreviousAddress.ADDRESS1,
CurrentAddress.ADDRESS2,
PreviousAddress.ADDRESS2,
CurrentAddress.ADDRESS3,
PreviousAddress.ADDRESS3,
CurrentAddress.CITY,
PreviousAddress.CITY,
CurrentAddress.STATE,
PreviousAddress.STATE,
CurrentAddress.ZIP_STR,
PreviousAddress.ZIP_STR,
CurrentAddress.ZIP4_STR,
PreviousAddress.ZIP4_STR,
CurrentAddress.COUNTRY,
PreviousAddress.COUNTRY
FROM INDIVIDUAL i
INNER JOIN MEMBERSHIPPARTICIPANT mpt
ON i.INDIVIDUAL_ID = mpt.INDIVIDUAL_ID
AND i.DL_LOAD_DATE = mpt.DL_LOAD_DATE
INNER JOIN AGR_MEMBERTOTAL_TODAY m
ON mpt.MEMBER_NBR = m.MEMBER_NBR
AND mpt.DL_LOAD_DATE = m.DL_LOAD_DATE
INNER JOIN BRANCH b
ON i.BRANCH_NBR = b.BRANCH_NBR
CROSS APPLY dbo.GetCurrentAddress(i.INDIVIDUAL_ID, #latestDate) AS CurrentAddress
CROSS APPLY dbo.GetCurrentAddress(i.INDIVIDUAL_ID, #previousDate) AS PreviousAddress
WHERE i.DL_LOAD_DATE = #latestDate
AND ( m.OPN_LN_ALL_CNT > 0 OR m.OPN_SV_ALL_CNT > 0 )
order by i.FIRST_NAME asc
DELETE #results
WHERE Address1_Today = Address2_Yesterday
AND Address2_Today = Address1_Yesterday
SELECT *
FROM #results
WHERE (Address1_Today != Address1_Yesterday
OR Address2_Today != Address2_Yesterday
OR Address3_Today != Address3_Yesterday
OR City_Today != City_Yesterday
OR State_Today != State_Yesterday
OR ZipCode_Today != ZipCode_Yesterday
--OR FullZip_Today != FullZip_Yesterday
OR Country_Today != Country_Yesterday)
I'd like to remove the almost duplicate rows
For example:
Old Address - 1234 East Main Street
New Address - 1234 E Main St

There isn't a built in way to test via SQL, and it will have to be defined by logic via procedure. The first thing I'd do is group the substrings in both Old Address and New Address by count of those substrings. The ones where the counts equal each other at the row level, you can split by space and break up the address. Think of each address field as three parts [street_nbr, street_nm, street_suffix]. The street_nm can have an abbreviated prefix, which is why grouping the count of substrings is important thereby increasing the count past 3. Secondary lookup tables that match words/abbreviations that you identify can then be used to "un-duplicate" those suffixes and prefixes.
CREATE TABLE lookup_abbreviations(
unabbreviated_name varchar(50),
abbreviated_name varchar(50));
INSERT INTO lookup_abbreviations(unabbreviated_name, abbreviated_name)
VALUES ('East', 'E')
INSERT INTO lookup_abbreviations(unabbreviated_name, abbreviated_name)
VALUES ('Street', 'St');
-- Use Cross Applies and functions(LEN, LEFT, RIGHT, CHARINDEX, SUBSTRING) to split the address
-- into equal parts. This is where you'll have to figure out the best logic for grouping.
SELECT DISTINCT
Old_Street_Nbr = SUBSTRING(Old_Address, CHARINDEX(' ', Old_Address))
Old_Street_Nm_Prefix = CASE WHEN /*Here is where the count of substrings is tested*/ END
Old_Street_Nm = CASE WHEN /*Here is where the count of substrings is tested*/ END
Old_Street_Suffix = []
INTO #AbbreviatonSort
FROM Results;
SELECT
Old_Street_Nbr ,
Old_Street_Nm_Prefix = CASE
WHEN Old_Street_Nm_Prefix IN (SELECT abbreviated_name from
lookup_abbreviations)
THEN (SELECT unabbreviated_name from
lookup_abbreviations WHERE abbreviated_name =
Old_Street_Nm_Prefix)
ELSE Old_Street_Nm_Prefix
END
INTO #SortedAddresses
FROM #AbbreviationSort
;
SELECT DISTINCT * FROM
(
SELECT Old_Street_Nbr, Old_Prefix FROM #SortedAddresses
UNION ALL
SELECT New_Street_Nbr, New_Prefix FROM #SortedAddresses
) AS DupSearch

Related

Access query inconsistently treats empty string as null

I have an application that is grabbing data from an Access database. I am seeking the minimum value of a column and the results I am getting back are inconsistent.
Have I run into a feature where Access inconsistently treating an empty string as a null depending on whether I add a filter or not, or is there something wrong with the way I am querying the data?
The column contains one blank value (not null) and several non-blank values that are all identical (about 30 instances of 'QLD'). The query I am using has a filter that involves multiple other tables, so that only the blank value and about half of the 'QLD' values are eligible.
It's probably easier to show the code and the effects rather than describe it. I have created a series of unioned queries which 'should' bring back identical results but do not.
Query:
SELECT 'min(LOC_STATE)' as Category
, min(LOC_STATE) as Result
FROM pay_run, pay_run_employee, employee, department, location
WHERE pr_id = pre_prid
AND em_location = loc_id
AND pre_empnum = em_empnum
AND em_department = dm_id
AND pr_date >= #2/24/2015#
AND pr_date <= #2/24/2016#
UNION ALL
(SELECT TOP 1 'top 1 LOC_STATE'
, LOC_STATE
FROM pay_run, pay_run_employee, employee, department, location
WHERE pr_id = pre_prid
AND em_location = loc_id
AND pre_empnum = em_empnum
AND em_department = dm_id
AND pr_date >= #2/24/2015#
AND pr_date <= #2/24/2016#
ORDER BY LOC_STATE)
UNION ALL
SELECT 'min unfiltered', min(loc_state)
FROM location
UNION ALL
(SELECT TOP 1 'iif is null', iif(loc_state is null, 'a', loc_state)
FROM location
ORDER BY loc_state)
Results:
Category Result
min(LOC_STATE) 'QLD'
top 1 LOC_STATE ''
min unfiltered ''
iif is null ''
If I do a minimum with the filter it brings back 'QLD' and not the empty string. At this stage it is possible that the empty string is not being included because it is treated as a null or the filter removes it.
The second query, which brings back the top 1 state using the filter shows that the empty string is not filtered out, which means that the Min function is ignoring the empty string.
The third query, which gets the minimum of the unfiltered table, brings back the empty string - so the minimum function does not exclude empty strings / treat them as null.
The fourth query, ensures that there is not a null in the empty string position.
My conclusion is that perhaps the inclusion of other tables and filter criteria is causing the empty string value to be treated as a null, but I feel that I must be missing something.
NB: I have a very similar query (date literals altered) that executes against the same data imported into a SQL Server database. It is correctly returning '' for all 4 queries.
Does anyone know why the empty string is ignored by the Min function in the first query?
PS: for those who prefer a query with joins
SELECT 'min(LOC_STATE)' as Category
, min(LOC_STATE) as Result
FROM (((pay_run
INNER JOIN pay_run_employee ON pay_run.pr_id = pay_run_employee.pre_prid)
INNER JOIN employee ON pay_run_employee.pre_empnum = employee.em_empnum)
INNER JOIN department ON employee.em_department = department.dm_id)
INNER JOIN location on employee.em_location = location.loc_id
WHERE
PR_DATE >= #2/24/2015# and
PR_DATE <= #2/24/2016#
union all
(SELECT TOP 1 'TOP 1 LOC_STATE'
, LOC_STATE
FROM (((pay_run
INNER JOIN pay_run_employee ON pay_run.pr_id = pay_run_employee.pre_prid)
INNER JOIN employee ON pay_run_employee.pre_empnum = employee.em_empnum)
INNER JOIN department ON employee.em_department = department.dm_id)
INNER JOIN location on employee.em_location = location.loc_id
WHERE
PR_DATE >= #2/24/2015# and
PR_DATE <= #2/24/2016#
order by LOC_STATE)
union all
select 'min unfiltered', min(loc_state)
from location

This has got nothing to do with corrupt data or unions or joins. The problem can be easily made visible by exectuting following queries in access:
create table testbug (Field1 varchar (255) NULL)
insert into testbug (Field1) values ('a')
insert into testbug (Field1) values ('')
insert into testbug (Field1) values ('c')
select min(field1) from testbug
To my opinion this is a bug in ms-access. When the MIN function in ms-access comes across an empty string ('') it forgets all the values he has come across and returns the minimum value from all the values below the empty string. (in my simple example only value 'c')

How to group by more than one row value?

I am working with POSTGRESQL and I can't find out how to solve a problem. I have a model called Foobar. Some of its attributes are:
FOOBAR
check_in:datetime
qr_code:string
city_id:integer
In this table there is a lot of redundancy (qr_code is not unique) but that is not my problem right now. What I am trying to get are the foobars that have same qr_code and have been in a well known group of cities, that have checked in at different moments.
I got this by querying:
SELECT * FROM foobar AS a
WHERE a.city_id = 1
AND EXISTS (
SELECT * FROM foobar AS b
WHERE a.check_in < b.check_in
AND a.qr_code = b.qr_code
AND b.city_id = 2
AND EXISTS (
SELECT * FROM foobar as c
WHERE b.check_in < c.check_in
AND c.qr_code = b.qr_code
AND c.city_id = 3
AND EXISTS(...)
)
)
where '...' represents more queries to get more persons with the same qr_code, different check_in date and those well known cities.
My problem is that I want to group this by qr_code, and I want to show the check_in fields of each qr_code like this:
2015-11-11 14:14:14 => [2015-11-11 14:14:14, 2015-11-11 16:16:16, 2015-11-11 17:18:20] (this for each different qr_code)
where the data at the left is the 'smaller' date for that qr_code, and the right part are all the other dates for that qr_code, including the first one.
Is this possible to do with a sql query only? I am asking this because I am actually doing this app with rails, and I know that I can make a different approach with array methods of ruby (a solution with this would be well received too)

You could solve that with a recursive CTE - if I interpret your question correctly:
Assuming you have a given list of cities that must be visited in order by the same qr_code. Your text doesn't say so, but your query indicates as much.
WITH RECURSIVE
c AS (SELECT '{1,2,3}'::int[] AS cities) -- your list of city_id's here
, route AS (
SELECT f.check_in, f.qr_code, 2 AS idx
FROM foobar f
JOIN c ON f.city_id = c.cities[1]
UNION ALL
SELECT f.check_in, f.qr_code, r.idx + 1
FROM route r
JOIN foobar f USING (qr_code)
JOIN c ON f.city_id = c.cities[r.idx]
WHERE r.check_in < f.check_in
)
SELECT qr_code, array_agg(check_in) AS check_in_list
FROM (
SELECT *
FROM route
ORDER BY qr_code, idx -- or check_in
) sub
HAVING count(*) = (SELECT array_length(cities) FROM c);
GROUP BY 1;
Provide the list as array in the first (non-recursive) CTE c.
In the recursive part start with any rows in the first city and travel along your array until the last element.
In the final SELECT aggregate your check_in column in order. Only return qr_code that have visited all cities of the array.
Similar:
Recursive query used for transitive closure

SQL Combining results from multiple tables, and rows, in to one row in one table

so here's my situation.
I have two tables (keysetdata115) containing vendor information and keysetdata117 that contains either a Remit or Payment address.
Here are the structures with one sample entry:
keysetdata115:
keysetnum ks183 ks178 ks184 ks185 ks187 usagecount
2160826 1 6934 AUDIO DIGEST FOUNDATION 26-1180877 A 0
keysetdata117 (I truncated values for ks192 and ks191 to fit formatting)
keysetnum ks183 ks178 ks188 ks189 ks190 ks192 ks191 usagecount
2160827 1 6934 P001 P EBSCO... TOP OF... A 0
2160828 1 6934 R002 R EBSCO... 123 SE... A 0
There is no 1:1 relationship and the only thing that makes a unique record is the combination or Remit Code,Payment Code, vendor number and vendor group.The codes can only be obtained by referencing the address and / or name.
Ideally what I'd like to do is set this up so that I can pass in the addresses and return all the related values.
I'm dumping this in a table called 'dbo.test' right now (for testing obviously), that has the following entries and what the correspond to in the above tables: vengroup (ks183), vendnum (ks178), remit (ks188), payment (ks188)... ks188 will be a remit or payment based off the value in ks189.
This is what I'm doing so far, using 3 select queries and it works, but there's a lot of redundancy and it's very inefficient.
Any suggestions on how I can streamline it would be MUCH appreciated.
insert into dbo.test (vengroup,vendnum)
select ks183, ks178
from hsi.keysetdata115
where ks184 like 'AUDIO DIGEST%'
update dbo.test
set dbo.test.remit = y.remit
from
dbo.test tst
INNER JOIN
(Select ksd.ks188 as remit, ksd.ks183 as vengroup, ksd.ks178 as vendnum
from hsi.keysetdata117 ksd
inner join dbo.test tst
on tst.vengroup = ksd.ks183 and tst.vendnum = ksd.ks178
where ksd.ks190 like 'EBSCO%' and ks189 = 'R') y
on tst.vengroup = y.vengroup and tst.vendnum = y.vendnum
update dbo.test
set dbo.test.payment = y.payment
from
dbo.test tst
INNER JOIN
(Select ksd.ks188 as payment, ksd.ks183 as vengroup, ksd.ks178 as vendnum
from hsi.keysetdata117 ksd
inner join dbo.test tst
on tst.vengroup = ksd.ks183 and tst.vendnum = ksd.ks178
where ksd.ks190 like 'EBSCO%' and ks189 = 'P') y
on tst.vengroup = y.vengroup and tst.vendnum = y.vendnum
Thanks so much for any suggestions!

You can do what you want in one statement. You just have to do the selection on the run. The way the statement below is written, if Remit gets the value, Payment gets a null and vice versa. If you want the other value to be non-null, just add an else clause to the cases. Like then b.ks188 else 0 end.
INSERT INTO dbo.TEST( vengroup, vendnum, remit, payment )
SELECT a.ks183, a.ks178,
CASE b.ks189 WHEN 'R' THEN b.ks188 END,
CASE b.ks189 WHEN 'P' THEN b.ks188 END
FROM keysetdata115 a
JOIN keysetdata117 b
ON b.ks183 = a.ks183
AND b.ks178 = a.ks178
AND b.ks190 LIKE 'EBSCO%'
WHERE a.ks184 LIKE 'AUDIO DIGEST%';

SQL query to retrieve discrepancies in punch order

Consider the table below.
The rule is - an employee cannot take a break (needs to clock out) from job num 1 before clocking in to job num 2. In this case the employee "A" was supposed to clock OUT instead of BREAK on jobnum 1 because he later clocked in to JobNum#2
Is it possible to write a query to find this in plain SQL?

Idea is to check if next record is proper one. To find next record one has to find first punchtime after current for same employee. Once this information is retrieved one can isolate record itself and check fields of interest, specifically is jobnum the same and [optionally] is punch_type 'IN'. If it is not, not exists evaluates to true and record is output.
select *
from #punch p
-- Isolate breaks only
where p.punch_type = 'BREAK'
-- The ones having no proper entry
and not exists
(
select null
-- The same table
from #punch a
where a.emplid = p.emplid
and a.jobnum = p.jobnum
-- Next record has punchtime from subquery
and a.punchtime = (select min (n.punchtime)
from #punch n
where n.emplid = p.emplid
and n.punchtime > p.punchtime
)
-- Optionally you might force next record to be 'IN'
and a.punch_type = 'IN'
)
Replace #punch with your table name. -- is comment in Sql Server; if you are not using this database, remove this lines. It is a good idea to tag your database and version as there are probably faster/better ways to do this.

Here is the SQL
select * from employees e1 cross join employees e2 where e1.JOBNUM = (e2.JOBNUM + 1)
and e1.PUNCH_TYPE = 'BREAK' and e2.PUNCH_TYPE = 'IN'
and e1.PUNCHTIME < e2.PUNCHTIME
and e1.EMPLID = e2.EMPLID

Selecting elements that don't exist

I am working on an application that has to assign numeric codes to elements. This codes are not consecutives and my idea is not to insert them in the data base until have the related element, but i would like to find, in a sql matter, the not assigned codes and i dont know how to do it.
Any ideas?
Thanks!!!
Edit 1
The table can be so simple:
code | element
-----------------
3 | three
7 | seven
2 | two
And I would like something like this: 1, 4, 5, 6. Without any other table.
Edit 2
Thanks for the feedback, your answers have been very helpful.

This will return NULL if a code is not assigned:
SELECT assigned_codes.code
FROM codes
LEFT JOIN
assigned_codes
ON assigned_codes.code = codes.code
WHERE codes.code = #code
This will return all non-assigned codes:
SELECT codes.code
FROM codes
LEFT JOIN
assigned_codes
ON assigned_codes.code = codes.code
WHERE assigned_codes.code IS NULL
There is no pure SQL way to do exactly the thing you want.
In Oracle, you can do the following:
SELECT lvl
FROM (
SELECT level AS lvl
FROM dual
CONNECT BY
level <=
(
SELECT MAX(code)
FROM elements
)
)
LEFT OUTER JOIN
elements
ON code = lvl
WHERE code IS NULL
In PostgreSQL, you can do the following:
SELECT lvl
FROM generate_series(
1,
(
SELECT MAX(code)
FROM elements
)) lvl
LEFT OUTER JOIN
elements
ON code = lvl
WHERE code IS NULL

Contrary to the assertion that this cannot be done using pure SQL, here is a counter example showing how it can be done. (Note that I didn't say it was easy - it is, however, possible.) Assume the table's name is value_list with columns code and value as shown in the edits (why does everyone forget to include the table name in the question?):
SELECT b.bottom, t.top
FROM (SELECT l1.code - 1 AS top
FROM value_list l1
WHERE NOT EXISTS (SELECT * FROM value_list l2
WHERE l2.code = l1.code - 1)) AS t,
(SELECT l1.code + 1 AS bottom
FROM value_list l1
WHERE NOT EXISTS (SELECT * FROM value_list l2
WHERE l2.code = l1.code + 1)) AS b
WHERE b.bottom <= t.top
AND NOT EXISTS (SELECT * FROM value_list l2
WHERE l2.code >= b.bottom AND l2.code <= t.top);
The two parallel queries in the from clause generate values that are respectively at the top and bottom of a gap in the range of values in the table. The cross-product of these two lists is then restricted so that the bottom is not greater than the top, and such that there is no value in the original list in between the bottom and top.
On the sample data, this produces the range 4-6. When I added an extra row (9, 'nine'), it also generated the range 8-8. Clearly, you also have two other possible ranges for a suitable definition of 'infinity':
-infinity .. MIN(code)-1
MAX(code)+1 .. +infinity
Note that:
If you are using this routinely, there will generally not be many gaps in your lists.
Gaps can only appear when you delete rows from the table (or you ignore the ranges returned by this query or its relatives when inserting data).
It is usually a bad idea to reuse identifiers, so in fact this effort is probably misguided.
However, if you want to do it, here is one way to do so.

This the same idea which Quassnoi has published.
I just linked all ideas together in T-SQL like code.
DECLARE
series #table(n int)
DECLARE
max_n int,
i int
SET i = 1
-- max value in elements table
SELECT
max_n = (SELECT MAX(code) FROM elements)
-- fill #series table with numbers from 1 to n
WHILE i < max_n BEGIN
INSERT INTO #series (n) VALUES (i)
SET i = i + 1
END
-- unassigned codes -- these without pair in elements table
SELECT
n
FROM
#series AS series
LEFT JOIN
elements
ON
elements.code = series.n
WHERE
elements.code IS NULL
EDIT:
This is, of course, not ideal solution. If you have a lot of elements or check for non-existing code often this could cause performance issues.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Is there a way to exlude SQL results that are ALMOST duplicates? - sql

Related

Access query inconsistently treats empty string as null

How to group by more than one row value?

SQL Combining results from multiple tables, and rows, in to one row in one table

SQL query to retrieve discrepancies in punch order

Selecting elements that don't exist

Categories

Resources