Finding Duplicate and Missing (null) Values in Oracle Table

Finding Duplicate and Missing (null) Values in Oracle Table - sql

I am looking for errors in a table and want to report both duplicates and missing values. I am unsure of the best way to do this and am looking for advice on a better way to accomplish this. This is in Oracle 12c.
This appears to achieve the desired result:
SELECT a.id,
a.mainfield,
a.location,
b.counter
FROM maintable a
INNER JOIN (
SELECT mainfield,
Count(*) counter
FROM maintable
GROUP BY mainfield
HAVING Count(mainfield) > 1 OR mainfield IS NULL
) b ON a.mainfield = b.mainfield OR
( a.mainfield IS NULL AND b.mainfield IS NULL )
ORDER BY a.mainfield;
This works and gives me the ID, the potentially null MAINFIELD, the location and a count of either the duplicate MAINFIELD values or the null MAINFIELD values.
Is there something simpler or potentially more efficient that I could be using? I have to admit that my SQL skills are quite rusty.
Sample data may or may not help, but the ID is the primary key, is a number and not nullable. The other fields are both NVARCHAR2 and nullable. None of those are indexed. Here is what the output could look like. Some records are outright errors. Some are obvious typos. Some appear to be test data.
ID MAINFIELD LOCATION COUNTER
------- --------- --------------------------------- -------
16626 206000650 9A OLIVER ST CENTRAL STATION 2
18805 206000650 3 SWIFT CT CENTRAL STATION 2
22409 940000170 2 MARKET ST NEWARK DE 2
22003 940000170 1 MARKET ST NEWARK NJ 2
29533 970000030 95 MILL RD ANDOVER 2
20256 970000030 12 RAILROAD AVE 2
29018 978900050 44 BROAD STREET 2
28432 978900050 WASHINGTON ST AND HAMILTON AVE 2
21831 980700050 BROADWAY NEWTOWN 2
24147 980700050 MAIN STREET LEVITTOWN 2
26418 3
26738 TEST DATA 3
26755 3
The last three rows have a null MAINFIELD and there are three such records (two of which have the location null also).
After adding some insight into the data above, I realized I might consider using NVL to eliminate part of the conditions, like this (assuming the value I chose would not be a valid value in the mainfield):
SELECT a.id,
a.mainfield,
a.location,
b.counter
FROM maintable a
INNER JOIN (
SELECT mainfield,
Count(*) counter
FROM maintable
GROUP BY mainfield
HAVING Count(mainfield) > 1 OR mainfield IS NULL
) b ON NVL(a.mainfield,'***NULL***') = NVL(b.mainfield.'***NULL***')
ORDER BY a.mainfield;
This executes a bit quicker and seems to produce the desired result. I have been trying other alternatives without success, so this may be the best alternative.
One alternative that I discarded which might be appropriate for a slightly different scenario (but was the worst performer for me) is this one:
SELECT id,
mainfield,
location,
COUNT (id) OVER (PARTITION BY mainfield) counter
FROM maintable a
WHERE mainfield IS NULL
OR EXISTS(SELECT 1 from maintable b
WHERE mainfield = a.mainfield AND ROWID <> a.ROWID)
ORDER BY a.mainfield;
I just really liked the way this was put together and was hopeful that it would be somewhat efficient. We are not talking that it runs for days, but I am trying to re-learn in Oracle what might have once been a skill back when I was coding with SQL/DS.
If any of the above gives anyone an idea of a better alternative, I am all ears. (For example, is there a way to reference the counter [the COUNT (id) over PARTITION BY mainfield] in the WHERE clause?)
Thanks again.

This seems to be a good compromise between readability and reliability and efficiency which was offered by Balazs Papp on the dba.stackexchange.com board:
https://dba.stackexchange.com/a/210998/154392
SELECT * FROM (
SELECT id,
mainfield,
location,
COUNT (id) OVER (PARTITION BY mainfield) counter
FROM maintable a
) where counter > 1 or mainfield IS NULL
ORDER BY mainfield;
This is a simplification of the last alternative of the original post. It does not appear to be more inefficient than my original alternative (as far as I can tell), but to me it is more readable.

Related

Snowflake Unpivot data with boolean

I have data as following
STORE_NO STORE_ADDRESS STORE_TYPE STORE_OWNER STORE_HOURS
1 123 Drive Thru Harpo 24hrs
1 123 Curbside Harpo 24hrs
1 123 Counter Harpo 24hrs
2 456 Drive Thru Groucho 9 to 9
2 456 Counter Groucho 9 to 9
And I want to pivot it as following.
STORE_NO STORE_ADDRESS Drive Thru Curbside Counter STORE_OWNER STORE_HOURS
1 123 TRUE TRUE TRUE Harpo 24hrs
2 456 TRUE FALSE TRUE Groucho 9 to 9
Here is what I have
select *
from stores
pivot(count(STORE_TYPE) for STORE_TYPE in ('Drive Thru', 'Curbside', 'Counter'))
as store_flattened;
But this returns a 1 or a 0. How do I convert to TRUE / FALSE without making this a CTE?

If you are ok with putting column names rather then select *, then following can be used -
select STORE_NO,STORE_ADDRESS,STORE_OWNER,STORE_HOURS,
"'Drive Thru'"=1 as drivethru,
"'Curbside'"=1 as curbside,
"'Counter'"=1 as counter
from stores
pivot(count(STORE_TYPE) for STORE_TYPE in ('Drive Thru', 'Curbside', 'Counter'))
as store_flattened;

I honestly think you should leave it as is. Any attempt at a workaround will result in either complicating the pivot logic or having to manually specify the columns names in multiple places; especially with pivoted columns appearing before the rest. Having said that, if you must find a way to do this, here is an attempt.
I know you wanted to avoid a CTE, but I am using it for a purpose different than what you might had in mind. General idea in steps--
In a CTE, sub-select some of the columns you want to appear before
the pivoted columns. Create a flag based on whether store_type
(b.value) from the lateral flatten matches existing store_type
for a given row. You'll notice the values passed to input=> can be easily copy-pasted to the pivot clause
Pivot using max(flag) which will turn (false,true)->true and
(false,false)->false. You can run the CTE portion to see why that
matters and how it solves the main issue
Finally, use a natural join with the main table to append the rest of
the columns (this is the first time I found a natural join useful enough to keep it. I actively avoid them otherwise)
with cte (store_no, store_address, store_type, flag) as
(select store_no, store_address, b.value::string, b.value::string = store_type
from t, lateral flatten(input=>['Drive Thru', 'Curbside', 'Counter']) b)
select *
from cte pivot(max(flag) for store_type in ('Drive Thru', 'Curbside', 'Counter'))
natural join (select distinct store_no, store_owner, store_hours from t)
Outputs:

SQL server - How to find the highest number in '<> ' in a text column?

Lets say I have the following data in the Employee table: (nothing more)
ID FirstName LastName x
-------------------------------------------------------------------
20 John Mackenzie <A>te</A><b>wq</b><a>342</a><d>rt21</d>
21 Ted Green <A>re</A><b>es</b><1>t34w</1><4>65z</4>
22 Marcy Nate <A>ds</A><b>tf</b><3>fv 34</3><6>65aa</6>
I need to search in the X column and get highest number in <> these brackets
What sort of SELECT statement can get me, for example, the number 6 like in <6>, in the x column?

This type of query generally works on finding patterns, I consider that the <6> is at the 9th position from left.
Please note if the pattern changes the below query will not work.
SELECT A.* FROM YOURTABLE A INNER JOIN
(SELECT TOP 1 ID,Firstname,Lastname,SUBSTRING(X,LEN(X)-9,1) AS [ORDER]
FROM YOURTABLE
WHERE ISNUMERIC(SUBSTRING(X,LEN(X)-9,1))=1
ORDER BY SUBSTRING(X,LEN(X)-9,1))B
ON
A.ID=B.ID AND
A.FIRSTNAME=B.FIRSTNAME AND
A.LASTNAME=B.LASTNAME

Select distinct values for a particular column choosing arbitrarily from duplicates

I have health data relating to deaths. Individual should die once maximum. In the database they sometimes don't; probably because causes of death were changed but the original entry was not deleted. I don't really understand how this was allowed to happen, but it has. So, as a made up example, I have:
Row_number | Individual_ID | Cause_of_death | Date_of_death
------------+---------------+-----------------------+---------------
1 | 1 | Stroke | 3 march 2008
2 | 2 | Myocardial infarction | 1 jan 2009
3 | 2 | Pulmonary Embolus | 1 jan 2009
I want each individual to have only one cause of death.
In the example, I want a query that returns row 1 and either row 2 or row 3 (not both). I have to make an arbitrary choice between rows 2 and 3 because there is no timestamp in any of the fields that can be used to determine which is the revision; it's not ideal but is unavoidable.
I can't make the SQL work to do this. I've tried inner joining distinct Individual_ID to the other fields, but this still gives all the rows. I've tried adding a 'having count(Individual_ID) = 1' clause with it. This leaves out people with more than one cause of death completely. Suggestions on the internet seem to be based on using a timestamped field to choose the most recent, but I don't have that.
IBM DB2. Windows XP. Any thoughts gratefully received.

Have you tried using MIN (or MAX) against the cause of death. (and the date of death, if they died on two different dates)
SELECT IndividualID, MIN(Cause_Of_Death), MIN (Date_Of_Death)
from deaths
GROUP BY IndividualID

I don't know DB2 so I'll answer in general. There are two main approaches:
select *
from T
join (
select keys, min(ID) as MinID
from T
group by keys
) on T.ID = MinID
And
select *, row_number() over (partition by keys) as r
from T
where r = 1
Both return all rows, no matter if duplicate or not. But they returns only one duplicate per "key".
Notice, that both statements are pseudo-SQL.

The row_number() approach is probably preferable from a performance standpoint. Here is usr's example, in DB2 syntax:
select * from (
select T.*, row_number() over (partition by Individual_ID) as r
from T
)
where r=1;

How to group by a column

Hi I know how to use the group by clause for sql. I am not sure how to explain this so Ill draw some charts. Here is my original data:
Name Location
----------------------
user1 1
user1 9
user1 3
user2 1
user2 10
user3 97
Here is the output I need
Name Location
----------------------
user1 1
9
3
user2 1
10
user3 97
Is this even possible?

The normal method for this is to handle it in the presentation layer, not the database layer.
Reasons:
The Name field is a property of that data row
If you leave the Name out, how do you know what Location goes with which name?
You are implicitly relying on the order of the data, which in SQL is a very bad practice (since there is no inherent ordering to the returned data)
Any solution will need to involve a cursor or a loop, which is not what SQL is optimized for - it likes working in SETS not on individual rows

Hope this helps
SELECT A.FINAL_NAME, A.LOCATION
FROM (SELECT DISTINCT DECODE((LAG(YT.NAME, 1) OVER(ORDER BY YT.NAME)),
YT.NAME,
NULL,
YT.NAME) AS FINAL_NAME,
YT.NAME,
YT.LOCATION
FROM YOUR_TABLE_7 YT) A
As Jirka correctly pointed out, I was using the Outer select, distinct and raw Name unnecessarily. My mistake was that as I used DISTINCT , I got the resulted sorted like
1 1
2 user2 1
3 user3 97
4 user1 1
5 3
6 9
7 10
I wanted to avoid output like this.
Hence I added the raw id and outer select
However , removing the DISTINCT solves the problem.
Hence only this much is enough
SELECT DECODE((LAG(YT.NAME, 1) OVER(ORDER BY YT.NAME)),
YT.NAME,
NULL,
YT.NAME) AS FINAL_NAME,
YT.LOCATION
FROM SO_BUFFER_TABLE_7 YT
Thanks Jirka

If you're using straight SQL*Plus to make your report (don't laugh, you can do some pretty cool stuff with it), you can do this with the BREAK command:
SQL> break on name
SQL> WITH q AS (
SELECT 'user1' NAME, 1 LOCATION FROM dual
UNION ALL
SELECT 'user1', 9 FROM dual
UNION ALL
SELECT 'user1', 3 FROM dual
UNION ALL
SELECT 'user2', 1 FROM dual
UNION ALL
SELECT 'user2', 10 FROM dual
UNION ALL
SELECT 'user3', 97 FROM dual
)
SELECT NAME,LOCATION
FROM q
ORDER BY name;
NAME LOCATION
----- ----------
user1 1
9
3
user2 1
10
user3 97
6 rows selected.
SQL>

I cannot but agree with the other commenters that this kind of problem does not look like it should ever be solved using SQL, but let us face it anyway.
SELECT
CASE main.name WHERE preceding_id IS NULL THEN main.name ELSE null END,
main.location
FROM mytable main LEFT JOIN mytable preceding
ON main.name = preceding.name AND MIN(preceding.id) < main.id
GROUP BY main.id, main.name, main.location, preceding.name
ORDER BY main.id
The GROUP BY clause is not responsible for the grouping job, at least not directly. In the first approximation, an outer join to the same table (LEFT JOIN below) can be used to determine on which row a particular value occurs for the first time. This is what we are after. This assumes that there are some unique id values that make it possible to arbitrarily order all the records. (The ORDER BY clause does NOT do this; it orders the output, not the input of the whole computation, but it is still necessary to make sure that the output is presented correctly, because the remaining SQL does not imply any particular order of processing.)
As you can see, there is still a GROUP BY clause in the SQL, but with a perhaps unexpected purpose. Its job is to "undo" a side effect of the LEFT JOIN, which is duplication of all main records that have many "preceding" ( = successfully joined) records.
This is quite normal with GROUP BY. The typical effect of a GROUP BY clause is a reduction of the number of records; and impossibility to query or test columns NOT listed in the GROUP BY clause, except through aggregate functions like COUNT, MIN, MAX, or SUM. This is because these columns really represent "groups of values" due to the GROUP BY, not just specific values.

If you are using SQL*Plus, use the BREAK function. In this case, break on NAME.
If you are using another reporting tool, you may be able to compare the "name" field to the previous record and suppress printing when they are equal.

If you use GROUP BY, output rows are sorted according to the GROUP BY columns as if you had an ORDER BY for the same columns. To avoid the overhead of sorting that GROUP BY produces, add ORDER BY NULL:
SELECT a, COUNT(b) FROM test_table GROUP BY a ORDER BY NULL;
Relying on implicit GROUP BY sorting in MySQL 5.6 is deprecated. To achieve a specific sort order of grouped results, it is preferable to use an explicit ORDER BY clause. GROUP BY sorting is a MySQL extension that may change in a future release; for example, to make it possible for the optimizer to order groupings in whatever manner it deems most efficient and to avoid the sorting overhead.
For full information - http://academy.comingweek.com/sql-groupby-clause/

SQL GROUP BY STATEMENT
SQL GROUP BY clause is used in collaboration with the SELECT statement to arrange identical data into groups.
Syntax:
1. SELECT column_nm, aggregate_function(column_nm) FROM table_nm WHERE column_nm operator value GROUP BY column_nm;
Example :
To understand the GROUP BY clauserefer the sample database.Below table showing fields from “order” table:
1. |EMPORD_ID|employee1ID|customerID|shippers_ID|
Below table showing fields from “shipper” table:
1. | shippers_ID| shippers_Name |
Below table showing fields from “table_emp1” table:
1. | employee1ID| first1_nm | last1_nm |
Example :
To find the number of orders sent by each shipper.
1. SELECT shipper.shippers_Name, COUNT (orders.EMPORD_ID) AS No_of_orders FROM orders LEFT JOIN shipper ON orders.shippers_ID = shipper.shippers_ID GROUP BY shippers_Name;
1. | shippers_Name | No_of_orders |
Example :
To use GROUP BY statement on more than one column.
1. SELECT shipper.shippers_Name, table_emp1.last1_nm, COUNT (orders.EMPORD_ID) AS No_of_orders FROM ((orders INNER JOIN shipper ON orders.shippers_ID=shipper.shippers_ID) INNER JOIN table_emp1 ON orders.employee1ID = table_emp1.employee1ID)
2. GROUP BY shippers_Name,last1_nm;
| shippers_Name | last1_nm |No_of_orders |
for more clarification refer my link
http://academy.comingweek.com/sql-groupby-clause/

MySQL: Getting highest score for a user

I have the following table (highscores),
id gameid userid name score date
1 38 2345 A 100 2009-07-23 16:45:01
2 39 2345 A 500 2009-07-20 16:45:01
3 31 2345 A 100 2009-07-20 16:45:01
4 38 2345 A 200 2009-10-20 16:45:01
5 38 2345 A 50 2009-07-20 16:45:01
6 32 2345 A 120 2009-07-20 16:45:01
7 32 2345 A 100 2009-07-20 16:45:01
Now in the above structure, a user can play a game multiple times but I want to display the "Games Played" by a specific user. So in games played section I can't display multiple games. So the concept should be like if a user played a game 3 times then the game with highest score should be displayed out of all.
I want result data like:
id gameid userid name score date
2 39 2345 A 500 2009-07-20 16:45:01
3 31 2345 A 100 2009-07-20 16:45:01
4 38 2345 A 200 2009-10-20 16:45:01
6 32 2345 A 120 2009-07-20 16:45:01
I tried following query but its not giving me the correct result:
SELECT id,
gameid,
userid,
date,
MAX(score) AS score
FROM highscores
WHERE userid='2345'
GROUP BY gameid
Please tell me what will be the query for this?
Thanks

Requirement is a bit vague/confusing but would something like this satisfy the need ?
(purposely added various aggregates that may be of interest).
SELECT gameid,
MIN(date) AS FirstTime,
MAX(date) AS LastTime,
MAX(score) AS TOPscore.
COUNT(*) AS NbOfTimesPlayed
FROM highscores
WHERE userid='2345'
GROUP BY gameid
-- ORDER BY COUNT(*) DESC -- for ex. to have games played most at top
Edit: New question about adding the id column to the the SELECT list
The short answer is: "No, id cannot be added, not within this particular construct". (Read further to see why) However, if the intent is to have the id of the game with the highest score, the query can be modified, using a sub-query, to achieve that.
As explained by Alex M on this page, all the column names referenced in the SELECT list and which are not used in the context of an aggregate function (MAX, MIN, AVG, COUNT and the like), MUST be included in the ORDER BY clause. The reason for this rule of the SQL language is simply that in gathering the info for the results list, SQL may encounter multiple values for such an column (listed in SELECT but not GROUP BY) and would then not know how to deal with it; rather than doing anything -possibly useful but possibly silly as well- with these extra rows/values, SQL standard dictates a error message, so that the user can modify the query and express explicitly his/her goals.
In our specific case, we could add the id in the SELECT and also add it in the GROUP BY list, but in doing so the grouping upon which the aggregation takes place would be different: the results list would include as many rows as we have id + gameid combinations the aggregate values for each of this row would be based on only the records from the table where the id and the gameid have the corresponding values (assuming id is the PK in table, we'd get a single row per aggregation, making the MAX() and such quite meaningless).
The way to include the id (and possibly other columns) corresponding to the game with the top score, is with a sub-query. The idea is that the subquery selects the game with TOP score (within a given group by), and the main query's SELECTs any column of this rows, even when the fieds wasn't (couldn't be) in the sub-query's group-by construct. BTW, do give credit on this page to rexem for showing this type of query first.
SELECT H.id,
H.gameid,
H.userid,
H.name,
H.score,
H.date
FROM highscores H
JOIN (
SELECT M.gameid, hs.userid, MAX(hs.score) MaxScoreByGameUser
FROM highscores H2
GROUP BY H2.gameid, H2.userid
) AS M
ON M.gameid = H.gameid
AND M.userid = H.userid
AND M.MaxScoreByGameUser = H.score
WHERE H.userid='2345'
A few important remarks about the query above
Duplicates: if there the user played several games that reached the same hi-score, the query will produce that many rows.
GROUP BY of the sub-query may need to change for different uses of the query. If rather than searching for the game's hi-score on a per user basis, we wanted the absolute hi-score, we would need to exclude userid from the GROUP BY (that's why I named the alias of the MAX with a long, explicit name)
The userid = '2345' may be added in the [now absent] WHERE clause of the sub-query, for efficiency purposes (unless MySQL's optimizer is very smart, currently all hi-scores for all game+user combinations get calculated, whereby we only need these for user '2345'); down side duplication; solution; variables.
There are several ways to deal with the issues mentioned above, but these seem to be out of scope for a [now rather lenghty] explanation about the GROUP BY constructs.

Every field you have in your SELECT (when a GROUP BY clause is present) must be either one of the fields in the GROUP BY clause, or else a group function such as MAX, SUM, AVG, etc. In your code, userid is technically violating that but in a pretty harmless fashion (you could make your code technically SQL standard compliant with a GROUP BY gameid, userid); fields id and date are in more serious violation - there will be many ids and dates within one GROUP BY set, and you're not telling how to make a single value out of that set (MySQL picks a more-or-less random ones, stricter SQL engines might more helpfully give you an error).
I know you want the id and date corresponding to the maximum score for a given grouping, but that's not explicit in your code. You'll need a subselect or a self-join to make it explicit!

Use:
SELECT t.id,
t.gameid,
t.userid,
t.name,
t.score,
t.date
FROM HIGHSCORES t
JOIN (SELECT hs.gameid,
hs.userid,
MAX(hs.score) 'max_score'
FROM HIGHSCORES hs
GROUP BY hs.gameid, hs.userid) mhs ON mhs.gameid = t.gameid
AND mhs.userid = t.userid
AND mhs.max_score = t.score
WHERE t.userid = '2345'

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas