query duplicates from multiple matching JSON fields in a row - sql

JSON format is new to me and I am trying to understand how I can apply searches within those JSON strings.
In this situation I have a player table where two critical columns exists. The first column is a standard text Names. The next is a JSON string that holds various player information. For the sake of this question, I need two extracts from it. $.firstname and $.lastname. The challenge for me here is that I need to produce a query that lists any rows where all 3 data match. The function of this is to detect and list players that have duplicate characters.
In my standard query to list all I use.
SELECT
Name,
json_extract(charinfo, '$."firstname"') AS Firstname,
json_extract(charinfo, '$."lastname"') AS Lastname,
FROM players
ORDER BY Firstname;
What I have currently is"
SELECT
Name,
json_extract(charinfo, '$."firstname"') AS Firstname,
json_extract(charinfo, '$."lastname"') AS Lastname,
COUNT(*) AS qty, license
FROM players
GROUP BY NAME, Firstname, Lastname HAVING COUNT(*)> 1
ORDER BY Firstname;
While this works, I would like it to display the duplicate rows, not just count them. And I'm not sure the correct way to make that adjustment.

Related

How can I select all fields except for those with non-distinct values?

I have a table which represents data for people that have applied. Each person has one PERSON_ID, but can have multiple APP_IDs. I want to select all of the columns except for APP_ID(because its values aren't distinct) for all of the distinct people in the table.
I can list every field individually in both the select and group by clause
This works:
select PERSON_ID, FIRST,LAST,MIDDLE,BIRTHDATE,SEX,EMAIL,PRIMARY_PHONE from
applications
where first = 'Rob' and last='Robot'
group by PERSON_ID,FIRST,LAST,MIDDLE,BIRTHDATE,SEX,EMAIL,PRIMARY_PHONE
But there are twenty more fields that I may or may not use at any given time
Is there any shorter way to achieve this sort of selection without being so verbose?
select distinct is shorter:
select distinct PERSON_ID, FIRST, LAST, MIDDLE, BIRTHDATE, SEX, EMAIL, PRIMARY_PHONE
from applications
where first = 'Rob' and last = 'Robot';
But you still have to list out the columns once.
Some more modern databases support an except clause that lets you remove columns from the wildcard list. To the best of my knowledge, Oracle has no similar concept.
You could write a query to bring the columns together from the system tables. That could simplify writing the query and help prevent misspellings.

Merge with multiple matching conditions

I have to write a t-sql merge statement where I have to meet multiple conditions to match.
Table column names:
ID,
emailaddress,
firstname,
surname,
titile,
mobile,
dob,
accountnumber,
address,
postcode
The main problem here is that, the database I am working with does not have mandatory fields, there is no primary keys to compare, and source table can have duplicates records as well. As a result, there are many combination to check for the duplicates of source table against the target table. My manager have come up with following scenario
We could have data where two people using same email address so emailaddress, firstname and surname match is 100% match (thinking all other columns else are empty)
data where mobile and accountnumber match is 100% match (thinking all other columns else are empty)
title, surname, postcode, dob match is 100% match (thinking all other columns else are empty)
I was given this task where I cannot see the data because I am a new recruit and my employee does not want to me to see this data for the moment. So, I am kind of working with my imagination.
The solution Now, I am thinking rather than checking the existing record of source against target database, I will cleanse the source data using stored procedure statements, where if it meets one duplicate condition then it will skip the next duplicate removing statements and insert the data into target table.
with cte_duplicate1 AS
(
select emailaddress, sname, ROW_NUMBER() over(partition by emailaddress, sname order by emailaddress) as dup1
from DuplicateRecordTable1
)
delete from cte_duplicate1
where dup1>1;
(if the first cte_duplicate1 code was executed then it will skip the cte_duplicate2)
with cte_duplicate2 AS
(
select emailaddress, fname, ROW_NUMBER() over(partition by emailaddress, fname order by emailaddress) as dup2
from DuplicateRecordTable1
)
delete from cte_duplicate2
where dup2>1;
That is the vague plan at the moment. I do not know yet, if it achievable or not.
I have given a job where I cannot see the data because I am new recruit and my employee does not want to me to give me data to work with. So, I am kind of working with my imagination.
Anyway, the main problem here is that, the database I am working with does not have mandatory fields, there is no primary keys to compare, and source table can have duplicates records as well. As a result, there are many combination to check for the duplicates of source table against the target table.
The solution
Now, I am thinking rather than checking the existing record of source against target database, I will cleanse the source data using stored procedure statements, where if it meets one duplicate condition then it will skip the next duplicate removing statements and insert the data into target table.
with cte_duplicate1 AS
(
select emailaddress, sname, ROW_NUMBER() over(partition by emailaddress, sname order by emailaddress) as dup1
from DuplicateRecordTable1
)
delete from cte_duplicate1
where dup1>1;
(if the first cte_duplicate1 code was executed then it will skip the cte_duplicate2)
with cte_duplicate2 AS
(
select emailaddress, fname, ROW_NUMBER() over(partition by emailaddress, fname order by emailaddress) as dup2
from DuplicateRecordTable1
)
delete from cte_duplicate2
where dup2>1;
That is the vague plan at the moment. I do not know yet, if it achievable or not.

Retrieving duplicate and original rows from a table using sql query

Say I have a student table with the following fields - student id, student name, age, gender, marks, class.Assume that due to some error, there are multiple entries corresponding to each student. My requirement is to identify the duplicate rows in the table and the filter criterion is the student name and the class.But in the query result, in addition to identifying the duplicate records, I also need to find the original student detail which got duplicated. Is there any method to do this. I went through this answer: SQL: How to find duplicates based on two fields?. But here it only specifies how to find the duplicate rows and not a means to identify the actual row that was duplicated. Kindly throw some light on the possible solution. Thanks.
First of all: if the columns you've listed are all in the same table, it looks like your database structure could use some normalization.
In terms of your question: I'm assuming your StudentID field is a database generated, primary key and so has not been duplicated. (If this is not the case, I think you have bigger problems than just duplicates).
I'm also assuming the duplicate row has a higher value for StudentID than the original row.
I think the following should work (Note: I haven't created a table to verify this so it might not be perfect straight away. If it doesn't it should be fairly close)
select dup.StudentID as DuplicateStudentID
dup.StudentName, dup.Age, dup.Gender, dup.Marks, dup.Class,
orig.StudentID as OriginalStudentId
from StudentTable dup
inner join (
-- Find first student record for each unique combination
select Min(StudentId) as StudentID, StudentName, Age, Gender, Marks, Class
from StudentTable t
group by StudentName, Age, Gender, Marks, Class
) orig on dup.StudentName = orig.StudenName
and dup.Age = orig.Age
and dup.Gender = orig.Gender
and dup.Marks = orig.Marks
and dup.Class = orig.Class
and dup.StudentID > orig.StudentID -- Don't identify the original record as a duplicate

Return all Fields and Distinct Rows

Whats the best way to do this, when looking for distinct rows?
SELECT DISTINCT name, address
FROM table;
I still want to return all fields, ie address1, city etc but not include them in the DISTINCT row check.
Then you have to decide what to do when there are multiple rows with the same value for the column you want the distinct check to check against, but with different val;ues in the other columns. In this case how does the query processor know which of the multiple values in the other columns to output, if you don't care, then just write a group by on the distinct column, with Min(), or Max() on all the other ones..
EDIT: I agree with comments from others that as long as you have multiple dependant columns in the same table (e.g., Address1, Address2, City, State ) That this approach is going to give you mixed (and therefore inconsistent ) results. If each column attribute in the table is independant ( if addresses are all in an Address Table and only an AddressId is in this table) then it's not as significant an issue... cause at least all the columns from a join to the Address table will generate datea for the same address, but you are still getting a more or less random selection of one of the set of multiple addresses...
This will not mix and match your city, state, etc. and should give you the last one added even:
select b.*
from (
select max(id) id, Name, Address
from table a
group by Name, Address) as a
inner join table b
on a.id = b.id
When you have a mixed set of fields, some of which you want to be DISTINCT and others that you just want to appear, you require an aggregate query rather than DISTINCT. DISTINCT is only for returning single copies of identical fieldsets. Something like this might work:
SELECT name,
GROUP_CONCAT(DISTINCT address) AS addresses,
GROUP_CONCAT(DISTINCT city) AS cities
FROM the_table
GROUP BY name;
The above will get one row for each name. addresses contains a comma delimted string of all the addresses for that name once. cities does the sames for all the cities.
However, I don't see how the results of this query are going to be useful. It will be impossible to tell which address belongs to which city.
If, as is often the case, you are trying to create a query that will output rows in the format you require for presentation, you're much better off accepting multiple rows and then processing the query results in your application layer.
I don't think you can do this because it doesn't really make sense.
name | address | city | etc...
abc | 123 | def | ...
abc | 123 | hij | ...
if you were to include city, but not have it as part of the distinct clause, the value of city would be unpredictable unless you did something like Max(city).
You can do
SELECT DISTINCT Name, Address, Max (Address1), Max (City)
FROM table
Use #JBrooks answer below. He has a better answer.
Return all Fields and Distinct Rows
If you're using SQL Server 2005 or above you can use the RowNumber function. This will get you the row with the lowest ID for each name. If you want to 'group' by more columns, add them in the PARTITION BY section of the RowNumber.
SELECT id, Name, Address, ...
(select id, Name, Address, ...,
ROW_NUMBER() OVER (PARTITION BY Name ORDER BY id) AS RowNo
from table) sub
WHERE RowNo = 1

How to randomize order of data in 3 columns

I have 3 columns of data in SQL Server 2005 :
LASTNAME
FIRSTNAME
CITY
I want to randomly re-order these 3 columns (and munge the data) so that the data is no longer meaningful. Is there an easy way to do this? I don't want to change any data, I just want to re-order the index randomly.
When you say "re-order" these columns, do you mean that you want some of the last names to end up in the first name column? Or do you mean that you want some of the last names to get associated with a different first name and city?
I suspect you mean the latter, in which case you might find a programmatic solution easier (as opposed to a straight SQL solution). Sticking with SQL, you can do something like:
UPDATE the_table
SET lastname = (SELECT lastname FROM the_table ORDER BY RAND())
Depending on what DBMS you're using, this may work for only one line, may make all the last names the same, or may require some variation of syntax to work at all, but the basic approach is about right. Certainly some trials on a copy of the table are warranted before trying it on the real thing.
Of course, to get the first names and cities to also be randomly reordered, you could apply a similar query to either of those columns. (Applying it to all three doesn't make much sense, but wouldn't hurt either.)
Since you don't want to change your original data, you could do this in a temporary table populated with all rows.
Finally, if you just need a single random value from each column, you could do it in place without making a copy of the data, with three separate queries: one to pick a random first name, one a random last name, and the last a random phone number.
I suggest using newid with checksum for doing randomization
SELECT LASTNAME, FIRSTNAME, CITY FROM table ORDER BY CHECKSUM(NEWID())
In SQL Server 2005+ you could prepare a ranked rowset containing the three target columns and three additional computed columns filled with random rankings (one for each of the three target columns). Then the ranked rowset would be joined with itself three times using the ranking columns, and finally each of the three target columns would be pulled from their own instance of the ranked rowset. Here's an illustration:
WITH sampledata (FirstName, LastName, CityName) AS (
SELECT 'John', 'Doe', 'Chicago' UNION ALL
SELECT 'James', 'Foe', 'Austin' UNION ALL
SELECT 'Django', 'Fan', 'Portland'
),
ranked AS (
SELECT
*,
FirstNameRank = ROW_NUMBER() OVER (ORDER BY NEWID()),
LastNameRank = ROW_NUMBER() OVER (ORDER BY NEWID()),
CityNameRank = ROW_NUMBER() OVER (ORDER BY NEWID())
FROM sampledata
)
SELECT
fnr.FirstName,
lnr.LastName,
cnr.CityName
FROM ranked fnr
INNER JOIN ranked lnr ON fnr.FirstNameRank = lnr.LastNameRank
INNER JOIN ranked cnr ON fnr.FirstNameRank = cnr.CityNameRank
This is the result:
FirstName LastName CityName
--------- -------- --------
James Fan Chicago
John Doe Portland
Django Foe Austin
select *, rand() from table order by rand();
I understand some versions of SQL have a rand() that doesn't change for each line. Check for yours. Works on MySQL.