How to select most populated record? - sql

I have the unfortunate luck of having to deal with a db that contains duplicates of particular records, I am looking for a quick way to say "get the most populated record and update the duplicates to match it".
From there I can select distinct records and get a useful set of records.
Any ideas?
It's mainly names and addresses if that helps...
Ok lots of questions asked here so i'll add little bit more:
Firstly I want to pull the most "populated" not most "popular", this means the row with the most values that are not null.
Once I have the set (which is easy because in my case the id's match) I can then populate the missing values in the other rows.
I don't want to destroy data and i only intend to update data based on an accurate match (eg by id).
My problem at the moment is figuring out which of a set of rows has the most populated fields, having said that since posting this question I have found a different way to solve my bigger problem which is what to send to a remote server however I'm still interested to know what the solution to this might be.
Sample data might look something like this ...
id name addr1 addr2 ect
1 fred 1 the street Some town ...
1 fred null null null
Given a table full of matching pairs like this I want to find the pairs then grab the one with the info in it and insert those values where there is a null in the other row.

Keep in mind that you will be potentially destroying data here. Just because a row has fewer columns filled doesn't mean that it's less accurate in the columns that are filled.
I've assumed that duplicates are determined by a column called "name". You'll need to adjust based on your definition of duplicates. Also, since you didn't give any rules on how to deal with ties for "most populated" I just chose the row with the lowest id.
UPDATE
T1
SET
col_1 = T2.col_1,
col_2 = T2.col_2,
....
FROM
My_Table T1
INNER JOIN My_Table T2 ON
T2.name = T1.name AND
T2.id =
(
SELECT TOP 1
T3.id
FROM
My_Table T3
WHERE
T3.name = T1.name
ORDER BY
CASE WHEN col_1 IS NOT NULL THEN 1 ELSE 0 END +
CASE WHEN col_2 IS NOT NULL THEN 1 ELSE 0 END +
... DESC,
id ASC
)
EDIT: I just reread your question and you mention, "From there I can select distinct records and get a useful set of records." If that's what you really want, then don't bother updating the other rows, just select the ones that you want in the first place and leave everything else intact:
SELECT
T1.id,
T1.name,
T1.col_1,
T1.col_2,
...
FROM
My_Table T1
WHERE
T1.id =
(
SELECT TOP 1
T2.id
FROM
My_Table T2
WHERE
T2.name = T1.name
ORDER BY
CASE WHEN T2.col_1 IS NOT NULL THEN 1 ELSE 0 END +
CASE WHEN T2.col_2 IS NOT NULL THEN 1 ELSE 0 END +
... DESC,
T2.id ASC
)

Related

SQL group by selecting top rows with possible nulls

The example table:
id
name
create_time
group_id
1
a
2022-01-01 12:00:00
group1
2
b
2022-01-01 13:00:00
group1
3
c
2022-01-01 12:00:00
NULL
4
d
2022-01-01 13:00:00
NULL
5
e
NULL
group2
I need to get top 1 rows (with the minimal create_time) grouped by group_id with these conditions:
create_time can be null - it should be treated as a minimal value
group_id can be null - all rows with nullable group_id should be returned (if it's not possible, we can use coalesce(group_id, id) or sth like that assuming that ids are unique and never collide with group ids)
it should be possible to apply pagination on the query (so join can be a problem)
the query should be universal as much as possible (so no vendor-specific things). Again, if it's not possible, it should work in MySQL 5&8, PostgreSQL 9+ and H2
The expected output for the example:
id
name
create_time
group_id
1
a
2022-01-01 12:00:00
group1
3
c
2022-01-01 12:00:00
NULL
4
d
2022-01-01 13:00:00
NULL
5
e
NULL
group2
I've already read similar questions on SO but 90% of answers are with specific keywords (numerous answers with PARTITION BY like https://stackoverflow.com/a/6841644/5572007) and others don't honor null values in the group condition columns and probably pagination (like https://stackoverflow.com/a/14346780/5572007).
You can combine two queries with UNION ALL. E.g.:
select id, name, create_time, group_id
from mytable
where group_id is not null
and not exists
(
select null
from mytable older
where older.group_id = mytable.group_id
and older.create_time < mytable.create_time
)
union all
select id, name, create_time, group_id
from mytable
where group_id is null
order by id;
This is standard SQL and very basic at that. It should work in about every RDBMS.
As to pagination: This is usually costly, as you run the same query again and again in order to always pick the "next" part of the result, instead of running the query only once. The best approach is usually to use the primary key to get to the next part so an index on the key can be used. In above query we'd ideally add where id > :last_biggest_id to the queries and limit the result, which would be fetch next <n> rows only in standard SQL. Everytime we run the query, we use the last read ID as :last_biggest_id, so we read on from there.
Variables, however, are dealt with differently in the various DBMS; most commonly they are preceded by either a colon, a dollar sign or an at sign. And the standard fetch clause, too, is supported by only some DBMS, while others have a LIMIT or TOP clause instead.
If these little differences make it impossible to apply them, then you must find a workaround. For the variable this can be a one-row-table holding the last read maximum ID. For the fetch clause this can mean you simply fetch as many rows as you need and stop there. Of course this isn't ideal, as the DBMS doesn't know then that you only need the next n rows and cannot optimize the execution plan accordingly.
And then there is the option not to do the pagination in the DBMS, but read the complete result into your app and handle pagination there (which then becomes a mere display thing and allocates a lot of memory of course).
select * from T t1
where coalesce(create_time, 0) = (
select min(coalesce(create_time, 0)) from T t2
where coalesce(t2.group_id, t2.id) = coalesce(t1.group_id, t1.id)
)
Not sure how you imagine "pagination" should work. Here's one way:
and (
select count(distinct coalesce(t2.group_id, t2.id)) from T t2
where coalesce(t2.group_id, t2.id) <= coalesce(t1.group_id, t1.id)
) between 2 and 5 /* for example */
order by coalesce(t1.group_id, t1.id)
I'm assuming there's an implicit cast from 0 to a date value with a resulting value lower than all those in your database. Not sure if that's reliable. (Try '19000101' instead?) Otherwise the rest should be universal. You could probably also parameterize that in the same way as the page range.
You've also got a potential a complication with potential collisions between the group_id and id spaces. Yours don't appear to have that problem though having mixed data types creates its own issues.
This all gets more difficult when you want to order by other columns like name:
select * from T t1
where coalesce(create_time, 0) = (
select min(coalesce(create_time, 0)) from T t2
where coalesce(t2.group_id, t2.id) = coalesce(t1.group_id, t1.id)
) and (
select count(*) from (
select * from T t1
where coalesce(create_time, 0) = (
select min(coalesce(create_time, 0)) from T t2
where coalesce(t2.group_id, t2.id) = coalesce(t1.group_id, t1.id)
)
) t3
where t3.name < t1.name or t3.name = t1.name
and coalesce(t3.group_id, t3.id) <= coalesce(t1.group_id, t1.id)
) between 2 and 5
order by t1.name;
That does handle ties but also makes the simplifying assumption that name can't be null which would add yet another small twist. At least you can see that it's possible without CTEs and window functions but expect these to also be a lot less efficient to run.
https://dbfiddle.uk/?rdbms=mysql_5.5&fiddle=9697fd274e73f4fa7c1a3a48d2c78691
I would guess
SELECT id, name, MAX(create_time), group_id
FROM tb GROUP BY group_id
UNION ALL
SELECT id, name, create_time, group_id
FROM tb WHERE group_id IS NULL
ORDER BY name
I should point out that 'name' is a reserved word.

Nested SQL Queries with Self JOIN - How to filter rows OUT

I have an SQLite3 database with a table upon which I need to filter by several factors. Once such factor is to filter our rows based on the content of other rows within the same table.
From what I've researched, a self JOIN is going to be required, but I am not sure how I would do that to filter the table by several factors.
Here is a sample table of the data:
Name Part # Status Amount
---------------------------------
Item 1 12345 New $100.00
Item 2 12345 New $15.00
Item 3 35864 Old $132.56
Item 4 12345 Old $15.00
What I need to do is find any Items that have the same Part #, one of them has an "Old" Status and the Amount is the same.
So, first we would get all rows with Part # "12345," and then check if any of the rows have an "Old" status with a matching Amount. In this example, we would have Item2 and Item4 as a result.
What now would need to be done is to return the REST of the rows within the table, that have a "New" Status, essentially discarding those two items.
Desired Output:
Name Part # Status Amount
---------------------------------
Item 1 12345 New $100.00
Removed all "Old" status rows and any "New" that had a matching "Part #" and "Amount" with an "Old" status. (I'm sorry, I know that's very confusing, hence my need for help).
I have looked into the following resources to try and figure this out on my own, but there are so many levels that I am getting confused.
Self-join of a subquery
ZenTut
Compare rows and columns of same table
The first two links dealt with comparing columns within the same table. The third one does seem to be a pretty similar question, but does not have a readable answer (for me, anyway).
I do Java development as well and it would be fairly simple to do this there, but I am hoping for a single SQL query (nested), if possible.
The "not exists" statment should do the trick :
select * from table t1
where t1.Status = 'New'
and not exists (select * from table t2
where t2.Status = 'Old'
and t2.Part = t1.Part
and t2.Amount = t1.Amount);
This is a T-SQL answer. Hope it is translatable. If you have a big data set for matches you might change the not in to !Exists.
select *
from table
where Name not in(
select Name
from table t1
join table t2
on t1.PartNumber = t2.PartNumber
AND t1.Status='New'
AND t2.Status='Old'
and t1.Amount=t2.Amount)
and Status = 'New'
could be using an innner join a grouped select for get status old and not only this
select * from
my_table
INNER JOIN (
select
Part_#
, Amount
, count(distinct Status)
, sum(case when Status = 'Old' then 1 else 0 )
from my_table
group part_#, Amount,
having count(distinct Status)>1
and sum(case when Status = 'Old' then 1 else 0 ) > 0
) t on.t.part_# = my_table.part_#
and status = 'new'
and my_table.Amount <> t.Amount
Tried to understand what you want best I could...
SELECT DISTINCT yt.PartNum, yt.Status, yt.Amount
FROM YourTable yt
JOIN YourTable yt2
ON yt2.PartNum = yt.PartNum
AND yt2.Status = 'Old'
AND yt2.Amount != yt.Amount
WHERE yt.Status = 'New'
This gives everything with a new status that has an old status with a different price.

SQL Data Duplication Query

Greetings of the day!!!!
I have a table having multiple columns of data with different status.
Assume I have 500 rows of data with Status 'Valid' And I have 150 rows of data with 'chkDuplicate'.
Now I have to write query to Update these 150 records status to Valid or Invalid by comparing few columns for duplication like Address,City,State.
How to achieve this, It needs to support large data tables as well.
Thanks in advance....
TABLE DEFINITION
CREATE TABLE XYZ
(
ID bigint,
ADDRESS navrchar,
CITY navrchar,
STATE nvarchar,
ZIP nvarchar,
STATUS
)
Status should update based on duplication query.
Important!!!! For Duplicate data first record should be valid others should be invalid. If re-process the Invalid data again it should not disturb the valid records.
If I run query the above table should be same. Record 1,3 should be Success and 3,4 should be 'Duplicate'. Even if i have add few more 1,3 always be in Success other duplicates should be updated to 'Duplicate'.
This query returned duplicate rows.
select tbl.data1, tbl.data2, tbl.data3
from TestTable1 tbl
inner join (
SELECT data1 , data2, data3 , COUNT(*) AS dupCount
FROM TestTable1
GROUP BY data1, data2, data3
HAVING COUNT(*) > 1
) oc on tbl.data1 = oc.data1 and tbl.data2 = oc.data2 and tbl.data3 = oc.data3
then use Cursor and update duplicate row
Cursor Expamle
Added ID for ORDER BY clause then it works for me even if I re-process the duplication call multiple times.
WITH TABLE_DATA_DUPLICATE AS
(SELECT * ,ROW_NUMBER() OVER(
PARTITION BY STREET1,CITY,STATE,ZIP
ORDER BY STREET1,CITY,STATE,ZIP,ID
) NO_OF_REPEATS
FROM YOURTABLE(NOLOCK))
UPDATE TABLE_DATA_DUPLICATE SET STATUS = (CASE WHEN NO_OF_REPEATS = 1 THEN 'VALID' ELSE 'DUPLICATE' END)
Thanks everyone for support.... Cheers!!!!

(TRANSACT SQL) how to create Master-Detail in a row using sql?

I have two tables
table1
------
ID
NAME
ADDRESS
table2
-------
ID
PHONE
EMAIL
how can i create report like this
------------------------------------
01 Dave 123 Veneu
555-5 A#YAHOO.COM
66-66 B#Yahoo.co.id
213-1 D#c.com
02 John 23 Park
322-1 C#you.com
54-23 D#Net.com
231-2 me#you.com
im using sql server 2005 express,, thank you in advance.
Not sure why you would ever want to write this in anything other than a report designer, but just for the hell of it:
SELECT ID AS Column1, NAME AS Column2, Address AS Column3, ID AS SortColumn1, 1 AS SortColumn2
UNION
SELECT '', PHONE, EMAIL, ID AS SortColumn1, 2 AS SortColumn2
ORDER BY SortColumn1, SortColumn2
The output is going to basically be a load of gibberish really, and you've got the two extra columns on the end of to get rid of.
You shouldn't. It's a general principle that formatting should not be done in the database layer.
SQL Server should be used to generate data, then your application should process the data, including the formatting.
I would open two queries. One that loads table one, ordered by the ID column. And the other that load table two, also ordered by the ID column. You can then iterate through both record sets at the same time, something like the following pseudo-code...
rs1 = SQL.Execute("SELECT * FROM table1 ORDER BY ID")
rs2 = SQL.Execute("SELECT * FROM table2 ORDER BY ID, phone")
rs2.Next()
WHILE rs1.Next()
Output The Address Info Here
WHILE rs1.ID = rs2.ID
Output The Phone/Email Info Here
rs2.Next()
END WHILE
END WHILE
In order to have one-to-many relationship, as in your example (one person has multiple phones and emails), you need to add some kind of link column to the second table, which would contain the ID of the person the email / phone belongs to.
So your table structure should look like this:
table1
------
ID
NAME
ADDRESS
table2
-------
ID
TABLE1_ID
PHONE
EMAIL
Then, you could query your data using joins:
SELECT table1.name, table1.address, table2.phone, table2.email WHERE table2.table1_id = table1.id
I strongly agree with dems, but if you really need to come up with something like that, the following could work (albeit without the empty lines)
SELECT case
when group_rn = 1 then id
else ''
end as id,
case
when group_rn = 1 then name
else phone
end as name_phone_column,
case
when group_rn = 1 then address
else _email
end as address_email_column
FROM (
SELECT t1.id,
t1.name,
t1.address,
t2.email,
t2.phone,
row_number() over (partition by t1.id order t1.name) as group_rn
FROM table1 t1
LEFT JOIN table2 t2 ON t1.id = t2.id
) t
ORDER BY id
This assumes that phone and name both have the same datatype, just like address and email.

SQL First Match or just First

Because this is part of a larger SQL SELECT statement, I want a wholly SQL query that selects the first item matching a criteria or, if no items match the criteria, just the first item.
I.e. using Linq I want:
Dim t1 = From t In Tt
Dim t2 = From t In t1 Where Criteria(t)
Dim firstIfAny = From t In If(t2.Any, t2, t1) Take 1 Select t
Because If is not part of Linq, LinqPad doesn't show a single SQL statement, but two, the second depending upon whether the Criteria matches any of the Tt values.
I know it will be SELECT TOP 1 etc. and I can add ORDER BY clauses to get the specific first one I want, but I'm having trouble thinking of the most straightforward way to get the first of two criteria. (It was at exactly this point when I was able to solve this myself.)
Seeing as I don't see an existing question for this, I will let it stand. I'm sure someone else will see the answer quickly.
select top 1 *
from (
select top 1 *, 1 as Rank from MyTable where SomeColumn = MyCriteria
union all
select top 1 *, 2 as Rank from MyTable order by MyOrderColumn
) a
order by Rank
I've gone with this:
SELECT TOP 1 *
FROM MyTable
WHERE SomeColumn = MyCriteria
OR NOT (EXISTS (SELECT NULL FROM MyTable WHERE SomeColumn = MyCriteria))
ORDER BY MyOrdering
My actual SomeColumn = MyCriteria is rather more complex of course, as well as other unrelated where clauses.