Finding duplicate rows and related records

Finding duplicate rows and related records - sql

I have a database that was migrated into a new schema. The old database had no referential integrity and so I need to get rid of lots of duplicates.
I have a table of RegisteredVehicles:
id | plate | state
# | 1425 | il
# | 3322 | il
And a table of ParkingRequests:
id | date | registeredVehicleId (FK)
# | 2/2/12 | #
The relatoinship is one to many - one registered vehicle to many requests.
The following query gets me each duplicate record by Plate and State and also outputs each RegisteredVehicle's Id.
select Id, Plate, [State] from RegisteredVehicles where Plate in (
select plate from RegisteredVehicles group by Plate having count(*) > 1
)
Which gives me something like this
Id Plate State
036d59f1-d928-40f2-b373-049122202bff 0000000 IL
615e2fab-8b43-4e42-b6f0-268038bba949 0000000
I am trying to get a count of parking request per each vehicle row returned in the above code block. Something like this
Id | Plate | State | # Requests
1 | 222 | IL | 2
2 | 333 | IL | 4
But am having issues making the query more complex than it already is. This itself took me quite a while to get working.

Please try this query :
SELECT
A.ID,
A.PLATE,
A.STATE AS [STATE],
COUNT(A.ID) AS [NO OF REQUESTS]
FROM REGISTEREDVEHICLES A
LEFT JOIN PARKINGREQUESTS B
ON B.REGISTEREDVEHICLEID = A.ID
WHERE
A.PLATE IN(
SELECT
PLATE
FROM REGISTEREDVEHICLES
GROUP BY PLATE
HAVING COUNT(*) > 1
)
GROUP BY
A.ID,
A.PLATE,
A.STATE

Related

How to get the frequency in postgresql

I have a table (table_1) like this (I have simplified it)
code | description | season
----------+------------------+--------------
500 | info 1 | fall
500 | info 4 | fall
500 | info 8 | fall
500 | info 1 | winter
300 | info 1 | spring
400 | info 1 | fall
And I want a table like below, where I have the frequency of codes in each season
season | Number of Unique Codes
----------+------------------------
fall | 2
winter | 1
spring | 1
So far I have this:
SELECT
season,
count(DISTINCT code) AS "Number of Unique Codes"
FROM table_1
WHERE code IS NOT NULL
GROUP BY season
ORDER BY code desc;
However, I am running into a few issues.

Your error is on the ORDER BY, change your ORDER BY to sort by the alias created.
SELECT
season,
count(distinct code) AS "Number of Unique Codes"
FROM table_1
WHERE code IS NOT NULL
GROUP BY season
ORDER BY "Number of Unique Codes" DESC;

Compare Two Relations in SQL

I just started studying SQL and this is a demo given by the teacher in an online course and it works fine. The statement is looking for "students such that number of other students with same GPA is equal to number of other students with same sizeHS":
select *
from Student S1
where (
select count(*)
from Student S2
where S2.sID <> S1.sID and S2.GPA = S1.GPA
) = (
select count(*)
from Student S2
where S2.sID <> S1.sID and S2.sizeHS = S1.sizeHS
);
It seems that in this where clause, we're comparing two relations (because the result of a subquery is a relation), but most of the time we are comparing attributes(as far as I've seen).
So I'm thinking about whether there are requirements for how many attributes, and how many tuples, the RELATION should contain when comparing two RELATIONS. If not, how do we compare two RELATIONS when there're multiple attributes or multiple tuples and what do we get for result?
Note:
Student relation has 4 attributes: sID, sName, GPA, sizeHS. And here's the data:
+-----+--------+-----+--------+
| sID | sName | GPA | sizeHS |
+-----+--------+-----+--------+
| 123 | Amy | 3.9 | 1000 |
| 234 | Bob | 3.6 | 1500 |
| 345 | Craig | 3.5 | 500 |
| 456 | Doris | 3.9 | 1000 |
| 567 | Edward | 2.9 | 2000 |
| 678 | Fay | 3.8 | 200 |
| 789 | Gary | 3.4 | 800 |
| 987 | Helen | 3.7 | 800 |
| 876 | Irene | 3.9 | 400 |
| 765 | Jay | 2.9 | 1500 |
| 654 | Amy | 3.9 | 1000 |
| 543 | Craig | 3.4 | 2000 |
+-----+--------+-----+--------+
and the result of this query is:
+-----+--------+-----+---------+
| sID | sName | GPA | sizeHS |
+-----+--------+-----+---------+
| 345 | Craig | 3.5 | 500 |
| 567 | Edward | 2.9 | 2000 |
| 678 | Fay | 3.8 | 200 |
| 789 | Gary | 3.4 | 800 |
| 765 | Jay | 2.9 | 1500 |
| 543 | Craig | 3.4 | 2000 |
+-----+--------+-----+---------+

because the result of a subquery is a relation
Relation is the scientific name for what we call a table in a database and I like the name "table" much better than "relation". A table is easy to imagine. We know them from our school time schedule for instance. Yes, we relate things here inside a table (day and time and the subject taught in school), but we can also relate tables to tables (pupils' timetables with the table of class rooms, the overall subject schedule, and the teacher's timetables). As such, tables in an RDBMS are also related to each other (hence the name relational database management system). I find the name relation for a table quite confusing (and many people use the word "relation" to describe the relations between tables instead).
So, yes, a query result itself is again a table ("relation"). And from tables we can of course select:
select * from (select * from b) as subq;
And then there are scalar queries that return exactly one row and one column. select count(*) from b is such a query. While this is still a table we can select from
select * from (select count(*) as cnt from b) as subq;
we can even use them where we usually have single values, e.g. in the select clause:
select a.*, (select count(*) from b) as cnt from a;
In your query you have two scalar subqueries in your where clause.
With subqueries there is another distinction to make: we have correlated and non-correlated subqueries. The last query I have just shown contains a non-correlated subquery. It selects the count of b rows for every single result row, no matter what that row contains elsewise. A correlated subquery on the other hand may look like this:
select a.*, (select count(*) from b where b.x = a.y) as cnt from a;
Here, the subquery is related to the main table. For every result row we look up the count of b rows matching the a row we are displaying via where b.x = a.y, so the count is different from row to row (but we'd get the same count for a rows sharing the same y value).
Your subqueries are also correlated. As with the select clause, the where clause deals with one row at a time (in order to keep or dismiss it). So we look at one student S1 at a time. For this student we count other students (S2, where S2.sID <> S1.sID) who have the same GPA (and S2.GPA = S1.GPA) and count other students who have the same sizeHS. We only keep students (S1) where there are exactly as many other students with the same GPA as there are with the same sizeHS.
UPDATE
As do dealing with multiple tuples as in
select *
from Student S1
where (
select count(*), avg(grade)
from Student S2
where S2.sID <> S1.sID and S2.GPA = S1.GPA
) = (
select count(*), avg(grade)
from Student S2
where S2.sID <> S1.sID and S2.sizeHS = S1.sizeHS
);
this is possible in some DBMS, but not in SQL Server. SQL Server doesn't know tuples.
But there are other means to achieve the same. You could just add two subqueries:
select * from student s1
where (...) = (...) -- compare counts here
and (...) = (...) -- compare averages here
Or get the data in the FROM clause and then deal with it. E.g.:
select *
from Student S1
cross apply
(
select count(*) as cnt, avg(grade) as avg_grade
from Student S2
where S2.sID <> S1.sID and S2.GPA = S1.GPA
) sx
cross apply
(
select count(*) as cnt, avg(grade) as avg_grade
from Student S2
where S2.sID <> S1.sID and S2.sizeHS = S1.sizeHS
) sy
where sx.cnt = sy.cnt and sx.avg_grade = sy.avg_grade;

There are relational operations:
The intersection operator produces the set of tuples that two
relations share in common. Intersection is implemented in SQL in the
form of the INTERSECT operator.
The difference operator acts on two relations and produces the set of tuples from the first relation that do not exist in the second relation. Difference is implemented in SQL in the form of the EXCEPT or MINUS operator.
So, in the context of SQL Server, for example, you can do:
SELECT *
FROM R1
EXCEPT
SELECT *
FROM R2
to get rows in R1 not included in R2 and the reverse - to get all differences.
Of course, the attributes must be the same - if not, you need to explicit set the attributes in the SELECT.

How can I do a group-concat call with a max value?

I'm tracking game prices across multiple stores. I have a games table:
id | title | platform_id
---|-------------|-----------
1 | Super Mario | 1
2 | Tetris | 3
3 | Sonic | 2
a stores table:
id | title
---|-------------
1 | Target
2 | Amazon
3 | EB Games
and a copies table with one entry for Target's copy of a given game, one entry for Amazon's, etc. I store the SKU so I can use it when scraping their websites.
game_id | store_id | sku
--------|----------|----------
1 | 2 | AMZ-3F4YK
1 | 3 | 001481
I run one scrape a day or a week or however long, and I store the result as cents in a prices table:
sku | price | time
----------|---------|------
AMZ-3F4YK | 4010 | 13811101
001481 | 3210 | 13811105
Plus a platforms table that just maps IDs to names.
Here's where I get confused and stuck.
I want to issue a query that selects each game, plus its most recent price at each store. So it would net results like
games.title | platform_name | info
------------|---------------|------
Super Mario | NES | EB Games,1050;Amazon,3720;Target,5995
Tetris | Game Boy | EB Games,3720;Amazon,410;Target,5995
My best attempt thus far is
select
games.title as title,
platforms.name as platform,
group_concat(distinct(stores.name) || "~" || prices.price) as price_info
from games
join platforms on games.platform_id = platforms.id
join copies on copies.game_id = games.id
join prices on prices.sku = copies.sku
join stores on stores.id = copies.store_id
group by title
Which nets results like
Super Mario | NES | EB Games~2300,Target~2300,Target~3800
that is, it includes every price listed, when I only want one per store (and for it to be the most recent). Figuring out how to integrate the 'select price where id = (select id from max(time)...' etc subquery to sort this out has totally stumped me all night and I'd appreciate any advice anyone could offer me.
I'm using SQLite, but if there's a better option in Postgres I could do it there.

You need two levels of aggregation . . . And, Postgres is much simpler for this, so I'll use Postgres syntax:
select title, platform,
string_agg(s.name || '~' pr.price order by s.name)
from (select distinct on (g.title, p.name, s.name) g.title as title, p.name as platform, s.name, pr.price
from games g join
platforms p
on g.platform_id = p.id join
copies c
on c.game_id = g.id join
prices pr
on pr.sku = c.sku join
stores s
on s.id = c.store_id
group by g.title, p.name, s.name, pr.time desc
) gps
group by title, platform

SQL Query to return a distinct count of one column while allowing a full summation of a second column, grouped by a third

I'm writing a query in access 2010 and i can't use count(distinct... so I'm running into a bit of trouble with what can be found below:
An example of my table is as follows
Provider | Member ID | Dollars | Status
FacilityA | 1001 | 50 | Pended
FacilityA | 1001 | 100 | Paid
FacilityA | 1002 | 200 | Paid
FacilityB | 1005 | 30 | Pended
FacilityB | 1009 | 90 | Pended
FacilityC | 1001 | 100 | Paid
FacilityC | 1008 | 500 | Paid
I want to return the total # of unique members that have visited each facility, but I also want to get the total dollar amount that is Pended, so for this example the ideal output would be
Provider | # members | Total Pended charges
FacilityA | 2 | 50
FacilityB | 2 | 120
FacilityC | 2 | 0
I tried using some code I found here: Count Distinct in a Group By aggregate function in Access 2007 SQL
and here:
SQL: Count distinct values from one column based on multiple criteria in other columns
Copying the code from the first link provided by gzaxx:
SELECT cd.DiagCode, Count(cd.CustomerID)
FROM (select distinct DiagCode, CustomerID from CustomerTable) as cd
Group By cd.DiagCode;
I can make this work for counting the members:
SELECT cd.Provider_Number, Count(cd.Member_ID)
FROM (select distinct Provider_Number, Member_ID from Claims_Table) as cd
ON claims_table.Provider_Number=cd.Provider_Number
Group By cd.Provider_Number;
However, no matter what I try I can't get a second portion dealing with the dollars to work without causing an error or messing up the calculation on the member count.

SELECT cd.Provider_Number,
-- claims_table.Member_ID, claims_table.Dollars
SUM(IIF ( Claims_Table.Status = 'Pended' , Claims_Table.Dollars , 0 )) as Dollars_Pending,
Count(cd.Member_ID) as Uniq_Members,
Sum(Dollars) as Dollar_Wrong
FROM (select distinct Provider_Number, Member_ID from Claims_Table) as cd inner join #claims_table
ON claims_table.Provider_Number=cd.Provider_Number and claims_table.Member_ID = cd.Member_ID
Group By cd.Provider_Number;

This should work fine based only on the table you described (named Tabelle1):
SELECT Provider, count(MemberID) as [# Members],
NZ(SUM(SWITCH([Status]='Pended', Dollars)),0) as [Total pending charges]
FROM Tabelle1
GROUP BY Provider;
Explanation
I think the first and second column are self-explanatory.
The third column is where most things are done. The SWITCH([Status]='Pended', Dollars) returns the Dollars only if the status is pending. This then gets summed up by SUM. The NZ(..,0) will set the column to 0 if the SUM returns a NULL.
EDIT: This was tested on Access 2016

Fetch Id's that are related to a specific set of items, but not others

Good morning all, apologies for the title... i had trouble simplifying the problem down to a line. My database platform is Teradata.
I am working w/ a table like the following (let's call it "t1")
+------------+----------------------------------------+
| Service_Id | Product |
+------------+----------------------------------------+
| 1 | Traffic |
| 1 | Weather |
| 1 | Travel |
| 1 | Audio |
| 1 | Audio Add-on |
| 2 | Traffic |
| 2 | Weather |
| 2 | Travel |
+------------+----------------------------------------+
I am trying to select service_id's that are related to the following products AND ONLY the following products: Traffic, Weather, Travel
"Service_Id = 1" does not apply here because while it has the required products, it also has an "audio" product related to it... so we have to leave it out. I was able to successfully do this through a series of temp (volatile) tables but it's feeling really hacky and I feel there's got to be a better way. Thanks for your assistance.

I'm doing stuff like that (find a subset/superset/exact match for a set of rows) in my training classes using pizzas :-)
There are several ways to get your result, but for an exact match the easiest way is a SUM using following logic:
SELECT service_id
FROM t1
GROUP BY 1
HAVING
SUM(CASE WHEN Product IN ('Traffic', 'Weather', 'Travel') THEN 1 ELSE -1 END = 3

Assuming that Product is unique for every service_ID.
SELECT service_ID
FROM tableName a
WHERE Product IN ('Traffic', 'Weather', 'Travel') AND
EXISTS
(
SELECT 1
FROM tableName b
WHERE a.Service_ID = b.Service_ID
GROUP BY b.Service_ID
HAVING COUNT(*) = 3 -- <<== total number of products
)
GROUP BY service_ID
HAVING COUNT(*) = 3 -- <<== total number of products
SQLFiddle Demo (demo is running under MySQL database, not sure if it will work on teradata)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Finding duplicate rows and related records - sql

Related

How to get the frequency in postgresql

Compare Two Relations in SQL

How can I do a group-concat call with a max value?

SQL Query to return a distinct count of one column while allowing a full summation of a second column, grouped by a third

Fetch Id's that are related to a specific set of items, but not others

Categories

Resources