Finding most popular and most unique records using SQL - sql

My mom wanted a baby name game for my brother's baby shower. Wanting to learn python, I volunteered to do it. I pretty much have the python bit, it's the SQL that is throwing me.
The way the game is supposed to work is everyone at the shower writes down names on paper, I manually enter them into Excel (normalizing spellings as much as possible) and export to MS Access. Then I run my python program to find the player with the most popular names and the player with the most unique names. The database, called "babynames", is just four columns.
ID | BabyFirstName | BabyMiddleName | PlayerName
---|---------------|----------------|-----------
My mom has changed things every so often, but as they stand right now, I have to figure out :
a) The most popular name (or names if there is a tie) out of all first and middle names
b) The most unique name (or names if there is a tie) out of all the first and middle names
c) The player that has the most number of popular names (wins a prize)
d) The player that has the most number of unique names (wins a prize)
I've been working on this for about a week now and can't even get a SQL query for a) and b) to work, much less c) and d). I'm more than just a bit frustrated.
BTW, I'm just looking at spellings of the names, not phonetics. As I manually enter names, I will change names like "Kris" to "Chris" and "Xtina" to "Christina" etc.
Editing to add a couple of the most recent queries I tried for a)
SELECT [BabyFirstName],
COUNT ([BabyFirstName]) AS 'FirstNameOccurrence'
FROM [babynames]
GROUP BY [BabyFirstName]
ORDER BY 'FirstNameOccurrence' DESC
LIMIT 1
and
SELECT [BabyFirstName]
FROM [babynames]
GROUP BY [BabyFirstName]
HAVING COUNT(*) =
(SELECT COUNT(*)
FROM [babynames]
GROUP BY [BabyFirstName]
ORDER BY COUNT(*) DESC
LIMIT 1)
These both lead to syntax errors.
pyodbc.ProgrammingError: ('42000', '[42000] [Microsoft][ODBC Microsoft Access Driver] Syntax error in ORDER BY clause. (-3508) (SQLExecDirectW)')
I've tried using [FirstNameOccurrence] and just FirstNameOccurrence as well with the same error. Not sure why it's not recognizing it by that column name to order by.
pyodbc.ProgrammingError: ('42000', "[42000] [Microsoft][ODBC Microsoft Access Driver] Syntax error. in query expression 'COUNT(*) = (SELECT COUNT(*) FROM [babynames] GROUP BY [BabyFirstName] ORDER BY COUNT(*) DESC LIMIT 1)'. (-3100) (SQLExecDirectW)")
I'll admit that I'm not really grokking all of the COUNT(*) commands here, but this was a solution for a similar issue here in stackoverflow that I figured I'd try when my other idea didn't pan out.

For A and B, use a group by clause in your SQL, and then count, and order by the count. Use descending order for A and ascending order for B, and just take the first result for each.
For C and D, essentially use the same strategy but now just add the PlayerName (e.g. group by babyname,playername) and then use the ascending order/descending order question.
Here's Microsoft's write-up for a group by clause in MS Access: https://office.microsoft.com/en-us/access-help/group-by-clause-HA001231482.aspx
Here's an even better write-up demonstrating how to do both group by and order by at the same time: http://rogersaccessblog.blogspot.com/2009/06/select-queries-part-3-sorting-and.html

For the first query you tried, change it to:
SELECT TOP 1 [BabyFirstName],
COUNT ([BabyFirstName]) AS 'FirstNameOccurrence'
FROM [babynames]
GROUP BY [BabyFirstName]
ORDER BY 'FirstNameOccurrence' DESC
For the second, change it to:
SELECT [BabyFirstName]
FROM [babynames]
GROUP BY [BabyFirstName]
HAVING COUNT(*) =
(SELECT TOP 1 COUNT(*)
FROM [babynames]
GROUP BY [BabyFirstName]
ORDER BY COUNT(*) DESC)
Limiting the number of records returned by a SQL Statement in Access is achieved by adding a TOP statement directly after SELECT, not with ORDER BY... LIMIT
Also, Access TOP statement will return all instances of the top n (or n percent) unique records, so if there are two or more identical records in the query output (before TOP), and TOP 1 is specified, you'll see them all.

Related

row_number error when trying to rank items

I'm trying to get back into SQL query and am having a frustrating problem. I have two questions:
I'm trying to take all items in my dataset and rank them by partitions. I researched this and think it should look like this:
select g.ticker, g.sector, g.industry, g.countryname, g.exchange, c.carbon, c.year,
ROW_NUMBER() OVER (
PARTITION BY g.sector, g.industry, g.countryname, g.exchange
ORDER BY c.carbon DESC
) AS 'Rank'
from "General" g
INNER JOIN carbon c ON upper(c.ticker) =g.ticker ;
The output would be a rank for each group in the partition in this case it would be sector, industry, country name and exchange then the rows are ranked based on their carbon emissions.
I'm getting this error:
Error occurred during SQL script execution
Reason:
SQL Error [42601]: ERROR: syntax error at or near "'Rank'"
Position: 1305
if I remove the rank section, the data joins and provides results(obviously not ranked like I want but I know the base query works). What am I doing wrong?
Second(related) question, I forgot how much I hated SQL error messages. The above error tells me there's syntax error then I went to the docs and couldn't see anything different in my code vs their example. Assuming lack of experience, is there a better way to get actionable error messages(i.e. in python I get a stack trace that I can read to see what part of my code went wrong)?
Thank you!
Don't use single quotes for column aliases. Also, I would suggest avoiding anything that is part of standard SQL (which has a rank() function. I often use seqnum:
select g.ticker, g.sector, g.industry, g.countryname, g.exchange, c.carbon, c.year,
row_number() over (
partition by g.sector, g.industry, g.countryname, g.exchange
order by c.carbon desc
) as seqnum
from "General" g join
carbon c
on upper(c.ticker) = g.ticker ;
Note: You should only use single quotes for string and date constants. If you want to escape a column name, use double quotes (just as your query does for the table name General).

How can I get Access SQL to return a dataset of the largest value in each category?

This has been driving me crazy all day, and I've gone through every solution I can find on here. This should be a very simple thing.
I have a table in Access that contains a list of applications:
ApplicantNumber | Region
There are many more columns, but those are the two I care about at the moment. Each row is a separate application, and each applicant can submit multiple applications.
I have a query in Access that finds the count per applicant of applications in each region:
ApplicantNumber | Region | CountOfAPplications
How the ##&*!!! do I pull out of that the region with the most applications for each ApplicantNumber?
As far as I can tell, the following should work fine but it just provides the same output as the initial query with the full count per applicant:
SELECT myQry.ApplicantNumber, myQRY.Region, Max(myQRY.CountOfRegion)
FROM (SELECT AppliedCensusBlocks.ApplicantNumber, AppliedCensusBlocks.Region, Count(AppliedCensusBlocks.Region) AS CountOfRegion
FROM AppliedCensusBlocks
GROUP BY AppliedCensusBlocks.ApplicantNumber, AppliedCensusBlocks.Region) AS myQRY
GROUP BY myQry.ApplicantNumber, myQry.Region
What am I doing wrong? If I remove the Region field, Access will work as I'd expect and just show the ApplicantNumber and maximum count. BUt I'm really trying to get at the region name associated with the maximum count.
This is a bit tricky. MS Access is not the best suited for this sort of query. But here is one way
SELECT acb.ApplicantNumber, acb.Region, Count(*) AS CountOfRegion
FROM AppliedCensusBlocks as acb
GROUP BY acb.ApplicantNumber, acb.Region
HAVING COUNT(*) = (SELECT TOP 1 COUNT(*)
FROM AppliedCensusBlocks as acb2
WHERE acb2.ApplicantNumber = acb.ApplicantNumber
GROUP BY acb2.Region
ORDER BY COUNT(*) DESC, acb2.Region
);
SELECT TOP 1 ApplicantNumber, Region, COUNT(*) AS Applications
FROM AppliedCensusBlocks
GROUP BY ApplicantNumber, Region
ORDER BY COUNT(*) DESC

How to get most popular name by year in SQL Server

I am practicing SQL in Microsoft SQL Server 2012 (not a homework question), and have a table Names. The table shows baby names by year, with columns Sex (gender of name), N (number of babies having that name), Yr (year), and Name (the name itself).
I need to write a query using only one SELECT statement that returns the most popular baby name by year, with gender, the year, and the number of babies named. So far I have;
SELECT *
From Names
ORDER By N DESC;
Which gives the highest values of N in DESC order, repeating years. I need to limit it to only the highest value in each year, and everything I have tried to do so has thrown errors. Any advice you can give me for this would be appreciated.
Off the top of my my head, something like the following would normally let you do it in (technically) one SELECT statment. That statement includes sub-SELECTs, but I'm not immediately seeing an alternative that wouldn't.
When there's joint top ranking names, both queries should bring back all joint top results so there may not be exactly one answer. If you then just need a random single representative row from those result, look at using select top 1, perhaps adding order by to get the first alphabetically.
Most popular by year regardless of gender:
-- ONE PER YEAR:
SELECT n.Year, n.Name, n.Gender, n.Qty FROM Name n
WHERE NOT EXISTS (
SELECT 1 FROM Name n2
WHERE n2.Year = n.Year
AND n2.Qty > n.Qty
)
Most popular by year for each gender:
-- ONE PER GENDER PER YEAR:
SELECT n.Year, n.Name, n.Gender, n.Qty FROM Name n
WHERE NOT EXISTS (
SELECT 1 FROM Name n2
WHERE n2.Year = n.Year
AND n2.Gender = n.Gender
AND n2.Qty > n.Qty
)
Performance is, despite the verbosity of the SQL, usually on a par with alternatives when using this pattern (often better).
There are other approaches, including using GROUP statements, but personally I find this one more readable and standard cross-DBMS.

Nested subquery in Access alias causing "enter parameter value"

I'm using Access (I normally use SQL Server) for a little job, and I'm getting "enter parameter value" for Night.NightId in the statement below that has a subquery within a subquery. I expect it would work if I wasn't nesting it two levels deep, but I can't think of a way around it (query ideas welcome).
The scenario is pretty simple, there's a Night table with a one-to-many relationship to a Score table - each night normally has 10 scores. Each score has a bit field IsDouble which is normally true for two of the scores.
I want to list all of the nights, with a number next to each representing how many of the top 2 scores were marked IsDouble (would be 0, 1 or 2).
Here's the SQL, I've tried lots of combinations of adding aliases to the column and the tables, but I've taken them out for simplicity below:
select Night.*
,
( select sum(IIF(IsDouble,1,0)) from
(SELECT top 2 * from Score where NightId=Night.NightId order by Score desc, IsDouble asc, ID)
) as TopTwoMarkedAsDoubles
from Night
This is a bit of speculation. However, some databases have issues with correlation conditions in multiply nested subqueries. MS Access might have this problem.
If so, you can solve this by using aggregation with a where clause that chooses the top two values:
select s.nightid,
sum(IIF(IsDouble, 1, 0)) as TopTwoMarkedAsDoubles
from Score as s
where s.id in (select top 2 s2.id
from score as s2
where s2.nightid = s.nightid
order by s2.score desc, s2.IsDouble asc, s2.id
)
group by s.nightid;
If this works, it is a simply matter to join Night back in to get the additional columns.
Your subquery can only see one level above it. so Night.NightId is totally unknown to it hence why you are being prompted to enter a value. You can use a Group By to get the value you want for each NightId then correlate that back to the original Night table.
Select *
From Night
left join (
Select N.NightId
, sum(IIF(S.IsDouble,1,0)) as [Number of Doubles]
from Night N
inner join Score S
on S.NightId = S.NightId
group by N.NightId) NightsWithScores
on Night.NightId = NightsWithScores.NightId
Because of the IIF(S.IsDouble,1,0) I don't see the point is using top.

ms-access: runtime error 3354

i'm having a problem running an sql in ms-access. im using this code:
SELECT readings_miu_id, ReadDate, ReadTime, RSSI, Firmware, Active, OriginCol, ColID, Ownage, SiteID, PremID, prem_group1, prem_group2
INTO analyzedCopy2
FROM analyzedCopy AS A
WHERE ReadTime = (SELECT TOP 1 analyzedCopy.ReadTime FROM analyzedCopy WHERE analyzedCopy.readings_miu_id = A.readings_miu_id AND analyzedCopy.ReadDate = A.ReadDate ORDER BY analyzedCopy.readings_miu_id, analyzedCopy.ReadDate, analyzedCopy.ReadTime)
ORDER BY A.readings_miu_id, A.ReadDate ;
and before this i'm filling in the analyzedCopy table from other tables given certain criteria. for one set of criteria this code works just fine but for others it keeps giving me runtime error '3354'. the only diference i can see is that with the criteria that works, the table is around 4145 records long where as with the criteria that doesn't work the table that im using this code on is over 9000 records long. any suggestions?
is there any way to tell it to only pull half of the information and then run the same select string on the other half of the table im pulling from and add those results to the previous results from the first half?
The full text for run-time error '3354' is that it is "At most one record can be returned by this subquery."
I just tried to run this query on the first 4000 records and it failed again with the same error code so it can't be the ammount of records i would think.
See this:
http://allenbrowne.com/subquery-02.html#AtMostOneRecord
What is happening is your subquery is returning two identical records (based on the ORDER BY) and the TOP 1 actually returns two records (yes that's how access does the TOP statement). You need to add fields to the ORDER BY to make it unique - preferable an unique ID (you do have an unique PK don't you?)
As Andomar below stated DISTINCT TOP 1 will work as well.
What does MS-ACCESS return when you run the subquery?
SELECT TOP 1 analyzedCopy.ReadTime
FROM analyzedCopy
WHERE analyzedCopy.readings_miu_id = A.readings_miu_id
AND analyzedCopy.ReadDate = A.ReadDate
ORDER BY analyzedCopy.readings_miu_id, analyzedCopy.ReadDate,
analyzedCopy.ReadTime
If it returns multiple rows, maybe it can be fixed with DISTINCT:
SELECT DISTINCT TOP 1 analyzedCopy.ReadTime
FROM ... rest of query ...
I don't know if this would work or not (and I no longer have a copy of Access to test on), so I apologize up front if I'm way off.
First, just do a select on the primary key of analyzedCopy to get the mid-point ID. Something like:
SELECT TOP 4500 readings_miu_id FROM analyzedCopy ORDER BY readings_miu_id, ReadDate;
Then, when you have the mid-point ID, you can add that to the WHERE statement of your original statement:
SELECT ...
INTO ...
FROM ...
WHERE ... AND (readings_miu_id <= {ID from above}
ORDER BY ...
Then SELECT the other half:
SELECT ...
INTO ...
FROM ...
WHERE ... AND (readings_miu_id > {ID from above}
ORDER BY ...
Again, sorry if I'm way off.