Getting top 10 most popular within an array column using SQL UNNEST

Getting top 10 most popular within an array column using SQL UNNEST - sql

I am working with a sample data set which gives the following result:
Continuing to work, I am now trying to get the top 10 Production Companies (based on "production_companies" field) that made the most number of movies in the most popular genre for a year.
The output
Rank | Production Company | Popular Genre | Movie Count
I thought breaking this down to getting the most popular genre for the year would be the 1st step with the following query:
select
genres.name AS _genre,
FROM
commons.movies m,
UNNEST(m.genres) as genres
WHERE
SUBSTR(m.release_date, 1, 4) = '2008'
GROUP BY
genres.name
ORDER BY
COUNT(genres.name) DESC
LIMIT
1
I have now go the output as 'Drama' being the most popular genre for the year 2008.
Answering the question to get the most popular prod company and their count has been a bit challenging and failing several times.
I have after several tries got to:
select
o_prd_cmp.name,
o_mov.title
from
commons.movies o_mov,
unnest(o_mov.genres) as o_gnr,
UNNEST(o_mov.production_companies) AS o_prd_cmp
where
SUBSTR(o_mov.release_date, 1, 4) = '2008'
AND o_gnr.name = (
select
genres.name AS _genre,
FROM
commons.movies m,
UNNEST(m.genres) as genres
WHERE
SUBSTR(m.release_date, 1, 4) = '2008'
GROUP BY
genres.name
ORDER BY
COUNT(genres.name) DESC
LIMIT
1
)
Any help with this is greatly appreciated.

Related

count the number of times a combination of values occurs

Dataset looking at the types of crime for a given city.
Incident ID
Incident Code
Incident Category
Incident Subcategory
Incident Description
618691
4134
Assault
Simple Assault
Battery
618691
15300
Offences Against The Family And Children
Other
Hate Crime (secondary only)
618701
7053
Vehicle Impounded
Vehicle Impounded
Vehicle, Impounded
618701
65010
Traffic Violation Arrest
Traffic Violation Arrest
Traffic Violation Arrest
618701
65050
Other Miscellaneous
Other
Driving While Under The Influence Of Alcohol
626010
5043
Burglary
Burglary - Residential
Burglary, Residence, Unlawful Entry
626010
6381
Larceny Theft
Larceny Theft - Other
Embezzlement from Dependent or Elder Adult by Caretaker
626010
7041
Recovered Vehicle
Recovered Vehicle
Vehicle, Recovered, Auto
626010
16650
Drug Offense
Drug Violation
Methamphetamine Offense
Each IncidentID has 2, 3, or 4 Incident Codes associated with it.
I want to be able to count the number of times each combination of 2, 3, or 4 Incident Codes appears in the entire dataset.
For example:
Incident Codes 4134, 15300: x amount of times
Incident Codes 7053, 65010, 65050: x amount of times
Incident Codes 5043, 6381, 7041, 16650: x amount of times
I apologize if I've given a poor explanation - this is my first post on SO and quite frankly I don't know how to best communicate this question.
I don't know what SQL code to run to get my answer. The closest I've come to finding an answer is this post, Select combination of two columns, and count occurrences of this combination, but it already has the data separated into two columns, which my data is not there.
My thought is to split the additional codes into other columns, but perhaps there is a way to avoid doing that by having the code run the calculation for me without it.
I appreciate any and all input you may be able to give!

Let's suppose your table is named "TableX". I think this query should be near to what you need:
Select T1.IncidentCode, T2.IncidentCode, T3.IncidentCode, T4.IncidentCode, Count(1) AS AmountOfTimes
From TableX T1
Join TableX T2 ON T2.IncidentID = T1.IncidentID AND
T2.IncidentCode <> T1.IncidentCode
Left Join TableX T3 ON T3.IncidentID = T1.IncidentID AND
T3.IncidentCode <> T1.IncidentCode AND
T3.IncidentCode <> T2.IncidentCode
Left Join TableX T4 ON T4.IncidentID = T1.IncidentID AND
T4.IncidentCode <> T1.IncidentCode AND
T4.IncidentCode <> T2.IncidentCode AND
T4.IncidentCode <> T3.IncidentCode
Group By T1.IncidentCode, T2.IncidentCode, T3.IncidentCode, T4.IncidentCode

You would probably be best to try and NOT get all 3 parts in one query and here is why. Lets say for example that one officer enters their data as codes 1, 2, 3. Another enters codes as 3, 1, 2, and yet another enters as 2, 3, 1. They are all the same "set" of codes just in different order. If you rely on just being the first being the same, you would be getting 3 different rows showing the same thing each with 1 count.
You would be better served by running 3 distinct queries with a WHERE and HAVING clause based on just the codes you are interested in the "set". Something simple like
select
YT.IncidentID,
count(*) HowMany
from
YourTable YT
where
YT.IncidentCode in ( 4134, 15300 )
group by
YT.IncidentID
having
count(*) = 2
This will return all incidents that have BOTH parts, even if the incident was associated with any 3rd and/or 4th additional codes in a given incident. Having the total records IS your count.
So, now, take your codes of interest ex: 1 & 2, and you have the possibility of 2 more incident codes per incident, and you add an additional 30+ combinations of codes 3 & 4 into the mix. If you dont care about the others that may be "extra", it does not screw up your count on the precise piece(s) you are looking for.
Then, all you have to do to get your other "what if" scenario counts is change your IN clause once and the having to match the count. Since you are only filtering based on the specific codes in question, you only want those that have the same count regardless of extra incident codes per example stated.
YT.IncidentCode in ( 7053, 65010, 65050 )
group by
YT.IncidentID
having
count(*) = 3
YT.IncidentCode in ( 5043, 6381, 7041, 16650 )
group by
YT.IncidentID
having
count(*) = 4
Now, if you only really care about the final count of each respectively, just wrap that up one more to get the count of rows returned such as
select
count(*) NumberOfIncidents
from
( select
YT.IncidentID,
count(*) HowMany
from
YourTable YT
where
YT.IncidentCode in ( 4134, 15300 )
group by
YT.IncidentID
having
count(*) = 2 ) PreQualified
Then, if you wanted to do this on some time period basis such as you have a given date of the incident, and you wanted to keep running the same query / counts, you could expand and do something like this by doing a UNION to each query.
select
'Assault and Offenses against Family and Children' as Activity,
count(*) NumberOfIncidents
from
( select
YT.IncidentID,
count(*) HowMany
from
YourTable YT
where
YT.IncidentCode in ( 4134, 15300 )
AND WhateverDateFilters...
group by
YT.IncidentID
having
count(*) = 2 ) PreQualified
UNION
select
'Vehicle Impound, Traffic Arrest, Other Misc' as Activity,
count(*) NumberOfIncidents
from
( select
YT.IncidentID,
count(*) HowMany
from
YourTable YT
where
YT.IncidentCode in ( 7053, 65010, 65050 )
AND WhateverDateFilters...
group by
YT.IncidentID
having
count(*) = 3 ) PreQualified
UNION
select
'Burglary, Theft, Drugs and Vehicle Recovery' as Activity,
count(*) NumberOfIncidents
from
( select
YT.IncidentID,
count(*) HowMany
from
YourTable YT
where
YT.IncidentCode in ( 5043, 6381, 7041, 16650 )
AND WhateverDateFilters...
group by
YT.IncidentID
having
count(*) = 4 ) PreQualified
Notice each query in the UNION returns the same number, and order of columns. So it will just return a list (in this case) of 3 rows with a description and count per category regardless of the physical order the incident codes were entered, even IF they were entered in the 3rd and 4th when only looking for 2 code possibilities.
Sometimes a generic query (as in the left-join sample) is ok, and nothing wrong with it, but ask yourself the flexibility and do you want to drill into each permutation just to get your final result numbers.

find an average of a column using group with inner join and then filtering through the groups

I've been trying to solve an sqlite question where I have two tables: Movies and movie_cast.
Movies has the columns: id, movie_title, and `score. Here is a sample of the data:
11|Star Wars|76.496
62|2001:Space Odyssey|39.064
152|Start Trek|26.551
movie_cast has the columns: movie_id, cast_id, cast_name, birthday, popularity. Here is a sample.
11|2|Mark Hamill|9/25/51|15.015
11|3|Harrison Ford|10/21/56|8.905
11|5|Peter Cushing|05/26/13|6.35
IN this case movies.id and movie_cast.movie_id are the same.
The question is to Find the top ten cast members who have the highest average movie scores.
Do not include movies with score <25 in the average score calculation.
▪ Exclude cast members who have appeared in two or fewer movies.
My query is as below but it doesn't seem to get me the right answer.
SELECT movie_cast.cast_id,
movie_cast.cast_name,
printf("%.2f",CAST(AVG(movies.score) as float)),
COUNT(movie_cast.cast_name)
FROM movies
INNER JOIN movie_cast ON movies.id = movie_cast.movie_id
WHERE movies.score >= 25
GROUP BY movie_cast.cast_id
HAVING COUNT(movie_cast.cast_name) > 2
ORDER BY AVG(movies.score ) DESC, movie_cast.cast_name ASC
LIMIT 10
The answers I get are in the format cast_id,cat_name,avg score.
-And example is: 3 Harrison Ford 52.30
I've analyzed and re-analyzed my logic but to no avail. I'm not sure where I'm going wrong. Any help would be great!
Thank you!

This is how I would write the query:
SELECT mc.cast_id,
mc.cast_name,
PRINTF('%.2f', AVG(m.score)) avg_score
FROM movie_cast mc INNER JOIN movies m
ON m.id = mc.movie_id
WHERE m.score >= 25
GROUP BY mc.cast_id, mc.cast_name
HAVING COUNT(*) > 2
ORDER BY AVG(m.score) DESC, mc.cast_name ASC
LIMIT 10;
I use aliases for the tables to shorten the code and make it more readable.
There is no need to cast the average to a float because the average in SQLite is always a real number.
Both COUNT(movie_cast.cast_name) can be simplified to COUNT(*) but the 1st one in the SELECT list is not needed by your requirement (if it is then add it).
The function PRINTF() returns a string, but if you want a number returned then use ROUND():
ROUND(AVG(m.score), 2) avg_score

SQL - Get consecutively minimum numbers

Title may not make sense so I will provide some context.
I have a table, call it Movies.
A movie tuple has the values: Name, Director, Genre, Year
I'm trying to create a query that allows me to return all Directors who have never released two consecutive Horror films with more than 4 years apart.
I'm not sure where I'd begin but I'm trying to start off by creating a query that given some specific year, returns the next minimum year, so that I can check if the difference between these two is less than 4, and keep doing that for all movies.
My attempt was:
SELECT D1.Director
FROM Movies D1
WHERE D1.Director NOT IN
(SELECT D2.Director FROM Director D2
WHERE D2.Director = D1.Director
AND D2.Genre = 'Horror'
AND D1.Genre = 'Horror' AND D2.Year - D1.Year > 4
OR D1.Year - D2.Year > 4)
which does not work for obvious reasons.
I've also had a few attempts using joins, and it works on films that follow a pattern such as 2000, 2003, 2006, but fail if more than 3 films.

You could try this:
Select all data, and use lag or lead to return the last or next year. After that look at the difference between the two.
WITH TempTable AS (
SELECT
Name,
Director,
Genre,
Year,
LAG(Year) OVER (PARTITION BY Name, Director, Genre ORDER BY Year ASC) AS 'PriorYear'
FROM
Movies
WHERE
Genre = 'Horror'
)
SELECT
Name,
Director
FROM
TempTable
GROUP BY
Name,
Director
HAVING
MAX(Year-PriorYear) < 2

Try this:
SELECT * FROM (
SELECT director, min(diff) as diff FROM (
SELECT m1.director, m1.year as year1, m2.year as year2, m2.year-m1.year as diff
FROM `movies` m1, movies m2
WHERE m1.director = m2.director and m1.name <> m2.name and m1.year<=m2.year
and m1.genre='horror' and m2.genre='horror'
) d1 group by director
) d2 WHERE diff>4
First, in the inner Select it will list all movie pairs of directors' horror movies with year difference calculated, then minimum of these are selected (for consecutiveness), then longer than 4 years differences are selected...

Finding the decade with largest records, SQL Server

I have the following db diagram :
I want to find the decade (for example 1990 to 2000) that has the most number of movies.
Actually it only deals with "Movies" table.
Any idea on how to do that?

You can use the LEFT function in SQL Server to get the decade from the year. The decade is the first 3 digits of the year. You can group by the decade and then count the number of movies. If you sort, or order, the results by the number of movies - the decade with the largest number of movies will be at the top. For example:
select
count(id) as number_of_movies,
left(cast([year] as varchar(4)), 3) + '0s' as decade
from movies
group by left(cast([year] as varchar(4)), 3)
order by number_of_movies desc

An alternative to the string approach is to use integer division to get the decade:
SELECT [Year]/10*10 as [Decade]
, COUNT(*) as [CountMovies]
FROM Movies
GROUP BY [Year]/10*10
ORDER BY [CountMovies] DESC
This returns all, ordered by the decade(s) with the most movies. You could add a TOP (1) to only get the top, but then you'd need to consider tiebreaker scenarios to ensure you get deterministic results.

select substring(cast([year] as varchar), 1, 3) as Decade,
Count(1) [Count]
from Movies
group by substring(cast([year] as varchar), 1, 3)
order by 2 desc

SELECT floor(Year(getdate())/10)*10
, floor(year('5/11/2004')/10)*10
, floor(Year('7/23/1689')/10)*10
, floor(Year('7/09/1989')/10)*10

Despite being an old question I found this solution via trying
DATE_PART('decade',(year::date)) AS decade,
DATE_TRUNC('decade',(year::date)) AS decade_truncated,
also works for
DATE_PART('century',(year::date)) AS decade,
DATE_TRUNC('century',(year::date)) AS decade_truncated,

In my case, I am having year as a string column. To get the movies grouped by decades,
SELECT DISTINCT NUMRANGE(
CAST(FLOOR(CAST(year AS INT)/ 10) * 10 AS INT),
CAST((FLOOR(CAST(year AS INT)/ 10) * 10) + 9 AS INT)
) AS "decades", COUNT(*) AS "movie_count"
FROM movies
WHERE year IS NOT NULL AND year != ''
GROUP BY decades
ORDER BY movie_count DESC;
This gives the number of movies in that decade. Hope this one helps someone...

SQL query MAX(SUM(..)) [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
Table Structure:
Article(
model int(key),
year int(key),
author varchar(key),
num int)
num: number of articles wrote during the year
Find all the authors that each one of them in one year atleast wrote maximal number of articles (relative to all the other authors)
I tried:
SELECT author FROM Article,
(SELECT year,max(sumnum) s FROM
(SELECT year,author,SUM(num) sumnum FROM Article GROUP BY year,author)
GROUP BY year) AS B WHERE Article.year=B.year and Article.num=B.s;
Is this the right answer?
Thanks.

You might want to try a self-JOIN to get what you are looking for:
SELECT Main.author
FROM Article AS Main
INNER JOIN (
SELECT year
,author
,SUM(num) AS sumnum
FROM Article
GROUP BY year
,author
) AS SumMain
ON SumMain.year = Main.year
AND SumMain.author = Main.author
GROUP BY Main.author
HAVING SUM(Main.num) = MAX(SumMain.sumnum)
;
This would guarantee (as it is ANSI) you are getting the MAX of the SUMmed nums and only bringing back results for what you need. Keep in mind I only JOINed on those two fields because of the information provided ... if you have a unique ID you can JOIN on, or you require more specificity to get a 1-to-1 match, adjust accordingly.
Depending on what DBMS you are using, it can be simplified one of two ways:
SELECT author
FROM (
SELECT year
,author
,SUM(num) AS sumnum
FROM Article
GROUP BY year
,author
HAVING SUM(num) = MAX(sumnum)
) AS Main
;
Some DBMSes allow you to do multiple aggregate functions, and this could work there.
If your DBMS allows you to do OLAP functions, you can do something like this:
SELECT author
FROM (
SELECT year
,author
,SUM(num) AS sumnum
FROM Article
GROUP BY year
,author
) AS Main
QUALIFY (
ROW_NUMBER() OVER (
PARTITION BY author
,year
ORDER BY sumnum DESC
) = 1
)
;
Which would limit the result set to only the highest sumnum, although you may need more parameters to handle things if you wanted the year to be involved (you are GROUPing by it, only reason I bring it up).
Hope this helps!

You mention for homework and a valid attempt, however incorrect.
This is under a premise (unclear since no sample data) that the model column is like an auto-increment, and there is only going to be one entry per author per year and never multiple records for the same author within the same year. Ex:
model year author num
===== ==== ====== ===
1 2013 A 15
2 2013 C 18
3 2013 X 17
4 2014 A 16
5 2014 B 12
6 2014 C 16
7 2014 X 18
8 2014 Y 18
So the result expected is highest article count in 2013 = 18 and would only return author "C". In 2014, highest article count is 18 and would return authors "X" and "Y"
First, get a query of what was the maximum number of articles written...
select
year,
max( num ) as ArticlesPerYear
from
Article
GROUP BY
year
This would give you one record per year, and the maximum number of articles published... so if you had data for years 2010-2014, you would at MOST have 5 records returned. Now, it is as simple as joining this to the original table that had the matching year and articles
select
A2.*
from
( select
year,
max( num ) as ArticlesPerYear
from
Article
GROUP BY
year ) PreQuery
JOIN Article A2
on PreQuery.Year = A2.Year
AND PreQuery.ArticlesPerYear = A2.num

I suggest a CTE
WITH maxyear AS
(SELECT year, max(num) AS max_articles
FROM article
GROUP BY year)
SELECT DISTINCT author
FROM article a
JOIN maxyear m
ON a.year=m.year AND a.num=m.max_articles;
and compare that in performance to a partition, which is another way
SELECT DISTINCT author FROM
(SELECT author, rank() AS r
OVER (PARTITION BY year ORDER BY num DESC)
FROM article) AS subq
WHERE r = 1;
I think some RDBMS will let you put HAVING rank()=1 on the subquery and then you don't need to nest queries.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Getting top 10 most popular within an array column using SQL UNNEST - sql

Related

count the number of times a combination of values occurs

find an average of a column using group with inner join and then filtering through the groups

SQL - Get consecutively minimum numbers

Finding the decade with largest records, SQL Server

SQL query MAX(SUM(..)) [closed]

Categories

Resources