How to find every combination of features shared across multiple rows?

How to find every combination of features shared across multiple rows? - sql

I am pretty new to using SQL (using StandardSQL via Big Query currently) and unfortunately my Google-fu could not find me a solution to this issue.
I'm working with a dataset where each row is a different person and each column is an attribute (name, age, gender, weight, ethnicity, height, bmi, education level, GPA, etc.). I am tying to 'cluster' these people into all of the feature combinations that match 5 or more people.
Originally I did this manually with 3 feature columns where I would essentially concatenate a 'cluster name' column and then have 7 select queries for each grouping with a >5 where clause, which I then UNIONed together:
gender
age
ethnicity
gender + age
gender + ethnicity
age + ethnicity
gender + age + ethnicity
^ unfortunately doing it this way just balloons the number of combinations and with my anticipated ~15 total features doing it this way seems really unfeasible. I'd also like to do this through a less manual approach so that if a new feature is added in the future it does not require major edits to include it in my cluster identification.
Is there a function or existing process that could accomplish something like this? I'd ideally like to be able to identify ALL combinations that meet my combination user count minimum (so it's expected the same rows would match multiple different clusters here. Any advice or help here would be appreciated! Thanks.

If only BQ supported grouping sets or cube, this would be simple. One method that is pretty generalizable enumerates the 7 groups and then uses bits to figure out what to aggregate:
select (case when n & 1 > 0 then gender end) as gender,
(case when n & 2 > 0 then age end) as age,
(case when n & 4 > 0 then ethnicity end) as ethnicity,
count(*)
from t cross join
unnest(generate_array(1, 7)) n
group by n, 1, 2, 3;
Another method which is trickier is to reconstruct the groups using rollup(). Something like this:
select gender, age, ethnicity, count(*)
from t
group by rollup(gender, age, ethnicity);
Produces three of the groups you want. So:
select gender, age, ethnicity, count(*)
from t
group by rollup(gender, age, ethnicity)
union all
select gender, null, ethnicity, count(*)
from t
group by gender, ethnicity
union all
select null, age, ethnicity, count(*)
from t
group by rollup (ethnicity, age);
The above reconstructs all your groups using rollup().

Related

sum of two columns assigned to a condition

hi im trying to get the total of two columns stored to a name then get a condition but i having error on the 'Rebound' name on line 3
the offreb and defreb has a integer type and some values are stored as 0 (zero)
SELECT team, CONCAT(firstname,' ',lastname) AS name, SUM(offreb + defreb) AS Rebounds
FROM boxscore
WHERE round = 'Finals' AND game = 7 AND Rebounds > 0
ORDER BY team, Rebounds;

You want to filter by column in the WHERE clause which is not yet calculated when the WHERE clause is executed. You can use a sub-query or having.
It should be something like this:
SELECT team, CONCAT(firstname,' ',lastname) AS name, SUM(offreb + defreb) AS Rebounds
FROM boxscore
WHERE round = 'Finals' AND game = 7
GROUP BY team, CONCAT(firstname,' ',lastname)
HAVING SUM(offreb + defreb) > 0
ORDER BY team, Rebounds;

Here using HAVING clause solves your issue.
If a table has been grouped using GROUP BY, but only certain groups
are of interest, the HAVING clause can be used, much like a WHERE
clause, to eliminate groups from the result.
Official postgres docs
SELECT
team,
CONCAT(firstname,' ',lastname) AS name,
SUM(offreb + defreb) AS "Rebounds"
FROM
boxscore
WHERE
round = 'Finals' AND game = 7
GROUP BY
team,
CONCAT(firstname,' ',lastname)
HAVING
SUM(offreb + defreb) > 0
ORDER BY
team, "Rebounds";
Note that you cannot use column alias in WHERE and GROUP BY clause, but can be used in ORDER BY clause and wrap double quotes to preserve case-sensitivity.

How to aggregate data in one column by values in another column using SQL

I have a table in PostgreSQL that contains demographic data for each province of my country.
Columns are: Province_name, professions, Number_of_people.
As you can see, Province_names are repeated for each profession.
How then can I get the province names not repeated and instead get the professions in separate columns?

It sounds like you want to pivot your table (Really: It is better to show data and expected output in your question!)
demo:db<>fiddle
This is the PostgreSQL way (since 9.4) to do that using the FILTER clause
SELECT
province,
SUM(people) FILTER (WHERE profession = 'teacher') AS teacher,
SUM(people) FILTER (WHERE profession = 'banker') AS banker,
SUM(people) FILTER (WHERE profession = 'supervillian') AS supervillian
FROM mytable
GROUP BY province
If you want to go a more common way, you can use the CASE clause
SELECT
province,
SUM(CASE WHEN profession = 'teacher' THEN people ELSE 0 END) AS teacher,
SUM(CASE WHEN profession = 'banker' THEN people ELSE 0 END) AS banker,
SUM(CASE WHEN profession = 'supervillian' THEN people ELSE 0 END) AS supervillian
FROM mytable
GROUP BY province

What you want to do is a pivot which is a little more complicated in Postgresql then in other rdbms. You can use the crosstab function. Find a introduction here: https://www.vertabelo.com/blog/technical-articles/creating-pivot-tables-in-postgresql-using-the-crosstab-function
for you it would look something like this:
SELECT *
FROM crosstab( 'select Province_name, professions, Number_of_people from table1 order by 1,2')
AS final_result(Province_name TEXT, data_scientist NUMERIC,data_engineer NUMERIC,data_architect NUMERIC,student NUMERIC);

Ratio or Percentage from group by SQL query from column with condition and without condition

I am having some trouble with a SQL query. From a table let's call it Reports:
I want to group all the reports by the name column.
Then for each of those name groups I want to go to the rating column and count the number of times the rating was 15 or less. Let's say this happened 10 times for one of the groups with the name BOBBO.
I also want to know the number of times ratings were submitted (same as total number of records for each name group). So using the name group BOBBO let's say he has 20 ratings.
So under the condition the group BOBBO 50% of the time has a rating 15 or less.
I've seen these posts -- I am still having some trouble cracking this.
using-count-and-return-percentage-against-sum-of-records
getting-two-counts-and-then-dividing-them
getting-a-percentage-from-mysql-with-a-group-by-condition-and-precision
divide-two-counts-from-one-select
After reading those I tried queries like these:
ActiveRecord::Base.connection.execute
("SELECT COUNT(*) Matched,
(select COUNT(rating) from reports group by name) Total,
CAST(COUNT(*) AS FLOAT)/CAST((SELECT COUNT(*) FROM reports group by name) AS FLOAT)*100 Percentage from reports
where rating <= 15 order by Percentage")
ActiveRecord::Base.connection.execute
("select name, sum(rating) / count(rating) as bad_rating
from reports group by name having bad_rating <= 15")
Any help would be very much appreciated!

Consider a conditional aggregate for the bad ratings divided by full count:
SELECT [name],
SUM(CASE WHEN [rating] <= 15 THEN 1 ELSE 0 END) / Count(*) AS bad_rating
FROM Reports
GROUP BY [name]
Or as #CL. points out a shorter conditional aggregate (where logical expression is summed):
SELECT [name],
SUM([rating] <= 15) / Count(*) AS bad_rating
FROM Reports
GROUP BY [name]

when to use intersect in a query

i am a bit comfused about when do i have to use intersect in sql.The example that i am given is the following:
I have two tables:
MovieStar(name, address, gender, birthdate)
MovieExec(name, address, cert#, netWorth)
The example asks to find the name and address of all female actors who also are a movie executor and have networth over 10000000.The solution of the example in the book is the following:
(SELECT name, address
FROM MovieStar
WHERE gender = 'F')
INTERSECT
(SELECT name, address
FROM MovieExec
WHERE netWorth > 10000000);
So my problem is why i have to use INTERSECT while i could use the "AND" operator like:
SELECT name, address
FROM MovieStar, MovieExec
WHERE gender = 'F' AND netWorth > 10000000
Is there any tricky way to figure out when is better to use INTERSECT or "AND"?

Use INTERSECT when it suits you and you get correct results. Second always compare execution plan and statistics, because the way you get result may vary.
SqlFiddleDemo
1)
SELECT name, address
FROM MovieStar
WHERE gender = 'F'
INTERSECT
SELECT name, address
FROM MovieExec
WHERE netWorth > 10000000;
It means take name and addresses from MovieStar where gender is 'F',
take name and address from MovieExec where networth > 100000 and find records which are in both sets.
2)
SELECT ms.name, ms.address
FROM MovieStar AS ms, MovieExec AS me
WHERE gender = 'F' AND netWorth > 10000000
It means that you generate CROSS JOIN Cartesian Product(MxN records) and then take only records where gender = 'F' AND netWorth > 10000000
I guess that the first approach will returns result faster and use less memory(but Query Optimizer can do a lot).
When you should use INTERSECT:
you want to get intersection of both sets and you cannot JOIN them explicitly

SQL add multiple "Count" together

I'm trying to add the counts together and output the one with the max counts.
The question is: Display the person with the most medals (gold as place = 1, silver as place = 2, bronze as place = 3)
Add all the medals together and display the person with the most medals
Below is the code I have thought about (obviously doesn't work)
Any ideas?
Select cm.Givenname, cm.Familyname, count(*)
FROM Competitors cm JOIN Results re ON cm.competitornum = re.competitornum
WHERE re.place between '1' and '3'
group by cm.Givenname, cm.Familyname
having max (count(re.place = 1) + count(re.place = 2) + count(re.place = 3))
Sorry forgot to add that were not allowed to use ORDER BY.
Some data in the table
Competitors Table
Competitornum GivenName Familyname gender Dateofbirth Countrycode
219153 Imri Daniel Male 1988-02-02 Aus
Results Table
Eventid Competitornum Place Lane Elapsedtime
SWM111 219153 1 2 20 02

From what you've described it sounds like you just need to take the "Top" individual in the total medal count. In order to do that you would write something like this.
Select top 1 cm.Givenname, cm.Familyname, count(*)
FROM Competitors cm JOIN Results re ON cm.competitornum = re.competitornum
WHERE re.place between '1' and '3'
group by cm.Givenname, cm.Familyname
order by count(*) desc
Without using order by you have a couple of other options though I'm glossing over whatever syntax peculiarities sqlfire may use.
You could determine the max medal count of any user and then only select competitors that have that count. You could do this by saving it out to a variable or using a subquery.
Select cm.Givenname, cm.Familyname, count(*)
FROM Competitors cm JOIN Results re ON cm.competitornum = re.competitornum
WHERE re.place between '1' and '3'
group by cm.Givenname, cm.Familyname
having count(*) = (
Select max( count(*) )
FROM Competitors cm JOIN Results re ON cm.competitornum = re.competitornum
WHERE re.place between '1' and '3'
group by cm.Givenname, cm.Familyname
)
Just a note here. This second method is highly inefficient because we recalculate the max medal count for every row in the parent table. If sqlfire supports it you would be much better served by calculating this ahead of time, storing it in a variable and using that in the HAVING clause.

You are grouping by re.place, is that what you want? You want the results per ... ? :)
[edit] Good, now that's fixed you're almost there :)
The having is not needed in this case, you simply need to add a count(re.EventID) to your select and make a subquery out of it with a max(that_count_column).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to find every combination of features shared across multiple rows? - sql

Related

sum of two columns assigned to a condition

How to aggregate data in one column by values in another column using SQL

Ratio or Percentage from group by SQL query from column with condition and without condition

when to use intersect in a query

SQL add multiple "Count" together

Categories

Resources