Retrieve names by ratio of their occurrence - sql

I'm somewhat new to SQL queries, and I'm struggling with this particular problem.
Let's say I have query that returns the following 3 records (kept to one column for simplicity):
Tom
Jack
Tom
And I want to have those results grouped by the name and also include the fraction (ratio) of the occurrence of that name out of the total records returned.
So, the desired result would be (as two columns):
Tom | 2/3
Jack | 1/3
How would I go about it? Determining the numerator is pretty easy (I can just use COUNT() and GROUP BY name), but I'm having trouble translating that into a ratio out of the total rows returned.

SELECT name, COUNT(name)/(SELECT COUNT(1) FROM names) FROM names GROUP BY name;

Since the denominator is fixed, the "ratio" is directly proportional to the numerator. Unless you really need to show the denominator, it'll be a lot easier to just use something like:
select name, count(*) from your_table_name
group by name
order by count(*) desc
and you'll get the right data in the right order, but the number that's shown will be the count instead of the ratio.
If you really want that denominator, you'd do a count(*) on a non-grouped version of the same select -- but depending on how long the select takes, that could be pretty slow.

Related

SQL aggregation for identical values

Suppose I have data something like this
id
project-id
thing-count
country
1
1
4
GBR
2
1
2
GBR
3
1
8
GBR
4
2
1
USA
5
2
4
USA
6
2
9
USA
I want to group the data using the project-id and keep the country. I know that the country does not vary within a project. There seem to be two ways I can do this:
SELECT
project-id,
MIN(country) AS country
FROM data
GROUP BY project-id
or
SELECT
project-id,
country
FROM data
GROUP BY
project-id,
country
These both work, but neither seem right. The first puts an extra burden on the GROUP BY since there's an unnecessary MIN calculation, while the second suggests to anyone reading the query that I want to GROUP BY the country data.
I'm always surprised that there is no FIRST, LAST, or ANY aggregation function, but as far as I can tell neither SQL Server, MySQL, nor Postgres have that.
How can I write his query so that the GROUP BY does not need to do extra work aggregating a column whose entries are identical, in a way that makes it obvious in the SQL that I do not care which of the values being aggregated is chosen to represent the set of values?
Two errors in my original question: FIRST and LAST. I need to constantly remind myself that SQL rows have no implicit ordering (think sets not lists) and so these aggregate functions would not make sense. But ANY does make sense, and I have expected it to be available time and time again.
In a comment on the question #amir-saleem points out that this does exist in MySQL. Using the ANY_VALUE aggregation function I could write
SELECT project-id, ANY_VALUE(country) as country, FROM data GROUP BY project-id
(N.B. ANY_VALUE looks like a useful aggregation function, but it is included in the Miscellaneous Functions documentation rather than the Aggregate Functions documentation; I do not know why.)
There is no similar aggregate function in Postgres as far as I can tell, though I could use a custom aggregate function to achieve this. Here are some community provided examples that would work in my case:
First/last (aggregate)
Aggregate_Random
(I have not investigated SQL Server based solutions.)
If I'm understanding correctly, you basically want a list of all project IDs and their associated countries?
If so, you can do some by simply adding a distinct clause after select as shown below
Select distinct project-id,country from data

Calculating the mode/median/most frequent observation in categorical variables in SQL impala

I would like to calculate the mode/median or better, most frequent observation of a categorical variable within my query.
E.g, if the variable has the following string values:
dog, dog, dog, cat, cat and I want to get dog since its 3 vs 2.
Is there any function that does that? I tried APPX_MEDIAN() but it only returns the first 10 characters as median and I do not want that.
Also, I would like to get the most frequent observation with respect to date if there is a tie-break.
Thank you!
the most frequent observation is mode and you can calculate it like this.
Single value mode can be calculated like this on a value column. Get the count and pick up row with max count.
select count(*),value from mytable group by value order by 1 desc limit 1
now, in case you have multiple modes, you need to join back to the main table to find all matches.
select orig.value from
(select count(*) c, value v from mytable) orig
join (select count(*) cmode from mytable group by value order by 1 desc limit 1) cmode
ON orig.c= cmode.cmode
This will get all count of values and then match them based on count. Now, if one value of count matches to max count, you will get 1 row, if you have two value counts matches to max count, you will get 2 rows and so on.
Calculation of median is little tricky - and it will give you middle value. And its not most frequent one.

HANA concat rows

I use SAP-HANA database. I have a simple 2 column table whose columns are number, name, noodles, fish . The rows are these:
number name noodles fish
1 tom x
1 tom x
1 jack
2 jack x
I would like to group the rows by the id, and concatenate the names into a field, and thus obtain this:
number name noodles fish
1 tom x x
2 jack x
Can you please tell me how we can perform this operation in sap-hana? Thanks in advance.
Well, you did not really concatenate the names, but instead kept the same ones (if you would have concatenated the names as well, you would get something like jackjack in your result). I guess your x's indicate some sort of ABAP-style flags.
In any case, you would do this with grouping. This is a completely non-HANA thing (you can use the same basic SQL for any DB). You can group against several columns. All other columns that you want to select must be used in an aggregated expression (e.g. a SUM, MAX, COUNT, etc.).
To get the output from your question, I wrote the following code:
SELECT "ID", "NAME", MAX("FISH"), MAX("NOODLES")
FROM #TEST GROUP BY "ID", "NAME";
And got the same output as you. I used the MAX function based on the following assumption: you would want to get X if there is any X in the "concatenated" (aggregated) rows in that column. You get nothing / space if all the "concatenated" rows have space in them.

How can an get count of the unique lengths of a string in database rows?

I am using Oracle and I have a table with 1000 rows. There is a last name field and
I want to know the lengths of the name field but I don't want it for every row. I want a count of the various lengths.
Example:
lastname:
smith
smith
Johnson
Johnson
Jackson
Baggins
There are two smiths length of five. Four others, length of seven. I want my query to return
7
5
If there were 1,000 names I expect to get all kinds of lengths.
I tried,
Select count(*) as total, lastname from myNames group by total
It didn't know what total was. Grouping by lastname just groups on each individual name unless it's a different last name, which is as expected but not what I need.
Can this be done in one SQL query?
SELECT Length(lastname)
FROM MyTable
GROUP BY Length(lastname)
select distinct(LENGTH(lastname)) from mynames;
Select count(*), Length(column_name) from table_name group by Length(column_name);
This will work for the different lengths in a single column.

MySQL: Getting highest score for a user

I have the following table (highscores),
id gameid userid name score date
1 38 2345 A 100 2009-07-23 16:45:01
2 39 2345 A 500 2009-07-20 16:45:01
3 31 2345 A 100 2009-07-20 16:45:01
4 38 2345 A 200 2009-10-20 16:45:01
5 38 2345 A 50 2009-07-20 16:45:01
6 32 2345 A 120 2009-07-20 16:45:01
7 32 2345 A 100 2009-07-20 16:45:01
Now in the above structure, a user can play a game multiple times but I want to display the "Games Played" by a specific user. So in games played section I can't display multiple games. So the concept should be like if a user played a game 3 times then the game with highest score should be displayed out of all.
I want result data like:
id gameid userid name score date
2 39 2345 A 500 2009-07-20 16:45:01
3 31 2345 A 100 2009-07-20 16:45:01
4 38 2345 A 200 2009-10-20 16:45:01
6 32 2345 A 120 2009-07-20 16:45:01
I tried following query but its not giving me the correct result:
SELECT id,
gameid,
userid,
date,
MAX(score) AS score
FROM highscores
WHERE userid='2345'
GROUP BY gameid
Please tell me what will be the query for this?
Thanks
Requirement is a bit vague/confusing but would something like this satisfy the need ?
(purposely added various aggregates that may be of interest).
SELECT gameid,
MIN(date) AS FirstTime,
MAX(date) AS LastTime,
MAX(score) AS TOPscore.
COUNT(*) AS NbOfTimesPlayed
FROM highscores
WHERE userid='2345'
GROUP BY gameid
-- ORDER BY COUNT(*) DESC -- for ex. to have games played most at top
Edit: New question about adding the id column to the the SELECT list
The short answer is: "No, id cannot be added, not within this particular construct". (Read further to see why) However, if the intent is to have the id of the game with the highest score, the query can be modified, using a sub-query, to achieve that.
As explained by Alex M on this page, all the column names referenced in the SELECT list and which are not used in the context of an aggregate function (MAX, MIN, AVG, COUNT and the like), MUST be included in the ORDER BY clause. The reason for this rule of the SQL language is simply that in gathering the info for the results list, SQL may encounter multiple values for such an column (listed in SELECT but not GROUP BY) and would then not know how to deal with it; rather than doing anything -possibly useful but possibly silly as well- with these extra rows/values, SQL standard dictates a error message, so that the user can modify the query and express explicitly his/her goals.
In our specific case, we could add the id in the SELECT and also add it in the GROUP BY list, but in doing so the grouping upon which the aggregation takes place would be different: the results list would include as many rows as we have id + gameid combinations the aggregate values for each of this row would be based on only the records from the table where the id and the gameid have the corresponding values (assuming id is the PK in table, we'd get a single row per aggregation, making the MAX() and such quite meaningless).
The way to include the id (and possibly other columns) corresponding to the game with the top score, is with a sub-query. The idea is that the subquery selects the game with TOP score (within a given group by), and the main query's SELECTs any column of this rows, even when the fieds wasn't (couldn't be) in the sub-query's group-by construct. BTW, do give credit on this page to rexem for showing this type of query first.
SELECT H.id,
H.gameid,
H.userid,
H.name,
H.score,
H.date
FROM highscores H
JOIN (
SELECT M.gameid, hs.userid, MAX(hs.score) MaxScoreByGameUser
FROM highscores H2
GROUP BY H2.gameid, H2.userid
) AS M
ON M.gameid = H.gameid
AND M.userid = H.userid
AND M.MaxScoreByGameUser = H.score
WHERE H.userid='2345'
A few important remarks about the query above
Duplicates: if there the user played several games that reached the same hi-score, the query will produce that many rows.
GROUP BY of the sub-query may need to change for different uses of the query. If rather than searching for the game's hi-score on a per user basis, we wanted the absolute hi-score, we would need to exclude userid from the GROUP BY (that's why I named the alias of the MAX with a long, explicit name)
The userid = '2345' may be added in the [now absent] WHERE clause of the sub-query, for efficiency purposes (unless MySQL's optimizer is very smart, currently all hi-scores for all game+user combinations get calculated, whereby we only need these for user '2345'); down side duplication; solution; variables.
There are several ways to deal with the issues mentioned above, but these seem to be out of scope for a [now rather lenghty] explanation about the GROUP BY constructs.
Every field you have in your SELECT (when a GROUP BY clause is present) must be either one of the fields in the GROUP BY clause, or else a group function such as MAX, SUM, AVG, etc. In your code, userid is technically violating that but in a pretty harmless fashion (you could make your code technically SQL standard compliant with a GROUP BY gameid, userid); fields id and date are in more serious violation - there will be many ids and dates within one GROUP BY set, and you're not telling how to make a single value out of that set (MySQL picks a more-or-less random ones, stricter SQL engines might more helpfully give you an error).
I know you want the id and date corresponding to the maximum score for a given grouping, but that's not explicit in your code. You'll need a subselect or a self-join to make it explicit!
Use:
SELECT t.id,
t.gameid,
t.userid,
t.name,
t.score,
t.date
FROM HIGHSCORES t
JOIN (SELECT hs.gameid,
hs.userid,
MAX(hs.score) 'max_score'
FROM HIGHSCORES hs
GROUP BY hs.gameid, hs.userid) mhs ON mhs.gameid = t.gameid
AND mhs.userid = t.userid
AND mhs.max_score = t.score
WHERE t.userid = '2345'