Dynamicly labeling users by score - sql

I've got a table with users and their score, a decimal number ranging 1 - 10.
Table(user_id, score) I also have a 1 row query with the average of the scores (about 6), and the standard deviation (about 0.5). I use MS Access 2007.
I want to label the users A, B, C, D, where:
A has a score higher than (avg+stdev);
B has a score lower than A, but higher than the average;
C has a score lower than the average, but higher than (avg-stdev)
D has a score lower than (avg-stdev).
If I export all data to Excel, I can calculate the values easily, and import them back to the database. Obviously, this isn't the most elegant way. I would like to do this with SQL as a Query. The result should be a table (user_id, label)
But how?

You can use a cross join to join up your users to the 1-row stats query. Then you can use a nested iif to calculate the grade.
Something like this...
SELECT users.*,grade.*
,iif(users.score>grade.high,"A",iif(users.score>grade.average,"B",iif(users.score>grade.low,"C","D"))) as label
FROM (SELECT round(avg(users.score)-stdev(users.score),1) as low
,round(avg(users.score),1) as average
,round(avg(users.score)+stdev(users.score),1) as high
FROM users) AS grade, users;

The IIF did the trick.
I adopted the Query with the average scores to add the minimum A, B and C-scores
Table(avg,stdev,Ascore,Bscore,Cscore) as averages
The final query looked like
SELECT user.Id, user.avgScore,
IIf(avgScore>averages.Ascore,"A",
IIf(avgScore>averages.Bscore,"B",
IIf(avgScore>averages.Cscore,"C","D"))) AS label
FROM averages, users

Related

POSTGRESQL: Weighted average instead of average?

I have a list of averages
SELECT tv.id, AVG(ut.rating) FROM user_tvshow AS ut
LEFT JOIN tvshows AS tv ON tv.id = ut.tvshow
WHERE "user" IN (
SELECT follows FROM user_follows WHERE "user" = 1 -- List of users the current user follows
) AND rating IS NOT NULL GROUP BY tv.id;
At the moment it averages the results as expected. Is there any way to weight this average with the number of rows in the group? So that one row of rating 10 won't appear higher than 100 rows of rating 9.
This is not what a weighted average is. It sounds like you are trying to get at a Bayesian average where you penalize a small set by moving its observed average towards some meta-average. There is no built in way to do this in PostgreSQL.
Compute the sum and count separately, and then use some mechanism to implement the penalty based on those values. You could do that in the client, or you could write an outer query which takes the results of the subquery and applies the formula.
select id, (the_sum + 10* <metaaveerage>)/(the_count+10) from (
SELECT tv.id, sum(ut.rating) as the_sum, count(ut.rating) as the_count FROM user_tvshow AS ut
LEFT JOIN tvshows AS tv ON tv.id = ut.tvshow
WHERE "user" IN (
SELECT follows FROM user_follows WHERE "user" = 1 -- List of users the current user follows
) AND rating IS NOT NULL GROUP BY tv.id
) foobar
How you decide what values to plug in for the 10 and for the <metaaverage> are questions of statistics, not programming.

SQL: Performing window function and finding percent

I have a table that includes the rows Data, Gender, Age Group and Number of Fans. I need to show the split of page fans across age groups in %.
So far, I have been able to limit the data to the newest data (The most recent entry is 2018-10-06,) but have been unable to perform -- what I assume is needed -- a window function to group the genders (M, F, U) together and to then find the percent per age group. I greatly appreciate any help. Here is as far as I have gotten with success:
SELECT *
FROM fanspergenderage
WHERE fanspergenderage.date >= '2018-10-16'
GROUP BY fanspergenderage.gender, fanspergenderage.agegroup;
Here
I need to show the split of page fans across age groups in %.
I interpret this as the proportion of all fans in each age group. You seem to be asking for something like this:
SELECT f.agegroup,
COUNT(*) as num_fans,
COUNT(*) * 1.0 / SUM(COUNT(*)) OVER () as ratio
FROM fanspergenderage f
WHERE f.date >= '2018-10-16'
GROUP BY f.fanspergenderage;
The * 1.0 is because some databases do integer division.

recursive geometric query : five closest entities

The question is whether the query described below can be done without recourse to procedural logic, that is, can it be handled by SQL and a CTE and a windowing function alone? I'm using SQL Server 2012 but the question is not limited to that engine.
Suppose we have a national database of music teachers with 250,000 rows:
teacherName, address, city, state, zipcode, geolocation, primaryInstrument
where the geolocation column is a geography::point datatype with optimally tesselated index.
User wants the five closest guitar teachers to his location. A query using a windowing function performs well enough if we pick some arbitrary distance cutoff, say 50 miles, so that we are not selecting all 250,000 rows and then ranking them by distance and taking the closest 5.
But that arbitrary 50-mile radius cutoff might not always succeed in encompassing 5 teachers, if, for example, the user picks an instrument from a different culture, such as sitar or oud or balalaika; there might not be five teachers of such instruments within 50 miles of her location.
Also, now imagine we have a query where a conservatory of music has sent us a list of 250 singers, who are students who have been accepted to the school for the upcoming year, and they want us to send them the five closest voice coaches for each person on the list, so that those students can arrange to get some coaching before they arrive on campus. We have to scan the teachers database 250 times (i.e. scan the geolocation index) because those students all live at different places around the country.
So, I was wondering, is it possible, for that latter query involving a list of 250 student locations, to write a recursive query where the radius begins small, at 10 miles, say, and then increases by 10 miles with each iteration, until either a maximum radius of 100 miles has been reached or the required five (5) teachers have been found? And can it be done only for those students who have yet to be matched with the required 5 teachers?
I'm thinking it cannot be done with SQL alone, and must be done with looping and a temporary table--but maybe that's because I haven't figured out how to do it with SQL alone.
P.S. The primaryInstrument column could reduce the size of the set ranked by distance too but for the sake of this question forget about that.
EDIT: Here's an example query. The SINGER (submitted) dataset contains a column with the arbitrary radius to limit the geo-results to a smaller subset, but as stated above, that radius may define a circle (whose centerpoint is the student's geolocation) which might not encompass the required number of teachers. Sometimes the supplied datasets contain thousands of addresses, not merely a few hundred.
select TEACHERSRANKEDBYDISTANCE.* from
(
select STUDENTSANDTEACHERSINRADIUS.*,
rowpos = row_number()
over(partition by
STUDENTSANDTEACHERSINRADIUS.zipcode+STUDENTSANDTEACHERSINRADIUS.streetaddress
order by DistanceInMiles)
from
(
select
SINGER.name,
SINGER.streetaddress,
SINGER.city,
SINGER.state,
SINGER.zipcode,
TEACHERS.name as TEACHERname,
TEACHERS.streetaddress as TEACHERaddress,
TEACHERS.city as TEACHERcity,
TEACHERS.state as TEACHERstate,
TEACHERS.zipcode as TEACHERzip,
TEACHERS.teacherid,
geography::Point(SINGER.lat, SINGER.lon, 4326).STDistance(TEACHERS.geolocation)
/ (1.6 * 1000) as DistanceInMiles
from
SINGER left join TEACHERS
on
( TEACHERS.geolocation).STDistance( geography::Point(SINGER.lat, SINGER.lon, 4326))
< (SINGER.radius * (1.6 * 1000 ))
and TEACHERS.primaryInstrument='voice'
) as STUDENTSANDTEACHERSINRADIUS
) as TEACHERSRANKEDBYDISTANCE
where rowpos < 6 -- closest 5 is an abitrary requirement given to us
I think may be if you need just to get closest 5 teachers regardless of radius, you could write something like this. The Student will duplicate 5 time in this query, I don't know what do you want to get.
select
S.name,
S.streetaddress,
S.city,
S.state,
S.zipcode,
T.name as TEACHERname,
T.streetaddress as TEACHERaddress,
T.city as TEACHERcity,
T.state as TEACHERstate,
T.zipcode as TEACHERzip,
T.teacherid,
T.geolocation.STDistance(geography::Point(S.lat, S.lon, 4326))
/ (1.6 * 1000) as DistanceInMiles
from SINGER as S
outer apply (
select top 5 TT.*
from TEACHERS as TT
where TT.primaryInstrument='voice'
order by TT.geolocation.STDistance(geography::Point(S.lat, S.lon, 4326)) asc
) as T

BigQuery: GROUP BY clause for QUANTILES

Based on the bigquery query reference, currently Quantiles do not allow any kind of grouping by another column. I am mainly interested in getting medians grouped by a certain column. The only work around I see right now is to generate a quantile query per distinct group member where the group member is a condition in the where clause.
For example I use the below query for every distinct row in column-y if I want to get the desired result.
SELECT QUANTILE( <column-x>, 1001)
FROM <table>
WHERE
<column-y> == <each distinct row in column-y>
Does the big query team plan on having some functionality to allow grouping on quantiles in the future?
Is there a better way to get what I am trying to get here?
Thanks
With the recently announced percentile_cont() window function you can get medians.
Look at the example in the announcement blog post:
http://googlecloudplatform.blogspot.com/2013/06/google-bigquery-bigger-faster-smarter-analytics-functions.html
SELECT MAX(median) AS median, room FROM (
SELECT percentile_cont(0.5) OVER (PARTITION BY room ORDER BY data) AS median, room
FROM [io_sensor_data.moscone_io13]
WHERE sensortype='temperature'
)
GROUP BY room
While there are efficient algorithms to compute quantiles they are somewhat memory intensive - trying to do multiple quantile calculations in a single query gets expensive.
There are plans to improve QUANTILES, but I don't know what the timeline is.
Do you need median? Can you filter outliers and do an average of the remainder?
If your per-group size is fixed, you may be able to hack it using combination of order, nest and nth. For instance, if there are 9 distinct values of f2 per value of f1, for median:
select f1,nth(5,f2) within record from (
select f1,nest(f2) f2 from (
select f1, f2 from table
group by f1,f2
order by f2
) group by f1
);
Not sure if the sorted order in subquery is guaranteed to survive the second group, but it worked in a simple test I tried.

Calculating percentages with SQL

I'm trying to create an SQL query to work out the percentage of rows given its number of play counts.
My DB currently has 800 rows of content,
All content has been played a total of 3,000,000 times put together
table:
id, play_count, content
Lets say I'd like to work out the percentage of the first 10 rows.
My attempts have looked similar to this:
SELECT COUNT(*) AS total_content,
SUM(play_count) AS total_played,
content.play_count AS content_plays
FROM bebo_video
How would I put this all together to show a final percentage on each individual row??
SELECT play_count / (SELECT SUM(play_count) FROM bebo_video) * 100 FROM bebo_video
Use ROUND, TRUNCATE, etc. to format the resulting values.