SQL: Performing window function and finding percent - sql

I have a table that includes the rows Data, Gender, Age Group and Number of Fans. I need to show the split of page fans across age groups in %.
So far, I have been able to limit the data to the newest data (The most recent entry is 2018-10-06,) but have been unable to perform -- what I assume is needed -- a window function to group the genders (M, F, U) together and to then find the percent per age group. I greatly appreciate any help. Here is as far as I have gotten with success:
SELECT *
FROM fanspergenderage
WHERE fanspergenderage.date >= '2018-10-16'
GROUP BY fanspergenderage.gender, fanspergenderage.agegroup;
Here

I need to show the split of page fans across age groups in %.
I interpret this as the proportion of all fans in each age group. You seem to be asking for something like this:
SELECT f.agegroup,
COUNT(*) as num_fans,
COUNT(*) * 1.0 / SUM(COUNT(*)) OVER () as ratio
FROM fanspergenderage f
WHERE f.date >= '2018-10-16'
GROUP BY f.fanspergenderage;
The * 1.0 is because some databases do integer division.

Related

SQL - get an AVG from a COUNT

how you doing?
I'm trying to get the average from a count. The metric is a string, so I get an error.
I've tried some ways, but I can't. Thanks for your help
This is the code
SELECT
user_type, // works fine
newsletter, // works fine
COUNT (newsletter) as total, // Works fine
AVG (newsletter) as percentage, // Error. No matching signature for aggregate function AVG: Supported signatures AVG(INT64), AVG(NUMERICAL), AVG (FLOAT64)
This is what I've unsuccessfully tried
AVG (newsletter) as percentage
AVG (CAST (newsletter as INT64)) as percentage
COUNT(newsletter) / SUM(newsletter)
I would like to get a table like this
user_type | newsletter | total | percentage
free. yes. 4. x%
premium. yes. 7. x%
To get the ratio of the current row to the whole table...
you already have the value for each individual row
use window functions to get the total for the whole table
then divide the two
(With a "window" of () to represent the whole table)
x * 1.0 / SUM(x) OVER ()
In your case, x is COUNT(newsletter) which gives...
COUNT(newsletter) * 1.0 / SUM(COUNT(newsletter)) OVER ()
If you want to get the count of newsletter having value yes then you can use case when expression
SELECT
user_type, -- works fine
newsletter, -- works fine
COUNT (newsletter) as total, -- Works fine
sum (case when newsletter ='yes' then 1 else 0 end)
from yourtable

Grouping a percentage calculation in postgres/redshift

I keep running in to the same problem over and over again, hoping someone can help...
I have a large table with a category column that has 28 entries for donkey breed, then I'm counting two specific values grouped by each of those categories in subqueries like this:
WITH totaldonkeys AS (
SELECT donkeybreed,
COUNT(*) AS total
FROM donkeytable1
GROUP BY donkeybreed
)
,
sickdonkeys AS (
SELECT donkeybreed,
COUNT(*) AS totalsick
FROM donkeytable1
JOIN donkeyhealth on donkeytable1.donkeyid = donkeyhealth.donkeyid
WHERE donkeyhealth.sick IS TRUE
GROUP BY donkeybreed
)
,
It's my goal to end up with a table that has primarily the percentage of sick donkeys for each breed but I always end up struggling like hell with the problem of not being able to group by without using an aggregate function which I cannot do here:
SELECT (CAST(sickdonkeys.totalsick AS float) / totaldonkeys.total) * 100 AS percentsick,
totaldonkeys.donkeybreed
FROM totaldonkeys, sickdonkeys
GROUP BY totaldonkeys.donkeybreed
When I run this I end up with 28 results for each breed of donkey, one correct I believe but obviously hundreds of useless datapoints.
I know I'm probably being really dumb here but I keep hitting in to this same problem again and again with new donkeydata, I should obviously be structuring the whole thing a new way because you just can't do this final query without an aggregate function, I think I must be missing something significant.
You can easily count the proportion that are sick in the donkeyhealth table
SELECT d.donkeybreed,
AVG( (dh.sick)::int ) AS proportion_sick
FROM donkeytable1 d JOIN
donkeyhealth dh
ON d.donkeyid = dh.donkeyid
GROUP BY d.donkeybreed

SQL Percentage of Occurrences

I'm working on some SQL code as part of my University work. The data is factitious just to be clear. I'm trying to count the occurances of 1 & 0 in the SQL table Fact_Stream, this is stored in the Free_Stream column/attribute as a Boolean/bit value.
As calculations cant be made on bit values (at least in the way I'm trying) I've converted the value to an integer -- Just to be clear on that. The table contains information on a streaming companies streams, a 1 indicates the stream was free of charge, a 0 indicates the stream was paid for. My code:
SELECT Fact_Stream.Free_Stream, ((CAST(Free_Stream AS INT)) / COUNT(*) * 100) As 'Percentage of Streams'
FROM Fact_Stream
GROUP BY Free_Stream
The result/output is nearly where I want it to be, but it doesn't display the percentage correctly.
Output:
Using MS SQL Management Studio | MS SQL Server 2012 (I believe)
The percentage should be based on all rows, so you need to divide the count per 1/0 by a count of all rows. The easiest way to get this is utilizing a Windowed Aggregate Function:
SELECT Fact_Stream.Free_Stream,
100.0 * COUNT(*) -- count per bit
/ SUM(COUNT(*)) OVER () -- sum of those counts = count of all rows
As "Percentage of Streams"
FROM Fact_Stream
GROUP BY Free_Stream
You have INTs as a devisor and devidened(not sure I am correct with namings). So the result is also INT. Just cast one of those to decimal(notice how did I change to 100.0). Also you should debide count of elements in group to total count of rows in the table:
select Free_Stream,
(count(*) / (select count(*) from Free_Stream)) * 100.0 as 'Percentage of Streams'
from Fact_Stream
group by Free_Stream
Your equation is dividing the identifier (1 or 0) by the number of streams for each one, instead of dividing the count of free or paid by the total count. One way to do this is to get the total count first, then use it in your query:
declare #totalcount real;
select #totalcount = count(*) from Fact_Stream;
SELECT Fact_Stream.Free_Stream,
(Cast(Count(*) as real) / #totalcount)*100 AS 'Percentage of Streams'
FROM Fact_Stream
group by Fact_Stream.Free_Stream

Min function in postgresql

I am trying to find a division with the lowest population density to do so i did the following:
SELECT P.edname, MIN((P.total_area*1000)/P.total2011) AS "Lowest population density"
FROM eds_census2011 P
GROUP BY P.edname
HAVING COUNT (*)> 1
total_area is multiplied by 1000 (so it is in square metres) and divide by total population.
I want only one record displaying the division (edname) and the population density wich is calculated (MIN((P.total_area*1000)/P.total2011)), instead I get all the records - not even sorted...
The problem is that I have to group it by edname, if I leave out the GROUP BY and HAVING lines I get an error. Any help is greatly appriciated!
Try
SELECT edname, (total_area*1000/total2011) density
FROM eds_census2011
WHERE (total_area*1000/total2011) = (SELECT MIN(total_area*1000/total2011) FROM eds_census2011)
SQLFiddle
A 'Return only one row' rule could be easily enforced by using LIMIT 1 if it's really necessary
Without subquery:
SELECT p.edname, min((p.total_area * 1000)/p.total2011) AS lowest_pop
FROM eds_census2011 p
GROUP BY p.edname
HAVING COUNT (*) > 1
ORDER BY 2
LIMIT 1;
This one returns only 1 row (if any qualify), even if multiple rows have equally low density.
If you just want the lowest density, period, this can be much simpler:
SELECT edname, (total_area * 1000)/total2011) AS lowest_pop
FROM eds_census2011
ORDER BY 2
LIMIT 1;

Distribution of table in time

I have a MySQL table with approximately 3000 rows per user. One of the columns is a datetime field, which is mutable, so the rows aren't in chronological order.
I'd like to visualize the time distribution in a chart, so I need a number of individual datapoints. 20 datapoints would be enough.
I could do this:
select timefield from entries where uid = ? order by timefield;
and look at every 150th row.
Or I could do 20 separate queries and use limit 1 and offset.
But there must be a more efficient solution...
Michal Sznajder almost had it, but you can't use column aliases in a WHERE clause in SQL. So you have to wrap it as a derived table. I tried this and it returns 20 rows:
SELECT * FROM (
SELECT #rownum:=#rownum+1 AS rownum, e.*
FROM (SELECT #rownum := 0) r, entries e) AS e2
WHERE uid = ? AND rownum % 150 = 0;
Something like this came to my mind
select #rownum:=#rownum+1 rownum, entries.*
from (select #rownum:=0) r, entries
where uid = ? and rownum % 150 = 0
I don't have MySQL at my hand but maybe this will help ...
As far as visualization, I know this is not the periodic sampling you are talking about, but I would look at all the rows for a user and choose an interval bucket, SUM within the buckets and show on a bar graph or similar. This would show a real "distribution", since many occurrences within a time frame may be significant.
SELECT DATEADD(day, DATEDIFF(day, 0, timefield), 0) AS bucket -- choose an appropriate granularity (days used here)
,COUNT(*)
FROM entries
WHERE uid = ?
GROUP BY DATEADD(day, DATEDIFF(day, 0, timefield), 0)
ORDER BY DATEADD(day, DATEDIFF(day, 0, timefield), 0)
Or if you don't like the way you have to repeat yourself - or if you are playing with different buckets and want to analyze across many users in 3-D (measure in Z against x, y uid, bucket):
SELECT uid
,bucket
,COUNT(*) AS measure
FROM (
SELECT uid
,DATEADD(day, DATEDIFF(day, 0, timefield), 0) AS bucket
FROM entries
) AS buckets
GROUP BY uid
,bucket
ORDER BY uid
,bucket
If I wanted to plot in 3-D, I would probably determine a way to order users according to some meaningful overall metric for the user.
#Michal
For whatever reason, your example only works when the where #recnum uses a less than operator. I think when the where filters out a row, the rownum doesn't get incremented, and it can't match anything else.
If the original table has an auto incremented id column, and rows were inserted in chronological order, then this should work:
select timefield from entries
where uid = ? and id % 150 = 0 order by timefield;
Of course that doesn't work if there is no correlation between the id and the timefield, unless you don't actually care about getting evenly spaced timefields, just 20 random ones.
Do you really care about the individual data points? Or will using the statistical aggregate functions on the day number instead suffice to tell you what you wish to know?
AVG
STDDEV_POP
VARIANCE
TO_DAYS
select timefield
from entries
where rand() = .01 --will return 1% of rows adjust as needed.
Not a mysql expert so I'm not sure how rand() operates in this environment.
For my reference - and for those using postgres - Postgres 9.4 will have ordered set aggregates that should solve this problem:
SELECT percentile_disc(0.95)
WITHIN GROUP (ORDER BY response_time)
FROM pageviews;
Source: http://www.craigkerstiens.com/2014/02/02/Examining-PostgreSQL-9.4/