How can I solve my question in SQL without using a JOIN? - sql

Data : Desired result:
class type number class rate score
------------------------- ----------------------
2021 1 5 2021 0.5 4.8
2021 1 4.6 2022 0.5 4.6
2021 0 4.8
2021 null null
2022 1 4.2
2022 1 5
2022 0 4.2
2022 null null
rate = (type = 1 / all list) group by class.
score = AVG(number) where type = 1 group by class.
I want to do like below:
SELECT
a.class, SUM(type) / COUNT(*) AS rate, b.score
FROM
data as a
LEFT JOIN
(SELECT
class, AVG(number) AS score
FROM
data
WHERE
type = 1
GROUP BY
class) AS b ON a.class = b.class
GROUP BY
class
Is there any method to do this without JOIN?

First some issues should be named:
Do not use SQL key words like type or number as column names or table names.
Do not do a division without ruling out possible dividing by zero exceptions.
Anyway, in case your description is correct, you can do following:
SELECT class,
ROUND(AVG(CAST(COALESCE(type,0) AS FLOAT)),2) AS rate,
ROUND(AVG(CASE WHEN type = 1 THEN COALESCE(number,0) END),2) AS score
FROM data
GROUP BY class;
You can see here it's working correctly: db<>fiddle
Some explanations:
AVG will build the average without doing risky divisions.
COALESCE replaces NULL values by zero to make sure the average will be correct.
ROUND makes sure the average will be shown as example as 0.33, not as 0.33333...
If this is not sufficient for you, please be more precise about what exactly you want to do.

Related

Assign an age to a person based on known population average but no Date of birth

I would like to use Postgres SQL to assign an age category to a list of househoulds, where we don't know the date of birth of any of the family members.
Dataset looks like:
household_id
household_size
x1
5
x2
1
x3
8
...
...
I then have a set of percentages for each age group with that dataset looking like:
age_group
percentage
0-18
30
19-30
40
31-100
30
I want the query to calculate overall what will make the whole dataset as close to the percentages as possible and if possible similar at the household level(not as important). the dataset will end up looking like:
household_id
household_size
0-18
19-30
31-100
x1
5
2
2
1
x2
1
0
1
0
x3
8
3
3
2
...
...
...
....
...
I have looked at the ntile function but any pointers as to how I could handle this with postgres would be really helpful.
I didn't want to post an answer with just a link so I figured I'll give it a shot and see if I can simplify depesz weighted_random to plain sql. The result is this slower, less readable, worse version of it, but in shorter, plain sql:
CREATE FUNCTION weighted_random( IN p_choices ANYARRAY, IN p_weights float8[] )
RETURNS ANYELEMENT language sql as $$
select choice
from
( select case when (sum(weight) over (rows UNBOUNDED PRECEDING)) >= hit
then choice end as choice
from ( select unnest(p_choices) as choice,
unnest(p_weights) as weight ) inputs,
( select sum(weight)*random() as hit
from unnest(p_weights) a(weight) ) as random_hit
) chances
where choice is not null
limit 1
$$;
It's not inlineable because of aggregate and window function calls. It's faster if you assume weights will only be probabilities that sum up to 1.
The principle is that you provide any array of choices and an equal length array of weights (those can be percentages but don't have to, nor do they have to sum up to any specific number):
update test_area t
set ("0-18",
"19-30",
"31-100")
= (with cte AS (
select weighted_random('{0-18,19-30,31-100}'::TEXT[], '{30,40,30}')
as age_group
from generate_series(1,household_size,1))
select count(*) filter (where age_group='0-18') as "0-18",
count(*) filter (where age_group='19-30') as "19-30",
count(*) filter (where age_group='31-100') as "31-100"
from cte)
returning *;
Online demo showing that both his version and mine are statistically reliable.
A minimum start could be:
SELECT
household_id,
MIN(household_size) as size,
ROUND(SUM(CASE WHEN agegroup_from=0 THEN g ELSE 0 END),1) as g1,
ROUND(SUM(CASE WHEN agegroup_from=19 THEN g ELSE 0 END),1) as g2,
ROUND(SUM(CASE WHEN agegroup_from=31 THEN g ELSE 0 END),1) as g3
FROM (
SELECT
h.household_id,
h.household_size,
p.agegroup_from,
p.percentage/100.0 * h.household_size as g
FROM households h
CROSS JOIN PercPerAge p) x
GROUP BY household_id
ORDER BY household_id;
output:
household_id
size
g1
g2
g3
x1
5
1.5
2.0
1.5
x2
1
0.3
0.4
0.3
x3
8
2.4
3.2
2.4
see: DBFIDDLE
Notes:
Of course you should round the columns g to whole numbers, taking into account the complete split (g1+g2+g3 = total)
Because g1,g2 and g3 are based on percentages, their values can change (as long as the total split is OK.... (see, for more info: Return all possible combinations of values on columns in SQL )

How do you write a rolling "Case When" statment in PostgreSQL?

Let's say I have a table called "means" that looks like this:
year mean
1990 1.5
1991 1.0
1992 1.3
1993 1.0
And I have a second table called "values" that looks like this:
year tag value
1990 A 0.25
1991 B 1.10
1992 C 2.32
1993 A 0.70
I want to create another column where if the value for a given year is greater than the mean for a given year, the value of that column should be "Greater". If it's less than the mean for a given year, it should be "Less" and if it's equal to the mean, it should be "Equal".
Essentially, I want to create a series of Case When statements that are indexed to the year given in the table.
How would I go about doing that?
That'as a join with conditional logic:
select
v.*,
case
when v.value > m.mean then 'Greater'
when v.value < m.mean then 'Less'
else 'Equal'
end comp
from vals v
inner join means m on m.year = v.year
Note: this is called a case expression, not a case statement (the former is a conditional expression, while the latter is a control flow structure).

How to group years in decades in sqlite3 in jupyter notebook?

I'm suppose to Find the decade D with the largest number of films and the total number of films in D. A decade is a sequence of 10 consecutive years. For example, say in your database you have movie information starting from 1965. Then the first decade is 1965, 1966, ..., 1974; the second one is 1967, 1968, ..., 1976 and so on.
I'm suppose to implement this in jupyter note book where I imorpted sqlite3
I wrote the following code for it.
Select count(*) as total_films,concat(decade,'-',decade+9)
FROM (Select floor(YEAR('year')/10)*10 as decade FROM movie) t
GROUP BY decade
Order BY total_films desc;
However, the notebook threw error like "no such function: floor" and "no such function: Year" and no such function: concat"
Therefore, after going through sqlite documentation I changed code to
Select count(*) as total_films,decade||'-'||decade+9
FROM (Select cast(strftime('%Y',year)/10 as int)*10 as decade FROM movie) t
GROUP BY decade
Order BY total_films desc;
However, I got an incorrect output :
count(*) decade||'-'||decade+9
0 117 NaN
1 3358 -461.0
Would appreciate insights on why this is happening.
Updating question after going through comments by c.Perkins
1) I began, checking the type of year column
using the query PRAGMA table_info(movie)
Got the following result
cid name type notnull dflt_value pk
0 0 index INTEGER 0 None 0
1 1 MID TEXT 0 None 0
2 2 title TEXT 0 None 0
3 3 year TEXT 0 None 0
4 4 rating REAL 0 None 0
5 5 num_votes INTEGER 0 None 0
Since the year column is of the type text I changed to int using the cast function and check for nulls or NaN SELECT CAST(year as int) as yr FROM MOVIE WHERE yr is null
I didn't get any results, therefore it appears there are no nulls. However, on using the query SELECT CAST(year as int) as yr FROM MOVIE order by yr asc I see a lot of zeros in the year column
yr
0 0
1 0
2 0
3 0
4 0
-
-
-
-
3445 2018
3446 2018
3447 2018
3448 2018
3449 2018
3450 2018
From the above we see that the year is given as it is and in another stamp, therefore using strftime('%Y', year) did not yield result as mentioned in the comment.
Therefore, keeping all the above in mind, I changed the inner query to
SELECT (CAST( (year/10) as int) *10) as decade FROM MOVIE WHERE decade!=0 order by decade asc
Output for the above query :
decade
0 1930
1 1930
2 1930
3 1930
4 1930
5 1930
6 1940
7 1940
8 1940
-
-
-
3353 2010
3354 2010
3355 2010
3356 2010
3357 2010
Finally, placing this inner query, in the first query I wrote above
Select count(*) as total_films,decade||'-'||decade+9 as period
FROM (SELECT (CAST( (year/10) as int) *10) as decade FROM MOVIE WHERE decade!=0 order by decade asc)
GROUP BY decade
Output :
total_films period
0 6 1939
1 12 1949
2 71 1959
3 145 1969
4 254 1979
5 342 1989
6 551 1999
7 959 2009
8 1018 2019
As far as I can see the only issue is with period column where instead of showing 1930-1939 it is only showing 1939 and so on, if is using || is not right, is there anythother function that could be used ? because concat is not working.
Thanks in advance.
Pending updates to the question as requested in comments, here are a few immediate points that might help solve the problem without having all details:
Does the movie.year column contain null values? Likewise non-numeric or non-date values? the NaN (Not A Number) result likely indicates a null/invalid data in the source. (Technically there is no such NaN value in SQLite, so I'm assuming that the question data is copied form some other data grid or processed output.)
What type of data is in the column movie.year? Does it contain full ISO-8601 date strings or a Julian-date numeric value? Or does it only contain the year (as the column name implies)? If it only contains the year (as a string or integer), then the function call like strftime('%Y', year) will NOT return what you expect and is unnecessary. Just refer to the column directly.
I suspect this is where the -461.0 is coming from.
The operator / is an "integer division" operator if both operands are integers. A valid isolated year value will be an integer and the literal 10 is of course an integer, so integer division will automatically drop any decimal part and returns only the integer part of the division without having to explicitly cast to integer.
According to sqlite docs, the concatenation operator || has highest precedence. That means in the expression decade||'-'||decade+9, the concatenation is applied first so that one possible intermediate would be '1930-1930'+9. (Technically I would consider this result undefined since the string value does not contain a basic data type. In practices on my system, the string is apparently interpreted as 1930 and the overall result is the integer value 1939. Either way you will get unexpected bogus results rather than the desired string.)

Total Sum SQL Server

I have a query that collects many different columns, and I want to include a column that sums the price of every component in an order. Right now, I already have a column that simply shows the price of every component of an order, but I am not sure how to create this new column.
I would think that the code would go something like this, but I am not really clear on what an aggregate function is or why I get an error regarding the aggregate function when I try to run this code.
SELECT ID, Location, Price, (SUM(PriceDescription) FROM table GROUP BY ID WHERE PriceDescription LIKE 'Cost.%' AS Summary)
FROM table
When I say each component, I mean that every ID I have has many different items that make up the general price. I only want to find out how much money I spend on my supplies that I need for my pressure washers which is why I said `Where PriceDescription LIKE 'Cost.%'
To further explain, I have receipts of every customer I've worked with and in these receipts I write down my cost for the soap that I use and the tools for the pressure washer that I rent. I label all of these with 'Cost.' so it looks like (Cost.Water), (Cost.Soap), (Cost.Gas), (Cost.Tools) and I would like it so for Order 1 it there's a column that sums all the Cost._ prices for the order and for Order 2 it sums all the Cost._ prices for that order. I should also mention that each Order does not have the same number of Costs (sometimes when I use my power washer I might not have to buy gas and occasionally soap).
I hope this makes sense, if not please let me know how I can explain further.
`ID Location Price PriceDescription
1 Park 10 Cost.Water
1 Park 8 Cost.Gas
1 Park 11 Cost.Soap
2 Tom 20 Cost.Water
2 Tom 6 Cost.Soap
3 Matt 15 Cost.Tools
3 Matt 15 Cost.Gas
3 Matt 21 Cost.Tools
4 College 32 Cost.Gas
4 College 22 Cost.Water
4 College 11 Cost.Tools`
I would like for my query to create a column like such
`ID Location Price Summary
1 Park 10 29
1 Park 8
1 Park 11
2 Tom 20 26
2 Tom 6
3 Matt 15 51
3 Matt 15
3 Matt 21
4 College 32 65
4 College 22
4 College 11 `
But if the 'Summary' was printed on every line instead of just at the top one, that would be okay too.
You just require sum(Price) over(Partition by Location) will give total sum as below:
SELECT ID, Location, Price, SUM(Price) over(Partition by Location) AS Summed_Price
FROM yourtable
WHERE PriceDescription LIKE 'Cost.%'
First, if your Price column really contains values that match 'Cost.%', then you can not apply SUM() over it. SUM() expects a number (e.g. INT, FLOAT, REAL or DECIMAL). If it is text then you need to explicitly convert it to a number by adding a CAST or CONVERT clause inside the SUM() call.
Second, your query syntax is wrong: you need GROUP BY, and the SELECT fields are not specified correctly. And you want to SUM() the Price field, not the PriceDescription field (which you can't even sum as I explained)
Assuming that Price is numeric (see my first remark), then this is how it can be done:
SELECT ID
, Location
, Price
, (SELECT SUM(Price)
FROM table
WHERE ID = T1.ID AND Location = T1.Location
) AS Summed_Price
FROM table AS T1
to get exact result like posted in question
Select
T.ID,
T.Location,
T.Price,
CASE WHEN (R) = 1 then RN ELSE NULL END Summary
from (
select
ID,
Location,
Price ,
SUM(Price)OVER(PARTITION BY Location)RN,
ROW_number()OVER(PARTITION BY Location ORDER BY ID )R
from Table
)T
order by T.ID

Restricting a SQL query so that any particular value in a certain column can only appear 3 times in the results, with respect to a given ordering

Suppose that I have a table in a SQL database with columns like the ones shown below. The table records various performance metrics of the employees in my company each month.
I can easily query the table so that I can see the best monthly sales figures that my employees have ever obtained, along with which employee was responsible and which month the figure was obtained in:
SELECT * FROM EmployeePerformance ORDER BY Sales DESC;
NAME MONTH SALES COMMENDATIONS ABSENCES
Karen Jul 16 36,319.13 2 0
David Feb 16 35,398.03 2 1
Martin Nov 16 33,774.38 1 1
Sandra Nov 15 33,012.55 4 0
Sandra Mar 16 31,404.45 1 0
Karen Sep 16 30,645.78 2 2
David Feb 16 29,584.81 1 1
Karen Jun 16 29,030.00 3 0
Stuart Mar 16 28,877.34 0 1
Karen Nov 15 28,214.42 1 2
Martin May 16 28,091.99 3 0
This query is very simple, but it's not quite what I want. How would I need to change it if I wanted to see only the top 3 monthly figures achieved by each employee in the result set?
To put it another way, I want to write a query that is the same as the one above, but if any employee would appear in the result set more than 3 times, then only their top 3 results should be included, and any further results of theirs should be ignored. In my sample query, Karen's figure from Nov 15 would no longer be included, because she already has three other figures higher than that according to the ordering "ORDER BY Sales DESC".
The specific SQL database I am using is either SQLite or, if what I need is not possible with SQLite, then MySQL.
In MySQL you can use windows function:
SELECT *
FROM EmployeePerformance
WHERE row_number() OVER (ORDER BY Sales DESC)<=3
ORDER BY Sales DESC
In SQLite window functions aren't available, but you still can count the preceding rows:
SELECT *
FROM EmployeePerformance e
WHERE
(SELECT COUNT(*)
FROM EmployeePerformance ee
WHERE ee.Name=e.Name and ee.Sales>e.Sales)<3
ORDER BY e.Sales DESC
I have managed to find an answer myself. It seems to work by pairing each record up with all of the records from the same person that were equal or greater, and then choosing only the (left) records that had no more than 3 greater-or-equal pairings.
SELECT P.Name, P.Month, P.Sales, P.Commendations, P.Absences
FROM Performance P
LEFT JOIN Performance P2 ON (P.Name = P2.Name AND P.Sales <= P2.Sales)
GROUP BY P.Name, P.Month, P.Sales, P.Commendations, P.Absences
HAVING COUNT(*) <= 3
ORDER BY P.Sales DESC;
I will give the credit to a_horse_with_no_name for adding the tag "greatest-n-per-group", as I would have had no idea what to search for otherwise, and by looking through other questions with this tag I managed to find what I wanted.
I found this question that was similar to mine... Using LIMIT within GROUP BY to get N results per group?
And I followed this link that somebody had included in a comment... https://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/
...and the answer I wanted was in the first comment on that article. It's perfect as it uses only a LEFT JOIN, so it will work in SQLite.
Here is my SQL Fiddle: http://sqlfiddle.com/#!7/580f0/5/0