How to find the average by day in SQL? - sql

I am super new to SQL and am trying to figure out how to find the average by day. So YTD what were they averaging by day.
the table below is an example of the table I am working with
Study Date | ID | Subject
01\01\2018 | 123 | Math
01\01\2018 | 456 | Science
01\02\2018 | 789 | Science
01\02\2018 | 012 | History
01\03\2018 | 345 | Science
01\03\2018 | 678 | History
01\03\2018 | 921 | Art
01\03\2018 | 223 | Science
01\04\2018 | 256 | English
For instance, If I filter on just 'Science' as the Subject, the output I am looking for is , out of the 4 science subject results, what is the daily average, min and max YTD.
So my max in a day would be 2 science subjects, my min would be 1 etc.
how can i configure a query to output this information?
So far I know to get the YTD total it would be
select SUBJECT, count (ID)
from table
where SUBJECT='science' and year (Study_date)=2022
group by SUBJECT
what would be the next step I have to take?

If you want the min/max of the daily subject count, then you need two levels of aggregation:
select subject, sum(cnt_daily) as cnt,
min(cnt_daily) as min_cnt_daily, max(cnt_daily) as max_cnt_daily
from (
select study_date, subject, count(*) as cnt_daily
from mytable
where study_date >= '2022-01-01'
group by study_date, subject
) t
group by subject
The subquery aggregates by day and by subject, and computes the number of occurences in each group. Then, the outer query groups by subject only, and computes the total count (that's the sum of the intermediate counts), along with the min/max daily values.

select Subject
,count(*) as total_count
,min(cnt) as min_daily_count
,max(cnt) as max_daily_count
,avg(cnt*1.0) as avg_daily_count
from
(
select *
,count(*) over(partition by Study_Date, Subject) as cnt
from t
) t
group by Subject
Subject
total_count
min_daily_count
max_daily_count
avg_daily_count
Art
1
1
1
1.000000
English
1
1
1
1.000000
History
2
1
1
1.000000
Math
1
1
1
1.000000
Science
4
1
2
1.500000
Fiddle

Related

SQL Finding sum of rows and returning count of keys

For a database table looking something like this:
id | year | stint | sv
----+------+-------+---
mk1 | 2001 | 1 | 30
mk1 | 2001 | 2 | 20
ml0 | 1999 | 1 | 43
ml0 | 2000 | 1 | 44
hj2 | 1993 | 1 | 70
I want to get the following output:
count
-------
3
with the conditions being count the number of ids that have a sv > 40 for a single year greater than 1994. If there is more than one stint for the same year, add the sv points and see if > 40.
This is what I have written so far but it is obviously not right:
SELECT COUNT(DISTINCT id),
SUM(sv) as SV
FROM public.pitching
WHERE (year > 1994 AND sv >40);
I know the syntax is completely wrong and some of the conditions' information is missing but I'm not familiar enough with SQL and don't know how to properly do the summing of two rows in the same table with a condition (maybe with a subquery?). Any help would be appreciated! (using postgres)
You could use a nested query to get the aggregations, and wrap that for getting the count. Note that the condition on the sum must be in a having clause:
SELECT COUNT(id)
FROM (
SELECT id,
year,
SUM(sv) as SV
FROM public.pitching
WHERE year > 1994
GROUP BY id,
year
HAVING SUM(sv) > 40 ) years
If an id should only count once even it fulfils the condition in more than one year, then do COUNT(distinct id) instead of COUNT(id)
You can try like following using sum and partition by year.
select count( distinct year) from
(
select year, sum(sv) over (partition by year) s
from public.pitching
where year > 1994
) t where s>40
Online Demo

Calculate time span over a number of records

I have a table that has the following schema:
ID | FirstName | Surname | TransmissionID | CaptureDateTime
1 | Billy | Goat | ABCDEF | 2018-09-20 13:45:01.098
2 | Jonny | Cash | ABCDEF | 2018-09-20 13:45.01.108
3 | Sally | Sue | ABCDEF | 2018-09-20 13:45:01.298
4 | Jermaine | Cole | PQRSTU | 2018-09-20 13:45:01.398
5 | Mike | Smith | PQRSTU | 2018-09-20 13:45:01.498
There are well over 70,000 records and they store logs of transmissions to a web-service. What I'd like to know is how would I go about writing a script that would select the distinct TransmissionID values and also show the timespan between the earliest CaptureDateTime record and the latest record? Essentially I'd like to see what the rate of records the web-service is reading & writing.
Is it even possible to do so in a single SELECT statement or should I just create a stored procedure or report in code? I don't know where to start aside from SELECT DISTINCT TransmissionID for this sort of query.
Here's what I have so far (I'm stuck on the time calculation)
SELECT DISTINCT [TransmissionID],
COUNT(*) as 'Number of records'
FROM [log_table]
GROUP BY [TransmissionID]
HAVING COUNT(*) > 1
Not sure how to get the difference between the first and last record with the same TransmissionID I would like to get a result set like:
TransmissionID | TimeToCompletion | Number of records |
ABCDEF | 2.001 | 5000 |
Simply GROUP BY and use MIN / MAX function to find min/max date in each group and subtract them:
SELECT
TransmissionID,
COUNT(*),
DATEDIFF(second, MIN(CaptureDateTime), MAX(CaptureDateTime))
FROM yourdata
GROUP BY TransmissionID
HAVING COUNT(*) > 1
Use min and max to calculate timespan
SELECT [TransmissionID],
COUNT(*) as 'Number of records',datediff(s,min(CaptureDateTime),max(CaptureDateTime)) as timespan
FROM [log_table]
GROUP BY [TransmissionID]
HAVING COUNT(*) > 1
A method that returns the average time for all transmissionids, even those with only 1 record:
SELECT TransmissionID,
COUNT(*),
DATEDIFF(second, MIN(CaptureDateTime), MAX(CaptureDateTime)) * 1.0 / NULLIF(COUNT(*) - 1, 0)
FROM yourdata
GROUP BY TransmissionID;
Note that you may not actually want the maximum of the capture date for a given transmissionId. You might want the overall maximum in the table -- so you can consider the final period after the most recent record.
If so, this looks like:
SELECT TransmissionID,
COUNT(*),
DATEDIFF(second,
MIN(CaptureDateTime),
MAX(MAX(CaptureDateTime)) OVER ()
) * 1.0 / COUNT(*)
FROM yourdata
GROUP BY TransmissionID;

SQL: Values in column listed twice after pivot

When querying a specific table, I need to change the structure of the result, making it so that all values from a given year are on the same row, in separate columns that identify the category that the value belongs to.
The table looks like this (example data):
year | category | amount
1991 | A of s | 56
1992 | A of s | 55
1993 | A of s | 40
1994 | A of s | 51
1995 | A of s | 45
1991 | Total | 89
1992 | Total | 80
1993 | Total | 80
1994 | Total | 81
1995 | Total | 82
The result I need is this:
year | a_of_s | total
1991 | 56 | 89
1992 | 55 | 80
1993 | 40 | 80
1994 | 51 | 81
1995 | 45 | 82
From what I can understand I need to use pivot. However, my problem seems to be that I don't understand pivot. I've attempted to adapt the queries of solutions in similar questions where pivot seems to be part of the answer, and so far what I've come up with is this:
SELECT year, [A of s], [Total] FROM table
pivot (
max(amount)
FOR category in ([A of s], [Total])
) pvt
ORDER BY year
This returns the correct table structure, but all cells in the columns a_of_s and total are NULL, and every year is listed twice. What am I missing to get the result I need?
EDIT: After fixing the errors pointed out in the comments, the only real issue that remains is that years in the year column are listed twice.
Possibly related: Is the aggregate function I use in pivot (max, sum, min, etc) arbitrary?
I assumed you really don't need to pivot your table and with the result you require you can create an alternative approach to achieve it.
This is the query i made that returns according to your requirement.
;With cte as
(
select year, Amount from tbl
where category = 'A of s'
)
select
tbl1.year, tbl2.Amount as A_of_S, tbl1.Amount as Total
from tbl as tbl1
inner join cte as tbl2 on tbl1.year = tbl2.year
where tbl1.category = 'Total'
and this is the SQL fiddle i created for you with your test day. -> SQL fiddle
Much simpler answer:
WITH VTE AS(
SELECT *
FROM (VALUES (1991,'A of s',56),
(1992,'A of s',55),
(1993,'A of s',40),
(1994,'A of s',51),
(1995,'A of s',45),
(1991,'Total',89),
(1992,'Total',80),
(1993,'Total',80),
(1994,'Total',81),
(1995,'Total',82)) V([year],category, amount))
SELECT [year],
MAX(CASE category WHEN 'A of s' THEN amount END) AS [A of s],
MAX(CASE category WHEN 'Total' THEN amount END) AS Total
FROM VTE
GROUP BY [year];

Impala: change the column type prior to perform the aggregation function for group by

I have a table, my_table:
transaction_id | money | team
--------------------------------------------
1 | 10 | A
2 | 20 | B
3 | null | A
4 | 30 | A
5 | 16 | B
6 | 12 | B
When I group by team, I can compute max, min through query:
select team, max(money), min(money) from my_table group by team
However, I can't do avg and sum because there is null. i.e:
select team, avg(money), sum(money) from my_table group by team
would fail.
Is there a way to change the column type prior to computing the avg and sum? i.e. I want the output to be:
team | avg(money) | sum(money)
--------------------------------------
A | 20 | 40
B | 16 | 48
Thanks!
Per documentation provided by Cloudera your query should be working as-is. Both AVG Function and
SUM Function ignore null.
SELECT team, AVG(money), SUM(money)
FROM my_table
GROUP BY team
UPDATE: Per your comment, again I'm not familiar with Impala. Presumably standard SQL will work. Your error appears to be a datatype issue.
SELECT team, AVG(CAST(money AS INT)), SUM(CAST(money AS INT))
FROM my_table
GROUP BY team
Just divide the sum by the count:
SELECT team, SUM(money)/COUNT(money) AS AVG, SUM(money)
FROM team
GROUP BY team
Tested here: http://sqlfiddle.com/#!9/ba381/4

GROUP BY and aggregate sequential numeric values

Using PostgreSQL 9.0.
Let's say I have a table containing the fields: company, profession and year. I want to return a result which contains unique companies and professions, but aggregates (into an array is fine) years based on numeric sequence:
Example Table:
+-----------------------------+
| company | profession | year |
+---------+------------+------+
| Google | Programmer | 2000 |
| Google | Sales | 2000 |
| Google | Sales | 2001 |
| Google | Sales | 2002 |
| Google | Sales | 2004 |
| Mozilla | Sales | 2002 |
+-----------------------------+
I'm interested in a query which would output rows similar to the following:
+-----------------------------------------+
| company | profession | year |
+---------+------------+------------------+
| Google | Programmer | [2000] |
| Google | Sales | [2000,2001,2002] |
| Google | Sales | [2004] |
| Mozilla | Sales | [2002] |
+-----------------------------------------+
The essential feature is that only consecutive years shall be grouped together.
Identifying non-consecutive values is always a bit tricky and involves several nested sub-queries (at least I cannot come up with a better solution).
The first step is to identify non-consecutive values for the year:
Step 1) Identify non-consecutive values
select company,
profession,
year,
case
when row_number() over (partition by company, profession order by year) = 1 or
year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
else 0
end as group_cnt
from qualification
This returns the following result:
company | profession | year | group_cnt
---------+------------+------+-----------
Google | Programmer | 2000 | 1
Google | Sales | 2000 | 1
Google | Sales | 2001 | 0
Google | Sales | 2002 | 0
Google | Sales | 2004 | 1
Mozilla | Sales | 2002 | 1
Now with the group_cnt value we can create "group IDs" for each group that has consecutive years:
Step 2) Define group IDs
select company,
profession,
year,
sum(group_cnt) over (order by company, profession, year) as group_nr
from (
select company,
profession,
year,
case
when row_number() over (partition by company, profession order by year) = 1 or
year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
else 0
end as group_cnt
from qualification
) t1
This returns the following result:
company | profession | year | group_nr
---------+------------+------+----------
Google | Programmer | 2000 | 1
Google | Sales | 2000 | 2
Google | Sales | 2001 | 2
Google | Sales | 2002 | 2
Google | Sales | 2004 | 3
Mozilla | Sales | 2002 | 4
(6 rows)
As you can see each "group" got its own group_nr and this we can finally use to aggregate over by adding yet another derived table:
Step 3) Final query
select company,
profession,
array_agg(year) as years
from (
select company,
profession,
year,
sum(group_cnt) over (order by company, profession, year) as group_nr
from (
select company,
profession,
year,
case
when row_number() over (partition by company, profession order by year) = 1 or
year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
else 0
end as group_cnt
from qualification
) t1
) t2
group by company, profession, group_nr
order by company, profession, group_nr
This returns the following result:
company | profession | years
---------+------------+------------------
Google | Programmer | {2000}
Google | Sales | {2000,2001,2002}
Google | Sales | {2004}
Mozilla | Sales | {2002}
(4 rows)
Which is exactly what you wanted, if I'm not mistaken.
There's much value to #a_horse_with_no_name's answer, both as a correct solution and, like I already said in a comment, as a good material for learning how to use different kinds of window functions in PostgreSQL.
And yet I cannot help feeling that the approach taken in that answer is a bit too much of an effort for a problem like this one. Basically, what you need is an additional criterion for grouping before you go on aggregating years in arrays. You've already got company and profession, now you only need something to distinguish years that belong to different sequences.
That is just what the above mentioned answer provides and that is precisely what I think can be done in a simpler way. Here's how:
WITH MarkedForGrouping AS (
SELECT
company,
profession,
year,
year - ROW_NUMBER() OVER (
PARTITION BY company, profession
ORDER BY year
) AS seqID
FROM atable
)
SELECT
company,
profession,
array_agg(year) AS years
FROM MarkedForGrouping
GROUP BY
company,
profession,
seqID
Procedural solution with PL/pgSQL
The problem is rather unwieldy for plain SQL with aggregate / windows functions. While looping is typically slower than set-based solutions with plain SQL, a procedural solution with PL/pgSQL can make do with a single sequential scan over the table (implicit cursor of a FOR loop) and should be substantially faster in this particular case:
Test table:
CREATE TEMP TABLE tbl (company text, profession text, year int);
INSERT INTO tbl VALUES
('Google', 'Programmer', 2000)
, ('Google', 'Sales', 2000)
, ('Google', 'Sales', 2001)
, ('Google', 'Sales', 2002)
, ('Google', 'Sales', 2004)
, ('Mozilla', 'Sales', 2002)
;
Function:
CREATE OR REPLACE FUNCTION f_periods()
RETURNS TABLE (company text, profession text, years int[])
LANGUAGE plpgsql AS
$func$
DECLARE
r tbl; -- use table type as row variable
r0 tbl;
BEGIN
FOR r IN
SELECT * FROM tbl t ORDER BY t.company, t.profession, t.year
LOOP
IF ( r.company, r.profession, r.year)
<> (r0.company, r0.profession, r0.year + 1) THEN -- not true for first row
RETURN QUERY
SELECT r0.company, r0.profession, years; -- output row
years := ARRAY[r.year]; -- start new array
ELSE
years := years || r.year; -- add to array - year can be NULL, too
END IF;
r0 := r; -- remember last row
END LOOP;
RETURN QUERY -- output last iteration
SELECT r0.company, r0.profession, years;
END
$func$;
Call:
SELECT * FROM f_periods();
db<>fiddle here
Produces the requested result.