refering to field out of subeselects scope - sql

I'm working on a piece of SQL at the moment and i need to retrieve every row of a dataset with a median and an average aggregated in it.
Example
i have the following set
ID;month;value
and i would like to retrieve something like :
ID;month;value;average for this month;median for this month
without having to group by my result.
So it would be something like :
SELECT ID,month,value,
(SELECT AVG(value) FROM myTable) as "myAVG"
FROM myTable
but i would need that average to be the average for that month specifically. So, rows where the month="January" will have the average and median for "January" etc ...
Issue here is that i did not find a way to refer to the value of month in my subquery
(SELECT AVG(value) FROM myTable)
Does someone have a clue?
P.S: It's a redshift database i'm working on.

You would need to select all rows from the table, and do a left join with a select statement that does group by month. This way, you would get every row, and the group by results with them for that month.
Something like this:
SELECT * FROM myTable a
LEFT JOIN
(
SELECT Month, Sum(value being summed) as mySum
FROM myTable
GROUP BY Month
) b
ON a.Month = b.Month
Helpful?

with myavg as
(SELECT month, AVG(value) as avgval FROM myTable group by month)
, mymed as
(select month, median(value) as medval from myTable group by month)
select ID, month, value, ma.avgval, mm.medval
from mytable m left join myavg ma
on m.month = ma.month
left join mymed mm
on m.month = mm.month
You can use a cte to do this. However, you need a group by on month, as you are calculating an aggregate value.

In Redshift you can use Window Function.
select month,
avg(value) over
(PARTITION BY month rows unbounded preceding) as avg
from myTable
order by 1;

Related

Selecting max date of each month

I have a table with a lot of cumulative columns, these columns reset to 0 at the end of each month. If I sum this data, I'll end up double counting. Instead, With Hive, I'm trying to select the max date of each month.
I've tried this:
SELECT
yyyy_mm_dd,
id,
name,
cumulative_metric1,
cumulative_metric2
FROM
mytable
WHERE
yyyy_mm_dd = last_day(yyyy_mm_dd)
mytable has daily data from the start of the year. In the output of the above, I only see the last date for January but not February. How can I select the last day of each month?
February is not over yet. Perhaps a window function does what you want:
SELECT yyyy_mm_dd, id, name, cumulative_metric1, cumulative_metric2
FROM (SELECT t.*,
MAX(yyyy_mm_dd) OVER (PARTITION BY last_day(yyyy_mm_dd)) as last_yyyy_mm_dd
FROM mytable t
) t
WHERE yyyy_mm_dd = last_yyyy_mm_dd;
This calculates the last day in the data.
use correlated subquery and date to month function in hive
SELECT
yyyy_mm_dd,
id,
name,
cumulative_metric1,
cumulative_metric2
FROM
mytable t1
WHERE
yyyy_mm_dd = select max(yyyy_mm_dd) from mytable t2 where
month(t1.yyyy_mm_dd)= month(t2.yyyy_mm_dd)

SQL count new values only with partition by - running count with no duplicates

Based on table below in Presto I need a column for all new 'rid'. What I managed to do is the same what I can achieve with partition by but it's not exactly what I'm looking for (db<>fiddle demo).
Goal is to have many groupings counts but I think this should describe problem sufficiently.
I need data truncated by days and column for new users every day as shown at example below. In simple words - if value repeats don't count it. I've tried to find correlation between this and relational division problem but I just stuck.
You could use row_number() to rank the records of each rid by time; then you can aggregate and count in only the top record per group.
select
date_trunc(day, t.time) dy,
count(*) rid_count,
sum(case when t.rn = 1 then 1 else 0 end) new_rid_count
from (
select
t.*
row_number() over(partition by t.rid order by t.time) rn
from mytable t
) t
group by date_trunc(day, t.time)
I think of this as two levels of aggregation. The inner one to get the earliest date. The outer to aggregate:
select first_day, count(*)
from (select rid, date_trunc('day', min(time))::date as first_day
from orders o
group by rid
) r
group by 1

Cumulative Column that starts at Zero every Year

I have data from January 1st 2008 to today ordered by date in the first column of a table Ratio.
I have values in the second column. I was able to do a cumulative third column with the following code but I don't know how to make it re-start every January 1st to have cumulative per year.
SELECT
t3.Date,
SUM(cumul) AS cumul
FROM (
SELECT
t1.Date,
t1.nb,
SUM(t2.nb) AS cumul
FROM (
SELECT
Ratio.Date,
SUM(DailyValue) AS Nb
FROM Ratio
GROUP BY Ratio.Date
)t1
INNER JOIN (
SELECT
Ratio.Date,
SUM(DailyValue) AS nb
FROM Ratio
GROUP BY Ratio.Date
) t2
ON t1.Date >= t2.Date
GROUP BY t1.Date, t1.nb
)t3
GROUP BY PnLDate,nb
ORDER BY pnldate
There is a better way using window function SUM
select Date,
sum(sum(DailyValue)) over (
partition by year(date) order by date
) as cumul
from Ratio
group by Date
order by Date;

Aggregates for today and the previous day depending on data

Having trouble putting together a query to pull the aggregate values of a give timestamp and the timestamp before it. Given the following schema:
name TEXT,
ts TIMESTAMP,
X NUMERIC,
Y NUMERIC
where there are gaps in the ts column due to gaps in data, I'm trying to construct a query to produce
name,
date_trunc('day' q1.ts),
avg(q1.X),
sum(q2.Y),
date_trunc('day', q2.ts),
avg(q2.X),
sum(q2.Y)
The first half is straightforward:
SELECT q1.name, date_trunc('day', q1.ts), avg(q1.X), sum(q1.Y)
FROM data as q1
GROUP BY 1, 2
ORDER BY 1, 2;
But not sure how to generate the relation to find the "day" before for each row. I'm trying to work an inner join like this:
SELECT q1.name, q1.day, q1.avg, q1.sum, q2.day, q2.avg, q2.sum
FROM (
SELECT name, date_trunc('day', ts) AS day, avg(X) AS avg, sum(Y) as sum
FROM data
GROUP BY 1,2
ORDER BY 1,2
) q1 INNER JOIN (
SELECT name, date_trunc('day', ts) AS day, avg(X) AS avg, sum(Y) as sum
FROM data
GROUP BY 1,2
ORDER BY 1,2
) q2 ON (
q1.name = q2.name
AND q2.day = q1.day - interval '1 day'
);
The problem with this is, it doesn't cover the cases when the next "day" is more than 1 day before the current day.
The special difficulty here is that you need to number days after aggregating rows. You can do this in a single query level with the window function row_number(), since window functions are applied after aggregation by GROUP BY.
Also, use a CTE to avoid executing the same subquery multiple times:
WITH q AS (
SELECT name, ts::date AS day
,avg(x) AS avg_x, sum(y) AS sum_y
,row_number() OVER (PARTITION BY name ORDER BY ts::date) AS rn
FROM data
GROUP BY 1,2
)
SELECT q1.name, q1.day, q1.avg_x, q1.sum_y
,q2.day AS day2, q2.avg_x AS avg_x2, q2.sum_y AS sum_y2
FROM q q1
LEFT JOIN q q2 ON q1.name = q2.name
AND q1.rn = q2.rn + 1
ORDER BY 1,2;
Using the simpler cast to date (ts::date) instead of date_trunc('day', ts) to get "days".
LEFT [OUTER] JOIN (as opposed to [INNER] JOIN) is instrumental to preserve the corner case of the first row, where there is no previous day.
And ORDER BY should be applied to the outer query.
The question isn't crystal clear, but it sounds like you're actually trying to fill gaps while keeping track of leading/lagging rows.
To fill the gaps, look into generate_series() and left join it with your table:
select d
from generate_series(timestamp '2013-12-01', timestamp '2013-12-31', interval '1 day') d;
http://www.postgresql.org/docs/current/static/functions-srf.html
For previous and next row values, look into lead() and lag() window functions:
select date_trunc('day', ts) as curr_row_day,
lag(date_trunc('day', ts)) over w as prev_row_day
from data
window w as (order by ts)
http://www.postgresql.org/docs/current/static/tutorial-window.html

Last day of the month with a twist in SQLPLUS

I would appreciate a little expert help please.
in an SQL SELECT statement I am trying to get the last day with data per month for the last year.
Example, I am easily able to get the last day of each month and join that to my data table, but the problem is, if the last day of the month does not have data, then there is no returned data. What I need is for the SELECT to return the last day with data for the month.
This is probably easy to do, but to be honest, my brain fart is starting to hurt.
I've attached the select below that works for returning the data for only the last day of the month for the last 12 months.
Thanks in advance for your help!
SELECT fd.cust_id,fd.server_name,fd.instance_name,
TRUNC(fd.coll_date) AS coll_date,fd.column_name
FROM super_table fd,
(SELECT TRUNC(daterange,'MM')-1 first_of_month
FROM (
select TRUNC(sysdate-365,'MM') + level as DateRange
from dual
connect by level<=365)
GROUP BY TRUNC(daterange,'MM')) fom
WHERE fd.cust_id = :CUST_ID
AND fd.coll_date > SYSDATE-400
AND TRUNC(fd.coll_date) = fom.first_of_month
GROUP BY fd.cust_id,fd.server_name,fd.instance_name,
TRUNC(fd.coll_date),fd.column_name
ORDER BY fd.server_name,fd.instance_name,TRUNC(fd.coll_date)
You probably need to group your data so that each month's data is in the group, and then within the group select the maximum date present. The sub-query might be:
SELECT MAX(coll_date) AS last_day_of_month
FROM Super_Table AS fd
GROUP BY YEAR(coll_date) * 100 + MONTH(coll_date);
This presumes that the functions YEAR() and MONTH() exist to extract the year and month from a date as an integer value. Clearly, this doesn't constrain the range of dates - you can do that, too. If you don't have the functions in Oracle, then you do some sort of manipulation to get the equivalent result.
Using information from Rhose (thanks):
SELECT MAX(coll_date) AS last_day_of_month
FROM Super_Table AS fd
GROUP BY TO_CHAR(coll_date, 'YYYYMM');
This achieves the same net result, putting all dates from the same calendar month into a group and then determining the maximum value present within that group.
Here's another approach, if ANSI row_number() is supported:
with RevDayRanked(itemDate,rn) as (
select
cast(coll_date as date),
row_number() over (
partition by datediff(month,coll_date,'2000-01-01') -- rewrite datediff as needed for your platform
order by coll_date desc
)
from super_table
)
select itemDate
from RevDayRanked
where rn = 1;
Rows numbered 1 will be nondeterministically chosen among rows on the last active date of the month, so you don't need distinct. If you want information out of the table for all rows on these dates, use rank() over days instead of row_number() over coll_date values, so a value of 1 appears for any row on the last active date of the month, and select the additional columns you need:
with RevDayRanked(cust_id, server_name, coll_date, rk) as (
select
cust_id, server_name, coll_date,
rank() over (
partition by datediff(month,coll_date,'2000-01-01')
order by cast(coll_date as date) desc
)
from super_table
)
select cust_id, server_name, coll_date
from RevDayRanked
where rk = 1;
If row_number() and rank() aren't supported, another approach is this (for the second query above). Select all rows from your table for which there's no row in the table from a later day in the same month.
select
cust_id, server_name, coll_date
from super_table as ST1
where not exists (
select *
from super_table as ST2
where datediff(month,ST1.coll_date,ST2.coll_date) = 0
and cast(ST2.coll_date as date) > cast(ST1.coll_date as date)
)
If you have to do this kind of thing a lot, see if you can create an index over computed columns that hold cast(coll_date as date) and a month indicator like datediff(month,'2001-01-01',coll_date). That'll make more of the predicates SARGs.
Putting the above pieces together, would something like this work for you?
SELECT fd.cust_id,
fd.server_name,
fd.instance_name,
TRUNC(fd.coll_date) AS coll_date,
fd.column_name
FROM super_table fd,
WHERE fd.cust_id = :CUST_ID
AND TRUNC(fd.coll_date) IN (
SELECT MAX(TRUNC(coll_date))
FROM super_table
WHERE coll_date > SYSDATE - 400
AND cust_id = :CUST_ID
GROUP BY TO_CHAR(coll_date,'YYYYMM')
)
GROUP BY fd.cust_id,fd.server_name,fd.instance_name,TRUNC(fd.coll_date),fd.column_name
ORDER BY fd.server_name,fd.instance_name,TRUNC(fd.coll_date)