Efficient way to simultaneously calc AVG and stddev_pop in postgres - sql

stddev_pop() must be calculating AVG() as part of the full calc of standard deviation (unless there's a shortcut I'm unaware of).
for context, the goal is to test for the difference of means between the these two geom columns.
Is there any way to access that in order to avoid recalculating AVG()?
here's an example query:
select
avg(st_length(cons.geom)) as source_avg_length,
avg(st_length(csn.geom)) as target_avg_length,
stddev_pop(st_length(cons.geom)) as source_std_length,
stddev_pop(st_length(csn.geom)) as target_std_length
from
received.conflation_osm_no_service cons,
received.conflation_stress_network csn ;
and the output of EXPLAIN ANALYZE which makes me think that if I ask for avg() and stddev_pop() it will only do the avg() calc once and reuse it?:

To combine both tables in a single result you must aggregate before joining:
select *
from
(
select
avg(st_length(geom)) as source_avg_length,
stddev_pop(st_length(geom)) as source_std_length
from received.conflation_osm_no_service cons
) as src
cross join
(
select
avg(st_length(geom)) as target_avg_length,
stddev_pop(st_length(geom)) as target_std_length,
from
received.conflation_stress_network csn ;
) as tgt
or to get one row per table:
select 'source' as tablename,
avg(st_length(geom)) as avg_length,
stddev_pop(st_length(geom)) as std_length
from
received.conflation_osm_no_service cons
union all
select 'target',
avg(st_length(geom)),
stddev_pop(st_length(geom)),
from
received.conflation_stress_network csn ;

per the comments, I was attributing slow execution times to multiple average aggregations when in reality it was due to an unnecessary join.

Related

Need a SQL query explained

I'm learning the databricks platform at the moment, and I'm on a lesson where we are talking about CTE's. This specific query is of a CTE in a CTE definition, and the girl in the video is not doing the best job breaking down what exactly this query is doing.
WITH lax_bos AS (
WITH origin_destination (origin_airport, destination_airport) AS (
SELECT
origin,
destination
FROM
external_table
)
SELECT
*
FROM
origin_destination
WHERE
origin_airport = 'LAX'
AND destination_airport = 'BOS'
)
SELECT
count(origin_airport) AS `Total Flights from LAX to BOS`
FROM
lax_bos;
the output of the query comes out to 684 which I know comes from the last select statement, It's just mostly everything that's going on above, I don't fully understand what's happening.
at first you choose 2 needed columns from external_table and name this cte "origin_destination" :
SELECT
origin,
destination
FROM
external_table
next you filter it in another cte named "lax_bos"
SELECT
*
FROM
origin_destination ------the cte you already made
WHERE
origin_airport = 'LAX'
AND destination_airport = 'BOS'
and this is the main query where you use cte "lax_bos" that you made in previous step, here you just count a number of flights:
SELECT
count(origin_airport) AS `Total Flights from LAX to BOS`
FROM
lax_bos
Nesting CTE's is wierd. Normally they form a single-level transformation pipeline, like this:
WITH origin_destination (origin_airport, destination_airport) AS
(
SELECT origin, destination
FROM external_table
), lax_bos AS
(
SELECT *
FROM origin_destination
WHERE origin_airport = 'LAX'
AND destination_airport = 'BOS'
)
SELECT count(origin_airport) AS `Total Flights from LAX to BOS`
FROM lax_bos;
I do not understand why you are using an common table expression (cte).
I am going to give you a quick overview of how this can be done without an cte.
Always, use some type of sample data set. There are plenty that are installed with databricks. In fact, there is one for delayed airplane departures.
The next step is to read in the file and convert it to a temporary view.
At this point, we can use the Spark SQL magic command to query the data.
The query shows plane flights from LAX to BOS. We can remove the limit 10 option and change the '*' to "count(*) as Total" to get your answer. Thus, we solved the problem without a CTE.
The above image uses a CTE to pull the origin, destination and delay for all flights from LAX to BOS. Then it bins the delays from -9 to 9 hours with counts.
Again, this can all be done in one SQL statement that might be cleaner.
I reserve CTE for more complex situations. For instance, calculating a complex math formula using a range of data and paring it with the base data set.
CTE can be recursive query, or subquery. Here, they are only simple subquery.
1st, the query origin_destination is done. Second, the query lax_bos is done over origin_destination result. And then, the final query is done on lax_bos result.

Average interval between timestamps in an array

In a PostgreSQL 9.x database, I have a column which is an array of type timestamp. Each array has between 1..n timestamps.
I'm trying to extract the average interval between all elements in each array.
I understand using a window function on the source table might be the ideal way to tackle this but in this case I am trying to do it as an operation on the array.
I've looked at several other questions that are trying to calculate the moving average of another column etc or the avg (median date of a list of timestamps).
For example the average interval I'm looking for on an array with 3 elements like this:
'{"2012-10-09 17:04:05.710887"
,"2013-10-18 22:30:08.973749"
,"2014-10-22 22:18:18.885973"}'::timestamp[]
Would be:
-368d
Wondering if I need to unpack the array through a function?
One way of many possible: unnest, join, avg in a lateral subquery:
SELECT *
FROM tbl t
LEFT JOIN LATERAL (
SELECT avg(a2.ts - a1.ts) AS avg_intv
FROM unnest(t.arr) WITH ORDINALITY a1(ts, ord)
JOIN unnest(t.arr) WITH ORDINALITY a2(ts, ord) ON (a2.ord = a1.ord + 1)
) avg ON true;
db<>fiddle here
The [INNER] JOIN in the subquery produces exactly the set of combinations relevant for intervals between elements.
I get 371 days 14:37:06.587543, not '-368d', btw.
Related, with more explanation:
PostgreSQL unnest() with element number
You can also only unnest once and use the window functions lead() or lag(), but you were trying to avoid window functions. And you need to make sure of the original order of elements in any case ...
(There is no array function you could use directly to get what you need - in case you were hoping for that.)
Alternative with CTE
Might be appealing to still unnest only once (even while avoiding window functions):
SELECT *
FROM tbl t
LEFT JOIN LATERAL (
WITH a AS (SELECT * FROM unnest(t.arr) WITH ORDINALITY a1(ts, ord))
SELECT avg(a2.ts - a1.ts) AS avg_intv
FROM a a1
JOIN a a2 ON (a2.ord = a1.ord +1)
) avg ON true;
But I expect the added CTE overhead to cost more than unnesting twice. Mostly just demonstrating a WITH clause in a subquery.

Making one table with these two Queries

Hi I am calculating CRentention and I want to calculate the retention for 1day, 3 day and 7 days, but I have made three queries, can somebody please tell me how I can write thee in one query.
Please help!
The simple way to combine multiple queries is generally a join. If you take each of your queries and turn them into subqueries, then join by date, this should give you the results you want.
As in:
SELECT one_day.activity_date as activity_date,
one_day.signed_up_users, one_day.logged_in_users, one_day.retention_pct,
three_day.signed_up_users, three_day.logged_in_users, three_day.retention_pct,
seven_day.signed_up_users, seven_day.logged_in_users, seven_day.retention_pct
FROM (SELECT ... ) as one_day
JOIN (SELECT ... ) as three_day
ON one_day.activity_date = three_day.activity_date
JOIN (SELECT ... ) as seven_day
ON one_day.activity_date = seven_day.activity_date
You also can use IF() clauses to do this in a non-joined fashion. You can create a column that is 1 if in the 1 day count and 0 otherwise, then compute the sum when you aggregate.

BigQuery: GROUP BY clause for QUANTILES

Based on the bigquery query reference, currently Quantiles do not allow any kind of grouping by another column. I am mainly interested in getting medians grouped by a certain column. The only work around I see right now is to generate a quantile query per distinct group member where the group member is a condition in the where clause.
For example I use the below query for every distinct row in column-y if I want to get the desired result.
SELECT QUANTILE( <column-x>, 1001)
FROM <table>
WHERE
<column-y> == <each distinct row in column-y>
Does the big query team plan on having some functionality to allow grouping on quantiles in the future?
Is there a better way to get what I am trying to get here?
Thanks
With the recently announced percentile_cont() window function you can get medians.
Look at the example in the announcement blog post:
http://googlecloudplatform.blogspot.com/2013/06/google-bigquery-bigger-faster-smarter-analytics-functions.html
SELECT MAX(median) AS median, room FROM (
SELECT percentile_cont(0.5) OVER (PARTITION BY room ORDER BY data) AS median, room
FROM [io_sensor_data.moscone_io13]
WHERE sensortype='temperature'
)
GROUP BY room
While there are efficient algorithms to compute quantiles they are somewhat memory intensive - trying to do multiple quantile calculations in a single query gets expensive.
There are plans to improve QUANTILES, but I don't know what the timeline is.
Do you need median? Can you filter outliers and do an average of the remainder?
If your per-group size is fixed, you may be able to hack it using combination of order, nest and nth. For instance, if there are 9 distinct values of f2 per value of f1, for median:
select f1,nth(5,f2) within record from (
select f1,nest(f2) f2 from (
select f1, f2 from table
group by f1,f2
order by f2
) group by f1
);
Not sure if the sorted order in subquery is guaranteed to survive the second group, but it worked in a simple test I tried.

Subqueries and AVG() on a subtraction

Working on a query to return the average time from when an employee begins his/her shift and then arrives at the first home (this DB assumes they are salesmen).
What I have:
SELECT l.OFFICE_NAME, crew.EMPLOYEE_NAME, //avg(first arrival time)
FROM LOCAL_OFFICE l, CREW_WORK_SCHEDULE crew,
WHERE l.LOCAL_OFFICE_ID = crew1.LOCAL_OFFICE_ID
You can see the AVG() command is commented out, because I know the time that they arrive at work, and the time they get to the first house, and can find the value using this:
(SELECT MIN(c.ARRIVE)
FROM ORDER_STATUS c
WHERE c.USER_ID = crew.CREW_ID)
-(SELECT START_TIME
FROM CREW_SHIFT_CODES
WHERE WORK_SHIFT_CODE = crew.WORK_SHIFT_CODE)
Would the best way be to simply put the above into the the AVG() parentheses? Just trying to learn the best methods to create queries. If you want more info on any of the tables, etc. just ask, but hopefully they're all named so you know what they're returning.
As per my comment, the example you gave would only return one record to the AVG function, and so not do very much.
If the sub-query was returning multiple records, however, your suggestion of placing the sub-query inside the AVG() would work...
SELECT
AVG((SELECT MIN(sub.val) FROM sub WHERE sub.id = main.id GROUP BY sub.group))
FROM
main
GROUP BY
main.group
(Averaging a set of minima, and so requiring two levels of GROUP BY.)
In many cases this gives good performance, and is maintainable. But sometimes the sub-query grows large, and it can be better to reformat it using an inline view...
SELECT
main.group,
AVG(sub_query.val)
FROM
main
INNER JOIN
(
SELECT
sub.id,
sub.group,
MIN(sub.val) AS val
FROM
sub
GROUP BY
sub.id
sub.group
)
AS sub_query
ON sub_query.id = main.id
GROUP BY
main.group
Note: Although this looks as though the inline view will calculate a lod of values that are not needed (and so be inefficient), most RDBMS optimise this so only the required records get processes. (The optimiser knows how the inner query is being used by the outer query, and builds the execution plan accordingly.)
Don't think of subqueries: they're often quite slow. In effect, they are row by row (RBAR) operations rather than set based
join all the table together
I've used a derived table to calculate the 1st arrival time
Aggregate
Soemthing like
SELECT
l.OFFICE_NAME, crew.EMPLOYEE_NAME,
AVG(os.minARRIVE - cs.START_TIME)
FROM
LOCAL_OFFICE l
JOIN
CREW_WORK_SCHEDULE crew On l.LOCAL_OFFICE_ID = crew1.LOCAL_OFFICE_ID
JOIN
CREW_SHIFT_CODES cs ON cs.WORK_SHIFT_CODE = crew.WORK_SHIFT_CODE
JOIN
(SELECT MIN(ARRIVE) AS minARRIVE, USER_ID
FROM ORDER_STATUS
GROUP BY USER_ID
) os ON oc.USER_ID = crew.CREW_ID
GROUP B
l.OFFICE_NAME, crew.EMPLOYEE_NAME
This probably won't give correct data because of the minARRIVE grouping: there isn't enough info from ORDER_STATUS to show "which day" or "which shift". It's simply "first arrival for that user for all time"
Edit:
This will give you average minutes
You can add this back to minARRIVE using DATEADD, or change to hh:mm with some %60 (modul0) and /60 (integer divide
AVG(
DATEDIFF(minute, os.minARRIVE, os.minARRIVE)
)