Get average time difference grouped by another column in postgresql - sql

I have a posts table where i'm interested in calculating the average difference between each authors posts. Here is a minimal example:
+---------------+---------------------+
| post_author | post_date |
|---------------+---------------------|
| 0 | 2019-03-05 19:12:24 |
| 1 | 2017-11-06 18:28:43 |
| 1 | 2017-11-06 18:28:43 |
| 1 | 2017-11-06 18:28:43 |
| 1 | 2017-11-06 18:28:43 |
| 1 | 2018-02-19 18:36:36 |
| 1 | 2018-02-19 18:36:36 |
| 1 | 2018-02-19 18:36:36 |
| 1 | 2018-02-19 18:36:36 |
| 1 | 2018-02-19 18:40:09 |
+---------------+---------------------+
so for each author, i want to get the delta of their time series essentially, then find the average (grouped by author). so the end result would look something like:
+---------------+---------------------+
| post_author | post_date_delta(hrs)|
|---------------+---------------------|
| 0 | 0 |
| 1 | 327 |
| 2 | 95 |
| ... | ... |
+---------------+---------------------+
I can think of how to do it in Python, but I'm struggling to write a (postgres) SQL query to accomplish this. Any help is appreciated!

You can use aggregation and arithmetic:
select post_author,
(max(post_date) - min(post_date)) / nullif(count(*) - 1, 0)
from t
group by post_author;
The average days is the difference between the maximum and minimum days, divided by one less than the count.

Related

Time Series Downsampling/Upsampling

I am trying to downsample and upsample time series data on MonetDB.
Time series database systems (TSDS) usually have an option to make the downsampling and upsampling with an operator like SAMPLE BY (1h).
My time series data looks like the following:
sql>select * from datapoints limit 5;
+----------------------------+------------+--------------------------+--------------------------+--------------------------+--------------------------+--------------------------+
| time | id_station | temperature | discharge | ph | oxygen | oxygen_saturation |
+============================+============+==========================+==========================+==========================+==========================+==========================+
| 2019-03-01 00:00:00.000000 | 0 | 407.052 | 0.954 | 7.79 | 12.14 | 12.14 |
| 2019-03-01 00:00:10.000000 | 0 | 407.052 | 0.954 | 7.79 | 12.13 | 12.13 |
| 2019-03-01 00:00:20.000000 | 0 | 407.051 | 0.954 | 7.79 | 12.13 | 12.13 |
| 2019-03-01 00:00:30.000000 | 0 | 407.051 | 0.953 | 7.79 | 12.12 | 12.12 |
| 2019-03-01 00:00:40.000000 | 0 | 407.051 | 0.952 | 7.78 | 12.12 | 12.12 |
+----------------------------+------------+--------------------------+--------------------------+--------------------------+--------------------------+--------------------------+
I tried the following query but the results are obtained by aggregating all the values from different days, which is not what I am looking for:
sql>SELECT EXTRACT(HOUR FROM time) AS "hour",
AVG(pH) AS avg_ph
FROM datapoints
GROUP BY "hour";
| hour | avg_ph |
+======+==========================+
| 0 | 8.041121283524923 |
| 1 | 8.041086970785418 |
| 2 | 8.041152801724111 |
| 3 | 8.04107828783526 |
| 4 | 8.041060110153223 |
| 5 | 8.041167286877407 |
| ... | ... |
| 23 | 8.041219444444451 |
I tried then to aggregate the time series data first based on the day then on the hour:
SELECT EXTRACT(DATE FROM time) AS "day", EXTRACT(HOUR FROM time) AS "hour",
AVG(pH) AS avg_ph
FROM datapoints
GROUP BY "day", "hour";
But I am getting the following exception:
syntax error, unexpected sqlDATE in: "select extract(date"
My question: how could I aggregate/downsample the data to a specific period of time (e.g. obtain an aggregated value every 2 days or 12 hours)?

Join on minimum date between two dates - Spark SQL

I have a table of daily data and a table of monthly data. I'm trying to retrieve one daily record corresponding to each monthly record. The wrinkles are that some days are missing from the daily data and the field I care about, new_status, is sometimes null on the month_end_date.
month_df
| ID | month_end_date |
| -- | -------------- |
| 1 | 2019-07-31 |
| 1 | 2019-06-30 |
| 2 | 2019-10-31 |
daily_df
| ID | daily_date | new_status |
| -- | ---------- | ---------- |
| 1 | 2019-07-29 | 1 |
| 1 | 2019-07-30 | 1 |
| 1 | 2019-08-01 | 2 |
| 1 | 2019-08-02 | 2 |
| 1 | 2019-08-03 | 2 |
| 1 | 2019-06-29 | 0 |
| 1 | 2019-06-30 | 0 |
| 2 | 2019-10-30 | 5 |
| 2 | 2019-10-31 | NULL |
| 2 | 2019-11-01 | 6 |
| 2 | 2019-11-02 | 6 |
I want to fuzzy join daily_df to monthly_df where daily_date is >= month_end_dt and less than some buffer afterwards (say, 5 days). I want to keep only the record with the minimum daily date and a non-null new_status.
This post solves the issue using an OUTER APPLY in SQL Server, but that seems not to be an option in Spark SQL. I'm wondering if there's a method that is similarly computationally efficient that works in Spark.

Summarize tables with percent of sum in single query

I have an ACTIVE_TRANSPORTATION table:
+--------+----------+--------+
| ATN_ID | TYPE | LENGTH |
+--------+----------+--------+
| 1 | SIDEWALK | 20.6 |
| 2 | SIDEWALK | 30.1 |
| 3 | TRAIL | 15.9 |
| 4 | TRAIL | 40.4 |
| 5 | SIDEWALK | 35.2 |
| 6 | TRAIL | 50.5 |
+--------+----------+--------+
It is related to an INSPECTION table via the ATN_ID:
+---------+--------+------------------+
| INSP_ID | ATN_ID | LENGTH_INSPECTED |
+---------+--------+------------------+
| 101 | 2 | 15.2 |
| 102 | 3 | 5.4 |
| 103 | 5 | 15.9 |
| 104 | 6 | 20.1 |
+---------+--------+------------------+
I want to summarize the information like this:
+----------+--------+-------------------+
| TYPE | LENGTH | PERCENT_INSPECTED |
+----------+--------+-------------------+
| SIDEWALK | 85.9 | 36% |
| TRAIL | 106.8 | 23% |
+----------+--------+-------------------+
How can I do this within a single query?
Here is the updated answer using ACCESS 2010. Note that LENGTH is reserved in ACCESS, so it needs to be changed to LENGTH_
SELECT
TYPE,
SUM(LENGTH) as LENGTH_,
SUM(IIF(ISNULL(LENGTH_INSPECTED),0, LENGTH_INSPECTED))/SUM(LENGTH) as PERCENT_INSPECTED
FROM
ACTIVE_TRANSPORTATION A
LEFT JOIN INSPECTION B
ON A.ATN_ID = B.ATN_ID
GROUP BY TYPE
Here is the answer using T-SQL in SQL SERVER 2014 I had originally
SELECT SUM(LENGTH) as LENGTH,
SUM(ISNULL(LENGTH_INSPECTED,0))/SUM(LENGTH) as PERCENT_INSPECTED,
TYPE
FROM
ACTIVE_TRANSPORTATION A
LEFT JOIN INSPECTION B
ON A.ATN_ID = B.ATN_ID
GROUP BY TYPE
Let me know if you need it to be converted to percent, rounded, etc, but I'm guessing that part is easy for you.

How to generate this report?

I'm trying to set up a report based on several tables.
I have a table Actual that looks like this:
+--------+------+
| status | date |
+--------+------+
| 5 | 7/10 |
| 8 | 7/9 |
| 8 | 7/11 |
| 5 | 7/18 |
+--------+------+
Table Targets looks like this:
+--------+-------------+--------+------------+
| status | weekEndDate | target | cumulative |
+--------+-------------+--------+------------+
| 5 | 7/12 | 4 | 45 |
| 5 | 7/19 | 5 | 50 |
| 8 | 7/12 | 4 | 45 |
| 8 | 7/19 | 5 | 50 |
+--------+-------------+--------+------------+
Grouping the Actual records by which Targets.weekEndDate they fall under, I have the following aggregate query GroupActual:
+-------------+------------+--------------+--------+------------+
| weekEndDate | status | weeklyTarget | actual | cumulative |
+-------------+------------+--------------+--------+------------+
| 7/12 | 5 | 4 | 1 | 45 |
| 7/12 | 8 | 4 | 2 | 41 |
| 7/19 | 5 | 5 | 1 | 50 |
| 7/19 | 8 | 4 | | 45 |
+-------------+------------+--------------+--------+------------+
I'm trying to create this report:
+--------+------------+------+------+
| status | category | 7/12 | 7/19 | ...etc for every weekEndDate entry in Targets
+--------+------------+------+------+
| 5 | actual | 1 | 1 |
| 5 | target | 4 | 5 |
| 5 | cumulative | 45 | 50 |
+--------+------------+------+------+
| 8 | actual | 2 | |
| 8 | target | 4 | 5 |
| 8 | cumulative | 45 | 50 |
+--------------+------+------+------+
I can use a crosstab query to make the date columns, but I'm not sure how to have rows for "actual", "target", and "cumulative". They aren't values in the same table, which means (I think) that a crosstab query won't be useful for this breakdown. Should I try to change GroupActual so that it puts the data in the shape I'm looking for? Kind of confused as to where to go next with this...
EDIT: I've made some headway on the crosstabs as per PowerUser's solution, but I'm having trouble with the one for Target. I modified the wizard's generated SQL in an attempt to get what I want but it's not working out. I used a version of GroupActual that only has the weekEndDate,status, and weeklyTarget columns; here's the SQL:
TRANSFORM weeklyTarget
SELECT status
FROM TargetStatus_forCrosstab_Target
GROUP BY status,weeklyTarget
PIVOT Format([weekEndDate],"Short Date");
You're almost there. The problem is that you can't do this all in a single crosstab. You need to make 3 crosstabs (one for 'actual', one for 'target', and one for 'cumulative'), then make a Union query to combine them all.
Additional Tip: In your individual crosstabs, add a Sort column. Your 'actual' crosstab will have a Sort value of 1, 'Target' will have a Sort value of 2, and 'Cumulative' will have 3. That way, when you union them together, you can get them all in the right order.

SQL Query - Grouping Data

So every morning at work we have a stand-up meeting. We throw the nearest object to hand around the room as a method of deciding who speaks in what order. Being slightly odd I decided it could be fun to get some data on these throws. So, every morning I memorise the order of throws (as well as other relevant things like who dropped the ball/strange sponge object that was probably once a ball too and who threw to someone who'd already been or just gave an atrocious throw), and record this data in a table:
+---------+-----+------------+----------+---------+----------+--------+--------------+
| throwid | day | date | thrownum | thrower | receiver | caught | correctthrow |
+---------+-----+------------+----------+---------+----------+--------+--------------+
| 1 | 1 | 10/01/2012 | 1 | dan | steve | 1 | 1 |
| 2 | 1 | 10/01/2012 | 2 | steve | alice | 1 | 1 |
| 3 | 1 | 10/01/2012 | 3 | alice | matt | 1 | 1 |
| 4 | 1 | 10/01/2012 | 4 | matt | justin | 1 | 1 |
| 5 | 1 | 10/01/2012 | 5 | justin | arif | 1 | 1 |
| 6 | 1 | 10/01/2012 | 6 | arif | pete | 1 | 1 |
| 7 | 1 | 10/01/2012 | 7 | pete | greg | 0 | 1 |
| 8 | 1 | 10/01/2012 | 8 | greg | alan | 1 | 1 |
| 9 | 1 | 10/01/2012 | 9 | alan | david | 1 | 1 |
| 10 | 1 | 10/01/2012 | 10 | david | dan | 1 | 1 |
| 11 | 2 | 11/01/2012 | 1 | dan | david | 1 | 1 |
| 12 | 2 | 11/01/2012 | 2 | david | alice | 1 | 1 |
| 13 | 2 | 11/01/2012 | 3 | alice | steve | 1 | 1 |
| 14 | 2 | 11/01/2012 | 4 | steve | arif | 1 | 1 |
| 15 | 2 | 11/01/2012 | 5 | arif | pete | 0 | 1 |
| 16 | 2 | 11/01/2012 | 6 | pete | justin | 1 | 1 |
| 17 | 2 | 11/01/2012 | 7 | justin | alan | 1 | 1 |
| 18 | 2 | 11/01/2012 | 8 | alan | dan | 1 | 1 |
| 19 | 2 | 11/01/2012 | 9 | dan | greg | 1 | 1 |
+---------+-----+------------+----------+---------+----------+--------+--------------+
I've now got quite a few days worth of data for this, and I'm starting to run some queries on it for my own purposes (I've not told the rest of the team yet...wouldn't like to influence the results). I've done a few with no issues, but I'm stuck trying to get a certain result out.
What I'm looking for is the number of times each person has been the last team member to receive the ball. Now, as you can see on the table, due to absences etc the number of throws per day is not always constant, so I can't simply select the receiver by thrownum.
In the case for the data above, it would return:
+--------+-------------------+
| person | LastReceiverTotal |
+--------+-------------------+
| dan | 1 |
| greg | 1 |
+--------+-------------------+
I've got this far:
SELECT MAX(thrownum) AS LastThrowNum, day FROM Throws GROUP BY day
Now, this returns some useful data. I get the highest thrownum for each and every day. It would seem like all I need to do is get the receiver for this value, and then get a count grouped by receiver to get my answer. This doesn't work, though, because the resultset isn't what it seems due to the above query using aggregate functions.
I suspect there's a much better way of designing tables to store the data to be honest, but equally I'm also sure there's a way to get this information with the tables as they are - some kind of inner query? I can't figure out how it would work. Can anyone shed some light on how this would be done?
The query that you have gives you the biggest thrownum for each day.
With that, you just do a inner join with your table and get the receiver and the number of times he happears.
select t.receiver as person, count(t.day) as LastReceiverTotal from Throws t
inner join (SELECT MAX(thrownum) AS LastThrowNum, day FROM Throws GROUP BY day) a on a.LastThrowNum = t.thrownum and a.day = t.day
group by t.receiver