How to Query Data Associated With Minimum/Maximum in Pig

How to Query Data Associated With Minimum/Maximum in Pig - apache-pig

I'm looking for the coldest hour for each day. My data looks like this:
(2015/12/27,12AM,32.0)
(2015/12/27,12PM,34.0)
(2015/12/28,10AM,26.1)
(2015/12/28,10PM,28.0)
(2015/12/28,11AM,27.0)
(2015/12/28,11PM,28.9)
(2015/12/28,12AM,25.0)
(2015/12/28,12PM,26.100000000000005)
(2015/12/29,10AM,22.45)
(2015/12/29,10PM,26.1)
(2015/12/29,11AM,24.1)
(2015/12/29,11PM,25.0)
(2015/12/29,12AM,28.9)
I grouped on each day to find the Min Temp with this code:
minTemps = FOREACH gdate2 GENERATE group as day,MIN(removeDash.temp) as minTemp;
which gives this output:
(2015/12/18,17.1)
(2015/12/19,12.9)
(2015/12/20,23.0)
(2015/12/21,32.0)
(2015/12/22,30.899999999999995)
(2015/12/23,36.05)
(2015/12/24,30.45)
(2015/12/25,26.55)
(2015/12/26,28.899999999999995)
(2015/12/27,26.1)
(2015/12/28,23.55)
(2015/12/29,21.0)
My problem:I also need the hour at which the minimum temp occurred. How can I get the hour as well?

If I'm understanding your question correctly, grouping by (day, hour) won't work because this finds the coldest temperature for each hour, not the coldest hour and temperature for each day.
Instead, use a nested foreach:
B = GROUP A BY day;
C = FOREACH B {
orderd = ORDER A BY temp ASC;
limitd = LIMIT orderd 1;
GENERATE FLATTEN(limitd) AS (day, hour, temp);
};
Group by day as you did before, then order all the hours within the same day by temperature and select only the top record. Just be aware that if there is a tie between two or more hours, only one of these hours will be selected.

Yes, you are on the right track.Modify your group statement to group by day and hour.Finally use FLATTEN on your group decouple the keys.
gdate2 = GROUP removeDash by (day,hour);
minTemps = FOREACH gdate2 GENERATE FLATTEN(group) as (day,hour),MIN(removeDash.temp) as minTemp;

Related

SQL: Apply an aggregate result per day using window functions

Consider a time-series table that contains three fields time of type timestamptz, balance of type numeric, and is_spent_column of type text.
The following query generates a valid result for the last day of the given interval.
SELECT
MAX(DATE_TRUNC('DAY', (time))) as last_day,
SUM(balance) FILTER ( WHERE is_spent_column is NULL ) AS value_at_last_day
FROM tbl
2010-07-12 18681.800775017498741407984000
However, I am in need of an equivalent query based on window functions to report the total value of the column named balance for all the days up to and including the given date .
Here is what I've tried so far, but without any valid result:
SELECT
DATE_TRUNC('DAY', (time)) AS daily,
SUM(sum(balance) FILTER ( WHERE is_spent_column is NULL ) ) OVER ( ORDER BY DATE_TRUNC('DAY', (time)) ) AS total_value_per_day
FROM tbl
group by 1
order by 1 desc
2010-07-12 16050.496339044977568391974000
2010-07-11 13103.159119670350269890284000
2010-07-10 12594.525752964512456914454000
2010-07-09 12380.159588711091681327014000
2010-07-08 12178.119542536668113577014000
2010-07-07 11995.943973804127033140014000
EDIT:
Here is a sample dataset:
LINK REMOVED
The running total can be computed by applying the first query above on the entire dataset up to and including the desired day. For example, for day 2009-01-31, the result is 97.13522530000000000000, or for day 2009-01-15 when we filter time as time < '2009-01-16 00:00:00' it returns 24.446144000000000000.
What I need is an alternative query that computes the running total for each day in a single query.
EDIT 2:
Thank you all so very much for your participation and support.
The reason for differences in result sets of the queries was on the preceding ETL pipelines. Sorry for my ignorance!
Below I've provided a sample schema to test the queries.
https://www.db-fiddle.com/f/veUiRauLs23s3WUfXQu3WE/2
Now both queries given above and the query given in the answer below return the same result.

Consider calculating running total via window function after aggregating data to day level. And since you aggregate with a single condition, FILTER condition can be converted to basic WHERE:
SELECT daily,
SUM(total_balance) OVER (ORDER BY daily) AS total_value_per_day
FROM (
SELECT
DATE_TRUNC('DAY', (time)) AS daily,
SUM(balance) AS total_balance
FROM tbl
WHERE is_spent_column IS NULL
GROUP BY 1
) AS daily_agg
ORDER BY daily

Saving only unique datapoints in SQL

For simplicity: We have a table with 2 columns, value and date.
Every second a new data is received and we want to save it with it's timestamp. Since the data can be similar, to lower usage, if data is the same as previous entry, we don't save it.
Question: Given that same value was received during 24 hours, only the first value & date pair is saved. If we want to query 'Average value in last 1 hour', is there a way to have the db (PostgreSQL) see that no values are saved in last hour and search for last existing value entry?

It is not as easy as it may seem, and it is not just about retrieving the latest data point when there is none available within the last hour. You want to calculate an average, so you need to rebuild the time-series data of the period, second per second, filling the gaps with the latest available data point.
I think the simplest approach is generate_series() to build the rows, and then a lateral join to recover the data:
select avg(d.value) avg_last_hour
from generate_series(now() - interval '1 hour', now(), '1 second') t(ts)
cross join lateral (
select d.*
from data d
where d.date <= t.ts
order by d.date desc
limit 1
) t

Hmmm . . . if you simply want the average of values in the most recent hour in the data, you can use:
select date_trunc('hour', date) as ddhh, avg(value)
from t
group by ddhh
order by ddhh desc
limit 1;
If you have a lot of data being collected, it might be faster to add an index on date and use:
select avg(value)
from t
where date >= date_trunc('hour', (select max(t2.date) from t t2));

How to group timestamps into islands (based on arbitrary gap)?

Consider this list of dates as timestamptz:
I grouped the dates by hand using colors: every group is separated from the next by a gap of at least 2 minutes.
I'm trying to measure how much a given user studied, by looking at when they performed an action (the data is when they finished studying a sentence.) e.g.: on the yellow block, I'd consider the user studied in one sitting, from 14:24 till 14:27, or roughly 3 minutes in a row.
I see how I could group these dates with a programming language by going through all of the dates and looking for the gap between two rows.
My question is: how would go about grouping dates in this way with Postgres?
(Looking for 'gaps' on Google or SO brings too many irrelevant results; I think I'm missing the vocabulary for what I'm trying to do here.)

SELECT done, count(*) FILTER (WHERE step) OVER (ORDER BY done) AS grp
FROM (
SELECT done
, lag(done) OVER (ORDER BY done) <= done - interval '2 min' AS step
FROM tbl
) sub
ORDER BY done;
The subquery sub returns step = true if the previous row is at least 2 min away - sorted by the timestamp column done itself in this case.
The outer query adds a rolling count of steps, effectively the group number (grp) - combining the aggregate FILTER clause with another window function.
fiddle
Related:
Query to find all timestamps more than a certain interval apart
How to label groups in postgresql when group belonging depends on the preceding line?
Select longest continuous sequence
Grouping or Window
About the aggregate FILTER clause:
Aggregate columns with additional (distinct) filters
Conditional lead/lag function PostgreSQL?

Building up on Erwin's answer, here is the full query for tallying up the amount of time people spent on those sessions/islands:
My data only shows when people finished reviewing something, not when they started, which means we don't know when a session truly started; and some islands only have one timestamp in them (leading to a 0-duration estimate.) I'm accounting for both by calculating the average review time and adding it to the total duration of islands.
This is likely very idiosyncratic to my use case, but I learned a thing or two in the process, so maybe this will help someone down the line.
-- Returns estimated total study time and average time per review, both in seconds
SELECT (EXTRACT( EPOCH FROM logged) + countofislands * avgreviewtime) as totalstudytime, avgreviewtime -- add total logged time to estimate for first-review-in-island and 1-review islands
FROM
(
SELECT -- get the three key values that will let us calculate total time spent
sum(duration) as logged
, count(island) as countofislands
, EXTRACT( EPOCH FROM sum(duration) FILTER (WHERE duration != '00:00:00'::interval) )/( sum(reviews) FILTER (WHERE duration != '00:00:00'::interval) - count(reviews) FILTER (WHERE duration != '00:00:00'::interval)) as avgreviewtime
FROM
(
SELECT island, age( max(done), min(done) ) as duration, count(island) as reviews -- calculate the duration of islands
FROM
(
SELECT done, count(*) FILTER (WHERE step) OVER (ORDER BY done) AS island -- give a unique number to each island
FROM (
SELECT -- detect the beginning of islands
done,
(
lag(done) OVER (ORDER BY done) <= done - interval '2 min'
) AS step
FROM review
WHERE clicker_id = 71 AND "done" > '2015-05-13' AND "done" < '2015-05-13 15:00:00' -- keep the queries small and fast for now
) sub
ORDER BY done
) grouped
GROUP BY island
) sessions
) summary

SQL query getting multiple where-claused aliases

Hoping you can help with this issue.
I have an energymanagement software running on a system. The data logged is the total value, logged in the column Value. This is done every hour. Along is some other data, here amongst a boolean called Active and an integer called Day.
What I'm going for, is one query that gets me the a list of sorted days, the total powerusage of the day, and the peak-powerusage of the day.
The peak-power usage is counted by using Max/Min of the value where Active is present. Somedays, however, the Active bit isn't set, and the result of this query alone would yield NULL.
This is my query:
SELECT
A.Day, A.Forbrug, B.Peak
FROM
(SELECT
Day, Max(Value) - Min(Value) AS Forbrug
FROM
EL_HT1_K
WHERE
MONTH = 8 AND YEAR = 2016
GROUP By Day) A,
(SELECT
Day, Max(Value) - Min(Value) AS Peak
FROM
EL_HT1_K
WHERE
Month = 8 AND Year = 2016 AND Active = 1
GROUP BY Day) B
WHERE
A.Day = B.Day
Which only returns the result where query B (Peak-usage) would yield results.
What I want, is that the rest of the results from inner query A, still is shown, even though query B yields 0/null for that day.
Is this possible, and how?
FYI. The reason I need this to be in one query, is that the scada system has some difficulties handling multiple queries.

I think you just want conditional aggregation. Based on your description, this seems to be the query you want:
SELECT Day, SUM(Value) as total,
MAX(CASE WHEN Active = 1 THEN Value END) as Peak,
FROM EL_HT1_K
WHERE Month = 8 AND Year = 2016
GROUP BY Day;

PL SQL - newbie - MAX fron table and avg

I have a table of account numbers by date and 24 hourly intervals
ACCT# ; Date ; hour1, hour2, hour3......hour24
there could be as many as 365 days per account# I would like to find the average of all the intervals as well as the as max interval for each acct#. I tried to add sample data but since it is my first post I can't attach it (need+10 posts) –
"customer_number" "date" "est_hb_0000" "est_hb_0100" "est_hb_0200" "est_hb_0300" "est_hb_0400" "est_hb_0500" "est_hb_0600" "est_hb_0700" "est_hb_0800" "est_hb_0900" "est_hb_1000" "est_hb_1100" "est_hb_1200" "est_hb_1300" "est_hb_1400" "est_hb_1500" "est_hb_1600" "est_hb_1700" "est_hb_1800" "est_hb_1900" "est_hb_2000" "est_hb_2100" "est_hb_2200" "est_hb_2300"
Thanks in advance for your help

This SQL should get you close, after you have restructured and populated the table.
Select Customer, Date, HourOfDay, NumAccts
From ActivityStats
Where Customer = 'zzz'
and Date = 'Dt'
and RowNumber < 11
Order By NumAccts Desc

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas