How to select most recent values? - sql

I have a logging table collecting values from many probes:
CREATE TABLE [Log]
(
[LogID] int IDENTITY (1, 1) NOT NULL,
[Minute] datetime NOT NULL,
[ProbeID] int NOT NULL DEFAULT 0,
[Value] FLOAT(24) NOT NULL DEFAULT 0.0,
CONSTRAINT Log_PK PRIMARY KEY([LogID])
)
GO
CREATE INDEX [Minute_ProbeID_Value] ON [Log]([Minute], [ProbeID], [Value])
GO
Typically, each probe generates a value every minute or so. Some example output:
LogID Minute ProbeID Value
====== ================ ======= =====
873875 2014-07-27 09:36 1972 24.4
873876 2014-07-27 09:36 2001 29.7
873877 2014-07-27 09:36 3781 19.8
873878 2014-07-27 09:36 1963 25.6
873879 2014-07-27 09:36 2002 22.9
873880 2014-07-27 09:36 1959 -30.1
873881 2014-07-27 09:36 2005 20.7
873882 2014-07-27 09:36 1234 23.8
873883 2014-07-27 09:36 1970 19.9
873884 2014-07-27 09:36 1991 22.4
873885 2014-07-27 09:37 1958 1.7
873886 2014-07-27 09:37 1962 21.3
873887 2014-07-27 09:37 1020 23.1
873888 2014-07-27 09:38 1972 24.1
873889 2014-07-27 09:38 3781 20.1
873890 2014-07-27 09:38 2001 30
873891 2014-07-27 09:38 2002 23.4
873892 2014-07-27 09:38 1963 26
873893 2014-07-27 09:38 2005 20.8
873894 2014-07-27 09:38 1234 23.7
873895 2014-07-27 09:38 1970 19.8
873896 2014-07-27 09:38 1991 22.7
873897 2014-07-27 09:39 1958 1.4
873898 2014-07-27 09:39 1962 22.1
873899 2014-07-27 09:39 1020 23.1
What is the most efficient way to get just the latest reading for each Probe?
e.g.of desired output (note: the "Value" is not e.g. a Max() or an Avg()):
LogID Minute ProbeID Value
====== ================= ======= =====
873899 27-Jul-2014 09:39 1020 3.1
873894 27-Jul-2014 09:38 1234 23.7
873897 27-Jul-2014 09:39 1958 1.4
873880 27-Jul-2014 09:36 1959 -30.1
873898 27-Jul-2014 09:39 1962 22.1
873892 27-Jul-2014 09:38 1963 26
873895 27-Jul-2014 09:38 1970 19.8
873888 27-Jul-2014 09:38 1972 24.1
873896 27-Jul-2014 09:38 1991 22.7
873890 27-Jul-2014 09:38 2001 30
873891 27-Jul-2014 09:38 2002 23.4
873893 27-Jul-2014 09:38 2005 20.8
873889 27-Jul-2014 09:38 3781 20.1

This is another approach
select *
from log l
where minute =
(select max(x.minute) from log x where x.probeid = l.probeid)
You can compare the execution plan w/ a fiddle - http://sqlfiddle.com/#!3/1d3ff/3/0

Try this:
SELECT T1.*
FROM Log T1
INNER JOIN (SELECT Max(Minute) Minute,
ProbeID
FROM Log
GROUP BY ProbeID)T2
ON T1.ProbeID = T2.ProbeID
AND T1.Minute = T2.Minute
You can play around with it on SQL Fiddle

Your question is: "What is the most efficient way to get just the latest reading for each Probe?"
To really answer this question, you test to test different solutions. I would generally go with the row_number() method suggested by #jyparask. However, the following might have better performance:
select l.*
from log l
where not exists (select 1
from log l2
where l2.probeid = l.probeid and
l2.minute > l.minute
);
For performance, you want an index on log(probeid, minute).
Although not exactly your problem, here is an example of where not exists performs better than other methods on SQL Server.

;WITH MyCTE AS
(
SELECT LogID,
Minute,
ProbeID,
Value,
ROW_NUMBER() OVER(PARTITION BY ProbeID ORDER BY Minute DESC) AS rn
FROM LOG
)
SELECT LogID,
Minute,
ProbeID,
Value
FROM MyCTE
WHERE rn = 1

Related

Write & Apply Python Function with Grouped Pandas Data

I have data that is grouped by a column 'plant_name' and I need to write & apply a function to test for a trend on one of the columns, i.e., named "10%" or '90%' for example.
My data looks like this -
plant_name year count mean std min 10% 50% 90% max
0 ARIZONA I 2005 8760.0 8.25 2.21 1.08 5.55 8.19 11.09 15.71
1 ARIZONA I 2006 8760.0 7.87 2.33 0.15 4.84 7.82 10.74 16.75
2 ARIZONA I 2007 8760.0 8.31 2.25 0.03 5.52 8.27 11.23 16.64
3 ARIZONA I 2008 8784.0 7.67 2.46 0.21 4.22 7.72 10.78 15.73
4 ARIZONA I 2009 8760.0 6.92 2.33 0.23 3.79 6.95 9.96 14.64
5 ARIZONA I 2010 8760.0 8.07 2.21 0.68 5.51 7.85 11.14 17.31
6 ARIZONA I 2011 8760.0 7.54 2.38 0.33 4.44 7.45 10.54 17.77
7 ARIZONA I 2012 8784.0 8.61 1.92 0.33 6.37 8.48 11.07 15.84
8 ARIZONA I 2015 8760.0 8.21 2.13 0.60 5.58 8.24 10.88 16.74
9 ARIZONA I 2016 8784.0 8.39 2.27 0.46 5.55 8.32 11.34 16.09
10 ARIZONA I 2017 8760.0 8.32 2.11 0.85 5.70 8.25 11.12 17.96
11 ARIZONA I 2018 8760.0 7.94 2.28 0.07 5.17 7.72 11.04 16.31
12 ARIZONA I 2019 8760.0 7.71 2.49 0.38 4.28 7.75 10.87 15.79
13 ARIZONA I 2020 8784.0 7.57 2.43 0.50 4.36 7.47 10.78 15.69
14 CAETITE I 2005 8760.0 8.11 3.15 0.45 3.76 8.38 12.08 18.89
15 CAETITE I 2006 8760.0 7.70 3.21 0.05 3.50 7.66 12.05 19.08
16 CAETITE I 2007 8760.0 8.64 3.18 0.01 4.05 8.83 12.63 18.57
17 CAETITE I 2008 8784.0 7.87 3.09 0.28 3.75 7.80 11.92 18.54
18 CAETITE I 2009 8760.0 7.31 3.02 0.17 3.46 7.21 11.40 19.46
19 CAETITE I 2010 8760.0 8.00 3.24 0.34 3.63 8.03 12.29 17.27
I'm using this function from here -
import pymannkendall as mk
and you apply the function like this:
mk.original_test(dataframe)
I need the final dataframe to look like this which is the result of the series columns returned by the function (mk.original_test):
trend, h, p, z, Tau, s, var_s, slope, intercept = mk.original_test(data)
plant_name trend h p z Tau s var_s slope intercept
0 ARIZONA I no trend False 0.416 0.812 xxx x x x x
1 CAETITE I increasing True 0.002 3.6 xxx x x x x
I just am not sure how to use groupby to group by plant_name column and then apply the mk function by plant_name to either of the columns in the data shown. Thank you,
For a given column, you can run the test in a GroupBy.apply() and return the result as a Series indexed by result._fields:
def mktest(x):
result = mk.original_test(x)
return pd.Series(result, index=result._fields)
column = '10%'
df.groupby('plant_name', as_index=False)[column].apply(mktest)
plant_name
trend
h
p
z
Tau
s
var_s
slope
intercept
ARIZONA I
no trend
False
0.956276
-0.054827
-0.021978
-2.0
332.666667
-0.003333
5.361667
CAETITE I
no trend
False
0.452370
-0.751469
-0.333333
-5.0
28.333333
-0.026000
3.755000

select avg for specific values of date

I have this table 'meteorecords' with date, temperature, rh and the meteo station which made the record.
rerowid date temp rh meteostid
1 2019-09-9 28.8 55.6 AITNIA2
2 2019-09-10 30.3 51.3 AITNIA2
3 2019-09-11 28.6 49.0 AITNIA2
4 2019-09-12 26.7 51.9 AITNIA2
5 2019-09-13 25.3 48.1 AITNIA2
6 2019-09-14 25.3 38.5 AITNIA2
7 2019-09-15 25.0 42.2 AITNIA2
8 2019-09-16 24.1 52.1 AITNIA2
9 2019-09-17 23.3 65.2 AITNIA2
10 2019-09-18 22.7 72.2 AITNIA2
11 2019-09-19 23.4 73.9 AITNIA2
12 2019-09-20 23.1 76.7 AITNIA2
13 2019-09-21 22.5 60.3 AITNIA2
14 2019-09-22 20.9 61.6 AITNIA2
15 2019-09-23 21.9 73.9 AITNIA2
16 2019-09-24 23.2 79.6 AITNIA2
17 2019-09-25 21.8 73.6 AITNIA2
18 2019-09-26 22.2 77.6 AITNIA2
19 2019-09-27 22.9 77.1 AITNIA2
20 2019-09-28 22.8 68.4 AITNIA2
21 2019-09-29 22.6 75.5 AITNIA2
...........................
I want to select all the fields plus the average temperature of the last 3 days.
I'm using postgresql because I have some geometric and spatial data in the db.
I tried this with no luck:
SELECT rerowid,redate,retemp,rerh,meteostid,
(SELECT AVG(retemp)
FROM meteorecords m
WHERE meteostid = m.meteostid AND m.redate BETWEEN redate-2 AND redate)
FROM meteorecords
which returns a result like this:
rerowid date temp rh meteostid AVG_Last_3_Days
1 2019-09-09 28.8 55.6 AITNIA2 22.2824
2 2019-09-10 30.3 51.3 AITNIA2 22.2824
3 2019-09-11 28.6 49.0 AITNIA2 22.2824
4 2019-09-12 26.7 51.9 AITNIA2 22.2824
5 2019-09-13 25.3 48.1 AITNIA2 22.2824
6 2019-09-14 25.3 38.5 AITNIA2 22.2824
7 2019-09-15 25.1 42.2 AITNIA2 22.2824
..................
But I want a result like this:
rerowid date temp rh meteostid AVG_Last_3_Days
1 2019-09-09 28.8 55.6 AITNIA2 28.8
2 2019-09-10 30.3 51.3 AITNIA2 29.5
3 2019-09-11 28.6 49.0 AITNIA2 29.2
4 2019-09-12 26.7 51.9 AITNIA2 28.5
5 2019-09-13 25.3 48.1 AITNIA2 26.9
6 2019-09-14 25.3 38.5 AITNIA2 25.8
7 2019-09-15 25.1 42.2 AITNIA2 25.2
..................
Use window functions. If you have one row per date or you want the previous three dates *in the data):
SELECT rerowid, redate, retemp, rerh, meteostid,
AVG(retemp) OVER (PARTITION BY meteostid ORDER BY redate ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as avg_retemp_3
FROM meteorecords;
If you want 3 chronological days, use RANGE:
SELECT rerowid, redate, retemp, rerh, meteostid,
AVG(retemp) OVER (PARTITION BY meteostid
ORDER BY redate
RANGE BETWEEN '2 DAY' PRECEDING AND CURRENT ROW) as avg_retemp_3
FROM meteorecords;

PostgreSQL: How do I join two tables based on same start and end time (timestamp without time zone)?

Okay, I came across this relevant question but it is slightly different than my case.
Problem
I have two similar type of tables in my PostgreSQL 9.5 database tbl1 and tbl2 both containing 1,274 rows. The structure and layout of table 1 is as follows:
Table 1:
id (integer) start_time end_time my_val1 (numeric)
51 1994-09-26 16:50:00 1994-10-29 13:30:00 3.7
52 1994-10-29 13:30:00 1994-11-27 12:30:00 2.4
53 1994-11-27 12:30:00 1994-12-29 09:25:00 7.6
54 1994-12-29 09:25:00 1994-12-31 23:59:59 2.9
54 1995-01-01 00:00:00 1995-02-05 13:50:00 2.9
55 1995-02-05 13:50:00 1995-03-12 11:10:00 1.6
56 1995-03-12 11:10:00 1995-04-11 09:05:00 2.2
171 1994-10-29 16:15:00 1994-11-27 19:10:00 6.9
172 1994-11-27 19:10:00 1994-12-29 11:40:00 4.2
173 1994-12-29 11:40:00 1994-12-31 23:59:59 6.7
173 1995-01-01 00:00:00 1995-02-05 15:30:00 6.7
174 1995-02-05 15:30:00 1995-03-12 09:45:00 3.2
175 1995-03-12 09:45:00 1995-04-11 11:30:00 1.2
176 1995-04-11 11:30:00 1995-05-11 15:30:00 2.7
321 1994-09-26 14:40:00 1994-10-30 14:30:00 0.2
322 1994-10-30 14:30:00 1994-11-27 14:45:00 7.8
323 1994-11-27 14:45:00 1994-12-29 14:20:00 4.6
324 1994-12-29 14:20:00 1994-12-31 23:59:59 4.1
324 1995-01-01 00:00:00 1995-02-05 14:35:00 4.1
325 1995-02-05 14:35:00 1995-03-12 11:30:00 8.2
326 1995-03-12 11:30:00 1995-04-11 09:45:00 1.2
.....
In some rows, start_time and end_time may look similar but whole time window may not be equal. For example,
id (integer) start_time end_time my_val1 (numeric)
54 1994-12-29 09:25:00 1994-12-31 23:59:59 2.9
173 1994-12-29 11:40:00 1994-12-31 23:59:59 6.7
Start_time and end_time are timestamp without time zone. The start_time and end_time have to be in one year window thus whenever there was a change of year from 1994 to 1995 then that row was divided into two rows therefore, there are repeating IDs in the column id. Table 2 tbl2 contains the similar start_time and end_time (timestamp without time zone) and column my_val2 (numeric). For each row in table 1 I need to join corresponding row of table 2 where start_time and end_time are similar.
What I have tried,
Select
a.id,
a.start_time, a.end_time,
a.my_val1,
b.my_val2
from tbl1 a
left join tbl2 b on
b.start_time = a.start_time
order by a.id;
The query returned 3,802 rows which is not desired. The desired result is 1,274 rows of table 1 joined with my_val2. I am aware of Postgres Distinct on clause but I need to keep all repeating ids of tbl1 and only need to join my_val2 of tbl2. Do I need to use Postgres Window function here. Can someone suggest that how to join these two tables?
why you don't add to the ON part the condition
ON b.start_time = a.start_time AND a.id = b.id
For each row in table 1 I need to join corresponding row of table 2
where start_time and end_time are similar.
SQL query should include end_time
SELECT a.id,
a.start_time,
a.end_time,
a.my_val1,
b.my_val2
FROM tbl1 a
LEFT JOIN tbl2 b
ON b.start_time = a.start_time
AND b.end_time = a.end_time
ORDER BY a.id;

Select values based on another cell's values, then calculate a statistic from those cells and put it in a specific cell

I have is a dataset of daily stream runoff values for the past 11 years. It looks like this:
ID Year DD Apr May Jun Jul Aug Sep Oct
08HF004 2000 1 26.5 37.6 18.3 12.3 8.35 5.19 7.98
08HF004 2000 2 28.8 25.8 19.3 10.4 6.86 4.61 5.86
08HF004 2000 3 34.7 22.8 25.9 9.32 5.82 4.07 4.71
08HF004 2000 4 29.7 19.4 33.8 9.16 5.5 3.61 4.01
08HF004 2000 5 19.9 17.5 38.6 9.01 5.39 3.32 3.53
08HF004 2000 6 15 14.6 33.1 9.04 5.22 3.32 3.2
08HF004 2000 7 11.6 14.1 27 10.3 4.83 4.55 2.96
...and so forth for 400+ more lines. What I want to do is use VBA to select all the values from each month (April 2000, May 2000, etc) and figure out the average and standard deviation from each month and send them to a cell in the worksheet, or a cell in another worksheet, or, ideally, a new workbook in the directory I can just call "results".
I suggest a PivotTable (one month per table) - Year for ROWS and say April for VALUES (once as Average of and once as StdDev of or StdDevp of).
Or you might 'flatten' the data (eg as shown here) and use different views of a single PivotTable:

Would like to return a fake row if there is no match to my pair (for a year)

I would like to clean up some data returned from a query. This query :
select seriesId,
startDate,
reportingCountyId,
countyId,
countyName,
pocId,
pocValue
from someTable
where seriesId = 147
and pocid = 2
and countyId in (2033,2040)
order by startDate
usually returns 2 county matches for all years:
seriesId startDate reportingCountyId countyId countyName pocId pocValue
147 2004-01-01 00:00:00.000 6910 2040 CountyOne 2 828
147 2005-01-01 00:00:00.000 2998 2033 CountyTwo 2 4514
147 2005-01-01 00:00:00.000 3000 2040 CountyOne 2 2446
147 2006-01-01 00:00:00.000 3018 2033 CountyTwo 2 5675
147 2006-01-01 00:00:00.000 4754 2040 CountyOne 2 2265
147 2007-01-01 00:00:00.000 3894 2033 CountyTwo 2 6250
147 2007-01-01 00:00:00.000 3895 2040 CountyOne 2 2127
147 2008-01-01 00:00:00.000 4842 2033 CountyTwo 2 5696
147 2008-01-01 00:00:00.000 4846 2040 CountyOne 2 2013
147 2009-01-01 00:00:00.000 6786 2033 CountyTwo 2 2578
147 2009-01-01 00:00:00.000 6817 2040 CountyTwo 2 1933
147 2010-01-01 00:00:00.000 6871 2040 CountyOne 2 1799
147 2010-01-01 00:00:00.000 6872 2033 CountyTwo 2 4223
147 2011-01-01 00:00:00.000 8314 2033 CountyTwo 2 3596
147 2011-01-01 00:00:00.000 8315 2040 CountyOne 2 1559
But note please that the first entry has only CountyOne for 2004. I would like to return a fake row for CountyTwo for a graph I am doing. It would be sufficient to fill it like CountyOne only with pocValue = 0.
thanks!!!!!!!!
Try this (if you need blank row for that countryid)
; with CTE AS
(SELECT 2033 As CountryID UNION SELECT 2040),
CTE2 AS
(
seriesId, startDate, reportingCountyId,
countyId, countyName, pocId, pocValue
from someTable where
seriesId = 147 and pocid = 2 and countyId in (2033,2040)
order by startDate
)
SELECT x1.CountyId, x2.*, IsNull(pocValue,0) NewpocValue FROM CTE x
LEFT OUTER JOIN CTE2 x2 ON x1.CountyId = x2.reportingCountyId