Efficient time series querying in Postgres - sql

I have a table in my PG db that looks somewhat like this:
id | widget_id | for_date | score |
Each referenced widget has a lot of these items. It's always 1 per day per widget, but there are gaps.
What I want to get is a result that contains all the widgets for each date since X. The dates are brought in via generate series:
SELECT date.date::date
FROM generate_series('2012-01-01'::timestamp with time zone,'now'::text::date::timestamp with time zone, '1 day') date(date)
ORDER BY date.date DESC;
If there is no entry for a date for a given widget_id, I want to use the previous one. So say widget 1337 doesn't have an entry on 2012-05-10, but on 2012-05-08, then I want the resultset to show the 2012-05-08 entry on 2012-05-10 as well:
Actual data:
widget_id | for_date | score
1312 | 2012-05-07 | 20
1337 | 2012-05-07 | 12
1337 | 2012-05-08 | 41
1337 | 2012-05-11 | 500
Desired output based on generate series:
widget_id | for_date | score
1336 | 2012-05-07 | 20
1337 | 2012-05-07 | 12
1336 | 2012-05-08 | 20
1337 | 2012-05-08 | 41
1336 | 2012-05-09 | 20
1337 | 2012-05-09 | 41
1336 | 2012-05-10 | 20
1337 | 2012-05-10 | 41
1336 | 2012-05-11 | 20
1337 | 2012-05-11 | 500
Eventually I want to boil this down into a view so I have consistent data sets per day that I can query easily.
Edit: Made the sample data and expected resultset clearer

SQL Fiddle
select
widget_id,
for_date,
case
when score is not null then score
else first_value(score) over (partition by widget_id, c order by for_date)
end score
from (
select
a.widget_id,
a.for_date,
s.score,
count(score) over(partition by a.widget_id order by a.for_date) c
from (
select widget_id, g.d::date for_date
from (
select distinct widget_id
from score
) s
cross join
generate_series(
(select min(for_date) from score),
(select max(for_date) from score),
'1 day'
) g(d)
) a
left join
score s on a.widget_id = s.widget_id and a.for_date = s.for_date
) s
order by widget_id, for_date

First of all, you can have a much simpler generate_series() table expression. Equivalent to yours (except for descending order, that contradicts the rest of your question anyways):
SELECT generate_series('2012-01-01'::date, now()::date, '1d')::date
The type date is coerced to timestamptz automatically on input. The return type is timestamptz either way. I use a subquery below, so I can cast to the output to date right away.
Next, max() as window function returns exactly what you need: the highest value since frame start ignoring NULL values. Building on that, you get a radically simple query.
For a given widget_id
Most likely faster than involving CROSS JOIN or WITH RECURSIVE:
SELECT a.day, s.*
FROM (
SELECT d.day
,max(s.for_date) OVER (ORDER BY d.day) AS effective_date
FROM (
SELECT generate_series('2012-01-01'::date, now()::date, '1d')::date
) d(day)
LEFT JOIN score s ON s.for_date = d.day
AND s.widget_id = 1337 -- "for a given widget_id"
) a
LEFT JOIN score s ON s.for_date = a.effective_date
AND s.widget_id = 1337
ORDER BY a.day;
->sqlfiddle
With this query you can put any column from score you like into the final SELECT list. I put s.* for simplicity. Pick your columns.
If you want to start your output with the first day that actually has a score, simply replace the last LEFT JOIN with JOIN.
Generic form for all widget_id's
Here I use a CROSS JOIN to produce a row for every widget on every date ..
SELECT a.day, a.widget_id, s.score
FROM (
SELECT d.day, w.widget_id
,max(s.for_date) OVER (PARTITION BY w.widget_id
ORDER BY d.day) AS effective_date
FROM (SELECT generate_series('2012-05-05'::date
,'2012-05-15'::date, '1d')::date AS day) d
CROSS JOIN (SELECT DISTINCT widget_id FROM score) AS w
LEFT JOIN score s ON s.for_date = d.day AND s.widget_id = w.widget_id
) a
JOIN score s ON s.for_date = a.effective_date
AND s.widget_id = a.widget_id -- instead of LEFT JOIN
ORDER BY a.day, a.widget_id;
->sqlfiddle

Using your table structure, I created the following Recursive CTE which starts with your MIN(For_Date) and increments until it reaches the MAX(For_Date). Not sure if there is a more efficient way, but this appears to work well:
WITH RECURSIVE nodes_cte(widgetid, for_date, score) AS (
-- First Widget Using Min Date
SELECT
w.widgetId,
w.for_date,
w.score
FROM widgets w
INNER JOIN (
SELECT widgetId, Min(for_date) min_for_date
FROM widgets
GROUP BY widgetId
) minW ON w.widgetId = minW.widgetid
AND w.for_date = minW.min_for_date
UNION ALL
SELECT
n.widgetId,
n.for_date + 1 for_date,
coalesce(w.score,n.score) score
FROM nodes_cte n
INNER JOIN (
SELECT widgetId, Max(for_date) max_for_date
FROM widgets
GROUP BY widgetId
) maxW ON n.widgetId = maxW.widgetId
LEFT JOIN widgets w ON n.widgetid = w.widgetid
AND n.for_date + 1 = w.for_date
WHERE n.for_date + 1 <= maxW.max_for_date
)
SELECT *
FROM nodes_cte
ORDER BY for_date
Here is the SQL Fiddle.
And the returned results (format the date however you'd like):
WIDGETID FOR_DATE SCORE
1337 May, 07 2012 00:00:00+0000 12
1337 May, 08 2012 00:00:00+0000 41
1337 May, 09 2012 00:00:00+0000 41
1337 May, 10 2012 00:00:00+0000 41
1337 May, 11 2012 00:00:00+0000 500
Please note, this assumes your For_Date field is a Date -- if it includes a Time -- then you may need to use Interval '1 day' in the query above instead.
Hope this helps.

The data:
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
CREATE TABLE widget
( widget_id INTEGER NOT NULL
, for_date DATE NOT NULL
, score INTEGER
, PRIMARY KEY (widget_id,for_date)
);
INSERT INTO widget(widget_id , for_date , score) VALUES
(1312, '2012-05-07', 20)
, (1337, '2012-05-07', 12)
, (1337, '2012-05-08', 41)
, (1337, '2012-05-11', 500)
;
The query:
SELECT w.widget_id AS widget_id
, cal::date AS for_date
-- , w.for_date AS org_date
, w.score AS score
FROM generate_series( '2012-05-07'::timestamp , '2012-05-11'::timestamp
, '1day'::interval) AS cal
-- "half cartesian" Join;
-- will be restricted by the NOT EXISTS() below
LEFT JOIN widget w ON w.for_date <= cal
WHERE NOT EXISTS (
SELECT * FROM widget nx
WHERE nx.widget_id = w.widget_id
AND nx.for_date <= cal
AND nx.for_date > w.for_date
)
ORDER BY cal, w.widget_id
;
The result:
widget_id | for_date | score
-----------+------------+-------
1312 | 2012-05-07 | 20
1337 | 2012-05-07 | 12
1312 | 2012-05-08 | 20
1337 | 2012-05-08 | 41
1312 | 2012-05-09 | 20
1337 | 2012-05-09 | 41
1312 | 2012-05-10 | 20
1337 | 2012-05-10 | 41
1312 | 2012-05-11 | 20
1337 | 2012-05-11 | 500
(10 rows)

Related

Redshift: Add Row for each hour in a day

I have a table contains item_wise quantity at different hour of date. trying to add data for each hour(24 enteries in a day) with previous hour available quantity. For example for hour(2-10), it will be 5.
I created a table with hours enteries (1-24) & full join with shared table.
How can i add previous available entry. Need suggestion
item_id| date | hour| quantity
101 | 2022-04-25 | 2 | 5
101 | 2022-04-25 | 10 | 13
101 | 2022-04-25 | 18 | 67
101 | 2022-04-25 | 23 | 27
You can try to use generate_series to generate hours number, let it be the OUTER JOIN base table,
Then use a correlated-subquery to get your expect quantity column
SELECT t1.*,
(SELECT quantity
FROM T tt
WHERE t1.item_id = tt.item_id
AND t1.date = tt.date
AND t1.hour >= tt.hour
ORDER BY tt.hour desc
LIMIT 1) quantity
FROM (
SELECT DISTINCT item_id,date,v.hour
FROM generate_series(1,24) v(hour)
CROSS JOIN T
) t1
ORDER BY t1.hour
Provided the table of int 1 .. 24 is all24(hour) you can use lead and join
select t.item_id, t.date, all24.hour, t.quantity
from all24
join (
select *,
lead(hour, 1, 25) over(partition by item_id, date order by hour) - 1 nxt_h
from tbl
) t on all24.hour between t.hour and t.nxt_h

LEFT JOIN match. If no match, need to match on most recent date

My current SQL code:
SELECT
[Date], [Count]
FROM
Calendar_Table pdv
LEFT JOIN
(SELECT
COUNT([FILE NAME]) AS [Count], [CLOSE DT]
FROM
Production_Table
GROUP BY
[CLOSE DT]) [Group] ON [pdv].[Date] = [Group].[CLOSE DT]
ORDER BY
[Date]
Please see code below. Calendar_Table is a simple table, 1 row for every date. Production_Table gives products sold each day. If the left join produces a NULL, please produce the most recent non-NULL value.
Current output:
Date | Count
-----------+--------
9/4/2019 | NULL
9/5/2019 | 1
9/6/2019 | 4
9/7/2019 | NULL
9/8/2019 | 7
9/9/2019 | 11
9/10/2019 | NULL
9/11/2019 | 14
9/12/2019 | NULL
9/13/2019 | 19
Desired output:
Date | Count
-----------+--------
9/4/2019 | 0
9/5/2019 | 1
9/6/2019 | 4
9/7/2019 | 4
9/8/2019 | 7
9/9/2019 | 11
9/10/2019 | 11
9/11/2019 | 14
9/12/2019 | 14
9/13/2019 | 19
One option is a lateral join:
select c.date, p.*
from calendar_table c
outer apply (
select top (1) count(file_name) as cnt, close_dt
from production_table p
where p.close_dt <= c.date
group by p.close_dt
order by p.close_dt desc
) p
As an alternative, we can use an equi-join to bring the matching dates, as in your original query, and then fill the gaps with window functions. The basic idea is to build groups that reset everytime a match is met.
select date, coalesce(max(cnt) over(partition by grp), 0) as cnt
from (
select c.date, p.cnt,
sum(case when p.close_dt is null then 0 else 1 end) over(order by c.dt) as grp
from calendar_table c
left join (
select close_dt, count(file_name) as cnt
from production_table p
group by close_dt
) p on p.close_dt = c.date
) t
Depending on your data, one solution or the other may perform better.

Select Top 20 Distinct Rows in Each Category

I have a database table in the following format.
Product | Date | Score
A | 01/01/18 | 99
B | 01/01/18 | 98
C | 01/01/18 | 97
--------------------------
A | 02/01/18 | 99
B | 02/01/18 | 98
C | 02/01/18 | 97
--------------------------
D | 03/01/18 | 99
A | 03/01/18 | 98
B | 03/01/18 | 97
C | 03/01/18 | 96
I want to pick the first from every month such that there are no repeat products. For example, the output of the above table should be
Product | Date | Score
A | 01/01/18 | 99
B | 02/01/18 | 98
D | 03/01/18 | 99
How do I get this result with a single sql query? The actual table is much bigger than this and I want top 20 from every month without repetition.
This is a hard problem -- a type of subgraph problem that isn't really suitable to SQL. There is a brute force approach:
with jan as (
select *
from t
where date = '2018-01-01'
limit 1
),
feb as (
select *
from t
where date = '2018-02-01' and
product not in (select product from jan)
),
mar as (
select *
from t
where date = '2018-03-01' and
product not in (select product from jan) and
product not in (select product from feb)
)
select *
from jan
union all
select *
from feb
union all
select *
from mar;
You can generalize this with additional CTEs. But there is no guarantee that a month will have a product -- even when it could have had one.
It is possible by using row_number.
select * from (
select row_Number() over(partition by Product order by Product ) as rno,* from
Products
) as t where t.rno<=20
I think you want top 20 records every month without repeating products than below solution will be work.
select *
into #temp
from
(values
('A','01/01/18','99')
,('B','01/01/18','98')
,('C','01/01/18','97')
,('A','02/01/18','99')
,('B','02/01/18','98')
,('C','02/01/18','97')
,('D','03/01/18','99')
,('A','03/01/18','98')
,('B','03/01/18','97')
,('C','03/01/18','96')
) AS VTE (Product ,Date, Score )
select * from
(
select * , ROW_NUMBER() over (partition by date,product order by score ) as rn
from #TEMP
)
A where rn < 20

Find max, min, avg, percentile of count(*) per mmdd PostgreSQL

Postgres version 9.4.18, PostGIS Version 2.2.
Here are the tables I'm working with (and can unlikely make significant changes to the table structure):
Table ltg_data (spans 1988 to 2018):
Column | Type | Modifiers
----------+--------------------------+-----------
intensity | integer | not null
time | timestamp with time zone | not null
lon | numeric(9,6) | not null
lat | numeric(8,6) | not null
ltg_geom | geometry(Point,4269) |
Indexes:
"ltg_data2_ltg_geom_idx" gist (ltg_geom)
"ltg_data2_time_idx" btree ("time")
Size of ltg_data (~800M rows):
ltg=# select pg_relation_size('ltg_data');
pg_relation_size
------------------
149729288192
Table counties:
Column | Type | Modifiers
-----------+-----------------------------+--------------------------------- -----------------------
gid | integer | not null default
nextval('counties_gid_seq'::regclass)
objectid_1 | integer |
objectid | integer |
state | character varying(2) |
cwa | character varying(9) |
countyname | character varying(24) |
fips | character varying(5) |
time_zone | character varying(2) |
fe_area | character varying(2) |
lon | double precision |
lat | double precision |
the_geom | geometry(MultiPolygon,4269) |
Indexes:
"counties_pkey" PRIMARY KEY, btree (gid)
"counties_gix" gist (the_geom)
"county_cwa_idx" btree (cwa)
"countyname_cwa_idx" btree (countyname)
I have a query that calculates the total number of rows per day of the year (month-day) spanning the 30 years. With the help of Stackoverflow, the query to get these counts is working fine. Here's the query and results, using the following function.
Function:
CREATE FUNCTION f_mmdd(date) RETURNS int LANGUAGE sql IMMUTABLE AS
$$SELECT to_char($1, 'MMDD')::int$$;
Query:
SELECT d.mmdd, COALESCE(ct.ct, 0) AS total_count
FROM (
SELECT f_mmdd(d::date) AS mmdd -- ignoring the year
FROM generate_series(timestamp '2018-01-01' -- any dummy year
, timestamp '2018-12-31'
, interval '1 day') d
) d
LEFT JOIN (
SELECT f_mmdd(time::date) AS mmdd, count(*) AS ct
FROM counties c
JOIN ltg_data d ON ST_contains(c.the_geom, d.ltg_geom)
WHERE cwa = 'MFR'
GROUP BY 1
) ct USING (mmdd)
ORDER BY 1;
Results:
mmdd total_count
725 | 2126
726 | 558
727 | 2
728 | 2
729 | 2
730 | 0
731 | 0
801 | 0
802 | 10
Desired Results: I'm trying to find other statistical information about the counts for the days of the year. For instance, I know on July 25 (725 in the table below) that the total count over the many years that are in the table is 2126. What I'm looking for is the max daily count for July 25 (725), percent of years that that day is not zero, the min, percent years where count(*) is not zero, percentiles (10th percentile, 25th percentile, 50th percentile, 75th percentile, 90th percentile, and stdev would be useful too). It would be good to see what year the max_daily occurred. I guess if there haven't been any counts for that day in all the years, the year_max_daily would be blank or zero.
mmdd total_count max daily year_max_daily percent_years_count_not_zero 10th percentile_daily 90th percentile_daily
725 | 2126 1000 1990 30 15 900
726 | 558 120 1992 20 10 80
727 | 2 1 1991 2 0 1
728 | 2 1 1990 2 0 1
729 | 2 1 1989 2 0 1
730 | 0 0 0 0 0
731 | 0 0 0 0 0
801 | 0 0 0 0 0
802 | 10 10 1990 0 1 8
What I've tried thus far just isn't working. It returns the same results as total. I think it's because I'm just trying to get an avg after the totals have already been calculated, so I'm not really looking at the counts for each day of each year and finding the average.
Attempt:
SELECT AVG(CAST(total_count as FLOAT)), day
FROM
(
SELECT d.mmdd as day, COALESCE(ct.ct, 0) as total_count
FROM (
SELECT f_mmdd(d::date) AS mmdd
FROM generate_series(timestamp '2018-01-01', timestamp '2018-12-31', interval '1 day') d
) d
LEFT JOIN (
SELECT mmdd, avg(q.ct) FROM (
SELECT f_mmdd((time at time zone 'utc+12')::date) as mmdd, count(*) as ct
FROM counties c
JOIN ltg_data d on ST_contains(c.the_geom, d.ltg_geom)
WHERE cwa = 'MFR'
GROUP BY 1
)
) as q
ct USING (mmdd)
ORDER BY 1
Thanks for any help!
I haven't included calculations for all requested stats - there is too much in one question, but I hope that you'd be able to extend the query below and add extra stats that you need.
I'm using CTE below to make to query readable. If you want, you can put it all in one huge query. I'd recommend to run the query step-by-step, CTE-by-CTE and examine intermediate results to understand how it works.
CTE_Dates is a simple list of all possible dates for 30 years.
CTE_DailyCounts is a list of basic counts for each day for 30 years (I took your existing query for that).
CTE_FullStats is again a list of all dates together with some stats calculated for each (month,day) using window functions with partitioning by month,day. ROW_NUMBER there is used to get a date where the count was the largest for each year.
Final query selects only one row with the largest count for the year along with the rest of the information.
I didn't try to run the query, because the question doesn't have sample data, so there may be some typos.
WITH
CTE_Dates
AS
(
SELECT
d::date AS dt
,EXTRACT(MONTH FROM d::date) AS dtMonth
,EXTRACT(DAY FROM d::date) AS dtDay
,EXTRACT(YEAR FROM d::date) AS dtYear
FROM
generate_series(timestamp '1988-01-01', timestamp '2018-12-31', interval '1 day') AS d
-- full range of possible dates
)
,CTE_DailyCounts
AS
(
SELECT
time::date AS dt
,count(*) AS ct
FROM
counties c
INNER JOIN ltg_data d ON ST_contains(c.the_geom, d.ltg_geom)
WHERE cwa = 'MFR'
GROUP BY time::date
)
,CTE_FullStats
AS
(
SELECT
CTE_Dates.dt
,CTE_Dates.dtMonth
,CTE_Dates.dtDay
,CTE_Dates.dtYear
,CTE_DailyCounts.ct
,SUM(CTE_DailyCounts.ct) OVER (PARTITION BY dtMonth, dtDay) AS total_count
,MAX(CTE_DailyCounts.ct) OVER (PARTITION BY dtMonth, dtDay) AS max_daily
,SUM(CASE WHEN CTE_DailyCounts.ct > 0 THEN 1 ELSE 0 END) OVER (PARTITION BY dtMonth, dtDay) AS nonzero_day_count
,COUNT(*) OVER (PARTITION BY dtMonth, dtDay) AS years_count
,100.0 * SUM(CASE WHEN CTE_DailyCounts.ct > 0 THEN 1 ELSE 0 END) OVER (PARTITION BY dtMonth, dtDay)
/ COUNT(*) OVER (PARTITION BY dtMonth, dtDay) AS percent_years_count_not_zero
,ROW_NUMBER() OVER (PARTITION BY dtMonth, dtDay ORDER BY CTE_DailyCounts.ct DESC) AS rn
FROM
CTE_Dates
LEFT JOIN CTE_DailyCounts ON CTE_DailyCounts.dt = CTE_Dates.dt
)
SELECT
dtMonth
,dtDay
,total_count
,max_daily
,dtYear AS year_max_daily
,percent_years_count_not_zero
FROM
CTE_FullStats
WHERE
rn = 1
ORDER BY
dtMonth
,dtDay
;

Get Monthly Totals from Running Totals

I have a table in a SQL Server 2008 database with two columns that hold running totals called Hours and Starts. Another column, Date, holds the date of a record. The dates are sporadic throughout any given month, but there's always a record for the last hour of the month.
For example:
ContainerID | Date | Hours | Starts
1 | 2010-12-31 23:59 | 20 | 6
1 | 2011-01-15 00:59 | 23 | 6
1 | 2011-01-31 23:59 | 30 | 8
2 | 2010-12-31 23:59 | 14 | 2
2 | 2011-01-18 12:59 | 14 | 2
2 | 2011-01-31 23:59 | 19 | 3
How can I query the table to get the total number of hours and starts for each month between two specified years? (In this case 2011 and 2013.) I know that I need to take the values from the last record of one month and subtract it by the values from the last record of the previous month. I'm having a hard time coming up with a good way to do this in SQL, however.
As requested, here are the expected results:
ContainerID | Date | MonthlyHours | MonthlyStarts
1 | 2011-01-31 23:59 | 10 | 2
2 | 2011-01-31 23:59 | 5 | 1
Try this:
SELECT c1.ContainerID,
c1.Date,
c1.Hours-c3.Hours AS "MonthlyHours",
c1.Starts - c3.Starts AS "MonthlyStarts"
FROM Containers c1
LEFT OUTER JOIN Containers c2 ON
c1.ContainerID = c2.ContainerID
AND datediff(MONTH, c1.Date, c2.Date)=0
AND c2.Date > c1.Date
LEFT OUTER JOIN Containers c3 ON
c1.ContainerID = c3.ContainerID
AND datediff(MONTH, c1.Date, c3.Date)=-1
LEFT OUTER JOIN Containers c4 ON
c3.ContainerID = c4.ContainerID
AND datediff(MONTH, c3.Date, c4.Date)=0
AND c4.Date > c3.Date
WHERE
c2.ContainerID is null
AND c4.ContainerID is null
AND c3.ContainerID is not null
ORDER BY c1.ContainerID, c1.Date
Using recursive CTE and some 'creative' JOIN condition, you can fetch next month's value for each ContainterID:
WITH CTE_PREP AS
(
--RN will be 1 for last row in each month for each container
--MonthRank will be sequential number for each subsequent month (to increment easier)
SELECT
*
,ROW_NUMBER() OVER (PARTITION BY ContainerID, YEAR(Date), MONTH(DATE) ORDER BY Date DESC) RN
,DENSE_RANK() OVER (ORDER BY YEAR(Date),MONTH(Date)) MonthRank
FROM Table1
)
, RCTE AS
(
--"Zero row", last row in decembar 2010 for each container
SELECT *, Hours AS MonthlyHours, Starts AS MonthlyStarts
FROM CTE_Prep
WHERE YEAR(date) = 2010 AND MONTH(date) = 12 AND RN = 1
UNION ALL
--for each next row just join on MonthRank + 1
SELECT t.*, t.Hours - r.Hours, t.Starts - r.Starts
FROM RCTE r
INNER JOIN CTE_Prep t ON r.ContainerID = t.ContainerID AND r.MonthRank + 1 = t.MonthRank AND t.Rn = 1
)
SELECT ContainerID, Date, MonthlyHours, MonthlyStarts
FROM RCTE
WHERE Date >= '2011-01-01' --to eliminate "zero row"
ORDER BY ContainerID
SQLFiddle DEMO (I have added some data for February and March in order to test on different lengths of months)
Old version fiddle