Find max, min, avg, percentile of count(*) per mmdd PostgreSQL - sql

Postgres version 9.4.18, PostGIS Version 2.2.
Here are the tables I'm working with (and can unlikely make significant changes to the table structure):
Table ltg_data (spans 1988 to 2018):
Column | Type | Modifiers
----------+--------------------------+-----------
intensity | integer | not null
time | timestamp with time zone | not null
lon | numeric(9,6) | not null
lat | numeric(8,6) | not null
ltg_geom | geometry(Point,4269) |
Indexes:
"ltg_data2_ltg_geom_idx" gist (ltg_geom)
"ltg_data2_time_idx" btree ("time")
Size of ltg_data (~800M rows):
ltg=# select pg_relation_size('ltg_data');
pg_relation_size
------------------
149729288192
Table counties:
Column | Type | Modifiers
-----------+-----------------------------+--------------------------------- -----------------------
gid | integer | not null default
nextval('counties_gid_seq'::regclass)
objectid_1 | integer |
objectid | integer |
state | character varying(2) |
cwa | character varying(9) |
countyname | character varying(24) |
fips | character varying(5) |
time_zone | character varying(2) |
fe_area | character varying(2) |
lon | double precision |
lat | double precision |
the_geom | geometry(MultiPolygon,4269) |
Indexes:
"counties_pkey" PRIMARY KEY, btree (gid)
"counties_gix" gist (the_geom)
"county_cwa_idx" btree (cwa)
"countyname_cwa_idx" btree (countyname)
I have a query that calculates the total number of rows per day of the year (month-day) spanning the 30 years. With the help of Stackoverflow, the query to get these counts is working fine. Here's the query and results, using the following function.
Function:
CREATE FUNCTION f_mmdd(date) RETURNS int LANGUAGE sql IMMUTABLE AS
$$SELECT to_char($1, 'MMDD')::int$$;
Query:
SELECT d.mmdd, COALESCE(ct.ct, 0) AS total_count
FROM (
SELECT f_mmdd(d::date) AS mmdd -- ignoring the year
FROM generate_series(timestamp '2018-01-01' -- any dummy year
, timestamp '2018-12-31'
, interval '1 day') d
) d
LEFT JOIN (
SELECT f_mmdd(time::date) AS mmdd, count(*) AS ct
FROM counties c
JOIN ltg_data d ON ST_contains(c.the_geom, d.ltg_geom)
WHERE cwa = 'MFR'
GROUP BY 1
) ct USING (mmdd)
ORDER BY 1;
Results:
mmdd total_count
725 | 2126
726 | 558
727 | 2
728 | 2
729 | 2
730 | 0
731 | 0
801 | 0
802 | 10
Desired Results: I'm trying to find other statistical information about the counts for the days of the year. For instance, I know on July 25 (725 in the table below) that the total count over the many years that are in the table is 2126. What I'm looking for is the max daily count for July 25 (725), percent of years that that day is not zero, the min, percent years where count(*) is not zero, percentiles (10th percentile, 25th percentile, 50th percentile, 75th percentile, 90th percentile, and stdev would be useful too). It would be good to see what year the max_daily occurred. I guess if there haven't been any counts for that day in all the years, the year_max_daily would be blank or zero.
mmdd total_count max daily year_max_daily percent_years_count_not_zero 10th percentile_daily 90th percentile_daily
725 | 2126 1000 1990 30 15 900
726 | 558 120 1992 20 10 80
727 | 2 1 1991 2 0 1
728 | 2 1 1990 2 0 1
729 | 2 1 1989 2 0 1
730 | 0 0 0 0 0
731 | 0 0 0 0 0
801 | 0 0 0 0 0
802 | 10 10 1990 0 1 8
What I've tried thus far just isn't working. It returns the same results as total. I think it's because I'm just trying to get an avg after the totals have already been calculated, so I'm not really looking at the counts for each day of each year and finding the average.
Attempt:
SELECT AVG(CAST(total_count as FLOAT)), day
FROM
(
SELECT d.mmdd as day, COALESCE(ct.ct, 0) as total_count
FROM (
SELECT f_mmdd(d::date) AS mmdd
FROM generate_series(timestamp '2018-01-01', timestamp '2018-12-31', interval '1 day') d
) d
LEFT JOIN (
SELECT mmdd, avg(q.ct) FROM (
SELECT f_mmdd((time at time zone 'utc+12')::date) as mmdd, count(*) as ct
FROM counties c
JOIN ltg_data d on ST_contains(c.the_geom, d.ltg_geom)
WHERE cwa = 'MFR'
GROUP BY 1
)
) as q
ct USING (mmdd)
ORDER BY 1
Thanks for any help!

I haven't included calculations for all requested stats - there is too much in one question, but I hope that you'd be able to extend the query below and add extra stats that you need.
I'm using CTE below to make to query readable. If you want, you can put it all in one huge query. I'd recommend to run the query step-by-step, CTE-by-CTE and examine intermediate results to understand how it works.
CTE_Dates is a simple list of all possible dates for 30 years.
CTE_DailyCounts is a list of basic counts for each day for 30 years (I took your existing query for that).
CTE_FullStats is again a list of all dates together with some stats calculated for each (month,day) using window functions with partitioning by month,day. ROW_NUMBER there is used to get a date where the count was the largest for each year.
Final query selects only one row with the largest count for the year along with the rest of the information.
I didn't try to run the query, because the question doesn't have sample data, so there may be some typos.
WITH
CTE_Dates
AS
(
SELECT
d::date AS dt
,EXTRACT(MONTH FROM d::date) AS dtMonth
,EXTRACT(DAY FROM d::date) AS dtDay
,EXTRACT(YEAR FROM d::date) AS dtYear
FROM
generate_series(timestamp '1988-01-01', timestamp '2018-12-31', interval '1 day') AS d
-- full range of possible dates
)
,CTE_DailyCounts
AS
(
SELECT
time::date AS dt
,count(*) AS ct
FROM
counties c
INNER JOIN ltg_data d ON ST_contains(c.the_geom, d.ltg_geom)
WHERE cwa = 'MFR'
GROUP BY time::date
)
,CTE_FullStats
AS
(
SELECT
CTE_Dates.dt
,CTE_Dates.dtMonth
,CTE_Dates.dtDay
,CTE_Dates.dtYear
,CTE_DailyCounts.ct
,SUM(CTE_DailyCounts.ct) OVER (PARTITION BY dtMonth, dtDay) AS total_count
,MAX(CTE_DailyCounts.ct) OVER (PARTITION BY dtMonth, dtDay) AS max_daily
,SUM(CASE WHEN CTE_DailyCounts.ct > 0 THEN 1 ELSE 0 END) OVER (PARTITION BY dtMonth, dtDay) AS nonzero_day_count
,COUNT(*) OVER (PARTITION BY dtMonth, dtDay) AS years_count
,100.0 * SUM(CASE WHEN CTE_DailyCounts.ct > 0 THEN 1 ELSE 0 END) OVER (PARTITION BY dtMonth, dtDay)
/ COUNT(*) OVER (PARTITION BY dtMonth, dtDay) AS percent_years_count_not_zero
,ROW_NUMBER() OVER (PARTITION BY dtMonth, dtDay ORDER BY CTE_DailyCounts.ct DESC) AS rn
FROM
CTE_Dates
LEFT JOIN CTE_DailyCounts ON CTE_DailyCounts.dt = CTE_Dates.dt
)
SELECT
dtMonth
,dtDay
,total_count
,max_daily
,dtYear AS year_max_daily
,percent_years_count_not_zero
FROM
CTE_FullStats
WHERE
rn = 1
ORDER BY
dtMonth
,dtDay
;

Related

Select most popular hour per country based on number of sales

I want to get the most popular hour for each country based on max value of count(id) which tells how many purchases were made.
I've tried getting the max value of purchases and converted the timestamp into hours, but it always returns each hour for each country when I want only a single hour (the one with most purchases) per country.
The table is like:
id | country | time
1 | AE | 19:20:00.00000
1 | AE | 20:13:00.00000
3 | GB | 23:17:00.00000
4 | IN | 10:23:00.00000
6 | IN | 02:01:00.00000
7 | RU | 05:54:00.00000
2 | RU | 16:34:00.00000
SELECT max(purchases), country, tss
FROM (
SELECT time_trunc(time, hour) AS tss,
count(id) as purchases,
country
FROM spending
WHERE dt > date_sub(current_date(), interval 30 DAY)
GROUP BY tss, country
)
GROUP BY tss, country
Expected output:
amount of purchases | Country | Most popular Hour
34 | GB | 16:00
445 | US | 21:00
You can use window functions along with group by. Notice that it uses RANK function so, for example, if one particular country has same amount of sales at 11AM and 2PM it'll return both hours for that country.
WITH cte AS (
SELECT country
, time_trunc(time, hour) AS hourofday
, COUNT(id) AS purchases
, RANK() OVER(PARTITION BY country ORDER BY COUNT(id) DESC) AS rnk
FROM t
GROUP BY country, time_trunc(time, hour)
)
SELECT *
FROM cte
WHERE rnk = 1

Querying DAU/MAU over time (daily)

I have a daily sessions table with columns user_id and date. I'd like to graph out DAU/MAU (daily active users / monthly active users) on a daily basis. For example:
Date MAU DAU DAU/MAU
2014-06-01 20,000 5,000 20%
2014-06-02 21,000 4,000 19%
2014-06-03 20,050 3,050 17%
... ... ... ...
Calculating daily active users is straightforward but calculating the monthly active users e.g. the number of users that logged in today minus 30 days, is causing problems. How is this achieved without a left join for each day?
Edit: I'm using Postgres.
Assuming you have values for each day, you can get the total counts using a subquery and range between:
with dau as (
select date, count(userid) as dau
from dailysessions ds
group by date
)
select date, dau,
sum(dau) over (order by date rows between -29 preceding and current row) as mau
from dau;
Unfortunately, I think you want distinct users rather than just user counts. That makes the problem much more difficult, especially because Postgres doesn't support count(distinct) as a window function.
I think you have to do some sort of self join for this. Here is one method:
with dau as (
select date, count(distinct userid) as dau
from dailysessions ds
group by date
)
select date, dau,
(select count(distinct user_id)
from dailysessions ds
where ds.date between date - 29 * interval '1 day' and date
) as mau
from dau;
This one uses COUNT DISTINCT to get the rolling 30 days DAU/MAU:
(calculating reddit's user engagement in BigQuery - but the SQL is standard enough to be used on other databases)
SELECT day, dau, mau, INTEGER(100*dau/mau) daumau
FROM (
SELECT day, EXACT_COUNT_DISTINCT(author) dau, FIRST(mau) mau
FROM (
SELECT DATE(SEC_TO_TIMESTAMP(created_utc)) day, author
FROM [fh-bigquery:reddit_comments.2015_09]
WHERE subreddit='AskReddit') a
JOIN (
SELECT stopday, EXACT_COUNT_DISTINCT(author) mau
FROM (SELECT created_utc, subreddit, author FROM [fh-bigquery:reddit_comments.2015_09], [fh-bigquery:reddit_comments.2015_08]) a
CROSS JOIN (
SELECT DATE(SEC_TO_TIMESTAMP(created_utc)) stopday
FROM [fh-bigquery:reddit_comments.2015_09]
GROUP BY 1
) b
WHERE subreddit='AskReddit'
AND SEC_TO_TIMESTAMP(created_utc) BETWEEN DATE_ADD(stopday, -30, 'day') AND TIMESTAMP(stopday)
GROUP BY 1
) b
ON a.day=b.stopday
GROUP BY 1
)
ORDER BY 1
I went further at How to calculate DAU/MAU with BigQuery (engagement)
I've written about this on my blog.
The DAU is easy, as you noticed. You can solve the MAU by first creating a view with boolean values for when a user activates and de-activates, like so:
CREATE OR REPLACE VIEW "vw_login" AS
SELECT *
, LEAST (LEAD("date") OVER w, "date" + 30) AS "activeExpiry"
, CASE WHEN LAG("date") OVER w IS NULL THEN true ELSE false AS "activated"
, CASE
WHEN LEAD("date") OVER w IS NULL THEN true
WHEN LEAD("date") OVER w - "date" > 30 THEN true
ELSE false
END AS "churned"
, CASE
WHEN LAG("date") OVER w IS NULL THEN false
WHEN "date" - LAG("date") OVER w <= 30 THEN false
WHEN row_number() OVER w > 1 THEN true
ELSE false
END AS "resurrected"
FROM "login"
WINDOW w AS (PARTITION BY "user_id" ORDER BY "date")
This creates boolean values per user per day when they become active, when they churn and when they re-activate.
Then do a daily aggregate of the same:
CREATE OR REPLACE VIEW "vw_activity" AS
SELECT
SUM("activated"::int) "activated"
, SUM("churned"::int) "churned"
, SUM("resurrected"::int) "resurrected"
, "date"
FROM "vw_login"
GROUP BY "date"
;
And finally calculate running totals of active MAUs by calculating the cumulative sums over the columns. You need to join the vw_activity twice, since the second one is joined to the day when the user becomes inactive (i.e. 30 days since their last login).
I've included a date series in order to ensure that all days are present in your dataset. You can do without it too, but you might skip days in your dataset.
SELECT
d."date"
, SUM(COALESCE(a.activated::int,0)
- COALESCE(a2.churned::int,0)
+ COALESCE(a.resurrected::int,0)) OVER w
, d."date", a."activated", a2."churned", a."resurrected" FROM
generate_series('2010-01-01'::date, CURRENT_DATE, '1 day'::interval) d
LEFT OUTER JOIN vw_activity a ON d."date" = a."date"
LEFT OUTER JOIN vw_activity a2 ON d."date" = (a2."date" + INTERVAL '30 days')::date
WINDOW w AS (ORDER BY d."date") ORDER BY d."date";
You can of course do this in a single query, but this helps understand the structure better.
You didn't show us your complete table definition, but maybe something like this:
select date,
count(*) over (partition by date_trunc('day', date) order by date) as dau,
count(*) over (partition by date_trunc('month', date) order by date) as mau
from sessions
order by date;
To get the percentage without repeating the window functions, just wrap this in a derived table:
select date,
dau,
mau,
dau::numeric / (case when mau = 0 then null else mau end) as pct
from (
select date,
count(*) over (partition by date_trunc('day', date) order by date) as dau,
count(*) over (partition by date_trunc('month', date) order by date) as mau
from sessions
) t
order by date;
Here is an example output:
postgres=> select * from sessions;
session_date | user_id
--------------+---------
2014-05-01 | 1
2014-05-01 | 2
2014-05-01 | 3
2014-05-02 | 1
2014-05-02 | 2
2014-05-02 | 3
2014-05-02 | 4
2014-05-02 | 5
2014-06-01 | 1
2014-06-01 | 2
2014-06-01 | 3
2014-06-02 | 1
2014-06-02 | 2
2014-06-02 | 3
2014-06-02 | 4
2014-06-03 | 1
2014-06-03 | 2
2014-06-03 | 3
2014-06-03 | 4
2014-06-03 | 5
(20 rows)
postgres=> select session_date,
postgres-> dau,
postgres-> mau,
postgres-> round(dau::numeric / (case when mau = 0 then null else mau end),2) as pct
postgres-> from (
postgres(> select session_date,
postgres(> count(*) over (partition by date_trunc('day', session_date) order by session_date) as dau,
postgres(> count(*) over (partition by date_trunc('month', session_date) order by session_date) as mau
postgres(> from sessions
postgres(> ) t
postgres-> order by session_date;
session_date | dau | mau | pct
--------------+-----+-----+------
2014-05-01 | 3 | 3 | 1.00
2014-05-01 | 3 | 3 | 1.00
2014-05-01 | 3 | 3 | 1.00
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-05-02 | 5 | 8 | 0.63
2014-06-01 | 3 | 3 | 1.00
2014-06-01 | 3 | 3 | 1.00
2014-06-01 | 3 | 3 | 1.00
2014-06-02 | 4 | 7 | 0.57
2014-06-02 | 4 | 7 | 0.57
2014-06-02 | 4 | 7 | 0.57
2014-06-02 | 4 | 7 | 0.57
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
2014-06-03 | 5 | 12 | 0.42
(20 rows)
postgres=>

Get Monthly Totals from Running Totals

I have a table in a SQL Server 2008 database with two columns that hold running totals called Hours and Starts. Another column, Date, holds the date of a record. The dates are sporadic throughout any given month, but there's always a record for the last hour of the month.
For example:
ContainerID | Date | Hours | Starts
1 | 2010-12-31 23:59 | 20 | 6
1 | 2011-01-15 00:59 | 23 | 6
1 | 2011-01-31 23:59 | 30 | 8
2 | 2010-12-31 23:59 | 14 | 2
2 | 2011-01-18 12:59 | 14 | 2
2 | 2011-01-31 23:59 | 19 | 3
How can I query the table to get the total number of hours and starts for each month between two specified years? (In this case 2011 and 2013.) I know that I need to take the values from the last record of one month and subtract it by the values from the last record of the previous month. I'm having a hard time coming up with a good way to do this in SQL, however.
As requested, here are the expected results:
ContainerID | Date | MonthlyHours | MonthlyStarts
1 | 2011-01-31 23:59 | 10 | 2
2 | 2011-01-31 23:59 | 5 | 1
Try this:
SELECT c1.ContainerID,
c1.Date,
c1.Hours-c3.Hours AS "MonthlyHours",
c1.Starts - c3.Starts AS "MonthlyStarts"
FROM Containers c1
LEFT OUTER JOIN Containers c2 ON
c1.ContainerID = c2.ContainerID
AND datediff(MONTH, c1.Date, c2.Date)=0
AND c2.Date > c1.Date
LEFT OUTER JOIN Containers c3 ON
c1.ContainerID = c3.ContainerID
AND datediff(MONTH, c1.Date, c3.Date)=-1
LEFT OUTER JOIN Containers c4 ON
c3.ContainerID = c4.ContainerID
AND datediff(MONTH, c3.Date, c4.Date)=0
AND c4.Date > c3.Date
WHERE
c2.ContainerID is null
AND c4.ContainerID is null
AND c3.ContainerID is not null
ORDER BY c1.ContainerID, c1.Date
Using recursive CTE and some 'creative' JOIN condition, you can fetch next month's value for each ContainterID:
WITH CTE_PREP AS
(
--RN will be 1 for last row in each month for each container
--MonthRank will be sequential number for each subsequent month (to increment easier)
SELECT
*
,ROW_NUMBER() OVER (PARTITION BY ContainerID, YEAR(Date), MONTH(DATE) ORDER BY Date DESC) RN
,DENSE_RANK() OVER (ORDER BY YEAR(Date),MONTH(Date)) MonthRank
FROM Table1
)
, RCTE AS
(
--"Zero row", last row in decembar 2010 for each container
SELECT *, Hours AS MonthlyHours, Starts AS MonthlyStarts
FROM CTE_Prep
WHERE YEAR(date) = 2010 AND MONTH(date) = 12 AND RN = 1
UNION ALL
--for each next row just join on MonthRank + 1
SELECT t.*, t.Hours - r.Hours, t.Starts - r.Starts
FROM RCTE r
INNER JOIN CTE_Prep t ON r.ContainerID = t.ContainerID AND r.MonthRank + 1 = t.MonthRank AND t.Rn = 1
)
SELECT ContainerID, Date, MonthlyHours, MonthlyStarts
FROM RCTE
WHERE Date >= '2011-01-01' --to eliminate "zero row"
ORDER BY ContainerID
SQLFiddle DEMO (I have added some data for February and March in order to test on different lengths of months)
Old version fiddle

Efficient time series querying in Postgres

I have a table in my PG db that looks somewhat like this:
id | widget_id | for_date | score |
Each referenced widget has a lot of these items. It's always 1 per day per widget, but there are gaps.
What I want to get is a result that contains all the widgets for each date since X. The dates are brought in via generate series:
SELECT date.date::date
FROM generate_series('2012-01-01'::timestamp with time zone,'now'::text::date::timestamp with time zone, '1 day') date(date)
ORDER BY date.date DESC;
If there is no entry for a date for a given widget_id, I want to use the previous one. So say widget 1337 doesn't have an entry on 2012-05-10, but on 2012-05-08, then I want the resultset to show the 2012-05-08 entry on 2012-05-10 as well:
Actual data:
widget_id | for_date | score
1312 | 2012-05-07 | 20
1337 | 2012-05-07 | 12
1337 | 2012-05-08 | 41
1337 | 2012-05-11 | 500
Desired output based on generate series:
widget_id | for_date | score
1336 | 2012-05-07 | 20
1337 | 2012-05-07 | 12
1336 | 2012-05-08 | 20
1337 | 2012-05-08 | 41
1336 | 2012-05-09 | 20
1337 | 2012-05-09 | 41
1336 | 2012-05-10 | 20
1337 | 2012-05-10 | 41
1336 | 2012-05-11 | 20
1337 | 2012-05-11 | 500
Eventually I want to boil this down into a view so I have consistent data sets per day that I can query easily.
Edit: Made the sample data and expected resultset clearer
SQL Fiddle
select
widget_id,
for_date,
case
when score is not null then score
else first_value(score) over (partition by widget_id, c order by for_date)
end score
from (
select
a.widget_id,
a.for_date,
s.score,
count(score) over(partition by a.widget_id order by a.for_date) c
from (
select widget_id, g.d::date for_date
from (
select distinct widget_id
from score
) s
cross join
generate_series(
(select min(for_date) from score),
(select max(for_date) from score),
'1 day'
) g(d)
) a
left join
score s on a.widget_id = s.widget_id and a.for_date = s.for_date
) s
order by widget_id, for_date
First of all, you can have a much simpler generate_series() table expression. Equivalent to yours (except for descending order, that contradicts the rest of your question anyways):
SELECT generate_series('2012-01-01'::date, now()::date, '1d')::date
The type date is coerced to timestamptz automatically on input. The return type is timestamptz either way. I use a subquery below, so I can cast to the output to date right away.
Next, max() as window function returns exactly what you need: the highest value since frame start ignoring NULL values. Building on that, you get a radically simple query.
For a given widget_id
Most likely faster than involving CROSS JOIN or WITH RECURSIVE:
SELECT a.day, s.*
FROM (
SELECT d.day
,max(s.for_date) OVER (ORDER BY d.day) AS effective_date
FROM (
SELECT generate_series('2012-01-01'::date, now()::date, '1d')::date
) d(day)
LEFT JOIN score s ON s.for_date = d.day
AND s.widget_id = 1337 -- "for a given widget_id"
) a
LEFT JOIN score s ON s.for_date = a.effective_date
AND s.widget_id = 1337
ORDER BY a.day;
->sqlfiddle
With this query you can put any column from score you like into the final SELECT list. I put s.* for simplicity. Pick your columns.
If you want to start your output with the first day that actually has a score, simply replace the last LEFT JOIN with JOIN.
Generic form for all widget_id's
Here I use a CROSS JOIN to produce a row for every widget on every date ..
SELECT a.day, a.widget_id, s.score
FROM (
SELECT d.day, w.widget_id
,max(s.for_date) OVER (PARTITION BY w.widget_id
ORDER BY d.day) AS effective_date
FROM (SELECT generate_series('2012-05-05'::date
,'2012-05-15'::date, '1d')::date AS day) d
CROSS JOIN (SELECT DISTINCT widget_id FROM score) AS w
LEFT JOIN score s ON s.for_date = d.day AND s.widget_id = w.widget_id
) a
JOIN score s ON s.for_date = a.effective_date
AND s.widget_id = a.widget_id -- instead of LEFT JOIN
ORDER BY a.day, a.widget_id;
->sqlfiddle
Using your table structure, I created the following Recursive CTE which starts with your MIN(For_Date) and increments until it reaches the MAX(For_Date). Not sure if there is a more efficient way, but this appears to work well:
WITH RECURSIVE nodes_cte(widgetid, for_date, score) AS (
-- First Widget Using Min Date
SELECT
w.widgetId,
w.for_date,
w.score
FROM widgets w
INNER JOIN (
SELECT widgetId, Min(for_date) min_for_date
FROM widgets
GROUP BY widgetId
) minW ON w.widgetId = minW.widgetid
AND w.for_date = minW.min_for_date
UNION ALL
SELECT
n.widgetId,
n.for_date + 1 for_date,
coalesce(w.score,n.score) score
FROM nodes_cte n
INNER JOIN (
SELECT widgetId, Max(for_date) max_for_date
FROM widgets
GROUP BY widgetId
) maxW ON n.widgetId = maxW.widgetId
LEFT JOIN widgets w ON n.widgetid = w.widgetid
AND n.for_date + 1 = w.for_date
WHERE n.for_date + 1 <= maxW.max_for_date
)
SELECT *
FROM nodes_cte
ORDER BY for_date
Here is the SQL Fiddle.
And the returned results (format the date however you'd like):
WIDGETID FOR_DATE SCORE
1337 May, 07 2012 00:00:00+0000 12
1337 May, 08 2012 00:00:00+0000 41
1337 May, 09 2012 00:00:00+0000 41
1337 May, 10 2012 00:00:00+0000 41
1337 May, 11 2012 00:00:00+0000 500
Please note, this assumes your For_Date field is a Date -- if it includes a Time -- then you may need to use Interval '1 day' in the query above instead.
Hope this helps.
The data:
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
CREATE TABLE widget
( widget_id INTEGER NOT NULL
, for_date DATE NOT NULL
, score INTEGER
, PRIMARY KEY (widget_id,for_date)
);
INSERT INTO widget(widget_id , for_date , score) VALUES
(1312, '2012-05-07', 20)
, (1337, '2012-05-07', 12)
, (1337, '2012-05-08', 41)
, (1337, '2012-05-11', 500)
;
The query:
SELECT w.widget_id AS widget_id
, cal::date AS for_date
-- , w.for_date AS org_date
, w.score AS score
FROM generate_series( '2012-05-07'::timestamp , '2012-05-11'::timestamp
, '1day'::interval) AS cal
-- "half cartesian" Join;
-- will be restricted by the NOT EXISTS() below
LEFT JOIN widget w ON w.for_date <= cal
WHERE NOT EXISTS (
SELECT * FROM widget nx
WHERE nx.widget_id = w.widget_id
AND nx.for_date <= cal
AND nx.for_date > w.for_date
)
ORDER BY cal, w.widget_id
;
The result:
widget_id | for_date | score
-----------+------------+-------
1312 | 2012-05-07 | 20
1337 | 2012-05-07 | 12
1312 | 2012-05-08 | 20
1337 | 2012-05-08 | 41
1312 | 2012-05-09 | 20
1337 | 2012-05-09 | 41
1312 | 2012-05-10 | 20
1337 | 2012-05-10 | 41
1312 | 2012-05-11 | 20
1337 | 2012-05-11 | 500
(10 rows)

Statistical Mode with postgres

I have a table that has this schema:
create table mytable (creation_date timestamp,
value int,
category int);
I want the maximum ocurrence of a value every each hour per category, Only on week days. I had made some progress, I have a query like this now:
select category,foo.h as h,value, count(value) from mytable, (
select date_trunc('hour',
'2000-01-01 00:00:00'::timestamp+generate_series(0,23)*'1 hour'::interval)::time as h) AS foo
where date_part('hour',creation_date) = date_part('hour',foo.h) and
date_part('dow',creation_date) > 0 and date_part('dow',creation_date) < 6
group by category,h,value;
as result I got something like this:
category | h | value | count
---------+----------+---------+-------
1 | 00:00:00 | 2 | 1
1 | 01:00:00 | 2 | 1
1 | 02:00:00 | 2 | 6
1 | 03:00:00 | 2 | 31
1 | 03:00:00 | 3 | 11
1 | 04:00:00 | 2 | 21
1 | 04:00:00 | 3 | 9
1 | 13:00:00 | 1 | 14
1 | 14:00:00 | 1 | 10
1 | 14:00:00 | 2 | 7
1 | 15:00:00 | 1 | 52
for example at 04:00 I have to values 2 and 3, with counts of 21 and 9 respectively, I only need the value with highest count which would be the statiscal mode.
BTW I have more than 2M records
This can be simpler:
SELECT DISTINCT ON (category, extract(hour FROM creation_date)::int)
category
, extract(hour FROM creation_date)::int AS h
, count(*)::int AS max_ct
, value
FROM mytable
WHERE extract(isodow FROM creation_date) < 6 -- no sat or sun
GROUP BY 1,2,4
ORDER BY 1,2,3 DESC;
Basically these are the steps:
Exclude weekends (WHERE ...). Use ISODOW to simplify the expression.
Extract hour from timestamp as h.
Group by category, h and value.
Count the rows per combination of the three; cast to integer - we don't need bigint.
Order by category, h and the highest count (DESC).
Only pick the first row (highest count) per (category, h) with the according category.
I am able to do this in one query level, because DISTINCT is applied after the aggregate function.
The result will hold no rows for any (category, h) without no entries at all. If you need to fill in the blanks, LEFT JOIN to this:
SELECT c.category, h.h
FROM cat_tbl c
CROSS JOIN (SELECT generate_series(0, 23) AS h) h
Given the size of your table, I'd be tempted to use your query to build a temporary table, then run a query on that to finalise the results.
Assuming you called the temporary table "summary_table", the following query should do it.
select
category, h, value, count
from
summary_table s1
where
not exists
(select * from summary_table s2
where s1.category = s2.category and
s1.h = s2.h and
(s1.count < s2.count
OR (s1.count = s2.count and s1.value > s2.value));
If you don't want to create a table, you could use a WITH clause to attach your query to this one.
with summary_table as (
select category,foo.h as h,value, count(value) as count from mytable, (
select date_trunc('hour',
'2000-01-01 00:00:00'::timestamp+generate_series(0,23)*'1 hour'::interval)::time as h) AS foo
where date_part('hour',creation_date) = date_part('hour',foo.h) and
date_part('dow',creation_date) > 0 and date_part('dow',creation_date) < 6
group by category,h,value)
select
category, h, value, count
from
summary_table s1
where
not exists
(select * from summary_table s2
where s1.category = s1.category and
s1.h = s2.h and
(s1.count < s2.count
OR (s1.count = s2.count and s1.value > s2.value));