How can I do this in SQL? - sql

Today, I need your help.
I have a stats website, I get data from Game Webservices.
I want to implement a new function but I don't know how.
I want to guess players' connection hours.
I have a script which collects data every hour and stores this data in a table.
Imagine that I have a table with: player_id, score and the hour (Integer, just H), and the day number of the month.
Then, for example, if the score between hour 17 and 18 is different then player has been connected to his account.
To simplify, imagine that I have a table with day from 1 to 31 and hour from 0 to 23 for every day.
At the end of the month I need to execute a query to calculate for each hour, the number of days the player has been connected during this hour.
Example :
0 => 31 The player has been connected between 23 and 0 : every days
1 => 3 The player has been connected between 0 and 1 : 3 days a month
2 => 5 The player has been connected between 1 and 2 : 5 days a month
3 => 10 The player has been connected between 3 and 4 : 10 days a month
...
23 => 4
I think I can ORDER BY days and hour and player_id from day 1 hour 0 to day 31 hour 23
And do a first SELECT with a CASE like :
SELECT
table.*,
(CASE WHEN ACTUAL_ROW.score!=PREVIOUS_ROW.score THEN 1 ELSE 0) AS active
FROM table
TO know for each row if the player has been connected.
AND THEN It's Simple to do a GROUP BY and a SUM for each hour.
But I don't know how I can compare previous row with actual
Do you have any IDEA or hint how to do this ? Is PL/SQL Better to do this ?
Note :I'm using PostGreSQL
Thanks

You can access the previous row of the table with LAG window function.
Try using something like
SELECT player_id, count(CASE WHEN score > prev_score THEN 1 END)
FROM(
SELECT player_id, score, mm, hh, LAG(score) OVER (ORDER BY mm,hh) as prev_score
FROM your_table)
GROUP BY player_id
Additional advise - store full timestamps instead of day and hour fields. You can always get the day and hour from timestamp with functions.
Manual on window functions: one, two

The problem here is that we're not checking when the player "was connected"
but instead when the player "earned points", which can be similar - or not;
and this at intervals of one hour (three logins in one hour count as one).
Just as well, a player remaining logged three hours and accruing points in that
period will result as being "logged" in one, two or three data points, depending.
With those caveats, we can JOIN the score table with itself:
SELECT a.player_id, a.day, a.hour, a.score - b.score AS chg
FROM cdata AS a
JOIN cdata AS b
ON (
(a.player_id = b.player_id AND a.score != b.score)
AND (
(a.hour > 0 AND a.day = b.day AND b.hour = a.hour-1)
OR
(a.hour = 0 AND a.day = b.day+1 AND b.hour = 23)
)
)
This will yield a series of statistics for the user, with the day and hour when his
score changed.
You can use this in a collecting subSELECT
SELECT player_id, hour, COUNT(player_id) FROM ( ... ) AS changes
GROUP BY player_id, hour
ORDER BY player_id, hour;
and this will return in 'changes' a number between 1 and 31. Hours with no logins will
not be counted.
I have attempted to provide a test case with this SQLFiddle. The above is not PostgreSQL specific, you can optimize the inner query using PostgreSQL window functions.

Related

Flag 2 actual vs benchmark readings every rolling 12 hours in SQL Developer Query

Looking for some help with code in SQL Developer query to flag any 2 temperature readings - every rolling 12 hours - if they are greater than the acceptable benchmark of 101 deg F.
The given data fields are:
Temp Recorded (DT/TM data type ; down to seconds)
Reading Value (number data type)
Patient ID
There are multiple readings taken throughout a patients stay, at random times.
Logically, we can check if two adjacent times total 12 hrs or more & EACH of their temp readings are >101 but not sure how to put it into a CASE statement (unless there's a better SQL syntax).
Will really appreciate if a SQL only solution can be recommended.
Many Thanks
Giving the code below including the CASE statement as provided by #Gordon Linoff underneath. The below sample code is part of a bigger query joining multiple tables:
SELECT CE.PatientID, CE.ReadingValue, CE.TempRecorded_DT_TM,
(case when sum(case when readingvalue > 101 then 1 else 0 end) over (
partition by patientid
order by dt
range between '12' hour preceding and current row
) >= 2
then 'Y' else 'N'
end) as temp_flag
FROM
edw.se_clinical_event CE
WHERE
CE.PatientID = '176660214'
AND
CE.TempRecorded_DT_TM >= '01-JAN-20'
ORDER BY
TempRecorded_DT_TM
If you want two readings in 12 hours that are greater than 101, then you can use a rolling sum with a window frame:
select t.*,
(case when sum(case when readingvalue > 101 then 1 else 0 end) over (
partition by patientid
order by dt
range between interval '12' hour preceding and current row
) >= 2
then 'Y' else 'N'
end) as temp_flag
from t;

7-day user count: Big-Query self-join to get date range and count?

My Google Firebase event data is integrated to BigQuery and I'm trying to fetch from here one of the info that Firebase gives me automatically: 1-day, 7-day, 28-day user count.
1-day count is quite straightforward
SELECT
"1-day" as period,
events.event_date,
count(distinct events.user_pseudo_id) as uid
FROM
`your_path.events_*` as events
WHERE events.event_name = "session_start"
group by events.event_date
with a neat result like
period event_date uid
1-day 20190609 5
1-day 20190610 7
1-day 20190611 5
1-day 20190612 7
1-day 20190613 37
1-day 20190614 73
1-day 20190615 52
1-day 20190616 36
But to me it gets complicated when I try to count for each day how many unique users I had in the previous 7 days
From the above query, I know my target value for day 20190616 will be 142, by filtering 7 days and removing the group by condition.
The solution I tried is direct self join (and variations that didnt change the result)
SELECT
"7-day" as period,
events.event_date,
count(distinct user_events.user_pseudo_id) as uid
FROM
`your_path.events_*` as events,
`your_path.events_*` as user_events
WHERE user_events.event_name = "session_start"
and PARSE_DATE("%Y%m%d", events.event_date) between DATE_SUB(PARSE_DATE("%Y%m%d", user_events.event_date), INTERVAL 7 DAY) and PARSE_DATE("%Y%m%d", user_events.event_date) #one day in the first table should correspond to 7 days worth of events in the second
and events.event_date = "20190616" #fixed date to check
group by events.event_date
Now, I know I'm barely setting any join conditions, but if any I expected to produce cross joins and huge results. Instead, the count this way is 70, which is a lot lower than expected. Furthermore, I can set INTERVAL 2 DAY and the result does not change.
I'm clearly doing something very wrong here, but I also thought that the way I'm doing it is very rudimental, and there must be a smarter way to accomplish this.
I have checked Calculating a current day 7 day active user with BigQuery? but the explicit cross join here is with event_dim which definition I'm unsure about
Cheched the solution provided at Rolling 90 days active users in BigQuery, improving preformance (DAU/MAU/WAU) as suggested by comment.
The solution seemed sound at first but has some problems the more recent the day is. Here's the query using COUNT(DISTINCT) that I adapted to my case
SELECT DATE_SUB(event_date, INTERVAL i DAY) date_grp
, COUNT(DISTINCT user_pseudo_id) unique_90_day_users
, COUNT(DISTINCT IF(i<29,user_pseudo_id,null)) unique_28_day_users
, COUNT(DISTINCT IF(i<8,user_pseudo_id,null)) unique_7_day_users
, COUNT(DISTINCT IF(i<2,user_pseudo_id,null)) unique_1_day_users
FROM (
SELECT PARSE_DATE("%Y%m%d",event_date) as event_date, user_pseudo_id
FROM `your_path_here.events_*`
WHERE EXTRACT(YEAR FROM PARSE_DATE("%Y%m%d",event_date))=2019
GROUP BY 1, 2
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
ORDER BY date_grp
and here is the result for the latest days (consider data starts 23rd May) where you can appreciate that the result is wrong
row_num date_grp 90-day 28-day 7-day 1-day
114 2019-06-16 273 273 273 210
115 2019-06-17 78 78 78 78
so in the last day this count for 90-day,28-day,7day is only considering the very same day instead of all the days before.
It's not possible for 90-day count on the 17th June to be 78 if the 1-day on the 16th June was higher.
This is AN answer to my same question.
My means are rudimentary as I'm not extremely familiar with BQ shortcuts and some advanced functions, but the result is still correct.
I hope others will be able to integrate with better queries.
#standardSQL
WITH dates AS (
SELECT i as event_date
FROM UNNEST(GENERATE_DATE_ARRAY('2019-05-24', CURRENT_DATE(), INTERVAL 1 DAY)) i
)
, ptd_dates as (
SELECT DISTINCT "90-day" as day_category, FORMAT_DATE("%Y%m%d",event_date) AS event_date, FORMAT_DATE("%Y%m%d",DATE_SUB(event_date, INTERVAL i-1 DAY)) as ptd_date
FROM dates,
UNNEST(GENERATE_ARRAY(1, 90)) i
UNION ALL
SELECT distinct "28-day" as day_category, FORMAT_DATE("%Y%m%d",event_date) AS event_date, FORMAT_DATE("%Y%m%d",DATE_SUB(event_date, INTERVAL i-1 DAY)) as ptd_date
FROM dates,
UNNEST(GENERATE_ARRAY(1, 29)) i
UNION ALL
SELECT distinct "7-day" as day_category, FORMAT_DATE("%Y%m%d",event_date) AS event_date, FORMAT_DATE("%Y%m%d",DATE_SUB(event_date, INTERVAL i-1 DAY)) as ptd_date
FROM dates,
UNNEST(GENERATE_ARRAY(1, 7)) i
UNION ALL
SELECT distinct "1-day" as day_category, FORMAT_DATE("%Y%m%d",event_date) AS event_date, FORMAT_DATE("%Y%m%d",event_date) as ptd_date
FROM dates
)
SELECT event_date,
sum(IF(day_category="90-day",unique_ptd_users,null)) as count_90_day ,
sum(IF(day_category="28-day",unique_ptd_users,null)) as count_28_day,
sum(IF(day_category="7-day",unique_ptd_users,null)) as count_7_day,
sum(IF(day_category="1-day",unique_ptd_users,null)) as count_1_day
from (
SELECT ptd_dates.day_category
, ptd_dates.event_date
, COUNT(DISTINCT user_pseudo_id) unique_ptd_users
FROM ptd_dates,
`your_path_here.events_*` events,
unnest(events.event_params) e_params
WHERE ptd_dates.ptd_date = events.event_date
GROUP BY ptd_dates.day_category
, ptd_dates.event_date)
group by event_date
order by 1,2,3
As per suggestion from ECris, I first defined a calendar table to use: this contains 4 categories of PTDs (period to date). Each is generated from basic elements: this should scale linearly as it's not querying the event dataset and therefore does not have gaps.
Then the join is made with events, where the join condition shows how for each date I'm counting distinct users in all related days in the period.
The results are correct.

Need to count unique transactions by month but ignore records that occur 3 days after 1st entry for that ID

I have a table with just two columns: User_ID and fail_date. Each time somebody's card is rejected they are logged in the table, their card is automatically tried again 3 days later, and if they fail again, another entry is added to the table. I am trying to write a query that counts unique failures by month so I only want to count the first entry, not the 3 day retries, if they exist. My data set looks like this
user_id fail_date
222 01/01
222 01/04
555 02/15
777 03/31
777 04/02
222 10/11
so my desired output would be something like this:
month unique_fails
jan 1
feb 1
march 1
april 0
oct 1
I'll be running this in Vertica, but I'm not so much looking for perfect syntax in replies. Just help around how to approach this problem as I can't really think of a way to make it work. Thanks!
You could use lag() to get the previous timestamp per user. If the current and the previous timestamp are less than or exactly three days apart, it's a follow up. Mark the row as such. Then you can filter to exclude the follow ups.
It might look something like:
SELECT month,
count(*) unique_fails
FROM (SELECT month(fail_date) month,
CASE
WHEN datediff(day,
lag(fail_date) OVER (PARTITION BY user_id,
ORDER BY fail_date),
fail_date) <= 3 THEN
1
ELSE
0
END follow_up
FROM elbat) x
WHERE follow_up = 0
GROUP BY month;
I'm not so sure about the exact syntax in Vertica, so it might need some adaptions. I also don't know, if fail_date actually is some date/time type variant or just a string. If it's just a string the date/time specific functions may not work on it and have to be replaced or the string has to be converted prior passing it to the functions.
If the data spans several years you might also want to include the year additionally to the month to keep months from different years apart. In the inner SELECT add a column year(fail_date) year and add year to the list of columns and the GROUP BY of the outer SELECT.
You can add a flag about whether this is a "unique_fail" by doing:
select t.*,
(case when lag(fail_date) over (partition by user_id order by fail_date) > fail_date - 3
then 0 else 1
end) as first_failure_flag
from t;
Then, you want to count this flag by month:
select to_char(fail_date, 'Mon'), -- should aways include the year
sum(first_failure_flag)
from (select t.*,
(case when lag(fail_date) over (partition by user_id order by fail_date) > fail_date - 3
then 0 else 1
end) as first_failure_flag
from t
) t
group by to_char(fail_date, 'Mon')
order by min(fail_date)
In a Derived Table, determine the previous fail_date (prev_fail_date), for a specific user_id and fail_date, using a Correlated subquery.
Using the derived table dt, Count the failure, if the difference of number of days between current fail_date and prev_fail_date is greater than 3.
DateDiff() function alongside with If() function is used to determine the cases, which are not repeated tries.
To Group By this result on Month, you can use MONTH function.
But then, the data can be from multiple years, so you need to separate them out yearwise as well, so you can do a multi-level group by, using YEAR function as well.
Try the following (in MySQL) - you can get idea for other RDBMS as well:
SELECT YEAR(dt.fail_date) AS year_fail_date,
MONTH(dt.fail_date) AS month_fail_date,
COUNT( IF(DATEDIFF(dt.fail_date, dt.prev_fail_date) > 3, user_id, NULL) ) AS unique_fails
FROM (
SELECT
t1.user_id,
t1.fail_date,
(
SELECT t2.fail_date
FROM your_table AS t2
WHERE t2.user_id = t1.user_id
AND t2.fail_date < t1.fail_date
ORDER BY t2.fail_date DESC
LIMIT 1
) AS prev_fail_date
FROM your_table AS t1
) AS dt
GROUP BY
year_fail_date,
month_fail_date
ORDER BY
year_fail_date ASC,
month_fail_date ASC

Count occurrences of combinations of columns

I have daily time series (actually business days) for different companies and I work with PostgreSQL. There is also an indicator variable (called flag) taking the value 0 most of the time, and 1 on some rare event days. If the indicator variable takes the value 1 for a company, I want to further investigate the entries from two days before to one day after that event for the corresponding company. Let me refer to that as [-2,1] window with the event day being day 0.
I am using the following query
CREATE TABLE test AS
WITH cte AS (
SELECT *
, MAX(flag) OVER(PARTITION BY company ORDER BY day
ROWS BETWEEN 1 preceding AND 2 following) Lead1
FROM mytable)
SELECT *
FROM cte
WHERE Lead1 = 1
ORDER BY day,company
The query takes the entries ranging from 2 days before the event to one day after the event, for the company experiencing the event.
The query does that for all events.
This is a small section of the resulting table.
day company flag
2012-01-23 A 0
2012-01-24 A 0
2012-01-25 A 1
2012-01-25 B 0
2012-01-26 A 0
2012-01-26 B 0
2012-01-27 B 1
2012-01-30 B 0
2013-01-10 A 0
2013-01-11 A 0
2013-01-14 A 1
Now I want to do further calculations for every [-2,1] window separately. So I need a variable that allows me to identify each [-2,1] window. The idea is that I count the number of windows for every company with the variable "occur", so that in further calculations I can use the clause
GROUP BY company, occur
Therefore my desired output looks like that:
day company flag occur
2012-01-23 A 0 1
2012-01-24 A 0 1
2012-01-25 A 1 1
2012-01-25 B 0 1
2012-01-26 A 0 1
2012-01-26 B 0 1
2012-01-27 B 1 1
2012-01-30 B 0 1
2013-01-10 A 0 2
2013-01-11 A 0 2
2013-01-14 A 1 2
In the example, the company B only occurs once (occur = 1). But the company A occurs two times. For the first time from 2012-01-23 to 2012-01-26. And for the second time from 2013-01-10 to 2013-01-14. The second time range of company A does not consist of all four days surrounding the event day (-2,-1,0,1) since the company leaves the dataset before the end of that time range.
As I said I am working with business days. I don't care for holidays, I have data from monday to friday. Earlier I wrote the following function:
CREATE OR REPLACE FUNCTION addbusinessdays(date, integer)
RETURNS date AS
$BODY$
WITH alldates AS (
SELECT i,
$1 + (i * CASE WHEN $2 < 0 THEN -1 ELSE 1 END) AS date
FROM generate_series(0,(ABS($2) + 5)*2) i
),
days AS (
SELECT i, date, EXTRACT('dow' FROM date) AS dow
FROM alldates
),
businessdays AS (
SELECT i, date, d.dow FROM days d
WHERE d.dow BETWEEN 1 AND 5
ORDER BY i
)
-- adding business days to a date --
SELECT date FROM businessdays WHERE
CASE WHEN $2 > 0 THEN date >=$1 WHEN $2 < 0
THEN date <=$1 ELSE date =$1 END
LIMIT 1
offset ABS($2)
$BODY$
LANGUAGE 'sql' VOLATILE;
It can add/substract business days from a given date and works like that:
select * from addbusinessdays('2013-01-14',-2)
delivers the result 2013-01-10. So in Jakub's approach we can change the second and third last line to
w.day BETWEEN addbusinessdays(t1.day, -2) AND addbusinessdays(t1.day, 1)
and can deal with the business days.
Function
While using the function addbusinessdays(), consider this instead:
CREATE OR REPLACE FUNCTION addbusinessdays(date, integer)
RETURNS date AS
$func$
SELECT day
FROM (
SELECT i, $1 + i * sign($2)::int AS day
FROM generate_series(0, ((abs($2) * 7) / 5) + 3) i
) sub
WHERE EXTRACT(ISODOW FROM day) < 6 -- truncate weekend
ORDER BY i
OFFSET abs($2)
LIMIT 1
$func$ LANGUAGE sql IMMUTABLE;
Major points
Never quote the language name sql. It's an identifier, not a string.
Why was the function VOLATILE? Make it IMMUTABLE for better performance in repeated use and more options (like using it in a functional index).
(ABS($2) + 5)*2) is way too much padding. Replace with ((abs($2) * 7) / 5) + 3).
Multiple levels of CTEs were useless cruft.
ORDER BY in last CTE was useless, too.
As mentioned in my previous answer, extract(ISODOW FROM ...) is more convenient to truncate weekends.
Query
That said, I wouldn't use above function for this query at all. Build a complete grid of relevant days once instead of calculating the range of days for every single row.
Based on this assertion in a comment (should be in the question, really!):
two subsequent windows of the same firm can never overlap.
WITH range AS ( -- only with flag
SELECT company
, min(day) - 2 AS r_start
, max(day) + 1 AS r_stop
FROM tbl t
WHERE flag <> 0
GROUP BY 1
)
, grid AS (
SELECT company, day::date
FROM range r
,generate_series(r.r_start, r.r_stop, interval '1d') d(day)
WHERE extract('ISODOW' FROM d.day) < 6
)
SELECT *, sum(flag) OVER(PARTITION BY company ORDER BY day
ROWS BETWEEN UNBOUNDED PRECEDING
AND 2 following) AS window_nr
FROM (
SELECT t.*, max(t.flag) OVER(PARTITION BY g.company ORDER BY g.day
ROWS BETWEEN 1 preceding
AND 2 following) in_window
FROM grid g
LEFT JOIN tbl t USING (company, day)
) sub
WHERE in_window > 0 -- only rows in [-2,1] window
AND day IS NOT NULL -- exclude missing days in [-2,1] window
ORDER BY company, day;
How?
Build a grid of all business days: CTE grid.
To keep the grid to its smallest possible size, extract minimum and maximum (plus buffer) day per company: CTE range.
LEFT JOIN actual rows to it. Now the frames for ensuing window functions works with static numbers.
To get distinct numbers per flag and company (window_nr), just count flags from the start of the grid (taking buffers into account).
Only keep days inside your [-2,1] windows (in_window > 0).
Only keep days with actual rows in the table.
Voilá.
SQL Fiddle.
Basically the strategy is to first enumarate the flag days and then join others with them:
WITH windows AS(
SELECT t1.day
,t1.company
,rank() OVER (PARTITION BY company ORDER BY day) as rank
FROM table1 t1
WHERE flag =1)
SELECT t1.day
,t1.company
,t1.flag
,w.rank
FROM table1 AS t1
JOIN windows AS w
ON
t1.company = w.company
AND
w.day BETWEEN
t1.day - interval '2 day' AND t1.day + interval '1 day'
ORDER BY t1.day, t1.company;
Fiddle.
However there is a problem with work days as those can mean whatever (do holidays count?).

Count over rows in previous time range partitioned by a specific column

My dataset consists of daily (actually business days) timeseries for different companies from different industries and I work with PostgreSQL. I have an indicator variable in my dataset taking values 1, -1 and most of the times 0. For better readability of the question I refer to days where the indicator variable is unequal to zero as indicator event.
So for all indicator events that are preceded by another indicator event for the same industry in the previous three business days, the indicator variable shall be updated to zero.
We can think of the following example dataset:
day company industry indicator
2012-01-12 A financial 1
2012-01-12 B consumer 0
2012-01-13 A financial 1
2012-01-13 B consumer -1
2012-01-16 A financial 0
2012-01-16 B consumer 0
2012-01-17 A financial 0
2012-01-17 B consumer 0
2012-01-17 C consumer 0
2012-01-18 A financial 0
2012-01-18 B consumer 0
2012-01-18 C consumer 1
So the indicator values that shall be updated to zero are on 2012-01-13 the entry for company A, and on 2012-01-18 the entry for company C, because they are preceded by another indicator event in the same industry within 3 business days.
I tried to accomplish it in the following way:
UPDATE test SET indicator = 0
WHERE (day, industry) IN (
SELECT day, industry
FROM (
SELECT industry, day,
COUNT(CASE WHEN indicator <> 0 THEN 1 END)
OVER (PARTITION BY industry ORDER BY day
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) As cnt
FROM test
) alias
WHERE cnt >= 2)
My idea was to count the indicator events for the current day and the 3 preceding days partitioned by industry. If it counts more than 1, it updates the indicator value to zero.
The weak spot is, that so far it counts over the three preceding rows (partitioned by industry) instead of the three preceding business days. So in the example data, it is not able to update company C on 2012-01-18, because it counts over the last three rows where industry = consumer instead of counting over all rows where industry=consumer for the last three business days.
I tried different methods like adding another subquery in the third last line of the code or adding a WHERE EXISTS - clause after the third last line, to ensure that the code counts over the three preceding dates. But nothing worked. I really don't know out how to do that (I just learn to work with PostgreSQL).
Do you have any ideas how to fix it?
Or maybe I am thinking in a completely wrong direction and you know another approach how to solve my problem?
DB design
Fist off, your table should be normalized. industry should be a small foreign key column (typically integer) referencing industry_id of an industry table. Maybe you have that already and only simplified for the sake of the question. Your actual table definition would go a long way.
Since rows with an indicator are rare but highly interesting, create a (possibly "covering") partial index to make any solution faster:
CREATE INDEX tbl_indicator_idx ON tbl (industry, day)
WHERE indicator <> 0;
Equality first, range last.
Assuming that indicator is defined NOT NULL. If industry was an integer, this index would be perfectly efficient.
Query
This query identifies rows to be reset:
WITH x AS ( -- only with indicator
SELECT DISTINCT industry, day
FROM tbl t
WHERE indicator <> 0
)
SELECT industry, day
FROM (
SELECT i.industry, d.day, x.day IS NOT NULL AS incident
, count(x.day) OVER (PARTITION BY industry ORDER BY day_nr
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS ct
FROM (
SELECT *, row_number() OVER (ORDER BY d.day) AS day_nr
FROM (
SELECT generate_series(min(day), max(day), interval '1d')::date AS day
FROM x
) d
WHERE extract('ISODOW' FROM d.day) < 6
) d
CROSS JOIN (SELECT DISTINCT industry FROM x) i
LEFT JOIN x USING (industry, day)
) sub
WHERE incident
AND ct > 1
ORDER BY 1, 2;
SQL Fiddle.
ISODOW as extract() parameter is convenient to truncate weekends.
Integrate this in your UPDATE:
WITH x AS ( -- only with indicator
SELECT DISTINCT industry, day
FROM tbl t
WHERE indicator <> 0
)
UPDATE tbl t
SET indicator = 0
FROM (
SELECT i.industry, d.day, x.day IS NOT NULL AS incident
, count(x.day) OVER (PARTITION BY industry ORDER BY day_nr
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS ct
FROM (
SELECT *, row_number() OVER (ORDER BY d.day) AS day_nr
FROM (
SELECT generate_series(min(day), max(day), interval '1d')::date AS day
FROM x
) d
WHERE extract('isodow' FROM d.day) < 6
) d
CROSS JOIN (SELECT DISTINCT industry FROM x) i
LEFT JOIN x USING (industry, day)
) u
WHERE u.incident
AND u.ct > 1
AND t.industry = u.industry
AND t.day = u.day;
This should be substantially faster than your solution with correlated subqueries and a function call for every row. Even if that's based on my own previous answer, it's not perfect for this case.
In the meantime I found one possible solution myself (I hope that this isn't against the etiquette of the forum).
Please note that this is only one possible solution. You are very welcome to comment it or to develop
improvements if you want to.
For the first part, the function addbusinessdays which can add (or subtract) business day to
a given date, I am referring to:
http://osssmb.wordpress.com/2009/12/02/business-days-working-days-sql-for-postgres-2/
(I just slightly modified it because I don't care for holidays, just for weekends)
CREATE OR REPLACE FUNCTION addbusinessdays(date, integer)
RETURNS date AS
$BODY$
with alldates as (
SELECT i,
$1 + (i * case when $2 < 0 then -1 else 1 end) AS date
FROM generate_series(0,(abs($2) + 5)*2) i
),
days as (
select i, date, extract('dow' from date) as dow
from alldates
),
businessdays as (
select i, date, d.dow from days d
where d.dow between 1 and 5
order by i
)
select date from businessdays where
case when $2 > 0 then date >=$1 when $2 < 0 then date <=$1 else date =$1 end
limit 1
offset abs($2)
$BODY$
LANGUAGE 'sql' VOLATILE
COST 100;
ALTER FUNCTION addbusinessdays(date, integer) OWNER TO postgres;
For the second part, I am referring to this related question, where I am applying Erwin Brandstetter's correlated subquery approach: Window Functions or Common Table Expressions: count previous rows within range
UPDATE test SET indicator = 0
WHERE (day, industry) IN (
SELECT day, industry
FROM (
SELECT industry, day,
(SELECT COUNT(CASE WHEN indicator <> 0 THEN 1 END)
FROM test t1
WHERE t1.industry = t.industry
AND t1.day between addbusinessdays(t.day,-3) and t.day) As cnt
FROM test t
) alias
WHERE cnt >= 2)